-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ordering of Word elements in page-with-glyphs.xml #12
Comments
This file is from the IMPACT interoperability framework testfiles: https://github.com/impactcentre/iif-testfiles I explicitly chose an external example for glyphs that has been widely used and is established and see where it led us.... :) IMHO the order of preference for Words (and Glyphs) should be
|
I see. Well I traced it down from there (gt.xml) to the IMPACT component USAL Line and Word Segmentation. Not having a login for IMPACT, I can only speculate what these data are and how others deal with them. But since this is announced as testdata (as opposed to GT data, despite the file name), isn't it possible these files are expected to fail, too? Back to our problem: I don't think we can have any such thing like an order of preference here. What if document ordering (your number one) is wrong (as in the case at hand)? How to detect this if not by comparing the other two options (ranked lower in your list)? And what if those disagree as well? Make a majority vote? And shouldn't looking at coordinates be ruled out entirely, because it is expensive (performance-wise), error-prone, and complex (at least for components that would not need to deal with them otherwise)? I agree that Looking at it from the PAGE specification, IIRC, XML ordering must agree with |
I meant order of my personal (humble) preference not for implementation. I m no expert but could imagine that readingOrder could be unavoidable for RTL or some constructs for non-latin scripts with fallback to document order if no readingorder defined. Coordinates are too error prone I think too. As for the test data: Could very well be that this is an expected failure. Ill try to find out more thanks for digging into it. Can @tboenig and @wrznr offer a more qualified opinion on modeling order of inline elements? |
Oh sorry, I got you completely wrong there before. So I concur! |
Let me summarize:
Proposal Decision: |
@tboenig I agree that the XML ordering is the most practical and flexible solution for us. Especially with so-called "Schmuckdruck" in mind. We should therefore handle this issue as a bug in the assets data which should be fixed by a PR. |
@kba RTL is a special case. But there are additional mechanisms in PAGE XML for handling this. |
@tboenig Just to make sure we understand each other: Your point 1 is not the same as my point 2: I was not referring to the
Okay fine, but why does it appear in kant_aufklaerung_1784-page-block-line-word up to the
Do you mean you don't know for sure now (whether this is an error in the assets), or rather there is always a possibility of error but one can never know (and specify) with certainty? My concern is with what processors can safely assume from incoming data, so a strict rule is needed, and one that can easily be implemented (like relying on XML ordering exclusively).
Does that mean the contents of |
@bertsky The concatenation of |
The rule:
|
@tboenig Pls. repair the erroneous files and close this issue. |
@tboenig PUSH. |
@bertsky: |
Is it correct for assets/data/page-with-glyphs.xml to have its
Word
elements misordered w.r.t. the linear reading order as seen byTextLine
?For example, in the first line N66290 of region r0,
the first word N72746 is actually the last element. This striking disorder is repeats throughout this file. The only information to reproduce the
TextLine
content here is in the coordinates.In principle, we could have:
readingOrder
indexing in thecustom
attributeWhat source can/must postcorrection rely on?
@tboenig
The text was updated successfully, but these errors were encountered: