Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ordering of Word elements in page-with-glyphs.xml #12

Closed
bertsky opened this issue Aug 30, 2018 · 15 comments
Closed

ordering of Word elements in page-with-glyphs.xml #12

bertsky opened this issue Aug 30, 2018 · 15 comments
Assignees
Labels
bug Something isn't working
Projects

Comments

@bertsky
Copy link
Contributor

bertsky commented Aug 30, 2018

Is it correct for assets/data/page-with-glyphs.xml to have its Word elements misordered w.r.t. the linear reading order as seen by TextLine?

For example, in the first line N66290 of region r0,

Ich. Chriian Edlen von S  midt

the first word N72746 is actually the last element. This striking disorder is repeats throughout this file. The only information to reproduce the TextLine content here is in the coordinates.

In principle, we could have:

  1. coordinates
  2. readingOrder indexing in the custom attribute
  3. XML ordering

What source can/must postcorrection rely on?

@tboenig

@kba
Copy link
Member

kba commented Aug 30, 2018

This file is from the IMPACT interoperability framework testfiles: https://github.com/impactcentre/iif-testfiles

I explicitly chose an external example for glyphs that has been widely used and is established and see where it led us.... :)

IMHO the order of preference for Words (and Glyphs) should be

  • document order (XML ordering)
  • Page readingOrder mechanism (which seems overkill for inline elements)
  • coordinates

@bertsky
Copy link
Contributor Author

bertsky commented Aug 30, 2018

I see. Well I traced it down from there (gt.xml) to the IMPACT component USAL Line and Word Segmentation. Not having a login for IMPACT, I can only speculate what these data are and how others deal with them. But since this is announced as testdata (as opposed to GT data, despite the file name), isn't it possible these files are expected to fail, too?

Back to our problem: I don't think we can have any such thing like an order of preference here. What if document ordering (your number one) is wrong (as in the case at hand)? How to detect this if not by comparing the other two options (ranked lower in your list)? And what if those disagree as well? Make a majority vote? And shouldn't looking at coordinates be ruled out entirely, because it is expensive (performance-wise), error-prone, and complex (at least for components that would not need to deal with them otherwise)?

I agree that readingOrder would be overkill for elements below the block/region level. Provided you share my objection to coordinates, that leaves us with XML ordering only.

Looking at it from the PAGE specification, IIRC, XML ordering must agree with textLineOrder and readingDirection.

@kba
Copy link
Member

kba commented Aug 30, 2018

I meant order of my personal (humble) preference not for implementation. I m no expert but could imagine that readingOrder could be unavoidable for RTL or some constructs for non-latin scripts with fallback to document order if no readingorder defined. Coordinates are too error prone I think too.

As for the test data: Could very well be that this is an expected failure. Ill try to find out more thanks for digging into it.

Can @tboenig and @wrznr offer a more qualified opinion on modeling order of inline elements?

@bertsky
Copy link
Contributor Author

bertsky commented Aug 30, 2018

I meant order of my personal (humble) preference not for implementation.

Oh sorry, I got you completely wrong there before. So I concur!

@tboenig
Copy link
Contributor

tboenig commented Sep 17, 2018

Let me summarize:

  1. there is only a "region" order so called Reading Order.
    It would be overkill a proposal for elements below the block/region.
  2. The coordinates are an indicator for the Reading Order. For polygons the definition of the 'order' indicator is not so simple. Therefore the definition of a bounding box would be useful in this case. Furthermore, the reading direction and LineOrder must also be considered in this case.
  3. The Reading Order is defined in the sequence of the word elements. The content of the elements <TextEquivType><Unicode> may differ. In this case, it is more likely that an entry error occurred.

Proposal Decision:
The Reading Order is defined in the sequence of the word elements. An evaluation of the elements <TextEquivType><Unicode>can be ignored. An evaluation of the elements can be neglected. If a comparison is made between the contents of the <Word> elements and the <TextEquivType><Unicode> elements, the rich sequence of the <Word> elements is always recommended.

@wrznr
Copy link
Collaborator

wrznr commented Sep 17, 2018

@tboenig I agree that the XML ordering is the most practical and flexible solution for us. Especially with so-called "Schmuckdruck" in mind. We should therefore handle this issue as a bug in the assets data which should be fixed by a PR.

@wrznr wrznr added the bug Something isn't working label Sep 17, 2018
@wrznr
Copy link
Collaborator

wrznr commented Sep 17, 2018

@kba RTL is a special case. But there are additional mechanisms in PAGE XML for handling this.

@bertsky
Copy link
Contributor Author

bertsky commented Sep 17, 2018

@tboenig Just to make sure we understand each other:

Your point 1 is not the same as my point 2: I was not referring to the ReadingOrder element, but the readingOrder key of the custom attribute.

1. It would be overkill a proposal for elements below the block/region.

Okay fine, but why does it appear in kant_aufklaerung_1784-page-block-line-word up to the Word level? Merely as an illustration of (non-standard) possibilities?

3. The content of the elements `<TextEquivType><Unicode>` may differ. In this case, it is more likely that an entry error occurred.

Do you mean you don't know for sure now (whether this is an error in the assets), or rather there is always a possibility of error but one can never know (and specify) with certainty? My concern is with what processors can safely assume from incoming data, so a strict rule is needed, and one that can easily be implemented (like relying on XML ordering exclusively).

If a comparison is made between the contents of the elements <TextEquivType><Unicode> and the elements, the rich sequence of the elements is always recommended.

Does that mean the contents of TextLine:TextEquiv:Unicode may (is allowed to) in fact deviate from the concatenation of its content TextLine:Word:TextEquiv:Unicode (plus whitespace)? I was actually hoping for a specification that would rule out such deviations entirely.

@wrznr
Copy link
Collaborator

wrznr commented Sep 17, 2018

@bertsky The concatenation of TextLine:Word:TextEquiv:Unicode contents is not allowed to deviate from the corresponding TextLine:TextEquiv:Unicode contents. That's why this issue has been marked with the label bug. The file in assets will be fixed asap.

@tboenig
Copy link
Contributor

tboenig commented Sep 17, 2018

The rule:

  1. The Reading Order is defined in the sequence of the word elements.
  2. If a comparison is made between the contents of the <Word> elements and the <TextEquivType><Unicode> elements, the contents of both elements must correspond.
  3. If a comparison between the contents of the <Word> elements and the <TextEquivType><Unicode> elements shows a difference, then there is an error. However, if this file is processed further, the order of the <Word> elements must be followed.

@bertsky
Copy link
Contributor Author

bertsky commented Sep 17, 2018

@wrznr @tboenig Thanks for clarifying!

@wrznr
Copy link
Collaborator

wrznr commented Oct 4, 2018

@tboenig Pls. repair the erroneous files and close this issue.

@kba kba added this to To do in coordinate Oct 18, 2018
@kba kba moved this from Backlog to High Priority in coordinate Oct 19, 2018
@wrznr
Copy link
Collaborator

wrznr commented Nov 6, 2018

@tboenig PUSH.

@tboenig
Copy link
Contributor

tboenig commented Feb 12, 2019

@bertsky:
see the document:
https://github.com/OCR-D/assets/tree/master/data/kant_enlightenment_1784-page-block-block-line-word_glyph/data/OCR-D-GT-SEG-WORD_GLYPH
here you will find an example for the recording of:
Region, Word and Glyph

@tboenig tboenig closed this as completed Feb 12, 2019
coordinate automation moved this from High Priority to Done Feb 12, 2019
@bertsky
Copy link
Contributor Author

bertsky commented Feb 12, 2019

@tboenig thanks!

Alas, see #26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
coordinate
  
Done
Development

No branches or pull requests

4 participants