ordering of Word elements in page-with-glyphs.xml #12

bertsky · 2018-08-30T12:50:27Z

Is it correct for assets/data/page-with-glyphs.xml to have its Word elements misordered w.r.t. the linear reading order as seen by TextLine?

For example, in the first line N66290 of region r0,

Ich. Chriian Edlen von S  midt

the first word N72746 is actually the last element. This striking disorder is repeats throughout this file. The only information to reproduce the TextLine content here is in the coordinates.

In principle, we could have:

coordinates
readingOrder indexing in the custom attribute
XML ordering

What source can/must postcorrection rely on?

@tboenig

The text was updated successfully, but these errors were encountered:

kba · 2018-08-30T13:28:21Z

This file is from the IMPACT interoperability framework testfiles: https://github.com/impactcentre/iif-testfiles

I explicitly chose an external example for glyphs that has been widely used and is established and see where it led us.... :)

IMHO the order of preference for Words (and Glyphs) should be

document order (XML ordering)
Page readingOrder mechanism (which seems overkill for inline elements)
coordinates

bertsky · 2018-08-30T20:16:54Z

I see. Well I traced it down from there (gt.xml) to the IMPACT component USAL Line and Word Segmentation. Not having a login for IMPACT, I can only speculate what these data are and how others deal with them. But since this is announced as testdata (as opposed to GT data, despite the file name), isn't it possible these files are expected to fail, too?

Back to our problem: I don't think we can have any such thing like an order of preference here. What if document ordering (your number one) is wrong (as in the case at hand)? How to detect this if not by comparing the other two options (ranked lower in your list)? And what if those disagree as well? Make a majority vote? And shouldn't looking at coordinates be ruled out entirely, because it is expensive (performance-wise), error-prone, and complex (at least for components that would not need to deal with them otherwise)?

I agree that readingOrder would be overkill for elements below the block/region level. Provided you share my objection to coordinates, that leaves us with XML ordering only.

Looking at it from the PAGE specification, IIRC, XML ordering must agree with textLineOrder and readingDirection.

kba · 2018-08-30T21:02:21Z

I meant order of my personal (humble) preference not for implementation. I m no expert but could imagine that readingOrder could be unavoidable for RTL or some constructs for non-latin scripts with fallback to document order if no readingorder defined. Coordinates are too error prone I think too.

As for the test data: Could very well be that this is an expected failure. Ill try to find out more thanks for digging into it.

Can @tboenig and @wrznr offer a more qualified opinion on modeling order of inline elements?

bertsky · 2018-08-30T21:07:16Z

I meant order of my personal (humble) preference not for implementation.

Oh sorry, I got you completely wrong there before. So I concur!

tboenig · 2018-09-17T11:53:46Z

Let me summarize:

there is only a "region" order so called Reading Order.
It would be overkill a proposal for elements below the block/region.
The coordinates are an indicator for the Reading Order. For polygons the definition of the 'order' indicator is not so simple. Therefore the definition of a bounding box would be useful in this case. Furthermore, the reading direction and LineOrder must also be considered in this case.
The Reading Order is defined in the sequence of the word elements. The content of the elements <TextEquivType><Unicode> may differ. In this case, it is more likely that an entry error occurred.

Proposal Decision:
The Reading Order is defined in the sequence of the word elements. An evaluation of the elements <TextEquivType><Unicode>can be ignored. An evaluation of the elements can be neglected. If a comparison is made between the contents of the <Word> elements and the <TextEquivType><Unicode> elements, the rich sequence of the <Word> elements is always recommended.

wrznr · 2018-09-17T12:05:10Z

@tboenig I agree that the XML ordering is the most practical and flexible solution for us. Especially with so-called "Schmuckdruck" in mind. We should therefore handle this issue as a bug in the assets data which should be fixed by a PR.

wrznr · 2018-09-17T12:07:37Z

@kba RTL is a special case. But there are additional mechanisms in PAGE XML for handling this.

bertsky · 2018-09-17T12:30:50Z

@tboenig Just to make sure we understand each other:

Your point 1 is not the same as my point 2: I was not referring to the ReadingOrder element, but the readingOrder key of the custom attribute.

1. It would be overkill a proposal for elements below the block/region.

Okay fine, but why does it appear in kant_aufklaerung_1784-page-block-line-word up to the Word level? Merely as an illustration of (non-standard) possibilities?

3. The content of the elements `<TextEquivType><Unicode>` may differ. In this case, it is more likely that an entry error occurred.

Do you mean you don't know for sure now (whether this is an error in the assets), or rather there is always a possibility of error but one can never know (and specify) with certainty? My concern is with what processors can safely assume from incoming data, so a strict rule is needed, and one that can easily be implemented (like relying on XML ordering exclusively).

If a comparison is made between the contents of the elements <TextEquivType><Unicode> and the elements, the rich sequence of the elements is always recommended.

Does that mean the contents of TextLine:TextEquiv:Unicode may (is allowed to) in fact deviate from the concatenation of its content TextLine:Word:TextEquiv:Unicode (plus whitespace)? I was actually hoping for a specification that would rule out such deviations entirely.

wrznr · 2018-09-17T13:22:10Z

@bertsky The concatenation of TextLine:Word:TextEquiv:Unicode contents is not allowed to deviate from the corresponding TextLine:TextEquiv:Unicode contents. That's why this issue has been marked with the label bug. The file in assets will be fixed asap.

tboenig · 2018-09-17T13:40:52Z

The rule:

The Reading Order is defined in the sequence of the word elements.
If a comparison is made between the contents of the <Word> elements and the <TextEquivType><Unicode> elements, the contents of both elements must correspond.
If a comparison between the contents of the <Word> elements and the <TextEquivType><Unicode> elements shows a difference, then there is an error. However, if this file is processed further, the order of the <Word> elements must be followed.

bertsky · 2018-09-17T14:24:06Z

@wrznr @tboenig Thanks for clarifying!

wrznr · 2018-10-04T07:32:24Z

@tboenig Pls. repair the erroneous files and close this issue.

wrznr · 2018-11-06T08:22:38Z

@tboenig PUSH.

tboenig · 2019-02-12T08:52:52Z

@bertsky:
see the document:
https://github.com/OCR-D/assets/tree/master/data/kant_enlightenment_1784-page-block-block-line-word_glyph/data/OCR-D-GT-SEG-WORD_GLYPH
here you will find an example for the recording of:
Region, Word and Glyph

bertsky · 2019-02-12T14:22:01Z

@tboenig thanks!

Alas, see #26

bertsky mentioned this issue Aug 30, 2018

word segmentation in kant_aufklaerung_1784 GT PageXML #13

Open

kba assigned tboenig Aug 30, 2018

bertsky mentioned this issue Sep 1, 2018

let's get practical OCR-D/ocrd_keraslm#5

Merged

wrznr self-assigned this Sep 14, 2018

wrznr added the bug Something isn't working label Sep 17, 2018

kba mentioned this issue Oct 9, 2018

PAGE: How to add TextEquiv and consistency rules, OCR-D/assets#16 OCR-D/spec#82

Merged

kba added this to To do in coordinate Oct 18, 2018

kba moved this from Backlog to High Priority in coordinate Oct 19, 2018

tboenig closed this as completed Feb 12, 2019

coordinate automation moved this from High Priority to Done Feb 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ordering of Word elements in page-with-glyphs.xml #12

ordering of Word elements in page-with-glyphs.xml #12

bertsky commented Aug 30, 2018

kba commented Aug 30, 2018

bertsky commented Aug 30, 2018

kba commented Aug 30, 2018

bertsky commented Aug 30, 2018

tboenig commented Sep 17, 2018 •

edited

Loading

wrznr commented Sep 17, 2018

wrznr commented Sep 17, 2018

bertsky commented Sep 17, 2018 •

edited

Loading

wrznr commented Sep 17, 2018

tboenig commented Sep 17, 2018

bertsky commented Sep 17, 2018

wrznr commented Oct 4, 2018

wrznr commented Nov 6, 2018

tboenig commented Feb 12, 2019

bertsky commented Feb 12, 2019

ordering of Word elements in page-with-glyphs.xml #12

ordering of Word elements in page-with-glyphs.xml #12

Comments

bertsky commented Aug 30, 2018

kba commented Aug 30, 2018

bertsky commented Aug 30, 2018

kba commented Aug 30, 2018

bertsky commented Aug 30, 2018

tboenig commented Sep 17, 2018 • edited Loading

wrznr commented Sep 17, 2018

wrznr commented Sep 17, 2018

bertsky commented Sep 17, 2018 • edited Loading

wrznr commented Sep 17, 2018

tboenig commented Sep 17, 2018

bertsky commented Sep 17, 2018

wrznr commented Oct 4, 2018

wrznr commented Nov 6, 2018

tboenig commented Feb 12, 2019

bertsky commented Feb 12, 2019

tboenig commented Sep 17, 2018 •

edited

Loading

bertsky commented Sep 17, 2018 •

edited

Loading