Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reader ignores index in ordered groups #13

Open
bertsky opened this issue Feb 26, 2021 · 1 comment
Open

reader ignores index in ordered groups #13

bertsky opened this issue Feb 26, 2021 · 1 comment

Comments

@bertsky
Copy link
Contributor

bertsky commented Feb 26, 2021

AFAICS, the existing implementations for all versions of PAGE-XML ignore (OrderedGroup|OrderedGroupIndexed)/@index when parsing the XML.

This is how it looks:

else if ( DefaultXmlNames.ELEMENT_RegionRef.equals(localName)
|| DefaultXmlNames.ELEMENT_RegionRefIndexed.equals(localName)) {
if (currentLogicalGroup != null) {
if ((i = atts.getIndex(DefaultXmlNames.ATTR_regionRef)) >= 0) {
currentLogicalGroup.addRegionRef(atts.getValue(i));
}
}

References for ATTR_index are nowhere to be found.

The model class of the group in turn does nothing on its part to check incoming indices, it simply appends them:

public void addRegionRef(String id) {
try {
members.add(new RegionRef(this, contentFactory.getIdRegister().getId(id)));
} catch (InvalidIdException e) {
e.printStackTrace();
}
}

This means that applications like PageViewer or PageConverter will use the XML order instead of the actual order laid out by the schema semantics. Which in turn creates a problem for applications like OCR-D: What is the correct representation, the one shown by PageViewer or my strict implementation?

Here's an example of the difference this can make:

  • PAGE-XML and original image: debug-readingorder.zip
  • rendered by PageViewer: FILE_0002_ORIGINAL_pageviewer-all-order
  • rendered by ocrd-segment-extract-pages: FILE_0002_EXTRACT-LINES-EYNOLLAH pseg

In sharp contrast to what one might suspect superficially, here it's PageViewer who gets the order wrong – along with the producing tool eynollah (which follows its model of just looking at the XML order), hence a compensatory error.

If my interpretation is wrong, please get back to me soonish for confirmation. (I don't care about the fix so much as clarity on the correct meaning of the standard for implementation in software and adoption in derived specifications like OCR-D.)

If the better place is the PAGE-XML repo, please transfer.

@mikegerber
Copy link

mikegerber commented Mar 19, 2021

I would also be very happy to know what PRImA-Research-Lab's view on the index value here is. 😀 I would interpret the schema description in the same way as @bertsky and I, too, think that the implementation in PAGE Viewer is therefore wrong as shown in the example. (In the example, XML order = correct reading order but the index values are essentially random values. These essentially random values should be interpreted as the order if our interpretation of the schema is correct.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants