reader ignores index in ordered groups #13

bertsky · 2021-02-26T00:34:27Z

AFAICS, the existing implementations for all versions of PAGE-XML ignore (OrderedGroup|OrderedGroupIndexed)/@index when parsing the XML.

This is how it looks:

prima-core-libs/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_2019_07_15.java

Lines 335 to 342 in 1f087a4

    
            else if (	DefaultXmlNames.ELEMENT_RegionRef.equals(localName) 
        
            		||	DefaultXmlNames.ELEMENT_RegionRefIndexed.equals(localName)) { 
        
            	if (currentLogicalGroup != null) { 
        
           if ((i = atts.getIndex(DefaultXmlNames.ATTR_regionRef)) >= 0) { 
        
              	currentLogicalGroup.addRegionRef(atts.getValue(i)); 
        
           } 
        
            	}

References for ATTR_index are nowhere to be found.

The model class of the group in turn does nothing on its part to check incoming indices, it simply appends them:

prima-core-libs/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/Group.java

Lines 193 to 199 in 1f087a4

    
           public void addRegionRef(String id) { 
        
           	try { 
        
           		members.add(new RegionRef(this, contentFactory.getIdRegister().getId(id))); 
        
           	} catch (InvalidIdException e) { 
        
           		e.printStackTrace(); 
        
           	} 
        
           }

This means that applications like PageViewer or PageConverter will use the XML order instead of the actual order laid out by the schema semantics. Which in turn creates a problem for applications like OCR-D: What is the correct representation, the one shown by PageViewer or my strict implementation?

Here's an example of the difference this can make:

PAGE-XML and original image: debug-readingorder.zip
rendered by PageViewer:
rendered by ocrd-segment-extract-pages:

In sharp contrast to what one might suspect superficially, here it's PageViewer who gets the order wrong – along with the producing tool eynollah (which follows its model of just looking at the XML order), hence a compensatory error.

If my interpretation is wrong, please get back to me soonish for confirmation. (I don't care about the fix so much as clarity on the correct meaning of the standard for implementation in software and adoption in derived specifications like OCR-D.)

If the better place is the PAGE-XML repo, please transfer.

The text was updated successfully, but these errors were encountered:

mikegerber · 2021-03-19T18:32:09Z

I would also be very happy to know what PRImA-Research-Lab's view on the index value here is. 😀 I would interpret the schema description in the same way as @bertsky and I, too, think that the implementation in PAGE Viewer is therefore wrong as shown in the example. (In the example, XML order = correct reading order but the index values are essentially random values. These essentially random values should be interpreted as the order if our interpretation of the schema is correct.)

bertsky mentioned this issue Feb 26, 2021

reading order representation (XML order vs index) qurator-spk/eynollah#22

Closed

bertsky mentioned this issue Aug 16, 2021

Order of regions qurator-spk/eynollah#51

Closed

bertsky mentioned this issue Oct 5, 2021

Direction, orientation, and reading order (text direction elements) altoxml/schema#74

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reader ignores index in ordered groups #13

reader ignores index in ordered groups #13

bertsky commented Feb 26, 2021

mikegerber commented Mar 19, 2021 •

edited

Loading

reader ignores index in ordered groups #13

reader ignores index in ordered groups #13

Comments

bertsky commented Feb 26, 2021

mikegerber commented Mar 19, 2021 • edited Loading

mikegerber commented Mar 19, 2021 •

edited

Loading