Skip to content
This repository has been archived by the owner on Nov 18, 2021. It is now read-only.

What tools are available for handling Page-XML? #10

Open
proycon opened this issue Apr 8, 2021 · 7 comments
Open

What tools are available for handling Page-XML? #10

proycon opened this issue Apr 8, 2021 · 7 comments
Labels
question Further information is requested

Comments

@proycon
Copy link
Member

proycon commented Apr 8, 2021

What tools do we already have available in and around CLARIAH for dealing with Page-XML? This question arose mostly out of the Golden Agents project but I think we may just as well discuss it in the CLARIAH text group here.

We developed one tool in the scope of foliautils:

  • FoLiA-page - Converts Page-XML to FoLiA, incorporates references to the original document and can
    translate existing word items. (designed as a preprocessing step to enable ticcl to work on Page-XML input)

I have one main question:

Do we have a tool that can interpret the coordinate information present in Page-XML and extract the proper 'reading order' for elements? Especially in the context of multi-column layout? (FoLiA-page doesn't do this). I think I heard @rvankoert mentioned in the Golden Agents meeting as possibly having a solution for this?

Tagging @marijnkoolen, @Gijsjan and @LvanWissen, @menzowindhouwer for respectively the Republic and Golden Agents projects.

@Gijsjan: If I'm not mistaken, Docere visualizes Page XML and the source image right?

@proycon proycon added the question Further information is requested label Apr 8, 2021
@marijnkoolen
Copy link
Member

For Republic, I have halfway decent generic Python code to read PageXML and parse into some default elements from the physical structure: scan, page, column, textregion, textline, word. The page and column elements are generated by my code, as they're not part of the PageXML spec.

Actually, page is part of PageXML, but it's actually the whole scan, which can be more or less than a single physical page, so I distinguish between scan (the whole image) and page (some region of the image that should correspond to a single page). Of course, which region corresponds to a page is image-dependent and project-specific and cannot be determined generically.

I think PageXML has some way to express reading order, but this is also pretty much image-dependent and strongly differs per scan and can be difficult to determine. So whether the reading order makes sense depends on whether this has been explicitly made part of the document model, or has been trained on or some such.

I also have terrible Python code that I'm currently updating for modelling elements from the logical structure (e.g. chapters, sections, paragraphs, tables, or in the case of Republic: resolutions, attendance lists, index entries, etc.). Logical elements can also be hierarchical, and at each level, they can have elements from the physical structure, so there is a correspondence between logical elements and the PageXML and the image coordinates.

But the logical structure is very project-specific so I doubt there is much that can be made generic.

@proycon
Copy link
Member Author

proycon commented Apr 9, 2021

@marijnkoolen Thanks! I suppose most of the code you mention is in https://github.com/HuygensING/republic-project ? I'll have a browse around there. Even though some things may be project-specific, it might still be useful or worth expanding upon for Golden Agents.

@rvankoert
Copy link

rvankoert commented Apr 9, 2021 via email

@marijnkoolen
Copy link
Member

marijnkoolen commented Apr 11, 2021

@marijnkoolen Thanks! I suppose most of the code you mention is in https://github.com/HuygensING/republic-project ? I'll have a browse around there. Even though some things may be project-specific, it might still be useful or worth expanding upon for Golden Agents.

@proycon Please wait a few days before having a look. I'm currently rewriting the PageXML parsing bit and have created new classes for those generic elements, but still have to push my code. Once done, I'll let you know where you can find that.

@proycon
Copy link
Member Author

proycon commented Apr 11, 2021

@marijnkoolen Ok! Thanks!

@marijnkoolen
Copy link
Member

Okay @proycon, I've pushed some updates. A relatively generic PageXML parser (though it only assume elements and properties there are used in the Republic PageXML output, not the full PageXML spec) can be found here: https://github.com/HuygensING/republic-project/blob/master/republic/parser/pagexml/generic_pagexml_parser.py

The parser returns a PageXMLScan object that consists of smaller objects (PageXMLTextRegion, PageXMLTextLine and PageXMLWord), which are part of the physical document structure (and inherit from the PhysicalStructureDoc class) and therefore have coordinates in the scan. There's also a generic class for the logical structure which can contain elements from the physical structure, bit itself has no direct connection to the scan.

The document models are here: https://github.com/HuygensING/republic-project/blob/master/republic/model/physical_document_model.py

No doubt there's still a lot of Republic-specific stuff going on, but I think the most important part is the distinction between physical and logical structures and how they map onto each other. Doing that right saves a lot of headache later on.

Anyway, I hope it can be of some use. If you think it's worth reusing, I should probably turn that part of the code into it's own repo.

@proycon
Copy link
Member Author

proycon commented Apr 19, 2021

Thanks for the sources! I'll have to take a deeper look still, but it will hopefully prevent doing unnecessary duplicate work!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants