-
Notifications
You must be signed in to change notification settings - Fork 2
What tools are available for handling Page-XML? #10
Comments
For Republic, I have halfway decent generic Python code to read PageXML and parse into some default elements from the physical structure: Actually, I think PageXML has some way to express reading order, but this is also pretty much image-dependent and strongly differs per scan and can be difficult to determine. So whether the reading order makes sense depends on whether this has been explicitly made part of the document model, or has been trained on or some such. I also have terrible Python code that I'm currently updating for modelling elements from the logical structure (e.g. chapters, sections, paragraphs, tables, or in the case of Republic: resolutions, attendance lists, index entries, etc.). Logical elements can also be hierarchical, and at each level, they can have elements from the physical structure, so there is a correspondence between logical elements and the PageXML and the image coordinates. But the logical structure is very project-specific so I doubt there is much that can be made generic. |
@marijnkoolen Thanks! I suppose most of the code you mention is in https://github.com/HuygensING/republic-project ? I'll have a browse around there. Even though some things may be project-specific, it might still be useful or worth expanding upon for Golden Agents. |
Determining reading order in general is not an easy task and depends highly
on the specific documents. In republic it's a fairly straight-forward
layout most of the time. For other projects i also deal with more complex
layouts ranging from circular layout to newspapers to annotations on
annotations on annotations on annotations. There is some tooling available,
but most is in the experimental phase or can only deal with simple layouts.
In the ICDAR conference it is still a topic of active research.
Reading order can be set in PageXML. For wp2 we will create some stuff that
does basic reading order detection.
Best,
Rutger
Op vr 9 apr. 2021 om 14:40 schreef Maarten van Gompel <
***@***.***>:
… @marijnkoolen <https://github.com/marijnkoolen> Thanks! I suppose most of
the code you mention is in https://github.com/HuygensING/republic-project
? I'll have a browse around there. Even though some things may be
project-specific, it might still be useful or worth expanding upon for
Golden Agents.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYFSEROFQZTPT3VRJJBYPTTH3YS3ANCNFSM42S5MYKA>
.
|
@proycon Please wait a few days before having a look. I'm currently rewriting the PageXML parsing bit and have created new classes for those generic elements, but still have to push my code. Once done, I'll let you know where you can find that. |
@marijnkoolen Ok! Thanks! |
Okay @proycon, I've pushed some updates. A relatively generic PageXML parser (though it only assume elements and properties there are used in the Republic PageXML output, not the full PageXML spec) can be found here: https://github.com/HuygensING/republic-project/blob/master/republic/parser/pagexml/generic_pagexml_parser.py The parser returns a PageXMLScan object that consists of smaller objects (PageXMLTextRegion, PageXMLTextLine and PageXMLWord), which are part of the physical document structure (and inherit from the PhysicalStructureDoc class) and therefore have coordinates in the scan. There's also a generic class for the logical structure which can contain elements from the physical structure, bit itself has no direct connection to the scan. The document models are here: https://github.com/HuygensING/republic-project/blob/master/republic/model/physical_document_model.py No doubt there's still a lot of Republic-specific stuff going on, but I think the most important part is the distinction between physical and logical structures and how they map onto each other. Doing that right saves a lot of headache later on. Anyway, I hope it can be of some use. If you think it's worth reusing, I should probably turn that part of the code into it's own repo. |
Thanks for the sources! I'll have to take a deeper look still, but it will hopefully prevent doing unnecessary duplicate work! |
What tools do we already have available in and around CLARIAH for dealing with Page-XML? This question arose mostly out of the Golden Agents project but I think we may just as well discuss it in the CLARIAH text group here.
We developed one tool in the scope of foliautils:
translate existing word items. (designed as a preprocessing step to enable ticcl to work on Page-XML input)
I have one main question:
Do we have a tool that can interpret the coordinate information present in Page-XML and extract the proper 'reading order' for elements? Especially in the context of multi-column layout? (FoLiA-page doesn't do this). I think I heard @rvankoert mentioned in the Golden Agents meeting as possibly having a solution for this?
Tagging @marijnkoolen, @Gijsjan and @LvanWissen, @menzowindhouwer for respectively the Republic and Golden Agents projects.
@Gijsjan: If I'm not mistaken, Docere visualizes Page XML and the source image right?
The text was updated successfully, but these errors were encountered: