Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing page (alto) from article not flagged #31

Open
LydiaFrance opened this issue May 10, 2022 · 0 comments
Open

Missing page (alto) from article not flagged #31

LydiaFrance opened this issue May 10, 2022 · 0 comments
Assignees
Labels
bug Something isn't working
Projects
Milestone

Comments

@LydiaFrance
Copy link
Collaborator

LydiaFrance commented May 10, 2022

To create the _art0001.txt, the text block IDs are defined in the METS file as follows:

<mets:smLocatorLink xlink:href="#art0001" xlink:label="article" xlink:type="locator"/>
    <mets:smLinkGrp>
    <mets:smLocatorLink xlink:href="#pa0002001" xlink:label="page2 area1" xlink:type="locator"/>
    <mets:smArcLink xlink:type="arc" xlink:from="article" xlink:to="page2 area1" ARCTYPE="logicalphysical"/>
</mets:smLinkGrp>

Where #art0001 defines the created .txt file (PUBID_YYYYMMDD_art0001.txt)
#pa0002001 defines the paragraph ID for the textblock within the source .xml file (PUBID_YYYYMMDD_0002.xml) . The 0002 in the paragraph ID refers to the xml file number.

And ..._0002.xml can't be read or does not exist, then art0001.txt will be created empty with no obvious warning (it is potentially in the log though).


Extended example with hypothetical situation where article crosses two pages:

<mets:smLinkGrp>
    <mets:smLocatorLink xlink:href="#art0001" xlink:label="article" xlink:type="locator"/>
    <mets:smLocatorLink xlink:href="#pa0001041" xlink:label="page1 area41" xlink:type="locator"/>
    <mets:smLocatorLink xlink:href="#pa0002001" xlink:label="page2 area1" xlink:type="locator"/>
    <mets:smArcLink xlink:type="arc" xlink:from="article" xlink:to="page1 area41" ARCTYPE="logicalphysical"/>
    <mets:smArcLink xlink:type="arc" xlink:from="article" xlink:to="page2 area1" ARCTYPE="logicalphysical"/>
</mets:smLinkGrp>

In this made up example, an article spans two physical pages and therefore two xml files source files. (This scenario may not actually happen, I don't know if articles can be defined this way.) If ..._0002.xml does not exist or can't be read correctly, the subsequent art0001.txt file is created with this expected text missing with no clear indication this has happened.

@LydiaFrance LydiaFrance added the bug Something isn't working label May 10, 2022
@LydiaFrance LydiaFrance self-assigned this May 10, 2022
@LydiaFrance LydiaFrance added this to TODO for public release in Release May 26, 2022
@andrewphilipsmith andrewphilipsmith added this to the v0.5 milestone Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Release
TODO for public release
Development

No branches or pull requests

2 participants