Add page number to metadata output #73

fedenanni · 2023-02-22T08:24:54Z

Hi all,
@kmcdono2 mentioned that it would be very useful to have the page number added to the metadata information that alto2txt keeps (so that this would be included in the db as well). Not sure about the complexity of it but I'm opening the issue so this is captured

griff-rees · 2023-02-22T12:57:13Z

Hey @fedenanni would you like to open a ticket for that in in the db repo as well to coordinate? Some sort of laundry list (and sorry if some of this is already included in alto2txt...) like:

griff-rees · 2023-02-22T14:41:08Z

@fedenanni and @kmcdono2: do you mean

the number of pages in the Item
the page number the Item begins
More comprehensive: the list of pages an Item covers (hence my page_numbers above)? I say list rather than range in case an Item is on the cover and page 4 (but none in between) etc.

kmcdono2 · 2023-02-22T15:11:56Z

As a priority, we would like the page number the Entry begins on.

If it is easy to do, then yes a list of pages an Entry covers would be nice, but it's not a necessity.

griff-rees · 2023-02-22T15:16:05Z

Cool. It's not too hard to add those options on Item model in the database but I'm much less familiar with how that relates to what currently comes out of alto2txt.

griff-rees · 2023-02-22T15:17:54Z

And apologies @kmcdono2 and @fedenanni: they're Items not Entry (ies). My mistake.

kmcdono2 · 2023-02-22T15:19:52Z

Cool. It's not too hard to add those options on Item model in the database but I'm much less familiar with how that relates to what currently comes out of alto2txt.

Yes, we understand that the newspaper metadata captures this in a slightly awkward way. It's been more than a year since I've seen the raw metadata, but others will know more who have been working with it.

dcsw2 · 2023-03-02T16:35:40Z

Hey @griff-rees, this is great. Just to be clear: the reason for wanting this is to allow us to access the original page images, which is a vital step in our research workflow. One method for doing this would be use the BNA's public web interface and to infer a URL from the DB metadata which links directly to the actual page image. Without the page numbers you can't do this. If there's a way to get actual polygon co-ordinates for an article, that would be even better. Page numbers are also useful for citation, but not essential as a finding aid, for which there could be even better methods. I'm adding @npedrazzini to this thread to complete the circle!

griff-rees · 2023-03-03T09:12:15Z

Hey @dcsw2 and @npedrazzini: yeah actual page images was my long term hope (@DavidBeavan this ring any bells from our recent chat?) but didn't know some of that was already public. Very up for a chat about this.

For clarity: by "polygon coordinates" do you mean page coordinates an article inhabits? Only other thought that came to mind is geographic coordinates of location that the article refers to or (or location of where it was printed). Also: I'm guessing by BNA you mean British National Archives?

kmcdono2 · 2023-03-03T09:41:10Z

@griff-rees - By page images, we mean what is available via the BNA (British Newspaper Archive) interface (https://www.britishnewspaperarchive.co.uk/), hence the effort to reconstruct the URLs for that website. We know that access to images otherwise is far to complex to attempt :)

Just want to clarify @dcsw2's point above - we do care about page numbers as a unit of analysis. E.g. we want to know how many toponyms appear on page 1 vs later pages of an issue.

Yes @griff-rees, "polygon coordinates" will refer to the layout location of the article in the page, not the shapes of the toponyms on the page.

dcsw2 · 2023-03-03T09:53:10Z

Yes, as Katie says: I meant co-ordinates on the page – this was purely a greedy UX request as it would be great to snap straight to the relevant article on the page, as they can often take a while to find. Very-much not essential, the page number is key though.

npedrazzini · 2023-03-03T16:30:28Z

Hi all, mapping article number with page number is relatively simple - I'm wondering what the easiest way would be for me to potentially share the mapping at article level? A big CSV? @griff-rees, from the PoV of the DB, what would work best?

Also, it just so happens that I have a script that, given an article, takes its coordinates and highlights them on the image. We could adapt that if it's what you're looking for?

kmcdono2 · 2023-03-03T16:42:33Z

@npedrazzini I think that's exactly what @dcsw2 is thinking of - we want to quickly see what content on the page is related to the data we have.

kallewesterling · 2023-03-24T16:46:50Z

In conversation with @kmcdono2 and @dcsw2, it became apparent that this is a crucial thing for the metadata db to include (which then also relates to the output from alto2txt as we're using that for ingestation into the db). I think it shouldn't be an add-on script (like you suggest @npedrazzini) but something that we include in the minimal metadata output from alto2txt if possible.

As for your request for polygon, @dcsw2, that is a different feature request, which I think deserves its own issue.

How can we make page numbers a reality as it relates to fairly soon-to-come research output, relating to the toponym research? Let's chat more.

kallewesterling · 2023-04-06T14:51:47Z

Recent chat between myself and @DavidBeavan (for documentation here):

D: Have you managed to locate where the info on page number is in the source mets/alto xml?

K: The problem is that it’s in the related ALTO files.. so it’s per item, and the XSLT needs to pull it out of the related document… I believe this is the bit where it gets the document:

    <xsl:for-each select="mets:fileSec//mets:fileGrp[@USE='Fulltext']/mets:file">
        <doc>
          <xsl:attribute name="ID"><xsl:value-of select="@ID" /></xsl:attribute>
          <xsl:variable name="fileloc2"><xsl:value-of select="$input_path" />/<xsl:value-of select="mets:FLocat/@xlink:href" /></xsl:variable>
          <xsl:copy-of select="document($fileloc2)" />
        </doc>
    </xsl:for-each>

Then, I should be able to do something like this (this is for the alto_namespace property in the sort-of “header” of the resulting file:

<alto_namespace><xsl:value-of select="$page_docs/doc[1]/alto/@xsi:noNamespaceSchemaLocation" /></alto_namespace>

You can see that it accesses the copied alto tag here.. So I thought this would work:

<page><xsl:value-of select="$page_docs/doc[1]/alto/Layout/Page/@ID" /></page>

But it didn’t…

D: You’re on the right track . . . Two places I can see page ID, what you propose, which I think gives a P1, P2 etc as their value. page ID may also be in the mets as ORDER="1" "2", "3" etc. Do you concur?

I have a feeling that mets might be easier to get to, because we’re parsing those pageareas to reach out the alto for their textual contents already.
K: Maybe, if we preserve that information in alto2txt output, we should call the tag mets_order or something instead of page?

D: There’s value in that, not asserting something that isn’t explicitly page. Although I bet it is page . . . Here's the diff:

--- a/src/alto2txt/xslts/extract_text_mets18.xslt
+++ b/src/alto2txt/xslts/extract_text_mets18.xslt
@@ -59,6 +59,21 @@

         <xsl:variable name="item_page_areas" select="exsl:node-set($item_page_areas_rt)" />

+        <xsl:variable name="item_mets_order_rt">
+          <xsl:for-each select="key('smLocatorLink_href', $item_ID_hash)/../mets:smArcLink/@xlink:to">
+            <xsl:variable name="pagearea" select="key('smLocatorLink_label', .)/@xlink:href" />
+            <xsl:variable name="pagearea_unhash" select="substring($pagearea, 2)" />
+            <xsl:variable name="key_out" select="key('structMap', $pagearea_unhash)" />
+            <xsl:for-each select="$key_out[@TYPE='pagearea']|$key_out/mets:div[@TYPE='pagearea']">
+              <xsl:if test="mets:fptr/mets:area[@BETYPE='IDREF']">
+                <mets_order><xsl:value-of select="../@ORDER" /></mets_order>
+              </xsl:if>
+            </xsl:for-each>
+          </xsl:for-each>
+        </xsl:variable>
+
+        <xsl:variable name="item_mets_order" select="exsl:node-set($item_mets_order_rt)" />
+
         <exsl:document method="text" href="{$output_path}_{$item_ID}.txt">
           <xsl:choose>
             <xsl:when test="$item_page_areas//String|$item_page_areas//HYP">
@@ -127,6 +142,7 @@
                 <date><xsl:value-of select="/mets:mets/mets:dmdSec[@ID=$issue_DMDID]//mods:dateIssued" /></date>
                 <item>
                   <xsl:attribute name="id"><xsl:value-of select="$item_ID" /></xsl:attribute>
+                  <mets_order><xsl:copy-of select="$item_mets_order" /></mets_order>
                   <plain_text_file><xsl:value-of select="$output_document_stub" />_<xsl:value-of select="$item_ID" />.txt</plain_text_file>
                   <title><xsl:value-of select="/mets:mets/mets:dmdSec[@ID=$item_DMDID]//mods:title" /></title>
                   <item_type><xsl:value-of select="@TYPE" /></item_type>

That gives us this:

    <issue id="1824-02-17">
        <date>1824-02-17</date>
        <item id="art0012">
        <mets_order>
            <mets_order ORDER="2"/>
            <mets_order ORDER="2"/>
        </mets_order>
        <plain_text_file>0002647_18240217_art0012.txt</plain_text_file>

Still got to:

reduce that list to unique numbers,
change the element names etc to something that better fits and
work on a good multi-page test case and roll it out to mets13 and other xslts

K: Perhaps this is something we can do on the hack day that we’re planning? It’d facilitate some good handover to the database work

DavidBeavan · 2023-04-06T15:03:54Z

I can do one better to produce

<item id="art0012">
  <mets_orders>
    <mets_order>2</mets_order>
    <mets_order>2</mets_order>
  </mets_orders>
  <plain_text_file>0002647_18240217_art0012.txt</plain_text_file>

--- a/src/alto2txt/xslts/extract_text_mets18.xslt
+++ b/src/alto2txt/xslts/extract_text_mets18.xslt
@@ -59,6 +59,21 @@

         <xsl:variable name="item_page_areas" select="exsl:node-set($item_page_areas_rt)" />

+        <xsl:variable name="item_mets_order_rt">
+          <xsl:for-each select="key('smLocatorLink_href', $item_ID_hash)/../mets:smArcLink/@xlink:to">
+            <xsl:variable name="pagearea" select="key('smLocatorLink_label', .)/@xlink:href" />
+            <xsl:variable name="pagearea_unhash" select="substring($pagearea, 2)" />
+            <xsl:variable name="key_out" select="key('structMap', $pagearea_unhash)" />
+            <xsl:for-each select="$key_out[@TYPE='pagearea']|$key_out/mets:div[@TYPE='pagearea']">
+              <xsl:if test="mets:fptr/mets:area[@BETYPE='IDREF']">
+                <mets_order><xsl:value-of select="../@ORDER" /></mets_order>
+              </xsl:if>
+            </xsl:for-each>
+          </xsl:for-each>
+        </xsl:variable>
+
+        <xsl:variable name="item_mets_order" select="exsl:node-set($item_mets_order_rt)" />
+
         <exsl:document method="text" href="{$output_path}_{$item_ID}.txt">
           <xsl:choose>
             <xsl:when test="$item_page_areas//String|$item_page_areas//HYP">
@@ -127,6 +142,7 @@
                 <date><xsl:value-of select="/mets:mets/mets:dmdSec[@ID=$issue_DMDID]//mods:dateIssued" /></date>
                 <item>
                   <xsl:attribute name="id"><xsl:value-of select="$item_ID" /></xsl:attribute>
+                  <mets_orders><xsl:copy-of select="$item_mets_order" /></mets_orders>
                   <plain_text_file><xsl:value-of select="$output_document_stub" />_<xsl:value-of select="$item_ID" />.txt</plain_text_file>
                   <title><xsl:value-of select="/mets:mets/mets:dmdSec[@ID=$item_DMDID]//mods:title" /></title>
                   <item_type><xsl:value-of select="@TYPE" /></item_type>

kmcdono2 · 2023-04-11T19:20:52Z

Hi both - this is really amazing. @dcsw2 and I are trying to plan a data exploration & writing sprint for our toponyms article.

Is there a potentially realistic timeframe in which we might be able to have metadata about items our our sample that contains page numbers? (E.g. is that a possible takeaway from the examples above?)

kallewesterling · 2023-04-12T15:38:30Z

We have a hack session scheduled for Friday, so hopefully we'll have a positive update after that! See #83.

kallewesterling · 2023-04-20T10:12:17Z

@DavidBeavan just got a JISC sample down here and posting it below.

Looks like this is the hierarchy we're looking for here (see below). BL_newspaper -> BL_article -> image_metadata -> pageImage -> pageSequence

Still looking for other formats!

<BL_newspaper xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:dcterms="http://purl.org/dc/elements/1.1/">
  <BL_article>
    <title_metadata>
      <title>Reynolds&apos;s Newspaper</title>
      <normalisedTitle>Reynolds&apos;s Newspaper</normalisedTitle>
      <titleAbbreviation>RDNP</titleAbbreviation>
      <changeToTitle>
        <name>Reynolds&apos;s Weekly News</name>
        <startDate day="05" month="05" year="1850"/>
        <endDate day="09" month="02" year="1851"/>
      </changeToTitle>
      <changeToTitle>
        ...
      </changeToTitle>
      <placeOfPublication>London</placeOfPublication>
      <datesOfPublication>5 May 1850 - 2 Feb 1851; 9 Feb 1851 - 30 Dec 1900</datesOfPublication>
      <typeOfPublication>Newspaper</typeOfPublication>
      <subCollection>London Weekly</subCollection>
    </title_metadata>
    <issue_metadata>
      <volumeNumber></volumeNumber>
      <issueNumber>1012</issueNumber>
      <printedDate>SUNDAY, JANUARY 02, 1870</printedDate>
      <normalisedDate>1870.01.02</normalisedDate>
      <pageCount>8</pageCount>
      <reelID>1191</reelID>
      <qualityRating>Good</qualityRating>
    </issue_metadata>
    <article_metadata>
      <dc_metadata>
        <dc:Title>FOREIGN INTELLIGENCE.</dc:Title>
        <dc:Subject></dc:Subject>
        <dcterms:issued>1870.01.02</dcterms:issued>
        <dc:Type>Image</dc:Type>
        <dc:Type>Newspaper article</dc:Type>
        <dc:Type>News</dc:Type>
        <dc:Identifier>WO1_RDNP_1870_01_02-0001-002.xml</dc:Identifier>
        <dcterms:bibliographicCitation>Reynolds&apos;s Newspaper, 1012, 0001, SUNDAY, JANUARY 02, 1870</dcterms:bibliographicCitation>
        <dc:Language>eng</dc:Language>
        <dcterms:isPartOf>Reynolds&apos;s Newspaper</dcterms:isPartOf>
        <dc:Rights>Copyright &#x00A9; The British Library Board</dc:Rights>
      </dc_metadata>
      <additional_metadata>
        <illustrations indicator="no"/>
        <conversionCredit>Apex CoVantage, LLC</conversionCredit>
        <tableCredit></tableCredit>
        <illustrationCredit></illustrationCredit>
        <authorName></authorName>
      </additional_metadata>
    </article_metadata>
    <image_metadata>
      <pageImage>
        <pageSequence>0001</pageSequence>
        <pageImageFile>WO1_RDNP_1870_01_02-0001.tif</pageImageFile>
        <pageCoordinates>407,237,5266,6875</pageCoordinates>
        <pageSkew>40</pageSkew>
      </pageImage>
      <articleImage>
        <articleSequence>0001-002</articleSequence>
        <articleImageFile>WO1_RDNP_1870_01_02-0001-002.tif</articleImageFile>
        <articleCoordinates>1661,680,2457,3963</articleCoordinates>
        <articleText>
          <articleWord coord="136,15,343,64">FOREIGN</articleWord>
          <articleWord coord="330,10,642,60">INTELLIGENCE.</articleWord>
          <articleWord coord="298,87,455,125">s&quot;xRs</articleWord>
          <articleWord coord="79,116,344,158">RE3TGNATION</articleWord>
          <articleWord coord="335,116,409,152">OF</articleWord>
          <articleWord coord="408,114,507,151">THE</articleWord>
          <articleWord coord="498,109,697,149">MIN13T&amp;Y.</articleWord>
          <articleWord coord="35,153,107,191">Tbe,</articleWord>
          <articleWord coord="102,152,245,189">Emperor</articleWord>
          <articleWord coord="239,151,308,187">has</articleWord>
          ...
        </articleText>
      </articleImage>
    </image_metadata>
  </BL_article>
</BL_newspaper>

griff-rees mentioned this issue Mar 2, 2023

Page number (and potentially other xml info) list for each Item Living-with-machines/lwmdb#82

Open

kallewesterling mentioned this issue Apr 6, 2023

[meta] Organise a hack day to address a few issues #83

Open

DavidBeavan mentioned this issue Apr 17, 2023

Page number output for mets #86

Draft

5 tasks

kallewesterling mentioned this issue Apr 19, 2023

Draft blog post documenting how to use alto2txt with the BL repository newspapers #88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add page number to metadata output #73

Add page number to metadata output #73

fedenanni commented Feb 22, 2023

griff-rees commented Feb 22, 2023 •

edited

griff-rees commented Feb 22, 2023 •

edited

kmcdono2 commented Feb 22, 2023

griff-rees commented Feb 22, 2023 •

edited

griff-rees commented Feb 22, 2023

kmcdono2 commented Feb 22, 2023

dcsw2 commented Mar 2, 2023

griff-rees commented Mar 3, 2023

kmcdono2 commented Mar 3, 2023

dcsw2 commented Mar 3, 2023 via email •

edited by griff-rees

npedrazzini commented Mar 3, 2023

kmcdono2 commented Mar 3, 2023

kallewesterling commented Mar 24, 2023

kallewesterling commented Apr 6, 2023

DavidBeavan commented Apr 6, 2023

kmcdono2 commented Apr 11, 2023

kallewesterling commented Apr 12, 2023

kallewesterling commented Apr 20, 2023

Add page number to metadata output #73

Add page number to metadata output #73

Comments

fedenanni commented Feb 22, 2023

griff-rees commented Feb 22, 2023 • edited

griff-rees commented Feb 22, 2023 • edited

kmcdono2 commented Feb 22, 2023

griff-rees commented Feb 22, 2023 • edited

griff-rees commented Feb 22, 2023

kmcdono2 commented Feb 22, 2023

dcsw2 commented Mar 2, 2023

griff-rees commented Mar 3, 2023

kmcdono2 commented Mar 3, 2023

dcsw2 commented Mar 3, 2023 via email • edited by griff-rees

npedrazzini commented Mar 3, 2023

kmcdono2 commented Mar 3, 2023

kallewesterling commented Mar 24, 2023

kallewesterling commented Apr 6, 2023

DavidBeavan commented Apr 6, 2023

kmcdono2 commented Apr 11, 2023

kallewesterling commented Apr 12, 2023

kallewesterling commented Apr 20, 2023

griff-rees commented Feb 22, 2023 •

edited

griff-rees commented Feb 22, 2023 •

edited

griff-rees commented Feb 22, 2023 •

edited

dcsw2 commented Mar 3, 2023 via email •

edited by griff-rees