partial rsync error when parsing zip file from HT rsync to index page data #539

mnaydan · 2023-07-07T18:18:14Z

Discovered when trying to add uga1.32108002998303 to the database.

Culprit from logs:

KeyError: "There is no item named '$b31619/UCAL_$B31619_00000177.txt' in the archive"

It seems accurate that the text file is not in the zipfile but we're not sure why the code expects it to be. Looking at the mets xml, the pages only go up to 108.txt. There must be something we're doing wrong with how we parse the xml and load the text files.

testing notes

Suppress and then remove 'uc1.$b31619'; then use the hathi import to re-add.

Import should succeed with no 500 error; page content should be searchable.

rlskoeser · 2024-02-02T19:27:31Z

The record that's actually causing this problem is uc1.$b31619

I got an error when I tried to suppress it in QA - we don't have permission (at least in qa) to delete the pairtree files from the NFS mount point. That's something I'll have to check on with Francis, but we may want an issue to track it. (Would be nice to know if it's also a problem in production)

mnaydan · 2024-02-02T19:33:36Z

How would we check whether that's a problem in production without actually suppressing it? I suppose I could suppress, delete, and then re-add it if you think I should. The record appears to be working and indexed in production: https://prosody.princeton.edu/archive/uc1.$b31619/?query=epic

rlskoeser · 2024-02-02T19:41:59Z

Yeah, that was my concern too - would rather not delete or suppress something we don't want to actually delete. I'll try to figure it out from the filesystem side without needing that.

It's possible that's not the record that's actually causing the problem... so weird!

When I run index_pages with just that id, it doesn't error but it also doesn't seem to index any pages.

rlskoeser · 2024-02-12T21:26:30Z

@mnaydan looking at the mets-xml and zip file for uc1.$b31619 - the mets definitely references page files that are not present in the zip file.

Here's the end of the mets record:

      <METS:div LABEL="CHAPTER_START, IMPLICIT_PAGE_NUMBER" TYPE="page" ORDER="176">
        <METS:fptr FILEID="TXT00000176"/>
        <METS:fptr FILEID="HTML00000176"/>
        <METS:fptr FILEID="IMG00000176"/>
      </METS:div>
      <METS:div TYPE="page" ORDER="177" LABEL="BLANK, IMPLICIT_PAGE_NUMBER">
        <METS:fptr FILEID="HTML00000177"/>
        <METS:fptr FILEID="TXT00000177"/>
        <METS:fptr FILEID="IMG00000177"/>
      </METS:div>
      <METS:div LABEL="CHAPTER_START, IMAGE_ON_PAGE, UNTYPICAL_PAGE, IMPLICIT_PAGE_NUMBER" TYPE="page" ORDER="178">
        <METS:fptr FILEID="IMG00000178"/>
        <METS:fptr FILEID="TXT00000178"/>
        <METS:fptr FILEID="HTML00000178"/>
      </METS:div>
      <METS:div TYPE="page" ORDER="179" LABEL="INDEX, IMAGE_ON_PAGE, UNTYPICAL_PAGE, IMPLICIT_PAGE_NUMBER">
        <METS:fptr FILEID="HTML00000179"/>
        <METS:fptr FILEID="TXT00000179"/>
        <METS:fptr FILEID="IMG00000179"/>
      </METS:div>
      <METS:div LABEL="BACK_COVER, IMAGE_ON_PAGE, IMPLICIT_PAGE_NUMBER" TYPE="page" ORDER="180">
        <METS:fptr FILEID="IMG00000180"/>
        <METS:fptr FILEID="TXT00000180"/>
        <METS:fptr FILEID="HTML00000180"/>
      </METS:div>
    </METS:div>

But the text files in the accompanying zip file only go up to 176!

$b31619/UCAL_$B31619_00000172.txt
$b31619/UCAL_$B31619_00000173.txt
$b31619/UCAL_$B31619_00000174.txt
$b31619/UCAL_$B31619_00000175.txt
$b31619/UCAL_$B31619_00000176.txt

This record was already suppressed on our staging site, so I removed it and re-imported it to get a fresh copy. So I think this is still the most recent copy of the record. Is there a way to report errors like this to HathiTrust?

We can add error handling in our page index code so this isn't a fatal error, and we can add logging so that if we ever look at log files we would see what's going on. In this case we're probably not missing too much (or maybe any?) content, and I think this is the only record where we've ever run into it, so it's probably safe to do that. Any thoughts?

mnaydan · 2024-02-12T22:07:20Z

Looking at it on HathiTrust, it looks like page 176 is the last meaningful page (and so the last one with a txt file); the subsequent pages are blank, flyleaf, and back cover, so we are not missing content. That plan sounds good to me -- to add error handling so it's not fatal, and log. To confirm: Is trying to call txt files for flyleafs, covers, etc. not a problem for other works?

I can submit a "Content Quality Correction" for the work on HathiTrust but that seems to be for scan quality on the frontend.

resolves #539

mnaydan · 2024-02-14T20:15:52Z

I followed the testing notes and this looks good!

mnaydan added the bug label Jul 7, 2023

mnaydan mentioned this issue Nov 30, 2023

As an admin, I want a way to reproducibly generate a full-text corpus of all public PPA content in order to support computational research on PPA materials. #556

Closed

rlskoeser self-assigned this Feb 12, 2024

rlskoeser added a commit that referenced this issue Feb 14, 2024

Add error handling for page files referenced in METS but not in zip file

84e2343

resolves #539

rlskoeser mentioned this issue Feb 14, 2024

Add error handling for page files referenced in METS but not in zip file #588

Merged

rlskoeser added a commit that referenced this issue Feb 14, 2024

Add error handling for page files referenced in METS but not in zip file

b2f9f46

resolves #539

rlskoeser added the awaiting testing label Feb 14, 2024

mnaydan closed this as completed Feb 14, 2024

mnaydan removed the awaiting testing label Feb 14, 2024

mnaydan mentioned this issue Mar 11, 2024

As a developer, I want a script to update all HathiTrust content so that I can refresh locally cached data with OCR improvements and other changes. #428

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partial rsync error when parsing zip file from HT rsync to index page data #539

partial rsync error when parsing zip file from HT rsync to index page data #539

mnaydan commented Jul 7, 2023 •

edited by rlskoeser

rlskoeser commented Feb 2, 2024

mnaydan commented Feb 2, 2024

rlskoeser commented Feb 2, 2024 •

edited

rlskoeser commented Feb 12, 2024

mnaydan commented Feb 12, 2024

mnaydan commented Feb 14, 2024

partial rsync error when parsing zip file from HT rsync to index page data #539

partial rsync error when parsing zip file from HT rsync to index page data #539

Comments

mnaydan commented Jul 7, 2023 • edited by rlskoeser

testing notes

rlskoeser commented Feb 2, 2024

mnaydan commented Feb 2, 2024

rlskoeser commented Feb 2, 2024 • edited

rlskoeser commented Feb 12, 2024

mnaydan commented Feb 12, 2024

mnaydan commented Feb 14, 2024

mnaydan commented Jul 7, 2023 •

edited by rlskoeser

rlskoeser commented Feb 2, 2024 •

edited