Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial rsync error when parsing zip file from HT rsync to index page data #539

Closed
mnaydan opened this issue Jul 7, 2023 · 6 comments
Closed
Assignees
Labels

Comments

@mnaydan
Copy link
Contributor

mnaydan commented Jul 7, 2023

Discovered when trying to add uga1.32108002998303 to the database.

Culprit from logs:

KeyError: "There is no item named '$b31619/UCAL_$B31619_00000177.txt' in the archive"

It seems accurate that the text file is not in the zipfile but we're not sure why the code expects it to be. Looking at the mets xml, the pages only go up to 108.txt. There must be something we're doing wrong with how we parse the xml and load the text files.

testing notes

Suppress and then remove 'uc1.$b31619'; then use the hathi import to re-add.

Import should succeed with no 500 error; page content should be searchable.

@rlskoeser
Copy link
Contributor

The record that's actually causing this problem is uc1.$b31619

I got an error when I tried to suppress it in QA - we don't have permission (at least in qa) to delete the pairtree files from the NFS mount point. That's something I'll have to check on with Francis, but we may want an issue to track it. (Would be nice to know if it's also a problem in production)

@mnaydan
Copy link
Contributor Author

mnaydan commented Feb 2, 2024

How would we check whether that's a problem in production without actually suppressing it? I suppose I could suppress, delete, and then re-add it if you think I should. The record appears to be working and indexed in production: https://prosody.princeton.edu/archive/uc1.$b31619/?query=epic

@rlskoeser
Copy link
Contributor

rlskoeser commented Feb 2, 2024

Yeah, that was my concern too - would rather not delete or suppress something we don't want to actually delete. I'll try to figure it out from the filesystem side without needing that.

It's possible that's not the record that's actually causing the problem... so weird!

When I run index_pages with just that id, it doesn't error but it also doesn't seem to index any pages.

@rlskoeser rlskoeser self-assigned this Feb 12, 2024
@rlskoeser
Copy link
Contributor

@mnaydan looking at the mets-xml and zip file for uc1.$b31619 - the mets definitely references page files that are not present in the zip file.

Here's the end of the mets record:

      <METS:div LABEL="CHAPTER_START, IMPLICIT_PAGE_NUMBER" TYPE="page" ORDER="176">
        <METS:fptr FILEID="TXT00000176"/>
        <METS:fptr FILEID="HTML00000176"/>
        <METS:fptr FILEID="IMG00000176"/>
      </METS:div>
      <METS:div TYPE="page" ORDER="177" LABEL="BLANK, IMPLICIT_PAGE_NUMBER">
        <METS:fptr FILEID="HTML00000177"/>
        <METS:fptr FILEID="TXT00000177"/>
        <METS:fptr FILEID="IMG00000177"/>
      </METS:div>
      <METS:div LABEL="CHAPTER_START, IMAGE_ON_PAGE, UNTYPICAL_PAGE, IMPLICIT_PAGE_NUMBER" TYPE="page" ORDER="178">
        <METS:fptr FILEID="IMG00000178"/>
        <METS:fptr FILEID="TXT00000178"/>
        <METS:fptr FILEID="HTML00000178"/>
      </METS:div>
      <METS:div TYPE="page" ORDER="179" LABEL="INDEX, IMAGE_ON_PAGE, UNTYPICAL_PAGE, IMPLICIT_PAGE_NUMBER">
        <METS:fptr FILEID="HTML00000179"/>
        <METS:fptr FILEID="TXT00000179"/>
        <METS:fptr FILEID="IMG00000179"/>
      </METS:div>
      <METS:div LABEL="BACK_COVER, IMAGE_ON_PAGE, IMPLICIT_PAGE_NUMBER" TYPE="page" ORDER="180">
        <METS:fptr FILEID="IMG00000180"/>
        <METS:fptr FILEID="TXT00000180"/>
        <METS:fptr FILEID="HTML00000180"/>
      </METS:div>
    </METS:div>

But the text files in the accompanying zip file only go up to 176!

$b31619/UCAL_$B31619_00000172.txt
$b31619/UCAL_$B31619_00000173.txt
$b31619/UCAL_$B31619_00000174.txt
$b31619/UCAL_$B31619_00000175.txt
$b31619/UCAL_$B31619_00000176.txt

This record was already suppressed on our staging site, so I removed it and re-imported it to get a fresh copy. So I think this is still the most recent copy of the record. Is there a way to report errors like this to HathiTrust?

We can add error handling in our page index code so this isn't a fatal error, and we can add logging so that if we ever look at log files we would see what's going on. In this case we're probably not missing too much (or maybe any?) content, and I think this is the only record where we've ever run into it, so it's probably safe to do that. Any thoughts?

@mnaydan
Copy link
Contributor Author

mnaydan commented Feb 12, 2024

Looking at it on HathiTrust, it looks like page 176 is the last meaningful page (and so the last one with a txt file); the subsequent pages are blank, flyleaf, and back cover, so we are not missing content. That plan sounds good to me -- to add error handling so it's not fatal, and log. To confirm: Is trying to call txt files for flyleafs, covers, etc. not a problem for other works?

I can submit a "Content Quality Correction" for the work on HathiTrust but that seems to be for scan quality on the frontend.

@mnaydan
Copy link
Contributor Author

mnaydan commented Feb 14, 2024

I followed the testing notes and this looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants