-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partial rsync error when parsing zip file from HT rsync to index page data #539
Comments
The record that's actually causing this problem is uc1.$b31619 I got an error when I tried to suppress it in QA - we don't have permission (at least in qa) to delete the pairtree files from the NFS mount point. That's something I'll have to check on with Francis, but we may want an issue to track it. (Would be nice to know if it's also a problem in production) |
How would we check whether that's a problem in production without actually suppressing it? I suppose I could suppress, delete, and then re-add it if you think I should. The record appears to be working and indexed in production: https://prosody.princeton.edu/archive/uc1.$b31619/?query=epic |
Yeah, that was my concern too - would rather not delete or suppress something we don't want to actually delete. I'll try to figure it out from the filesystem side without needing that. It's possible that's not the record that's actually causing the problem... so weird! When I run index_pages with just that id, it doesn't error but it also doesn't seem to index any pages. |
@mnaydan looking at the mets-xml and zip file for uc1.$b31619 - the mets definitely references page files that are not present in the zip file. Here's the end of the mets record: <METS:div LABEL="CHAPTER_START, IMPLICIT_PAGE_NUMBER" TYPE="page" ORDER="176">
<METS:fptr FILEID="TXT00000176"/>
<METS:fptr FILEID="HTML00000176"/>
<METS:fptr FILEID="IMG00000176"/>
</METS:div>
<METS:div TYPE="page" ORDER="177" LABEL="BLANK, IMPLICIT_PAGE_NUMBER">
<METS:fptr FILEID="HTML00000177"/>
<METS:fptr FILEID="TXT00000177"/>
<METS:fptr FILEID="IMG00000177"/>
</METS:div>
<METS:div LABEL="CHAPTER_START, IMAGE_ON_PAGE, UNTYPICAL_PAGE, IMPLICIT_PAGE_NUMBER" TYPE="page" ORDER="178">
<METS:fptr FILEID="IMG00000178"/>
<METS:fptr FILEID="TXT00000178"/>
<METS:fptr FILEID="HTML00000178"/>
</METS:div>
<METS:div TYPE="page" ORDER="179" LABEL="INDEX, IMAGE_ON_PAGE, UNTYPICAL_PAGE, IMPLICIT_PAGE_NUMBER">
<METS:fptr FILEID="HTML00000179"/>
<METS:fptr FILEID="TXT00000179"/>
<METS:fptr FILEID="IMG00000179"/>
</METS:div>
<METS:div LABEL="BACK_COVER, IMAGE_ON_PAGE, IMPLICIT_PAGE_NUMBER" TYPE="page" ORDER="180">
<METS:fptr FILEID="IMG00000180"/>
<METS:fptr FILEID="TXT00000180"/>
<METS:fptr FILEID="HTML00000180"/>
</METS:div>
</METS:div> But the text files in the accompanying zip file only go up to 176!
This record was already suppressed on our staging site, so I removed it and re-imported it to get a fresh copy. So I think this is still the most recent copy of the record. Is there a way to report errors like this to HathiTrust? We can add error handling in our page index code so this isn't a fatal error, and we can add logging so that if we ever look at log files we would see what's going on. In this case we're probably not missing too much (or maybe any?) content, and I think this is the only record where we've ever run into it, so it's probably safe to do that. Any thoughts? |
Looking at it on HathiTrust, it looks like page 176 is the last meaningful page (and so the last one with a txt file); the subsequent pages are blank, flyleaf, and back cover, so we are not missing content. That plan sounds good to me -- to add error handling so it's not fatal, and log. To confirm: Is trying to call txt files for flyleafs, covers, etc. not a problem for other works? I can submit a "Content Quality Correction" for the work on HathiTrust but that seems to be for scan quality on the frontend. |
I followed the testing notes and this looks good! |
Discovered when trying to add uga1.32108002998303 to the database.
Culprit from logs:
It seems accurate that the text file is not in the zipfile but we're not sure why the code expects it to be. Looking at the mets xml, the pages only go up to 108.txt. There must be something we're doing wrong with how we parse the xml and load the text files.
testing notes
Suppress and then remove 'uc1.$b31619'; then use the hathi import to re-add.
Import should succeed with no 500 error; page content should be searchable.
The text was updated successfully, but these errors were encountered: