-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As a developer, I want a script to update all HathiTrust content so that I can refresh locally cached data with OCR improvements and other changes. #428
Comments
questions we still need to answer:
|
I ran our new replicate playbook to make sure PPA pairtree matched production (including deleted test records we added to staging), and then ran the new version of the Here's the output from the command I saw in the terminal: Synchronizing data for 5319 records
rsync: [sender] link_stat "uga1/pairtree_prefix" (in ht_text_pd) failed: No such file or directory (2)
rsync: [sender] link_stat "uga1/pairtree_version0_1" (in ht_text_pd) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1865) [generator=3.2.7]
ERROR:ppa.archive.import_util:HathiTrust rsync failed — Partial transfer due to error / command: rsync -rLt --log-file=/tmp/ppa-rsync_hl204axt/ppa_hathi_rsync_20240308-132616.log --delete --ignore-errors --files-from=/tmp/ppa_hathi_pathlist-10jsmrx4.txt datasets.hathitrust.org::ht_text_pd /mnt/nfs/cdh/prosody/data/ht_text_pd/
Updated 9389 files for 4707 volumes; full details in ppa_rsync_changes_20240308-134553.csv It took a while to run - I think we had a lot of changes to catch up on since we haven't been updating. Here's a copy of the detail file generated by the script; it's a CSV file generated from parsing the rsync log. I opted to include more details based on Laure's review, since the raw rsync log file is currently not preserved. Note that we currently don't handle deletions in this report. |
Ran the script to update page counts based on the updated data; here's the output:
I'm starting a full page reindex now so search will be updated with the new content. [updated] Page reindex completed; it looks like we're getting more cases where the mets-xml references pages that don't have a text file (good thing we fixed that!). Because of how we count HathiTrust pages, this is going to cause more discrepancies in page counts (let's discuss how we want to handle). Here's the page indexing script output.
|
Not quite sure how to test. I looked at the CSV and I'm not sure how to read it. It looks like only 6 files were marked as size_changed FALSE and 2 as modification_time FALSE. It looks like there are 3 types of rsync_flags: >f.st...... >f..t...... and >f+++++++++. What do these mean? What do size_changed and modification_time refer to? (Edit: I see your explanation of the format on #453 now.) As far as the METS-XML referencing non-existent txt files, that's good to know (but unfortunate) that it's more than just that one record we found on #539. As long as the error is non-fatal, maybe it makes sense to document that page count may be inaccurate due to this weird HathiTrust abnormality, but not put development time into working around it. Do we still need to address your questions about how often to run this, and whether a content admin should manage it? |
I'm going to make an attempt at answering this, but @rlskoeser correct me if I'm wrong on any of the details.
The last portion corresponds to the rsync log output, which is fixed length string of 11 characters. This string should always start with
The rsync attributes we're explicitly logging are
|
I thought it would be good to run the rsync command in staging again before I run the new excerpt script - I was curious if there would be any changes since running rsync two weeks ago. The script output reports: "Updated 90 files for 56 volumes" Here's the csv report generated by the rsync script. |
Another set of changes just since yesterday! 12 volumes, including 11 njp records |
After talking through our decision tree we decided to retitle this issue for full works. Development notes (as I understand them):
Deletions and excerpts will be handled as part of #626 |
Since we rescoped this issue, we've already tested this issue as part of the other rsync investigations and OCR was successfully updated. |
As @mnaydan and I discovered in testing excerpts, when we found text and page image mismatches, (and as I've suspected for a while), we should probably be doing regular rsync updates of our HathiTrust pairtree data to get any changes and (presumably) deletions.
the manual version looks something like this:
rsync --copy-links --recursive --times --verbose --itemize-changes --files-from=/tmp/rsync_path_list.txt datasets.hathitrust.org::ht_text_pd .
. (itemize changes was a local addition so we could see specifically what files changed; I didn't use the delete option but we will need to add it). Note: might need to do something about permissions or file ownership. When I ran this in production, the permissions on some files did not allow the deploy user to set the time attributes (likely because files were created via the admin interface, so they were owned by apache instead of deploy; should also check that the rsync command generated by admin uses the same options).python manage.py index_pages
We'll need to do these steps in production because we know we have content out of date, and it affects some of the excerpts.
We probably want to automate this at some point — I think if we do that, it would help out if there are any deletions in our dataset, but I'm not clear on all the steps we'd need.
Currently thinking it might look like this:
I'm unclear on why volumes get removed — if we think it's unlikely for our data (since it is all public domain), maybe we can come up with a simpler approach for handling possible deletions. I'd love to have some way of sending the hathi email we get now to a script that can alert us if there's any overlap with those ids and PPA records, but haven't had time to think or ask about ways to accomplish that. We could ask HathiTrust (or check their documentation) to see if the deletion email information is made available in another form — if there's an rss feed or something we could consume via script, that would be easier.
The text was updated successfully, but these errors were encountered: