Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a developer, I want a script to update all HathiTrust content so that I can refresh locally cached data with OCR improvements and other changes. #428

Closed
rlskoeser opened this issue Jul 29, 2021 · 9 comments
Assignees

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Jul 29, 2021

As @mnaydan and I discovered in testing excerpts, when we found text and page image mismatches, (and as I've suspected for a while), we should probably be doing regular rsync updates of our HathiTrust pairtree data to get any changes and (presumably) deletions.

the manual version looks something like this:

  • get list of all hathi source ids in the database
  • create a text file with one id per line
  • use the one off script to generate a path list for rsync (per hathitrust instructions)
  • run rsync with the path file list generated, using the options hathitrust specifies: rsync --copy-links --recursive --times --verbose --itemize-changes --files-from=/tmp/rsync_path_list.txt datasets.hathitrust.org::ht_text_pd .. (itemize changes was a local addition so we could see specifically what files changed; I didn't use the delete option but we will need to add it). Note: might need to do something about permissions or file ownership. When I ran this in production, the permissions on some files did not allow the deploy user to set the time attributes (likely because files were created via the admin interface, so they were owned by apache instead of deploy; should also check that the rsync command generated by admin uses the same options).
  • After rsync completes, reindex pages to get updated content: python manage.py index_pages

We'll need to do these steps in production because we know we have content out of date, and it affects some of the excerpts.

We probably want to automate this at some point — I think if we do that, it would help out if there are any deletions in our dataset, but I'm not clear on all the steps we'd need.

Currently thinking it might look like this:

  • custom manage command that can be run as a cron job
    • gets a unique list of hathitrust source ids from the database
    • calls rsync with those ids, similar to the way we call rsync for adding new records now
    • identifies any records where the content was updated
      • reindexes pages for updated items
    • identifies any records that have been deleted (not sure how! file contents will be gone, but hopefully there's some way to get rsync to tell us what it removed)
      • mark database record as suppressed; add something to the note field and/or a log entry to document why/when it was suppressed (suppressing will take care of removing it from the index)
      • alert ppa admins (via email? slack?) that an item has been removed — they should be aware; they may need to find and import a different copy of that volume

I'm unclear on why volumes get removed — if we think it's unlikely for our data (since it is all public domain), maybe we can come up with a simpler approach for handling possible deletions. I'd love to have some way of sending the hathi email we get now to a script that can alert us if there's any overlap with those ids and PPA records, but haven't had time to think or ask about ways to accomplish that. We could ask HathiTrust (or check their documentation) to see if the deletion email information is made available in another form — if there's an rss feed or something we could consume via script, that would be easier.

@mnaydan mnaydan changed the title regularly synchronize HathiTrust pairtree data As an admin, I want HathiTrust pairtree data regularly synchronized so that we have the most current version of HathiTrust content. Jan 16, 2024
@rlskoeser
Copy link
Contributor Author

questions we still need to answer:

  • how often to run this? (currently thinking weekly)
  • how can/should a content admin manage this? maybe a log entry documenting that a work was updated via rsync would be sufficient - would be visible on individual work history but could also be run against the log entry admin list

@rlskoeser
Copy link
Contributor Author

I ran our new replicate playbook to make sure PPA pairtree matched production (including deleted test records we added to staging), and then ran the new version of the hathi_rsync manage command.

Here's the output from the command I saw in the terminal:

Synchronizing data for 5319 records
rsync: [sender] link_stat "uga1/pairtree_prefix" (in ht_text_pd) failed: No such file or directory (2)
rsync: [sender] link_stat "uga1/pairtree_version0_1" (in ht_text_pd) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1865) [generator=3.2.7]
ERROR:ppa.archive.import_util:HathiTrust rsync failed — Partial transfer due to error / command: rsync -rLt --log-file=/tmp/ppa-rsync_hl204axt/ppa_hathi_rsync_20240308-132616.log --delete --ignore-errors  --files-from=/tmp/ppa_hathi_pathlist-10jsmrx4.txt datasets.hathitrust.org::ht_text_pd /mnt/nfs/cdh/prosody/data/ht_text_pd/
Updated 9389 files for 4707 volumes; full details in ppa_rsync_changes_20240308-134553.csv

It took a while to run - I think we had a lot of changes to catch up on since we haven't been updating.

Here's a copy of the detail file generated by the script; it's a CSV file generated from parsing the rsync log. I opted to include more details based on Laure's review, since the raw rsync log file is currently not preserved.

Note that we currently don't handle deletions in this report.

ppa_rsync_changes_20240308-134553.csv

@rlskoeser
Copy link
Contributor Author

rlskoeser commented Mar 8, 2024

Ran the script to update page counts based on the updated data; here's the output:

Volumes with updated page count: 1,539
        Page count unchanged: 3,216
        Missing pairtree data: 0

I'm starting a full page reindex now so search will be updated with the new content.


[updated] Page reindex completed; it looks like we're getting more cases where the mets-xml references pages that don't have a text file (good thing we fixed that!). Because of how we count HathiTrust pages, this is going to cause more discrepancies in page counts (let's discuss how we want to handle). Here's the page indexing script output.

INFO:parasolr.django.solrclient:Connecting to default Solr http://lib-solr8d-staging.princeton.edu:8983/solr/cdh_ppa
INFO:parasolr.django.solrclient:Connecting to default Solr http://lib-solr8d-staging.princeton.edu:8983/solr/cdh_ppa
 14% (288795 of 1957295) |####################                                                                                                                        | Elapsed Time: 0:19:16 ETA:   1:51:24WARNING:ppa.archive.models:Indexing mdp.39015031007621 pages: 39015031007621/00000239.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015031007621 pages: 39015031007621/00000240.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015031007621 pages: 39015031007621/00000241.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015031007621 pages: 39015031007621/00000242.txt referenced in METS but not found in zip file
 24% (485097 of 1957295) |##################################                                                                                                          | Elapsed Time: 0:30:11 ETA:   1:31:36WARNING:ppa.archive.models:Indexing mdp.39015009309371 pages: 39015009309371/00000805.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015009309371 pages: 39015009309371/00000806.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015009309371 pages: 39015009309371/00000807.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015009309371 pages: 39015009309371/00000808.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015009309371 pages: 39015009309371/00000809.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015009309371 pages: 39015009309371/00000810.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015009309371 pages: 39015009309371/00000811.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015009309371 pages: 39015009309371/00000812.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015009309371 pages: 39015009309371/00000813.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing mdp.39015009309371 pages: 39015009309371/00000814.txt referenced in METS but not found in zip file
 37% (725807 of 1957295) |###################################################                                                                                         | Elapsed Time: 0:42:43 ETA:   1:12:30
 58% (1147343 of 1957295) |#################################################################################                                                          | Elapsed Time: 1:02:07 ETA:   0:43:51WARNING:ppa.archive.models:Indexing uc1.$b31619 pages: $b31619/UCAL_$B31619_00000177.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing uc1.$b31619 pages: $b31619/UCAL_$B31619_00000178.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing uc1.$b31619 pages: $b31619/UCAL_$B31619_00000179.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing uc1.$b31619 pages: $b31619/UCAL_$B31619_00000180.txt referenced in METS but not found in zip file
 90% (1777800 of 1957295) |##############################################################################################################################             | Elapsed Time: 1:43:32 ETA:   0:10:27WARNING:ppa.archive.models:Indexing uc1.$b31567 pages: $b31567/UCAL_$B31567_00000345.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing uc1.$b31567 pages: $b31567/UCAL_$B31567_00000346.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing uc1.$b31567 pages: $b31567/UCAL_$B31567_00000347.txt referenced in METS but not found in zip file
WARNING:ppa.archive.models:Indexing uc1.$b31567 pages: $b31567/UCAL_$B31567_00000348.txt referenced in METS but not found in zip file
100% (1957295 of 1957295) |###########################################################################################################################################| Elapsed Time: 1:51:46 Time:  1:51:46
INFO:parasolr.django.solrclient:Connecting to default Solr http://lib-solr8d-staging.princeton.edu:8983/solr/cdh_ppa

@mnaydan
Copy link
Contributor

mnaydan commented Mar 11, 2024

Not quite sure how to test. I looked at the CSV and I'm not sure how to read it. It looks like only 6 files were marked as size_changed FALSE and 2 as modification_time FALSE. It looks like there are 3 types of rsync_flags: >f.st...... >f..t...... and >f+++++++++. What do these mean? What do size_changed and modification_time refer to? (Edit: I see your explanation of the format on #453 now.)

As far as the METS-XML referencing non-existent txt files, that's good to know (but unfortunate) that it's more than just that one record we found on #539. As long as the error is non-fatal, maybe it makes sense to document that page count may be inaccurate due to this weird HathiTrust abnormality, but not put development time into working around it.

Do we still need to address your questions about how often to run this, and whether a content admin should manage it?

@laurejt
Copy link
Contributor

laurejt commented Mar 12, 2024

I'm going to make an attempt at answering this, but @rlskoeser correct me if I'm wrong on any of the details.

  • size_changed is a boolean indicating whether the file's size has changed (according to rsync)
  • modification_time is a boolean indicating whether the file's modification time has changed (according to rsync)

The last portion corresponds to the rsync log output, which is fixed length string of 11 characters. This string should always start with >f in these logs since we're only tracking files that have been downloaded. > indicates that a transfer from the remote server to the local server has taken place. f indicates that it's a file rather than a directory, symlink, etc. The next 9 characters correspond to specific attributes.

  • . indicates that the attribute hasn't changed.
  • + indicates a new file (so all attributes will be +s)
  • If there has been a change, the attribute will be marked with its corresponding letter (cstpoguax).

The rsync attributes we're explicitly logging are s and t which indicate that the file's size and modification time have changed. We'll generally log the following rsync flags:

  • >f.st....... the size and timestamp have changed
  • >f..t....... the timestamp has changed
  • >f+++++++++ a new file has been created

@rlskoeser
Copy link
Contributor Author

I thought it would be good to run the rsync command in staging again before I run the new excerpt script - I was curious if there would be any changes since running rsync two weeks ago.

The script output reports: "Updated 90 files for 56 volumes"

Here's the csv report generated by the rsync script.
ppa_rsync_changes_20240318-152446.csv

@rlskoeser
Copy link
Contributor Author

Another set of changes just since yesterday! 12 volumes, including 11 njp records

ppa_rsync_changes_20240319-140803.csv

@mnaydan mnaydan changed the title As an admin, I want HathiTrust pairtree data regularly synchronized so that we have the most current version of HathiTrust content. As an admin, I want to regularly rsync and reindex full works that HathiTrust has updated so that we have the most current version of HathiTrust content. Mar 21, 2024
@mnaydan
Copy link
Contributor

mnaydan commented Mar 21, 2024

After talking through our decision tree we decided to retitle this issue for full works.

Development notes (as I understand them):

  • run cron job daily at 2am
  • check for deletions
  • check for whether non-deleted files change size or timestamp
  • if full work and file change present, reindex that work and log change

Deletions and excerpts will be handled as part of #626

@mnaydan mnaydan changed the title As an admin, I want to regularly rsync and reindex full works that HathiTrust has updated so that we have the most current version of HathiTrust content. As a developer, I want a script to update all HathiTrust content. Mar 27, 2024
@rlskoeser rlskoeser changed the title As a developer, I want a script to update all HathiTrust content. As a developer, I want a script to update all HathiTrust content so that I can refresh locally cached data with OCR improvements and other changes. Mar 27, 2024
@mnaydan
Copy link
Contributor

mnaydan commented Mar 27, 2024

Since we rescoped this issue, we've already tested this issue as part of the other rsync investigations and OCR was successfully updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants