Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a developer, I want to one-time bulk fix HathiTrust excerpt page ranges from a spreadsheet so that we can pull correct page content when we reindex. #625

Closed
6 tasks done
mnaydan opened this issue Mar 21, 2024 · 15 comments
Assignees

Comments

@mnaydan
Copy link
Contributor

mnaydan commented Mar 21, 2024

  • adapt import script write script to update digital pages
  • run against this spreadsheet

acceptance criteria

  • updates digital page range in database
  • creates log entry documenting the change
  • reindexes pages in Solr based on new range
  • appropriate/reasonable error handling and reporting
rlskoeser added a commit that referenced this issue Mar 28, 2024
adapted from hathi_excerpt manage command

resolves #625
@rlskoeser
Copy link
Contributor

@mnaydan I'm writing the script to require a CSV that includes source_id, pages_orig, and new_pages_digital. I also prefer we do filtering on the CSV before using it with the script (it should only include rows for records we want updated). Does that sound ok to you?


For testing, I downloaded the first tab from Google Sheets, filtered out all the rows where the digital page range was marked as "correct", and renamed the "new digital range" to new_pages_digital.

In case it's useful, I used grep to filter out correct rows:

grep --invert ,correct, excerpt_updates.csv > excerpt_update_changes.csv

I didn't know that one of the rows was marked as "SUPPRESS" but it turned out to be useful for testing error handling (that record failed to save).

@mnaydan
Copy link
Contributor Author

mnaydan commented Mar 28, 2024

@rlskoeser that sounds good generally, but here's the needle in the haystack... there was at least one case where original page range changed (roman numerals) due to a typo I discovered in the original input. Will original page range need to match the database field exactly? Would fixing the typo in the database resolve the issue? If there's quick error handling for any matches NOT found that would be useful.

@rlskoeser
Copy link
Contributor

@mnaydan script reports on matches that are not found - I had one where the original page range was slightly different in the spreadsheet that in my copy of the database (which is probably out dated). Updating the incorrect page range in the database would resolve that problem. I thought looking for exact match would be best.

@mnaydan
Copy link
Contributor Author

mnaydan commented Mar 28, 2024

@rlskoeser okay perfect, I'll update that page range in the database and if it reports on no matches then we are good.

@rlskoeser
Copy link
Contributor

I ran the script in staging with this CSV file as input (generated from the google doc version as noted above):
excerpt_update_changes.csv

Here is how I ran the script and the summary output:

$ ./manage.py adjust_excerpts /tmp/excerpt_update_changes.csv
Error saving mdp.39015036664038 (SUPPRESS): Can't parse chunk 'SUPPRESS'
No record found for source id uc1.c2641998 and pages_orig 32-33, 66

Updated 119 records. 0 unchanged, 1 not found, 1 error.

I ran this in staging without refreshing from production because I wanted it to have the recent-ish rsync changes. (I did not run rsync immediately before).

If it's helpful for testing, you could update the original pages for the not found record in the staging database and I can run this script again. I could also run rsync. At some point before we release, we may want to test the full set of steps we will be doing in production (perhaps after we fix excerpt ids): replicate production to staging, rsync, update excerpts.

If I run the script again with the same input, it recognizes that it doesn't need to make changes:

$ ./manage.py adjust_excerpts /tmp/excerpt_update_changes.csv
Error saving mdp.39015036664038 (SUPPRESS): Can't parse chunk 'SUPPRESS'
No record found for source id uc1.c2641998 and pages_orig 32-33, 66

Updated 0 records. 119 unchanged, 1 not found, 1 error.

@mnaydan
Copy link
Contributor Author

mnaydan commented Mar 28, 2024

@rlskoeser this is really helpful, thanks! Let me fix the errors in production and staging and then we can re-run... give me a moment.

@mnaydan
Copy link
Contributor Author

mnaydan commented Mar 28, 2024

@rlskoeser how is it handling null original page number?

@rlskoeser
Copy link
Contributor

@mnaydan I don't know! Probably not correctly, I didn't know that was a possible case. LMK what it should do - at a minimum I can make sure to filter out full works when we look for matches...

@mnaydan
Copy link
Contributor Author

mnaydan commented Mar 28, 2024

@rlskoeser OK good to know. It's not a full work, it just doesn't have any physical page numbers printed on the pages! I will just fix that one manually on the backend and then you shouldn't have to change anything with the script. I think it's the only blank.

Edit: I went in to change it and it looks like your script fixed it already in QA! There's no other excerpts associated with that work, just the one.

@rlskoeser
Copy link
Contributor

@mnaydan worth checking what it did when you test the script; if it's the only excerpt from that volume it may have done the right thing.

@mnaydan
Copy link
Contributor Author

mnaydan commented Mar 28, 2024

@rlskoeser great minds! Ok, can we run it again? I am expecting the uc1.c2641998 error to be handled now, and 2 additional updated records.

@rlskoeser
Copy link
Contributor

Regenerated a test CSV from the google sheets (forgot to exclude the suppressed one) and ran again; here is the output:

$ ./manage.py adjust_excerpts /tmp/excerpt_update_changes_v2.csv
Error saving mdp.39015036664038 (SUPPRESS): Can't parse chunk 'SUPPRESS'

Updated 3 records. 117 unchanged, 0 not found, 1 error.

@mnaydan
Copy link
Contributor Author

mnaydan commented Mar 28, 2024

@rlskoeser yay! This is exactly what I expected. Do you want to close and track testing the full set of steps elsewhere?

@mnaydan
Copy link
Contributor Author

mnaydan commented Mar 28, 2024

@rlskoeser wait, I just saw your full set of acceptance criteria. I'm getting clearly bleary-eyed after a long week... let me test all those steps.

@mnaydan
Copy link
Contributor Author

mnaydan commented Mar 28, 2024

I spot checked a few records, everything looks great! I tested single page, changed range, unchanged/correct range, discontinuous page numbers, ark:/ IDs, the one blank original page range record... they all appear in the database and indexed how I would expect in all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants