As a developer, I want to one-time bulk fix HathiTrust excerpt page ranges from a spreadsheet so that we can pull correct page content when we reindex. #625

mnaydan · 2024-03-21T20:11:03Z

~~adapt import script~~ write script to update digital pages
run against this spreadsheet

acceptance criteria

updates digital page range in database
creates log entry documenting the change
reindexes pages in Solr based on new range
appropriate/reasonable error handling and reporting

adapted from hathi_excerpt manage command resolves #625

rlskoeser · 2024-03-28T19:03:02Z

@mnaydan I'm writing the script to require a CSV that includes source_id, pages_orig, and new_pages_digital. I also prefer we do filtering on the CSV before using it with the script (it should only include rows for records we want updated). Does that sound ok to you?

For testing, I downloaded the first tab from Google Sheets, filtered out all the rows where the digital page range was marked as "correct", and renamed the "new digital range" to new_pages_digital.

In case it's useful, I used grep to filter out correct rows:

grep --invert ,correct, excerpt_updates.csv > excerpt_update_changes.csv

I didn't know that one of the rows was marked as "SUPPRESS" but it turned out to be useful for testing error handling (that record failed to save).

mnaydan · 2024-03-28T19:19:11Z

@rlskoeser that sounds good generally, but here's the needle in the haystack... there was at least one case where original page range changed (roman numerals) due to a typo I discovered in the original input. Will original page range need to match the database field exactly? Would fixing the typo in the database resolve the issue? If there's quick error handling for any matches NOT found that would be useful.

rlskoeser · 2024-03-28T19:21:09Z

@mnaydan script reports on matches that are not found - I had one where the original page range was slightly different in the spreadsheet that in my copy of the database (which is probably out dated). Updating the incorrect page range in the database would resolve that problem. I thought looking for exact match would be best.

mnaydan · 2024-03-28T19:22:54Z

@rlskoeser okay perfect, I'll update that page range in the database and if it reports on no matches then we are good.

rlskoeser · 2024-03-28T19:53:16Z

I ran the script in staging with this CSV file as input (generated from the google doc version as noted above):
excerpt_update_changes.csv

Here is how I ran the script and the summary output:

$ ./manage.py adjust_excerpts /tmp/excerpt_update_changes.csv
Error saving mdp.39015036664038 (SUPPRESS): Can't parse chunk 'SUPPRESS'
No record found for source id uc1.c2641998 and pages_orig 32-33, 66

Updated 119 records. 0 unchanged, 1 not found, 1 error.

I ran this in staging without refreshing from production because I wanted it to have the recent-ish rsync changes. (I did not run rsync immediately before).

If it's helpful for testing, you could update the original pages for the not found record in the staging database and I can run this script again. I could also run rsync. At some point before we release, we may want to test the full set of steps we will be doing in production (perhaps after we fix excerpt ids): replicate production to staging, rsync, update excerpts.

If I run the script again with the same input, it recognizes that it doesn't need to make changes:

$ ./manage.py adjust_excerpts /tmp/excerpt_update_changes.csv
Error saving mdp.39015036664038 (SUPPRESS): Can't parse chunk 'SUPPRESS'
No record found for source id uc1.c2641998 and pages_orig 32-33, 66

Updated 0 records. 119 unchanged, 1 not found, 1 error.

mnaydan · 2024-03-28T19:58:48Z

@rlskoeser this is really helpful, thanks! Let me fix the errors in production and staging and then we can re-run... give me a moment.

mnaydan · 2024-03-28T20:03:48Z

@rlskoeser how is it handling null original page number?

rlskoeser · 2024-03-28T20:09:28Z

@mnaydan I don't know! Probably not correctly, I didn't know that was a possible case. LMK what it should do - at a minimum I can make sure to filter out full works when we look for matches...

mnaydan · 2024-03-28T20:12:02Z

@rlskoeser OK good to know. It's not a full work, it just doesn't have any physical page numbers printed on the pages! I will just fix that one manually on the backend and then you shouldn't have to change anything with the script. I think it's the only blank.

Edit: I went in to change it and it looks like your script fixed it already in QA! There's no other excerpts associated with that work, just the one.

rlskoeser · 2024-03-28T20:15:50Z

@mnaydan worth checking what it did when you test the script; if it's the only excerpt from that volume it may have done the right thing.

mnaydan · 2024-03-28T20:21:09Z

@rlskoeser great minds! Ok, can we run it again? I am expecting the uc1.c2641998 error to be handled now, and 2 additional updated records.

rlskoeser · 2024-03-28T20:43:21Z

Regenerated a test CSV from the google sheets (forgot to exclude the suppressed one) and ran again; here is the output:

$ ./manage.py adjust_excerpts /tmp/excerpt_update_changes_v2.csv
Error saving mdp.39015036664038 (SUPPRESS): Can't parse chunk 'SUPPRESS'

Updated 3 records. 117 unchanged, 0 not found, 1 error.

mnaydan · 2024-03-28T20:44:40Z

@rlskoeser yay! This is exactly what I expected. Do you want to close and track testing the full set of steps elsewhere?

mnaydan · 2024-03-28T20:45:37Z

@rlskoeser wait, I just saw your full set of acceptance criteria. I'm getting clearly bleary-eyed after a long week... let me test all those steps.

mnaydan · 2024-03-28T20:56:55Z

I spot checked a few records, everything looks great! I tested single page, changed range, unchanged/correct range, discontinuous page numbers, ark:/ IDs, the one blank original page range record... they all appear in the database and indexed how I would expect in all cases.

mnaydan assigned rlskoeser Mar 21, 2024

rlskoeser added a commit that referenced this issue Mar 28, 2024

New manage command to update excerpt digital page range

203c682

adapted from hathi_excerpt manage command resolves #625

rlskoeser mentioned this issue Mar 28, 2024

New manage command to update excerpt digital page range #634

Merged

rlskoeser added the awaiting testing label Mar 28, 2024

mnaydan closed this as completed Mar 28, 2024

mnaydan removed the awaiting testing label Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a developer, I want to one-time bulk fix HathiTrust excerpt page ranges from a spreadsheet so that we can pull correct page content when we reindex. #625

As a developer, I want to one-time bulk fix HathiTrust excerpt page ranges from a spreadsheet so that we can pull correct page content when we reindex. #625

mnaydan commented Mar 21, 2024 •

edited

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024 •

edited

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

mnaydan commented Mar 28, 2024

mnaydan commented Mar 28, 2024

As a developer, I want to one-time bulk fix HathiTrust excerpt page ranges from a spreadsheet so that we can pull correct page content when we reindex. #625

As a developer, I want to one-time bulk fix HathiTrust excerpt page ranges from a spreadsheet so that we can pull correct page content when we reindex. #625

Comments

mnaydan commented Mar 21, 2024 • edited

acceptance criteria

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024 • edited

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

mnaydan commented Mar 28, 2024

mnaydan commented Mar 28, 2024

mnaydan commented Mar 21, 2024 •

edited

mnaydan commented Mar 28, 2024 •

edited