Change stable ID to sourceID + original page range rather than digital page range #555

mnaydan · 2023-09-27T20:33:26Z

Now that we know that HathiTrust periodically rescans material (leading to changes in the digital page range), source ID + digital page range is no longer a stable ID as we had previously thought. It might make sense for us to move to source ID + original page range as the stable ID instead.

testing notes

dev notes

add a db field to keep track of the old identifier
write a data migration to populate old identifier field with current digital page range start
modify excerpts to use original page start for id
update first_page method to use original pages instead of digital (check ramifications...)
update DigitizedWorkDetailView.get_queryset to filter on original page instead of digital; may need to use a regex to make sure we don't match inaccurate partial page range (e.g. p1 should not match p10)
add an additional case to DigitizedWorkDetailView.get_queryset to check for a match using the old logic (digital page range) and set redirect url if found

The text was updated successfully, but these errors were encountered:

quadrismegistus · 2023-11-28T19:44:25Z

Worth making a note on this separate issue that for the export_textcorpus.py command, tracked #561, we are using as a new work_id: {source_id}_{first page_orig}.

This may change depending on the results and decisions of this issue. It seems safest in this situation to use {source_id}_{first page_orig} as a work_id because that will remain stable to connect up to any new ID system we use (since both source_id and page_orig ideas aren't subject to the changes in the digital page range).

mnaydan · 2024-02-28T17:07:57Z

We'll want to redirect the old URLs and can adapt existing logic we have for that.

ref #555

mnaydan · 2024-04-08T17:10:01Z

@rlskoeser thank you for the helpful acceptance testing list -- I tested the items and this generally looks good! There are 2 issues I found, and one FYI:

It would be great to also change the IDs on the backend to original page (right now they are the old IDs, based on digital page). The main place I think these appear are on the detail page, and then also the log page. Do you agree that is important to change?
I got a 404 error for https://test-prosody.cdh.princeton.edu/archive/CB0128774422-p177/. From my preliminary sleuthing, I think this is because it is a work with a single page. https://test-prosody.cdh.princeton.edu/archive/mdp.39015060429308-p126/ is also a single page work and also 404 errors.
An FYI that I discovered we do have 9 works with no original page numbers. These resolve to just the work ID. These 9 works are all unique IDs (there aren't two excerpts from a single volume), so I don't think this will cause an issue with our current data and the existing behavior should be functional.

I'm ready whenever you can generate the text corpus to test that last piece.

rlskoeser · 2024-04-08T17:24:34Z

@mnaydan thanks for testing.

responses:

Can you provide more detail to help me understand? I'm not sure where we're showing the old ids, but if we are showing them I agree we should change them.
You must be right - it looks like my regex for first page must not be working correctly for single pages. Will remedy.
I stumbled over one of these during development and wondered if it was a data error! Good to know. I added a database constraint that should require source id + original page number to be unique - do you think this would ever cause a problem? do you want to test it?

New question, potentially related: we don't currently do any validation on the original page number field; that means it would be possible to add a range that can't actually be used for the public url. Is this a concern?

mnaydan · 2024-04-08T17:50:18Z

These are screenshots of the two places on the backend where the ID appears with the old digital page number/range, rather than the original one:

In response to number 3, not a data error but an anomaly -- those works don't have any page numbers printed on them. The only scenario I can imagine as a problem is if there are two excerpts from the same volume both with no original page numbers. As far as I can tell this doesn't occur in our current data or the data we're planning on adding, so I think we should leave it.

However, I did just test the source id + original page number database constraint. I used CW0116616499 which has multiple excerpts from it. I tested with CW0116616499 (244-247) and CW0116616499 (251-254). I changed the original page number of the second excerpt to the exact same range as the first (244-247) and I, as expected, got a validation error telling me the combination already exists. HOWEVER, I tweaked it to (244-248) (same first page, different last), and it let me save it. Both URLs then create 500 errors on the public site. Do you think this is worth fixing? I don't imagine we'd encounter this error really.

What sorts of things would be added that couldn't be used for a public url - like punctuation marks due to human error? I'm inclined not to add validation if it means more work because I don't think this would be a common enough problem.

rlskoeser · 2024-04-08T20:46:56Z

Thanks for the screenshots - my eyes totally glazed over that! And thank you for catching, this is the string method for the digitized work object, which we use to represent instances various places. Much better for it to use the original page number.

Your validation test is good - it worries me that there is a potential to create a 500 error which would be non-obvious, I got distracted to see if there's an easy way to check for this — I'm sure I've done something similar but I forget where. (We can't use database constraints.) I'm going to look a little more on that one.

I agree, the other things seem unlikely - I was thinking that if someone entered an inferred original page range with brackets it might end up looking weird, but pretty edge edge case.

ref #555

mnaydan · 2024-04-09T13:24:38Z

Almost there! Single page URL issue appears fixed, and the text corpus appears to be using the correct IDs.

I retested the 500 error conditions, and it does result in a validation error instead now. However, it appears the validation fix is now overvalidating. When I change the digital page on a record and try to save it (leaving original page untouched), it says I can't save the record because that original page number + ID combo is not unique.

ref #555

rlskoeser · 2024-04-09T14:38:11Z

@mnaydan thanks for checking! I had logic in the code for that case but it wasn't doing the right thing. Added a test case and fixed, please confirm.

mnaydan · 2024-04-09T18:12:44Z

Fixed! I tested with a Gale item and a Hathi item.

mnaydan added the bug label Jan 16, 2024

jerielizabeth assigned laurejt Mar 18, 2024

rlskoeser assigned rlskoeser and unassigned laurejt Mar 28, 2024

rlskoeser mentioned this issue Apr 4, 2024

Update excerpt ids and urls to use original pages to ensure they are stable ids #635

Merged

rlskoeser added a commit that referenced this issue Apr 5, 2024

Link to original source must use first digital page, not original

2065c1e

ref #555

rlskoeser added the awaiting testing label Apr 5, 2024

mnaydan added tested needs attention and removed awaiting testing labels Apr 8, 2024

rlskoeser added a commit that referenced this issue Apr 8, 2024

Adjust and test view regex for excerpt with single page

5d5ed2a

ref #555

rlskoeser added a commit that referenced this issue Apr 8, 2024

Update DigitizedWork string method to use original pages

c9e4f86

ref #555

rlskoeser added a commit that referenced this issue Apr 8, 2024

Add custom validation to ensure source id + first page orig is unique

83f6a63

ref #555

rlskoeser mentioned this issue Apr 8, 2024

additional changes related to excerpt stable id based on first original page #637

Merged

rlskoeser added awaiting testing and removed tested needs attention labels Apr 8, 2024

mnaydan added tested needs attention and removed awaiting testing labels Apr 9, 2024

rlskoeser added a commit that referenced this issue Apr 9, 2024

Exclude current object when validating first page unique for source

ff23f31

ref #555

rlskoeser added awaiting testing and removed tested needs attention labels Apr 9, 2024

mnaydan closed this as completed Apr 9, 2024

mnaydan removed the awaiting testing label Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change stable ID to sourceID + original page range rather than digital page range #555

Change stable ID to sourceID + original page range rather than digital page range #555

mnaydan commented Sep 27, 2023 •

edited

quadrismegistus commented Nov 28, 2023 •

edited

mnaydan commented Feb 28, 2024

mnaydan commented Apr 8, 2024

rlskoeser commented Apr 8, 2024

mnaydan commented Apr 8, 2024 •

edited

rlskoeser commented Apr 8, 2024

mnaydan commented Apr 9, 2024

rlskoeser commented Apr 9, 2024

mnaydan commented Apr 9, 2024

Change stable ID to sourceID + original page range rather than digital page range #555

Change stable ID to sourceID + original page range rather than digital page range #555

Comments

mnaydan commented Sep 27, 2023 • edited

testing notes

dev notes

quadrismegistus commented Nov 28, 2023 • edited

mnaydan commented Feb 28, 2024

mnaydan commented Apr 8, 2024

rlskoeser commented Apr 8, 2024

mnaydan commented Apr 8, 2024 • edited

rlskoeser commented Apr 8, 2024

mnaydan commented Apr 9, 2024

rlskoeser commented Apr 9, 2024

mnaydan commented Apr 9, 2024

mnaydan commented Sep 27, 2023 •

edited

quadrismegistus commented Nov 28, 2023 •

edited

mnaydan commented Apr 8, 2024 •

edited