Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change stable ID to sourceID + original page range rather than digital page range #555

Closed
19 tasks done
mnaydan opened this issue Sep 27, 2023 · 9 comments
Closed
19 tasks done
Assignees
Labels

Comments

@mnaydan
Copy link
Contributor

mnaydan commented Sep 27, 2023

Now that we know that HathiTrust periodically rescans material (leading to changes in the digital page range), source ID + digital page range is no longer a stable ID as we had previously thought. It might make sense for us to move to source ID + original page range as the stable ID instead.

testing notes

  • public url for excerpt is now based on first page in original page range instead of digital
    • works for numeric page range
    • works for alpha page range (e.g. roman numerals)
    • works for single page
    • works for page range delimited by comma instead of dash (discontinuous range)
  • changing digital pages does not change the public url
  • accessing the previous url based on first page in digital pages should redirect to the new url
  • accessing the record from the archive search links to the correct new url with first page from original pages
  • link to source version uses the first page from the digital page range, not original
    • HathiTrust
    • Gale
  • confirm that text corpus uses the corrected ids
  • saving a record with same source id and same first original page as another work in the database should result in a validation error

dev notes

  • add a db field to keep track of the old identifier
  • write a data migration to populate old identifier field with current digital page range start
  • modify excerpts to use original page start for id
  • update first_page method to use original pages instead of digital (check ramifications...)
  • update DigitizedWorkDetailView.get_queryset to filter on original page instead of digital; may need to use a regex to make sure we don't match inaccurate partial page range (e.g. p1 should not match p10)
  • add an additional case to DigitizedWorkDetailView.get_queryset to check for a match using the old logic (digital page range) and set redirect url if found
@quadrismegistus
Copy link
Contributor

quadrismegistus commented Nov 28, 2023

Worth making a note on this separate issue that for the export_textcorpus.py command, tracked #561, we are using as a new work_id: {source_id}_{first page_orig}.

This may change depending on the results and decisions of this issue. It seems safest in this situation to use {source_id}_{first page_orig} as a work_id because that will remain stable to connect up to any new ID system we use (since both source_id and page_orig ideas aren't subject to the changes in the digital page range).

@mnaydan mnaydan added the bug label Jan 16, 2024
@mnaydan
Copy link
Contributor Author

mnaydan commented Feb 28, 2024

We'll want to redirect the old URLs and can adapt existing logic we have for that.

@mnaydan
Copy link
Contributor Author

mnaydan commented Apr 8, 2024

@rlskoeser thank you for the helpful acceptance testing list -- I tested the items and this generally looks good! There are 2 issues I found, and one FYI:

  1. It would be great to also change the IDs on the backend to original page (right now they are the old IDs, based on digital page). The main place I think these appear are on the detail page, and then also the log page. Do you agree that is important to change?
  2. I got a 404 error for https://test-prosody.cdh.princeton.edu/archive/CB0128774422-p177/. From my preliminary sleuthing, I think this is because it is a work with a single page. https://test-prosody.cdh.princeton.edu/archive/mdp.39015060429308-p126/ is also a single page work and also 404 errors.
  3. An FYI that I discovered we do have 9 works with no original page numbers. These resolve to just the work ID. These 9 works are all unique IDs (there aren't two excerpts from a single volume), so I don't think this will cause an issue with our current data and the existing behavior should be functional.

I'm ready whenever you can generate the text corpus to test that last piece.

@rlskoeser
Copy link
Contributor

@mnaydan thanks for testing.

responses:

  1. Can you provide more detail to help me understand? I'm not sure where we're showing the old ids, but if we are showing them I agree we should change them.
  2. You must be right - it looks like my regex for first page must not be working correctly for single pages. Will remedy.
  3. I stumbled over one of these during development and wondered if it was a data error! Good to know. I added a database constraint that should require source id + original page number to be unique - do you think this would ever cause a problem? do you want to test it?

New question, potentially related: we don't currently do any validation on the original page number field; that means it would be possible to add a range that can't actually be used for the public url. Is this a concern?

@mnaydan
Copy link
Contributor Author

mnaydan commented Apr 8, 2024

  1. These are screenshots of the two places on the backend where the ID appears with the old digital page number/range, rather than the original one:
    Screenshot 2024-04-08 at 1 26 23 PM
    Screenshot 2024-04-08 at 1 32 56 PM

In response to number 3, not a data error but an anomaly -- those works don't have any page numbers printed on them. The only scenario I can imagine as a problem is if there are two excerpts from the same volume both with no original page numbers. As far as I can tell this doesn't occur in our current data or the data we're planning on adding, so I think we should leave it.

However, I did just test the source id + original page number database constraint. I used CW0116616499 which has multiple excerpts from it. I tested with CW0116616499 (244-247) and CW0116616499 (251-254). I changed the original page number of the second excerpt to the exact same range as the first (244-247) and I, as expected, got a validation error telling me the combination already exists. HOWEVER, I tweaked it to (244-248) (same first page, different last), and it let me save it. Both URLs then create 500 errors on the public site. Do you think this is worth fixing? I don't imagine we'd encounter this error really.

What sorts of things would be added that couldn't be used for a public url - like punctuation marks due to human error? I'm inclined not to add validation if it means more work because I don't think this would be a common enough problem.

@rlskoeser
Copy link
Contributor

Thanks for the screenshots - my eyes totally glazed over that! And thank you for catching, this is the string method for the digitized work object, which we use to represent instances various places. Much better for it to use the original page number.

Your validation test is good - it worries me that there is a potential to create a 500 error which would be non-obvious, I got distracted to see if there's an easy way to check for this — I'm sure I've done something similar but I forget where. (We can't use database constraints.) I'm going to look a little more on that one.

I agree, the other things seem unlikely - I was thinking that if someone entered an inferred original page range with brackets it might end up looking weird, but pretty edge edge case.

@mnaydan
Copy link
Contributor Author

mnaydan commented Apr 9, 2024

Almost there! Single page URL issue appears fixed, and the text corpus appears to be using the correct IDs.

I retested the 500 error conditions, and it does result in a validation error instead now. However, it appears the validation fix is now overvalidating. When I change the digital page on a record and try to save it (leaving original page untouched), it says I can't save the record because that original page number + ID combo is not unique.

@rlskoeser
Copy link
Contributor

@mnaydan thanks for checking! I had logic in the code for that case but it wasn't doing the right thing. Added a test case and fixed, please confirm.

@mnaydan
Copy link
Contributor Author

mnaydan commented Apr 9, 2024

Fixed! I tested with a Gale item and a Hathi item.

@mnaydan mnaydan closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants