Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As an admin, I want a way to reproducibly generate a full-text corpus of all public PPA content in order to support computational research on PPA materials. #556

Closed
quadrismegistus opened this issue Oct 30, 2023 · 4 comments
Assignees

Comments

@quadrismegistus
Copy link
Contributor

quadrismegistus commented Oct 30, 2023

Adapt Vineet's script to export plain text corpus

@mnaydan
Copy link
Contributor

mnaydan commented Nov 30, 2023

Developer steps before acceptance testing:

  • Get Jeri shell access to servers
  • Mirror production data to QA
  • Run index_pages command to fix blank changed-clusters record bug (#554)

Acceptance testing checklist:

  • Code runs on QA
  • Any RSE team member can run code
  • Number of json files outputted equals number of public works in PPA (6,752 as of 11/30/23)
  • Suppressed records are being ignored
    • Sample suppressed IDs:
    • nyp.33433069255440
    • uc1.$b253881
    • dul1.ark:/13960/t07w7ct97
    • ien.35556004818043
    • CW0106468070
    • CB0131406785
  • Both full works and excerpts are contained in corpus
    • Sample full work IDs:
    • uc1.$b14645
    • mdp.39015003633594
    • wu.89099903650
    • loc.ark:/13960/t4rj4z67d
    • CW0117319378
    • CB0127088549
    • Sample excerpt/article IDs:
    • CW0114589903
    • CB0129112818
    • coo.31924051399685
    • aeu.ark:/13960/t1pg22p71
    • uc1.b3924132
    • uiuo.ark:/13960/t4qk01n82
  • Both HathiTrust and ECCO works are contained in corpus (should be able to test using IDs above)
  • All expected fields are present for each page (@quadrismegistus provide list)
  • HathiTrust IDs with extra punctuation behave the same as regular IDs
    • loc.ark:/13960/t0jt08550
    • dul1.ark:/13960/t5j970j9d
  • Works with no pages (e.g., as a result of a bug (#539)) create a metadata entry but no json file
    • uga1.32108002998303

Things to recheck after development:

  • Confirm bugfix to (#539) resolves the problem in the corpus script output in the way we expect (should pull pages and save a json file)
  • Confirm fix to excerpt range changes in HathiTrust (#560) resolves the problem in the corpus script output in the way we expect (should pull correct/relevant ranges)

@jerielizabeth
Copy link
Contributor

Skipped test for works with no pages (id: uga1.32108002998303) because suppressed during staging set up.

@jerielizabeth
Copy link
Contributor

Moving additional testing to the related bug issues. any changes where we need to test this script should be batched due to testing effort.

@jerielizabeth
Copy link
Contributor

all tests passed!! 🎊

@rlskoeser rlskoeser changed the title Adapt Vineet's script to export plain text corpus As an admin, I want a way to reproducibly generate a full-text corpus of all public PPA content in order to support computational research on PPA materials. Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants