Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get the METS-XML and pairtree data in a shareable form so that other Princeton researchers can use it #617

Closed
mnaydan opened this issue Mar 8, 2024 · 4 comments
Assignees
Labels

Comments

@mnaydan
Copy link
Contributor

mnaydan commented Mar 8, 2024

Our immediate use case is Brian Kernighan, who wants the METS-XML and pairtree data to play with potential solutions for the changing-excerpts HathiTrust problem. For Brian we decided the data for the items listed in this excerpt spreadsheet would suffice for his purposes, but it will be good to document the process here in case this comes up again in the future.

@rlskoeser
Copy link
Contributor

for pairtree data

  • used the spreadsheet to generate a text file of source ids, one id per line
  • configured my local dev instance of ppa to point to a new, empty hathi data directory
HATHI_DATA = "/Users/rkoeser/workarea/ppa-excerpts"
  • already had my local dev instance of ppa configured to use the staging site as my remote rsync source (instead of actual hathitrust servers), with this config:
# override rsync for dev/test
HATHITRUST_RSYNC_SERVER = "pulsys@cdh-test-prosody1"
HATHITRUST_RSYNC_PATH = "/mnt/nfs/cdh/prosody/data/ht_text_pd"
  • used the text file as input to the hathi_rsync manage command: cat ht_excerpt_sourceids.txt | ./manage.py hathi_rsync to synchronize content from staging server to new local excerpt dir
  • change directory to where I put the temporary hathi data and create a tar file of the contents: tar -cvf ppa-excerpts.tar ppa-excerpts; then gzipped the tar file

@rlskoeser
Copy link
Contributor

for page data from Solr as json:

We don't have a built-in mechanism for this, and the number of source ids in this case is too many for a single solr query (exceeds the allowed request size). Here's an approach using ppa/parasolr code and the python console that iterates through the ids and queries solr to get page content, and then adds it to a json file.

  • scp text file with source ids to the staging server
  • ssh to staging server, su to conan, and start a python/django shell manage.py shell
>>> with open('/tmp/ht_excerpt_sourceids.txt') as idfile:
...    source_ids = [sid.strip() for sid in idfile]
...
>>> source_ids[:10]
['hvd.32044090278565', 'nyp.33433081683744', 'uc1.b3924132', 'mdp.39015026482151', 'uiug.30112106245936', 'hvd.32044009576562', 'nyp.33433067294433', 'coo.31924065856167', 'uc1.ax0002627784', 'wu.89001946482']
>>> from ppa.archive.solr import PageSearchQuerySet
>>> import json
>>> psqs = PageSearchQuerySet().filter(item_type='page').only('id', 'source_id', 'order', 'label', 'content')
>>> with open('/tmp/ppa-excerpt-pages.json', 'w') as outfile:
...     for source_id in source_ids:
...         current_pages = psqs.filter(source_id="'source_id'")
...         json.dump(list(current_pages[:current_pages.count()]), outfile, indent=2)
...

@rlskoeser
Copy link
Contributor

Uploaded the files to a location in Google drive, gave Brian access, and emailed Brian with the folder link and brief context about the contents of the two files.

@rlskoeser
Copy link
Contributor

Correct rsync command should be this one!

./manage.py hathi_rsync ` cat ht_excerpt_sourceids.txt `

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants