Get the METS-XML and pairtree data in a shareable form so that other Princeton researchers can use it #617

mnaydan · 2024-03-08T14:23:03Z

Our immediate use case is Brian Kernighan, who wants the METS-XML and pairtree data to play with potential solutions for the changing-excerpts HathiTrust problem. For Brian we decided the data for the items listed in this excerpt spreadsheet would suffice for his purposes, but it will be good to document the process here in case this comes up again in the future.

rlskoeser · 2024-03-08T17:49:21Z

for pairtree data

used the spreadsheet to generate a text file of source ids, one id per line
configured my local dev instance of ppa to point to a new, empty hathi data directory

HATHI_DATA = "/Users/rkoeser/workarea/ppa-excerpts"

already had my local dev instance of ppa configured to use the staging site as my remote rsync source (instead of actual hathitrust servers), with this config:

# override rsync for dev/test
HATHITRUST_RSYNC_SERVER = "pulsys@cdh-test-prosody1"
HATHITRUST_RSYNC_PATH = "/mnt/nfs/cdh/prosody/data/ht_text_pd"

used the text file as input to the hathi_rsync manage command: cat ht_excerpt_sourceids.txt | ./manage.py hathi_rsync to synchronize content from staging server to new local excerpt dir
change directory to where I put the temporary hathi data and create a tar file of the contents: tar -cvf ppa-excerpts.tar ppa-excerpts; then gzipped the tar file

rlskoeser · 2024-03-08T18:11:22Z

for page data from Solr as json:

We don't have a built-in mechanism for this, and the number of source ids in this case is too many for a single solr query (exceeds the allowed request size). Here's an approach using ppa/parasolr code and the python console that iterates through the ids and queries solr to get page content, and then adds it to a json file.

scp text file with source ids to the staging server
ssh to staging server, su to conan, and start a python/django shell manage.py shell

>>> with open('/tmp/ht_excerpt_sourceids.txt') as idfile:
...    source_ids = [sid.strip() for sid in idfile]
...
>>> source_ids[:10]
['hvd.32044090278565', 'nyp.33433081683744', 'uc1.b3924132', 'mdp.39015026482151', 'uiug.30112106245936', 'hvd.32044009576562', 'nyp.33433067294433', 'coo.31924065856167', 'uc1.ax0002627784', 'wu.89001946482']
>>> from ppa.archive.solr import PageSearchQuerySet
>>> import json
>>> psqs = PageSearchQuerySet().filter(item_type='page').only('id', 'source_id', 'order', 'label', 'content')
>>> with open('/tmp/ppa-excerpt-pages.json', 'w') as outfile:
...     for source_id in source_ids:
...         current_pages = psqs.filter(source_id="'source_id'")
...         json.dump(list(current_pages[:current_pages.count()]), outfile, indent=2)
...

rlskoeser · 2024-03-08T18:16:43Z

Uploaded the files to a location in Google drive, gave Brian access, and emailed Brian with the folder link and brief context about the contents of the two files.

rlskoeser · 2024-03-18T19:39:54Z

Correct rsync command should be this one!

./manage.py hathi_rsync ` cat ht_excerpt_sourceids.txt `

mnaydan added the chore label Mar 8, 2024

mnaydan assigned rlskoeser Mar 8, 2024

rlskoeser closed this as completed Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get the METS-XML and pairtree data in a shareable form so that other Princeton researchers can use it #617

Get the METS-XML and pairtree data in a shareable form so that other Princeton researchers can use it #617

mnaydan commented Mar 8, 2024

rlskoeser commented Mar 8, 2024

rlskoeser commented Mar 8, 2024

rlskoeser commented Mar 8, 2024

rlskoeser commented Mar 18, 2024

Get the METS-XML and pairtree data in a shareable form so that other Princeton researchers can use it #617

Get the METS-XML and pairtree data in a shareable form so that other Princeton researchers can use it #617

Comments

mnaydan commented Mar 8, 2024

rlskoeser commented Mar 8, 2024

rlskoeser commented Mar 8, 2024

rlskoeser commented Mar 8, 2024

rlskoeser commented Mar 18, 2024