-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get the METS-XML and pairtree data in a shareable form so that other Princeton researchers can use it #617
Comments
for pairtree data
HATHI_DATA = "/Users/rkoeser/workarea/ppa-excerpts"
# override rsync for dev/test
HATHITRUST_RSYNC_SERVER = "pulsys@cdh-test-prosody1"
HATHITRUST_RSYNC_PATH = "/mnt/nfs/cdh/prosody/data/ht_text_pd"
|
for page data from Solr as json: We don't have a built-in mechanism for this, and the number of source ids in this case is too many for a single solr query (exceeds the allowed request size). Here's an approach using ppa/parasolr code and the python console that iterates through the ids and queries solr to get page content, and then adds it to a json file.
>>> with open('/tmp/ht_excerpt_sourceids.txt') as idfile:
... source_ids = [sid.strip() for sid in idfile]
...
>>> source_ids[:10]
['hvd.32044090278565', 'nyp.33433081683744', 'uc1.b3924132', 'mdp.39015026482151', 'uiug.30112106245936', 'hvd.32044009576562', 'nyp.33433067294433', 'coo.31924065856167', 'uc1.ax0002627784', 'wu.89001946482']
>>> from ppa.archive.solr import PageSearchQuerySet
>>> import json
>>> psqs = PageSearchQuerySet().filter(item_type='page').only('id', 'source_id', 'order', 'label', 'content')
>>> with open('/tmp/ppa-excerpt-pages.json', 'w') as outfile:
... for source_id in source_ids:
... current_pages = psqs.filter(source_id="'source_id'")
... json.dump(list(current_pages[:current_pages.count()]), outfile, indent=2)
... |
Uploaded the files to a location in Google drive, gave Brian access, and emailed Brian with the folder link and brief context about the contents of the two files. |
Correct rsync command should be this one! ./manage.py hathi_rsync ` cat ht_excerpt_sourceids.txt ` |
Our immediate use case is Brian Kernighan, who wants the METS-XML and pairtree data to play with potential solutions for the changing-excerpts HathiTrust problem. For Brian we decided the data for the items listed in this excerpt spreadsheet would suffice for his purposes, but it will be good to document the process here in case this comes up again in the future.
The text was updated successfully, but these errors were encountered: