Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As an admin, I want a bulk import of content from Gale/ECCO so that I can add content to the site that is not available from HathiTrust. #369

Closed
4 tasks done
rlskoeser opened this issue Jun 3, 2021 · 5 comments
Assignees
Milestone

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Jun 3, 2021

testing notes

Review in django admin:

  • filter by source = Gale to see everything imported from gale
  • confirm metadata is imported (provisional for now since we're not using Marc yet)
  • confirm there is a log entry in the record history documenting import by script
  • confirm that collection membership is set based on the flags in the import csv file

Review in public site archive search and detail pages — I think this will be most fully tested based on the user stories for public site functionality. If you want to test and review independently, I could give you access to the test Solr instance so you can see how page content is being indexed.

dev notes

New gale_import script equivalent to existing hathi_import script

  • Write new API wrapper code analogous to Hathi bib api wrapper. (No need to store content locally like we do with Hathi pairtree data — index directly from the API feed, since that’s how they provide it)
  • new add_from_gale method on DigitizedWork analogous to add_from_hathi
  • Add Gale as an option for DigitizedWork source field
  • The script will take a CSV file, and in addition to standard import, should set collection assignment based on the spreadsheet

the following (and more) may need to be refactored as part of this task:

  • Page.page_index_data should be split out into sub methods for HT (current logic) and gale
  • DigitizedWork.count_pages
  • DigitizedWork.page_index_data
  • DigitizedWork.get_metadata (assumes HT bib API)
@rlskoeser
Copy link
Contributor Author

Delivering first version of Gale/ECCO import (missing Marc metadata).

Here's the summary information from the script running in qa:

Processed 1,182 items for import.
Imported 1,177; skipped 0; 5 errors; imported 385,155 pages.

Here are the 5 ids that were not found:

CB132045539
CB110589871
CB129347115
CB125450132
CB128058273

@mnaydan
Copy link
Contributor

mnaydan commented Jun 15, 2021

Fixed IDs (Excel deleting trailing zeros) to the following, which RSK imported individually.
CB0132045539
CW0110589871 (duplicate - deleted)
CB0129347115
CW0125450132
CB0128058273

Added CW0107745171 and CB0131329414.

@mnaydan
Copy link
Contributor

mnaydan commented Jun 15, 2021

1183 records from Gale successfully imported - number in admin matches expected number in most up-to-date CSV.

On admin site, Source ID, author, and page count metadata successfully imported from ECCO. @rlskoeser assuming missing Place of Publication, Publisher, and Pub date metadata will be supplied via MARC records?

Log entry for individual record history documenting import by script confirmed. (e.g., link).

Collections metadata successfully and accurately imported from CSV. Internal curation notes also successfully imported in correct field. (Glanced through results pages and hand checked the following IDs, which represent a swath of collections-belonging and note content: CW0114965944; CW0114031766; CB0127365931; CW0111758520; CW0114122952; CW0112450946; CW0111189178)

Testing in django admin complete, need to test public facing functionality before closing.

@rlskoeser
Copy link
Contributor Author

@mnaydan yes, the metadata I want to pull from MARC records is title, subtitle, sort title, place of publication, publisher, and pub date. We need to decide whether to track that on this issue or make a new issue for that as a refinement.

@mnaydan
Copy link
Contributor

mnaydan commented Jun 16, 2021

Decided to track MARC record issue separately as #389 so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants