Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate viability of EEBO-TCP integration #571

Closed
mnaydan opened this issue Jan 10, 2024 · 4 comments
Closed

Investigate viability of EEBO-TCP integration #571

mnaydan opened this issue Jan 10, 2024 · 4 comments
Assignees
Labels

Comments

@mnaydan
Copy link
Contributor

mnaydan commented Jan 10, 2024

@rlskoeser Here is a working list of titles we are interested in adding (our undergraduate intern is actively working on this spreadsheet, FYI!). The spreadsheet includes URLs to the items and has a separate column for the volume IDs, which I took from the URL bar of the EEBO-TCP site. There are some sample excerpts in the spreadsheet, too: the pages are strange, however. Most of the pages are "unnumbered" so I took the numbers that change on the page URL following the ID, which seem to indicate the full section.

We're interested at this point in getting a sense of what the data structure is like and how easy or hard it might be to integrate it into the existing PPA structure (for instance, is the unnumbered page numbers a big blocker? is there metadata for those somewhere else that I'm not seeing?). Are image thumbnails impossible to include since they don't seem to be on the TCP site? How would we pull the metadata? Etc.

@rlskoeser
Copy link
Contributor

@mnaydan I started looking into this last week and should have added some notes while it was fresh.

The xml structure doesn't look too bad; I think we'll have to write some custom parsing code but it doesn't need to be very complicated since we mainly want pages, page numbers, and plain text. (This is based on looking at the collection that's published on GitHub; I'm assuming they are similar.). I don't think unnumbered pages is a blocker, we'd just want a way to handle pages with no label.

I think the metadata would be similar to what we did for the Gale/ECCO records, where we'd have to rely on MARC records already purchased by PUL. Fortunately they provide a mapping for us: https://textcreationpartnership.org/using-tcp-content/eebo-tcp-cataloging-records/

I am wondering if there's a way for us to use PUL's [unadvertised] bib API to get MARC records instead of having to store a local copy that we can lookup as needed (as we did for ECCO). That would depend on whether we can look it up by the id we'll need to use and if the API is fast enough (assuming we're allowed to use for this). I expect PUL owns the MARC records, but we should confirm.

I was wondering how we get the full text, and I found in the faq that there are some bulk downloads; I'm not certain if this is what we want or not: https://textcreationpartnership.org/faq/#faq05 (There's a collection published on GitHub, but it isn't all of the texts).

They do have a list of projects using the content; we can ask them to add PPA when/if we import content: https://textcreationpartnership.org/using-tcp-content/projects-and-publications-using-tcp-texts/

I don't see any mention of images anywhere; getting access to images for thumbnails might require negotating with ProQuest (if it's even possible).

We might want to be in contact with the TCP folks at UMich to let them know what we're working on, so we could ask them for advice on how we might get thumbnail images.

@mnaydan
Copy link
Contributor Author

mnaydan commented Jan 26, 2024

After looking back at our ECCO metadata documentation, I emailed Joe Marciniak at PUL to request access to the EEBO MARC records via a local stored copy, XML format preferred (we decided it would be smart to repurpose the code we used for ECCO as much as possible, rather than going the API route).

I also emailed the TCP folks at tcp-info@umich.edu to ask them about the full-text and thumbnails.

@rlskoeser
Copy link
Contributor

@mnaydan I downloaded the bulk exports and did a quick check against the ids in the spreadsheet you shared and the id lists in the text files they provide - all of our IDs are present, and the majority of them are included in phase 1, only three of them are in the phase 2 set.

@jerielizabeth
Copy link
Contributor

Investigation task is done - should be able to integrate

  • use existing code for MARC records
  • will need some lightweight xml parsing to get the text we care about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants