Investigate viability of EEBO-TCP integration #571

mnaydan · 2024-01-10T21:16:02Z

@rlskoeser Here is a working list of titles we are interested in adding (our undergraduate intern is actively working on this spreadsheet, FYI!). The spreadsheet includes URLs to the items and has a separate column for the volume IDs, which I took from the URL bar of the EEBO-TCP site. There are some sample excerpts in the spreadsheet, too: the pages are strange, however. Most of the pages are "unnumbered" so I took the numbers that change on the page URL following the ID, which seem to indicate the full section.

We're interested at this point in getting a sense of what the data structure is like and how easy or hard it might be to integrate it into the existing PPA structure (for instance, is the unnumbered page numbers a big blocker? is there metadata for those somewhere else that I'm not seeing?). Are image thumbnails impossible to include since they don't seem to be on the TCP site? How would we pull the metadata? Etc.

rlskoeser · 2024-01-25T22:17:44Z

@mnaydan I started looking into this last week and should have added some notes while it was fresh.

The xml structure doesn't look too bad; I think we'll have to write some custom parsing code but it doesn't need to be very complicated since we mainly want pages, page numbers, and plain text. (This is based on looking at the collection that's published on GitHub; I'm assuming they are similar.). I don't think unnumbered pages is a blocker, we'd just want a way to handle pages with no label.

I think the metadata would be similar to what we did for the Gale/ECCO records, where we'd have to rely on MARC records already purchased by PUL. Fortunately they provide a mapping for us: https://textcreationpartnership.org/using-tcp-content/eebo-tcp-cataloging-records/

I am wondering if there's a way for us to use PUL's [unadvertised] bib API to get MARC records instead of having to store a local copy that we can lookup as needed (as we did for ECCO). That would depend on whether we can look it up by the id we'll need to use and if the API is fast enough (assuming we're allowed to use for this). I expect PUL owns the MARC records, but we should confirm.

I was wondering how we get the full text, and I found in the faq that there are some bulk downloads; I'm not certain if this is what we want or not: https://textcreationpartnership.org/faq/#faq05 (There's a collection published on GitHub, but it isn't all of the texts).

They do have a list of projects using the content; we can ask them to add PPA when/if we import content: https://textcreationpartnership.org/using-tcp-content/projects-and-publications-using-tcp-texts/

I don't see any mention of images anywhere; getting access to images for thumbnails might require negotating with ProQuest (if it's even possible).

We might want to be in contact with the TCP folks at UMich to let them know what we're working on, so we could ask them for advice on how we might get thumbnail images.

mnaydan · 2024-01-26T15:46:14Z

After looking back at our ECCO metadata documentation, I emailed Joe Marciniak at PUL to request access to the EEBO MARC records via a local stored copy, XML format preferred (we decided it would be smart to repurpose the code we used for ECCO as much as possible, rather than going the API route).

I also emailed the TCP folks at tcp-info@umich.edu to ask them about the full-text and thumbnails.

rlskoeser · 2024-02-01T19:58:36Z

@mnaydan I downloaded the bulk exports and did a quick check against the ids in the spreadsheet you shared and the id lists in the text files they provide - all of our IDs are present, and the majority of them are included in phase 1, only three of them are in the phase 2 set.

jerielizabeth · 2024-02-05T17:49:47Z

Investigation task is done - should be able to integrate

use existing code for MARC records
will need some lightweight xml parsing to get the text we care about.

mnaydan assigned rlskoeser Jan 10, 2024

mnaydan added the chore label Jan 16, 2024

jerielizabeth closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate viability of EEBO-TCP integration #571

Investigate viability of EEBO-TCP integration #571

mnaydan commented Jan 10, 2024 •

edited

rlskoeser commented Jan 25, 2024

mnaydan commented Jan 26, 2024

rlskoeser commented Feb 1, 2024

jerielizabeth commented Feb 5, 2024

Investigate viability of EEBO-TCP integration #571

Investigate viability of EEBO-TCP integration #571

Comments

mnaydan commented Jan 10, 2024 • edited

rlskoeser commented Jan 25, 2024

mnaydan commented Jan 26, 2024

rlskoeser commented Feb 1, 2024

jerielizabeth commented Feb 5, 2024

mnaydan commented Jan 10, 2024 •

edited