EEBO-TCP Metadata Processing Scripts

This pipeline processes XML headers from the EEBO-TCP in order to extract key identifiers like ESTC, STC, and Wing, as well as indicators that work is from a collection of plays. This is not meant to replace the official TCP metdata, official EEBO metadata, or the more thorough information in catalogues like the ESTC.

First Run: Create the xml-processing Conda Environment

The first time you run this script, you will need to create a conda environment.

This stage requires a conda installation.

conda env create -f environment.yml

If this statement fails, then Python3 and conda are not on your path.

Here is a set of instructions for installing Python3 through Anaconda.

Extract XML Metadata with `batch-process.py`

Activate the conda environment with the project dependencies.

conda activate xml-processing

Use the -o argument to indicate an output filepath. Use the .csv extension. This will output in a utf-8 encoding.
Use the -f argument last followed by a list of paths to folders containing TCP XML files.
Use the -p argument to specify a number of concurrent processes. Start with 3, monitor memory usage.

python batch-process.py -o outfile.csv -f [TCP_PATH]

Notes

EEBO, ECCO, and Evans vary in their metadata encodings.
As Evans does not have XML files of interest, this parser doesn't perform well on Evans.

Create Provisional Database Metadata Tables with `csv-to-tables.py`

To execute this script you will need metadata extracted from the TCP with batch-process.py and a structured list of matches matches between works and TCP files.

Optionally, it will query the English Short Title Catalogue API for short titles associated with any works with ESTC ids. This is convenient because more than 80% of the works in the TCP have one or more known ESTC identifiers.

We use these shortened titles established by bibliographers and librarians because the sprawling titles of 18th-century books do not always fit into 21st-century web interfaces.

The matches compiled by Michele Pflug in Spring 2025 are stored in this repository as matches_tcp_lsdb.csv.

There are two types of matches:

match: a one-to-one match between a work and a potential textual witness
collection: a match between a work and an item within a collection of works

For example, Shakespeare's Hamlet is a match to the The tragicall historie of Hamlet Prince of Denmarke quarto A11959 and part of the collection of plays Mr. VVilliam Shakespeares comedies, histories, & tragedies better known as the First Folio A11954.

python csv-to-tables.py -m tcp_lsdb_matches.csv -t [TCP_metadata.csv]

Use the -m argument to specify the input file for matches.
Use the -t argument to specify the input file for metadata from the TCP.
Use the optional --tcp_outfile argument to specify a file path for the TCP table CSV.
Use the optional --works_tcp_outfile argument to specify a file path for the WorksTCP table CSV.
Use the optional --get_estc_titles flag to enable querying of the ESTC API for short titles.

This script will normally run in less than a minute, but with --get_estc_title flag enabled it will take longer to run to avoid flooding the ESTC API with requests. Be courteous!

Optional: Slurm Batch Scripts

We have included the Slurm batch scripts used to run the most time-consuming steps of this pipeline for reproducibility purposes.

Because of the idiosyncrasies of Slurm, these scripts reference account parameters and shared folders specific to UOregon's high-performance computing cluster, Talapas.

To run these scripts on your own Slurm cluster:

adjust the --account and --partition arguments
replace module load miniconda3 with the conda version on your cluster
upload the EEBO-TCP to your an appropriate networked folder on your cluster
replace input and output filepaths to match

Running these Slurm batch scripts instead of batch-process.py and csv-to-tables.py directly will not give different results, but they will run much faster.

batch-process.sbatch takes about 60 seconds to run on Talapas under the current configuration.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
batch-process.py		batch-process.py
batch-process.sbatch		batch-process.sbatch
csv-to-tables.py		csv-to-tables.py
csv-to-tables.sbatch		csv-to-tables.sbatch
environment.yml		environment.yml
tcp_lsdb_matches.csv		tcp_lsdb_matches.csv
xmlparser.py		xmlparser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EEBO-TCP Metadata Processing Scripts

First Run: Create the xml-processing Conda Environment

Extract XML Metadata with `batch-process.py`

Notes

Create Provisional Database Metadata Tables with `csv-to-tables.py`

Optional: Slurm Batch Scripts

About

Uh oh!

Releases

Packages

Languages

LondonStageDB/tcp-metadata-parser

Folders and files

Latest commit

History

Repository files navigation

EEBO-TCP Metadata Processing Scripts

First Run: Create the xml-processing Conda Environment

Extract XML Metadata with batch-process.py

Notes

Create Provisional Database Metadata Tables with csv-to-tables.py

Optional: Slurm Batch Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Extract XML Metadata with `batch-process.py`

Create Provisional Database Metadata Tables with `csv-to-tables.py`

Packages