This pipeline processes XML headers from the EEBO-TCP in order to extract key identifiers like ESTC, STC, and Wing, as well as indicators that work is from a collection of plays. This is not meant to replace the official TCP metdata, official EEBO metadata, or the more thorough information in catalogues like the ESTC.
The first time you run this script, you will need to create a conda environment.
This stage requires a conda installation.
conda env create -f environment.yml
If this statement fails, then Python3 and conda are not on your path.
Here is a set of instructions for installing Python3 through Anaconda.
Activate the conda environment with the project dependencies.
conda activate xml-processing
- Use the
-o
argument to indicate an output filepath. Use the.csv
extension. This will output in autf-8
encoding. - Use the
-f
argument last followed by a list of paths to folders containing TCP XML files. - Use the
-p
argument to specify a number of concurrent processes. Start with 3, monitor memory usage.
python batch-process.py -o outfile.csv -f [TCP_PATH]
- EEBO, ECCO, and Evans vary in their metadata encodings.
- As Evans does not have XML files of interest, this parser doesn't perform well on Evans.
To execute this script you will need metadata extracted from the TCP with
batch-process.py
and a structured list of matches matches between
works and TCP files.
Optionally, it will query the English Short Title Catalogue API for short titles associated with any works with ESTC ids. This is convenient because more than 80% of the works in the TCP have one or more known ESTC identifiers.
We use these shortened titles established by bibliographers and librarians because the sprawling titles of 18th-century books do not always fit into 21st-century web interfaces.
The matches compiled by Michele Pflug in Spring 2025 are stored in this repository as matches_tcp_lsdb.csv
.
There are two types of matches:
- match: a one-to-one match between a work and a potential textual witness
- collection: a match between a work and an item within a collection of works
For example, Shakespeare's Hamlet is a match to the The tragicall historie of Hamlet Prince of Denmarke quarto A11959
and part of the collection of plays Mr. VVilliam Shakespeares comedies, histories, & tragedies better known as the First Folio A11954
.
python csv-to-tables.py -m tcp_lsdb_matches.csv -t [TCP_metadata.csv]
- Use the
-m
argument to specify the input file for matches. - Use the
-t
argument to specify the input file for metadata from the TCP. - Use the optional
--tcp_outfile
argument to specify a file path for the TCP table CSV. - Use the optional
--works_tcp_outfile
argument to specify a file path for the WorksTCP table CSV. - Use the optional
--get_estc_titles
flag to enable querying of the ESTC API for short titles.
This script will normally run in less than a minute, but with --get_estc_title
flag enabled it will take longer to run to avoid flooding the ESTC API with requests. Be courteous!
We have included the Slurm batch scripts used to run the most time-consuming steps of this pipeline for reproducibility purposes.
Because of the idiosyncrasies of Slurm, these scripts reference account parameters and shared folders specific to UOregon's high-performance computing cluster, Talapas.
To run these scripts on your own Slurm cluster:
- adjust the
--account
and--partition
arguments - replace
module load miniconda3
with the conda version on your cluster - upload the EEBO-TCP to your an appropriate networked folder on your cluster
- replace input and output filepaths to match
Running these Slurm batch scripts instead of batch-process.py
and csv-to-tables.py
directly will not give different results, but they will run much faster.
batch-process.sbatch
takes about 60 seconds to run on Talapas under the current configuration.