Skip to content

LondonStageDB/tcp-metadata-parser

Repository files navigation

EEBO-TCP Metadata Processing Scripts

This pipeline processes XML headers from the EEBO-TCP in order to extract key identifiers like ESTC, STC, and Wing, as well as indicators that work is from a collection of plays. This is not meant to replace the official TCP metdata, official EEBO metadata, or the more thorough information in catalogues like the ESTC.

First Run: Create the xml-processing Conda Environment

The first time you run this script, you will need to create a conda environment.

This stage requires a conda installation.

conda env create -f environment.yml

If this statement fails, then Python3 and conda are not on your path.

Here is a set of instructions for installing Python3 through Anaconda.

Extract XML Metadata with batch-process.py

Activate the conda environment with the project dependencies.

conda activate xml-processing
  • Use the -o argument to indicate an output filepath. Use the .csv extension. This will output in a utf-8 encoding.
  • Use the -f argument last followed by a list of paths to folders containing TCP XML files.
  • Use the -p argument to specify a number of concurrent processes. Start with 3, monitor memory usage.
python batch-process.py -o outfile.csv -f [TCP_PATH]

Notes

  • EEBO, ECCO, and Evans vary in their metadata encodings.
  • As Evans does not have XML files of interest, this parser doesn't perform well on Evans.

Create Provisional Database Metadata Tables with csv-to-tables.py

To execute this script you will need metadata extracted from the TCP with batch-process.py and a structured list of matches matches between works and TCP files.

Optionally, it will query the English Short Title Catalogue API for short titles associated with any works with ESTC ids. This is convenient because more than 80% of the works in the TCP have one or more known ESTC identifiers.

We use these shortened titles established by bibliographers and librarians because the sprawling titles of 18th-century books do not always fit into 21st-century web interfaces.

The matches compiled by Michele Pflug in Spring 2025 are stored in this repository as matches_tcp_lsdb.csv.

There are two types of matches:

  • match: a one-to-one match between a work and a potential textual witness
  • collection: a match between a work and an item within a collection of works

For example, Shakespeare's Hamlet is a match to the The tragicall historie of Hamlet Prince of Denmarke quarto A11959 and part of the collection of plays Mr. VVilliam Shakespeares comedies, histories, & tragedies better known as the First Folio A11954.

python csv-to-tables.py -m tcp_lsdb_matches.csv -t [TCP_metadata.csv]
  • Use the -m argument to specify the input file for matches.
  • Use the -t argument to specify the input file for metadata from the TCP.
  • Use the optional --tcp_outfile argument to specify a file path for the TCP table CSV.
  • Use the optional --works_tcp_outfile argument to specify a file path for the WorksTCP table CSV.
  • Use the optional --get_estc_titles flag to enable querying of the ESTC API for short titles.

This script will normally run in less than a minute, but with --get_estc_title flag enabled it will take longer to run to avoid flooding the ESTC API with requests. Be courteous!

Optional: Slurm Batch Scripts

We have included the Slurm batch scripts used to run the most time-consuming steps of this pipeline for reproducibility purposes.

Because of the idiosyncrasies of Slurm, these scripts reference account parameters and shared folders specific to UOregon's high-performance computing cluster, Talapas.

To run these scripts on your own Slurm cluster:

  • adjust the --account and --partition arguments
  • replace module load miniconda3 with the conda version on your cluster
  • upload the EEBO-TCP to your an appropriate networked folder on your cluster
  • replace input and output filepaths to match

Running these Slurm batch scripts instead of batch-process.py and csv-to-tables.py directly will not give different results, but they will run much faster.

batch-process.sbatch takes about 60 seconds to run on Talapas under the current configuration.

About

XML parsing scripts for generating the TCP and WorksTCP database tables.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published