## Extracting EPL Metadata

This notebook is ran at step two of the EPL Metadata Extraction .ipynb

This notebook is used to extract the metadata and crete a CSV file on my local machine. There was too many files to upload them all to Google Drive, so this was ran locally and then the CSV uploaded to Google Drive instead. This file pairs with the [EPL Metadata Extraction])https://colab.research.google.com/drive/1_ZvgW_b71s9uTuQq9bZqNlsprP6jWFYJ?usp=sharing), and its contents are in the notebook as well. 


In [71]:
''' DATA QUERYING '''
from lxml import etree
parser = etree.XMLParser(collect_ids=False,encoding='utf-8')
nsamp = {'tei': 'http://www.tei-c.org/ns/1.0'} ### EPL Source


''' DATA MANAGEMENT '''
import pandas as pd
import regex as re

import json

import glob
from tqdm.notebook import tqdm

In [7]:
files = glob.glob(r"C:\Users\natal\Downloads\epmetadata-master\epmetadata-master\header\*.xml")

In [76]:
metadata_data = [] # Empty list for data
tcps = [] # Empty list for TCP IDs

## extracting the metadata from each file
for file_name in tqdm(files,desc="📖🔍 🪄 finding data...",unit=' text',):
 ## Using the file name to extract the TCP
    match = re.findall(r'(?<=header\\).+(?=_)',file_name)
    tcp_id = match[0]
    ## Using the XML parser to create an xml object that holds all of the metadata
    metadata = etree.parse(file_name,parser)
    title = metadata.find(".//tei:sourceDesc//tei:title", namespaces=nsamp).text

    ## Author
    try:
        author = metadata.find(".//tei:person[@role='creator']/tei:persName", namespaces=nsamp).text
    except AttributeError:
        author = None
    
    ## Author Gender
    try:
        gender = metadata.find(".//tei:sourceDesc//tei:author", namespaces=nsamp).get("gender")
    except AttributeError:
        gender = None
        
    ## Author Birth
    try:
        birth = metadata.find(".//tei:person[@role='creator']/tei:birth", namespaces=nsamp).text
    except AttributeError:
        birth = None
        
    ## Author Death
    try:
        death = metadata.find(".//tei:person[@role='creator']/tei:death", namespaces=nsamp).text
    except AttributeError:
        death = None

          # Get date (if there is one that isn't a range)
    try:
        date = metadata.find(".//tei:sourceDesc//tei:date", namespaces=nsamp).get("when")
    except AttributeError:
        date = None   
    # Get publisher (if there is one that isn't a range)
    try:
        publisher = metadata.find(".//tei:person[@type='printer']/tei:persName", namespaces=nsamp).text
    except AttributeError:
        publisher = None  
    # Get publisher (if there is one that isn't a range)
    try:
        pub_location = metadata.find(".//tei:sourceDesc//tei:pubPlace", namespaces=nsamp).text
    except AttributeError:
        pub_location = None  
      ## Storing the metadata in a dictionary
    current_metadata = {'TCP ID':tcp_id,'title':title,'author':author,'gender':gender,"auth birth":birth,"auth death":death,'pub date':date,'publisher':publisher,'location':pub_location}
    metadata_data.append(current_metadata)
      ## Adding the tcp to the index list
    tcps.append(tcp_id)
    
    
print ("✨ data has been found")
           

📖🔍 🪄 finding data...:   0%|          | 0/60331 [00:00<?, ? text/s]

✨ data has been found


In [77]:
''' Generating A Data Frame With the Metadata '''

metadata = pd.DataFrame(metadata_data,index=tcps)
metadata.head()

Unnamed: 0,TCP ID,title,author,gender,auth birth,auth death,pub date,publisher,location
A00001,A00001,[The passoinate [sic] morrice],"A.,",L,,,1593,R. Bourne?,[London :
A00002,A00002,"The brides ornaments viz. fiue meditations, mo...","Aylett, Robert,",M,1583.0,1655?.,1625,William Stansby,London :
A00003,A00003,A sermon preached at Paules-Crosse the second ...,"Ailesbury, Thomas,",M,,,1623,George Eld,London :
A00005,A00005,Here begynneth a shorte and abreue table on th...,,,,,1515,me Iulyan Notary,[Enprynted at Londo[n] :
A00007,A00007,The Cronycles of Englonde with the dedes of po...,,,,,1528,,[Imprynted at London :


In [78]:
len(metadata)

60331

In [79]:
metadata.to_csv("supplementary_metadata.csv")

___
<font color="gray">
This notebook was prepared by Natalie Castro for the CU Boulder Lab for Early Modern Text Analysis advised by Dr. David Glimp and Dr. Rachael Deagman Simonetta

10/30/2024</font>