# EPL Corpus Creation

This notebook is intending to create three CSV files from the [EPLEarly Text Corpus](https://earlyprint.org/lab/). The end result files from this notebook are three CSV files. The first, a split by paragraph corpus for all of the texts, the second a sentence split corpus, and the third a sentence split corpus by word with the modernized, adorned, and lemma for all of the texts.

The texts can be found at EPL's [Bitbucket](https://bitbucket.org/eplib/eebotcp/src/master/), and downloaded if so desired. The purpose of this notebook is to have a main CSV file where the texts can instead be queried in a central location, and reducing further preprocessing and computational resources moving forward. I forked the BitBucket repository, to my personal account to reduce traffic and API calls from the EPL repository, I did not want to overload the servers hosting their data by accident!

To do so, I will iteratively read in the files, download them, structure them in a dataframe, and then delete the files from the repository. This process will repeat until all files have been structured into the CSV files. The documentation for the BitBucket [Python API](https://atlassian-python-api.readthedocs.io/bitbucket.html) is linked here, additional code used to structure this data pull can be found at this [GitHub repository](https://github.com/atlassian-api/atlassian-python-api/tree/master/examples/bitbucket) provided by the BitBucket parent company, Atlassian.

---

## 🌐 1. Environment Creation

### 1.1 Library Import

The libraries used in this section fall under four categories: data access, data query, data structuring, and a progress bar.

In [1]:
''' DATA ACCESS '''
import glob
import json
import os

''' DATA QUERY '''
from lxml import etree
parser = etree.XMLParser(collect_ids=False,encoding='utf-8')
nsmap = {'tei': 'http://www.tei-c.org/ns/1.0'} ### EPL Source

''' DATA STRUCTURING '''
import pandas as pd

''' PROGRESS BAR '''
from tqdm.notebook import tqdm

### 1.2 Data Storage Import

The data will be iteratively pulled from two places, the machine file directory and the BitBucket Repository.

In [2]:
''' LOCAL FILE DIRECTORY '''

files = glob.glob(r"/scratch/alpine/naca4005/texts/*/*.xml")

In [3]:
dir_list = os.listdir(r"/scratch/alpine/naca4005/texts/") 

In [4]:
files[0]

'/scratch/alpine/naca4005/texts/A07/A07811.xml'

## 📖2. XML Parsing

To parse out the XML data, I will be using the documentation provided by the EPL team titled [Parsing *Early Print* XML Texts](https://earlyprint.org/jupyterbook/ep_xml.html). The author of this documentation was [John R Ladd](https://jrladd.com/). The EPL corpus is encoded usinig the [TEI simplePrint](https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_simplePrint.doc.html) standard, and was referenced to extract the paragraph like divisons.

At any reference to modernized words, this was conducted by the EPL using the [Northwestern Morphadorener](https://morphadorner.northwestern.edu/morphadorner/). This data was already preprocessed and prepared by the EPL team.

In [4]:
''' CREATING AN XML PARSER '''
    ## Creating a parser object
parser = etree.XMLParser(collect_ids=False)

 

In [7]:
   ## Parse your XML file into a "tree" object - 
    ## the tree objects will be used iteratively later to generate data for each of the text files
tree = etree.parse(files[1], parser)

XMLSyntaxError: Document is empty, line 1, column 1 (A49, line 1)

In [5]:
''' root generator function:

    INPUT: File Path from your machine
    OUTPUT: A parseable xml element that is TEI encoded

    The purpose of this function is to simplify the steps needed to get an xml file in parseable state. Any file name that
    is XML encoded will work for this function. 

'''


def root_generator(file_path):
    tree = etree.parse(file_path,parser)
    
    text_root = tree.getroot()
    
    return(text_root)

''' root generator function:

    INPUT: A text root
    OUTPUT: The TEI tag to identify this book

    The purpose of this function is to extract the unique TEI tag from the file. It will return a string with the TEI inside 

'''

def tei_finder(text_root):
    tei_tag = text_root.findall(".//tei:idno[@ type='DLPS']",namespaces=nsmap)
    tei = [tag.text for tag in tei_tag]
    return (tei[0])

### 2.1 Exploratory Parsing

In [10]:
a00001 = root_generator(files[0])
a00002 = root_generator(files[1])
a00003 = root_generator(files[2])
a00004 = root_generator(files[3])

In [11]:
root = root_generator(files[20])
tei = tei_finder(root)

print (tei)

A07480


#### 2.1.2 Extracting All Words

Functions:

* extract_all_words
* extract_all_modernized_words
* extract_all_lemmas
* extract_all_pos

All of the functions take the text root defined by the root generator above, and return a list of words with the desired type.

In [6]:
''' extract_all_words function:

    This function is intended to extract all of the words from the text file. 
    This will include the title, author, publisher, and all other metadata that may be included in the text. 
    
    INPUT: an xml tree tag
    OUTPUT: a list containing all of the words

'''
def extract_all_words(text_root):
    all_word_tags = text_root.findall(".//tei:w",namespaces=nsmap)
    
    all_words = [w.text for w in all_word_tags]
    
    return(all_words)


''' extract_all_modernized_words function:

    This function will generate a list of the modernized words generated by the NUIT Morphadorner. This function performs similarly
    to the orignal text function above, but just pulls a different XML tag.
    
    INPUT: an xml tree tag
    OUTPUT: a list containing all of the modernized words


'''

def extract_all_modernized_words(text_root):
    all_word_tags = text_root.findall(".//tei:w",namespaces=nsmap)
    
    regularized_words = [w.get('reg', w.text) for w in all_word_tags]
    
    
    return(regularized_words)

''' extract_all_lemmas function:

    This function will generate a list of the lemmatized words by EPL. It works in a similar fashion to those above.
    
    INPUT: an xml tree tag
    OUTPUT: a list containing all of the lemmatized words



'''

def extract_all_lemmas(text_root):
    all_word_tags = text_root.findall(".//tei:w",namespaces=nsmap)
    
    lemmas = [w.get('lemma') for w in all_word_tags]
    
    
    return(lemmas)


''' extract_all_pos function:

    This function will generate a list of the Parts of Speech Tags attributed to each individual word. These were generated
    
    INPUT: an xml tree tag
    OUTPUT: a list containing all of the POS tags for each respective word

'''

def extract_all_pos(text_root):
    all_word_tags = text_root.findall(".//tei:w",namespaces=nsmap)
    
    pos = [w.get('pos') for w in all_word_tags]
    
    return(pos)


                                 
                                 

In [13]:
all_words = extract_all_words(a00004)
all_words[10:20]

['DAVID',
 'sends',
 'his',
 'PIETIE',
 'TO',
 'KING',
 'CHARLES',
 'His',
 'Subjects',
 'Being']

In [14]:
modernized_words = extract_all_modernized_words(a00004)
modernized_words[10:20]

['DAVID',
 'sends',
 'his',
 'piety',
 'TO',
 'KING',
 'CHARLES',
 'His',
 'Subjects',
 'Being']

In [15]:
lemma_words = extract_all_lemmas(a00004)
lemma_words[10:20]

['david',
 'send',
 'his',
 'piety',
 'to',
 'king',
 'charles',
 'his',
 'subject',
 'be']

In [12]:
parts_of_speech = extract_all_pos(a00004)
parts_of_speech[10:20]

['n1', 'acp', 'd', 'j', 'j', 'nn1', 'nn1', 'n1', 'acp', 'n2']

In [13]:
print (len(all_words))
print (len(modernized_words))
print (len(lemma_words))
print (len(parts_of_speech))

3950
3950
3950
3950


#### 2.1.2 Extracting Sentences

In [7]:
''' extract_all_sentences function:

    This function will generate a list of the lines coded a 'l' in the TEI XML files
    
    INPUT: an xml tree tag
    OUTPUT: a list containing strings of all of the 'l' seperated lines.
    
'''
def extract_all_lines(text_root):
    line_tags = text_root.findall(".//tei:l", namespaces=nsmap)

    all_lines = []
    for line in line_tags: #Only the first 20 lines
        current_line = (' '.join([w.text for w in line.findall(".//tei:w", namespaces=nsmap)]))
        all_lines.append(current_line)
        
    return(all_lines)

''' extract_sentences_punctuation function:

    This function will generate a list of the lines coded a 'l' in the TEI XML files.
    Different than extract_all_sentences, the extract_sentences_punctuation will keep the punctuation provided
    
    INPUT: an xml tree tag
    OUTPUT: a list containing strings of all of the 'l' seperated lines.
    
'''
def extract_line_punctuation(text_root):
    line_tags = text_root.findall(".//tei:l", namespaces=nsmap)

    all_lines = []
    for line in line_tags:
        current_line = (' '.join([child.text for child in line]))
        all_lines.append(current_line)
        
    return(all_lines)

In [17]:
line_tags = a00001.findall(".//tei:l", namespaces=nsmap)


In [18]:
lines = extract_all_lines(a00004)
lines[0:5]

[]

In [19]:
punctuated_lines = extract_line_punctuation(a00004)
punctuated_lines[0:4]

[]

#### 2.1.3 Extracting Paragraphs

In [8]:
''' extract_all_sentences function:

    INPUT: an xml text root
    OUTPUT: a list of strings that are each a sentence in the text
    
    This function is heavily adopted from the EPL documentation linked above. Naming conventions were changed from 'master'
    to 'main', and the code comments were changed as well to increased readability in this file. This function, like the word
    functions, will take an xml text root and geneare a split list based on the text provided. This function can take some
    time to run, as it iterates through the tags one by one.

'''

def extract_all_sentences(text_root):
    word_and_punctuation_tags = text_root.xpath("//tei:w|//tei:pc", namespaces=nsmap)
    
    
        ## Creating list storage containers for the working text
    all_sentences = []
    new_sentence = []
    
        ## Iterating through each tag
    
    for tag in word_and_punctuation_tags:
        ## First, we will test to see if the tag has the attribute to find the end of the sentence
        if 'unit' in tag.attrib and tag.get('unit') == 'sentence':
            if tag.text != None:
                ## Adding the punctuation to the sentence
                new_sentence.append(tag.text)
            
            joined_sentence = ' '.join([word for word in new_sentence if word is not None])
            ## Storing the sentences
            all_sentences.append(joined_sentence)
            ## Reinstantiating the working sentence storage container
            new_sentence = []
        # If the tag is not at the end of a sentence, we can simply add its contents to the list
        else:
            new_sentence.append(tag.text)
            
    
    return (all_sentences)

In [21]:
sentences = extract_all_sentences(a00004)
for sentence in sentences[0:7]:
    print (sentence)

Mercurius Davidicus , OR A Patterne of Loyall Devotion .
Wherein King DAVID sends his PIETIE TO KING CHARLES , His Subjects .
Being the practice of the Primitive Christians , Martyrs , and Confessors , in all Ages ; Very fitting to be used both publick and private in these disloyall TIMES .
Likewise Prayers and Thanksgivings used in the Kings Army before and after BATTELL .
Published by His Majesties Command .
Oxford , Printed by Leonard Leichfield .
1634.


### 2.2 Structured Parsing

In this section, I will create three workflows to generate CSV files with the text. The first will be the word split, the second the sentence split, and the third the paragraph split.

#### 2.2.1 Word Split Parsing


* extract_all_words
* extract_all_modernized_words
* extract_all_lemmas
* extract_all_pos

In [67]:
text_data = [] # Empty list for data
teis = [] # Empty list for TCP IDs

## extracting the metadata from each file
for file_name in tqdm(files,desc="📖🔍 🪄 speed reading...",unit=' text',):
        ## Finding TEI and creating a parse object
    root = root_generator(file_name)
    tei = tei_finder(root)
    teis.append(tei)
    
        ## Parsing the object using the above functions
        
    words = extract_all_words(root)
    modernized = extract_all_modernized_words(root)
    lemmas = extract_all_lemmas(root)
    pos = extract_all_pos(root)

    current_text = {'TEI':tei,'words':words,'modernized':modernized,'lemmas':lemmas,'pos':pos}
    text_data.append(current_text)
      ## Adding the tcp to the index list
    
    
    
print ("✨📚 data has been found")
           

HBox(children=(HTML(value='📖🔍 \U0001fa84 speed reading...'), FloatProgress(value=0.0, max=489.0), HTML(value='…


✨📚 data has been found


In [68]:
tei_data = pd.DataFrame(text_data)
tei_data.head()

Unnamed: 0,TEI,words,modernized,lemmas,pos
0,A00361,"[¶, A, deuout, treatise, vpon, the, Pater, nos...","[¶, A, devout, treatise, upon, the, Pater, nos...","[¶, a, devout, treatise, upon, the, n/a, n/a, ...","[sy, d, j, n1, acp, d, fla, fla, vvn, ord, acp..."
1,A00544,"[A, DISCOVERY, OF, THE, ABHOminable, Delusions...","[A, DISCOVERY, OF, THE, Abominable, Delusions,...","[a, discovery, of, the, abominable, delusion, ...","[d, n1, acp, d, j, n2, acp, d, crq, vvb, pr, d..."
2,A00630,"[THE, ARTES, OF, LOGIKE, AND, Rethorike, plain...","[THE, ARTS, OF, LOGIC, AND, Rhetoric, plainly,...","[the, art, of, logic, and, rhetoric, plain, se...","[d, n2, acp, n1, cc, n1, av-j, vvn, av, acp, d..."
3,A00174,"[ARTICLES, TO, BE, MINISTRED, ENQVIRED, OF, AN...","[ARTICLES, TO, BE, ministered, ENQVIRED, OF, A...","[article, to, be, minister, enqvire, of, and, ...","[n2, acp, vvb, vvn, vvn, acp, cc, vvd, acp, d,..."
4,A00341,"[THE, COMPARATION, OF, a, Uyrgin, and, a, Mart...","[THE, COMPARATION, OF, a, Uyrgin, and, a, Mart...","[the, comparation, of, a, uyrgin, and, a, mart...","[d, n1, acp, d, n1, cc, d, n1, ab, crd, sy, d,..."


In [69]:
word_data = tei_data.apply(pd.Series.explode)
word_data.head()

Unnamed: 0,TEI,words,modernized,lemmas,pos
0,A00361,¶,¶,¶,sy
0,A00361,A,A,a,d
0,A00361,deuout,devout,devout,j
0,A00361,treatise,treatise,treatise,n1
0,A00361,vpon,upon,upon,acp


#### 2.2.2 Sentence Split Parsing

In [71]:
sentence_data = [] # Empty list for data
teis = [] # Empty list for TCP IDs

## extracting the metadata from each file
for file_name in tqdm(files[0:10],desc="📖🔍 🪄 speed reading...",unit=' text',):
        ## Finding TEI and creating a parse object
    root = root_generator(file_name)
    tei = tei_finder(root)
    teis.append(tei)
    
        ## Parsing the object using the above functions
    sentences = extract_all_sentences(root)

    current_text = {'TEI':tei,'sentences':sentences}
    sentence_data.append(current_text)
    
    
print ("✨📚 data has been found")
           

HBox(children=(HTML(value='📖🔍 \U0001fa84 speed reading...'), FloatProgress(value=0.0, max=10.0), HTML(value=''…


✨📚 data has been found


In [72]:
tei_full_sentences = pd.DataFrame(sentence_data)
tei_full_sentences.head()

Unnamed: 0,TEI,sentences
0,A00361,[¶ A deuout treatise vpon the Pater noster / m...
1,A00544,[A DISCOVERY OF THE ABHOminable Delusions of t...
2,A00630,"[THE ARTES OF LOGIKE AND Rethorike , plainlie ..."
3,A00174,"[ARTICLES TO BE MINISTRED , ENQVIRED OF , AND ..."
4,A00341,"[THE COMPARATION OF a Uyrgin and a Martyr ., A..."


In [73]:
tei_sentences = tei_full_sentences.explode(column='sentences')
tei_sentences.head()

Unnamed: 0,TEI,sentences
0,A00361,¶ A deuout treatise vpon the Pater noster / ma...
0,A00361,¶ Richarde Hyrde / vnto the moost studyous and...
0,A00361,S. sendeth gretynge and well to fare .
0,A00361,I Haue herde many men put great dout● whether ...
0,A00361,And some vtterly affyrme that it is nat onely ...


## 💾 3. Saving The Outputs

The files generated above will be saved in a similar way to the repository naming conventions found in BitBucket. The file name will have the first three of the TEI id (ie. A00, B00 ...) and then will have a descriptive name.

* *Text Data*: Each row is a different text, the words are all stored in lists in the rows
* *By Word Data*: Each row is a different word from all of the texts, rows are ordered chronologically and TEI least to greatest.
* *Sentence Data*: Each row is a different text, the sentences are all stored in lists in the rows
* *By Sentence Data*: Each row is an individual sentence from all of the texts. Rows are ordered chronologically, and TEI least to greatest.

In [74]:
    ## Saving these to a similar naming convention that the oringinal EPL uses
repo_tei_code_first = tei_data.at[0,'TEI']
repo_tei_code = repo_tei_code_first[0:3]

text_file_name = str(repo_tei_code)+" Text Data.csv"
by_word_file_name = str(repo_tei_code)+ " By Word Data.csv"
sentence_full_file_name = str(repo_tei_code)+" Sentence Data.csv"
sentence_file_name = str(repo_tei_code)+" By Sentence Data.csv"

In [75]:
    ## Saving the files to the current directory

tei_data.to_csv(text_file_name)
word_data.to_csv(by_word_file_name)
tei_full_sentences.to_csv(sentence_full_file_name)
tei_sentences.to_csv(sentence_file_name)

## ➿ 4. Function Creation

This section will combine everything above into a function so that a block of code per CSV needed will be generated.

In [9]:
def by_word_dataframe_maker(files):
    print ("📎WORD PARSING")
    text_data = [] # Empty list for data
    
    ## extracting the metadata from each file
    for file_name in files:
        #print (file_name)
            ## Finding TEI and creating a parse object
        root = root_generator(file_name)
        tei = tei_finder(root)

            ## Parsing the object using the above functions

        words = extract_all_words(root)
        modernized = extract_all_modernized_words(root)
        lemmas = extract_all_lemmas(root)
        pos = extract_all_pos(root)

        current_text = {'TEI':tei,'words':words,'modernized':modernized,'lemmas':lemmas,'pos':pos}
        text_data.append(current_text)
          

    print ("✨📚 data has been found\n... dataframes being assembled")
    tei_data = pd.DataFrame(text_data)
    word_data = tei_data.apply(pd.Series.explode)
    
   
    
    print("✒️... writing files to directory")
    repo_tei_code_first = tei_data.at[0,'TEI']
    repo_tei_code = repo_tei_code_first[0:3]

    text_file_name = str(repo_tei_code)+" FULL Text Data.csv"
    by_word_file_name = str(repo_tei_code)+ " SPLIT Word Data.csv"
    
    tei_data.to_csv(text_file_name)
    word_data.to_csv(by_word_file_name)
    
    print (f"🤖💬 all files in the {repo_tei_code} have been proccessed and saved successfully")

In [10]:
def by_sentence_dataframe_maker(files):
    print ("📎SENTENCE PARSING")
    sentence_data = [] # Empty list for data
    teis = [] # Empty list for TCP IDs

    ## extracting the metadata from each file
    for file_name in files:
            ## Finding TEI and creating a parse object
        root = root_generator(file_name)
        tei = tei_finder(root)
        teis.append(tei)

            ## Parsing the object using the above functions
        sentences = extract_all_sentences(root)

        current_text = {'TEI':tei,'sentences':sentences}
        sentence_data.append(current_text)


    print ("✨📚 data has been found\n...dataframes are being assembled")
    tei_full_sentences = pd.DataFrame(sentence_data)
    tei_sentences = tei_full_sentences.explode(column='sentences')

    
    print("✒️... writing files to directory")
    repo_tei_code_first = tei_full_sentences.at[0,'TEI']
    repo_tei_code = repo_tei_code_first[0:3]
    
    sentence_full_file_name = str(repo_tei_code)+" FULL Sentence Data.csv"
    sentence_file_name = str(repo_tei_code)+" SPLIT Sentence Data.csv"
    tei_full_sentences.to_csv(sentence_full_file_name)
    tei_sentences.to_csv(sentence_file_name)
    
    print (f"🤖💬 all files in the {repo_tei_code} have been proccessed and saved successfully")

In [11]:
def folder_iterator(path):
    folder_path = glob.glob(r"/scratch/alpine/naca4005/texts/"+path)
    for folder in folder_path:
        xml_paths = folder+"/*.xml"
        xml_files = glob.glob(xml_paths)
        
        #by_word_dataframe_maker(xml_files)
        by_sentence_dataframe_maker(xml_files)

In [None]:
folder_iterator("A02")

📎WORD PARSING
✨📚 data has been found
... dataframes being assembled
✒️... writing files to directory
🤖💬 all files in the A02 have been proccessed and saved successfully
📎SENTENCE PARSING


In [12]:
folder_iterator("A02")

📎SENTENCE PARSING
✨📚 data has been found
...dataframes are being assembled
✒️... writing files to directory
🤖💬 all files in the A02 have been proccessed and saved successfully


In [25]:
for path in dir_list[0:4]:
    print (f"\n🆕{path}")
    folder_iterator(path)


🆕A07
📎WORD PARSING
✨📚 data has been found
... dataframes being assembled
✒️... writing files to directory
🤖💬 all files in the A07 have been proccessed and saved successfully
📎SENTENCE PARSING


KeyboardInterrupt: 

## 5. API Data Pull

In [20]:
pd.read_csv("A02 SPLIT Sentence Data.csv")

Unnamed: 0.1,Unnamed: 0,TEI,sentences
0,0,A02262,CHRISTS PASSION .
1,0,A02262,A TRAGEDIE .
2,0,A02262,WITH ANNOTATIONS .
3,0,A02262,"LONDON , Printed by Iohn Legatt ."
4,0,A02262,M. D. C. XL. M.
...,...,...,...
722170,524,A02403,That which is objected chiefly by covetous and...
722171,524,A02403,Had indeed his Majestie any way ayded eyther t...
722172,524,A02403,But now seeing his Majestie has beene alwayes ...
722173,524,A02403,Wherefore seeing his Majestie protests he ente...
