# EPL Corpus Cleaning

The puropse of this notebook is to index the EPL corpus by section in the book and match with the correct metadata. This notebook is part 2 of the EPL Corpus Assmeblage.

## 🌐 1. Environment Creation

### 1.1 Library Import

The libraries used in this section fall under four categories: data access, data query, data structuring, and a progress bar.

In [2]:
''' DATA ACCESS '''
import glob
import json
import os

''' DATA QUERY '''
from lxml import etree
parser = etree.XMLParser(collect_ids=False,encoding='utf-8')
nsmap = {'tei': 'http://www.tei-c.org/ns/1.0'} ### EPL Source

''' DATA STRUCTURING '''
import pandas as pd

''' PROGRESS BAR '''
from tqdm.notebook import tqdm

### 1.2 Data Import

The data used in this section will read the entire file of the split by sentence (non-exploded) folder with the EPL corpus. The second file that will be read in is the EPL metadata generated in 'Extracting EPL Metadata'

#### 1.2.1 Reading in the Metadata

This metadata was generated from the informaiton provided in the EPL xml files. Each row is representative of each text in the corpus

In [4]:
metadata_filepath = r"/scratch/alpine/naca4005/texts csvs/supplementary_metadata.csv"

In [5]:
metadata = pd.read_csv(metadata_filepath)
metadata.drop(columns='Unnamed: 0',inplace=True)
metadata.rename(columns={'TCP ID':"TEI"},inplace=True)

In [6]:
metadata.head(2)

Unnamed: 0,TEI,title,author,gender,auth birth,auth death,pub date,publisher,location
0,A00001,[The passoinate [sic] morrice],"A.,",L,,,1593.0,R. Bourne?,[London :
1,A00002,"The brides ornaments viz. fiue meditations, mo...","Aylett, Robert,",M,1583.0,1655?.,1625.0,William Stansby,London :


In [7]:
len(metadata)

60331

#### 1.2.2 Reading in the Corpus

In [8]:
csv_filepath = glob.glob(r"/scratch/alpine/naca4005/texts csvs/by sentence/*")

In [9]:
len(csv_filepath)

139

In [23]:
''' STORING THE CSV FILES '''
## Creating a storage container for the CSVs
csv_holder = []

## Iterating through the filepaths
for path in csv_filepath[70:]:
    current_csv = pd.read_csv(path)
    csv_holder.append(current_csv)

In [24]:
''' CREATING A DF OF ALL OF THE TEXTS '''
## Concatenating the list of CSV files & creating a new index
corpus = pd.concat(csv_holder)
corpus.reset_index(inplace=True)
corpus.drop(columns=['index','Unnamed: 0'],inplace=True)

## Using eval to turn the strings of sentences into a proper list
corpus['sentences'] = corpus['sentences'].apply(lambda x: eval(x))

In [25]:
corpus

Unnamed: 0,TEI,sentences
0,B16700,"[Advertisement ., I AM sensible this Publick w..."
1,B16191,[❧ CERTAYNE LITEL TREATIES Set forth by Iohn V...
2,B16427,[¶ The proude wyues Pater noster that wolde go...
3,B16185,[An epistle vnto the right honorable and chris...
4,B16656,"[LONDON ss ., Ad Generalem Quarterial ' Sessio..."
...,...,...
30414,A20773,"[¶ To all Parsons , Vicares , Curates , School..."
30415,A20732,[THE COVENANT OF GRACE OR AN EXPOSITION VPON L...
30416,A20591,"[The Preface to the Reader ., GENTLE READER , ..."
30417,A20648,[A SERMON OF COMMEMORATION OF THE Lady Danuers...


## 👷🏻‍♀️ 2. Restructuing the EPL Corpus

In [26]:
corpus.head()

Unnamed: 0,TEI,sentences
0,B16700,"[Advertisement ., I AM sensible this Publick w..."
1,B16191,[❧ CERTAYNE LITEL TREATIES Set forth by Iohn V...
2,B16427,[¶ The proude wyues Pater noster that wolde go...
3,B16185,[An epistle vnto the right honorable and chris...
4,B16656,"[LONDON ss ., Ad Generalem Quarterial ' Sessio..."


In [27]:
''' MATCHING METADATA '''

corpus_and_metadata = corpus.merge(right=metadata,on='TEI')

In [28]:
print (f"The length of the corpus is: {len(corpus)}\nThe length of the metadata is: {len(metadata)}\nThe length of the merged data is {len(corpus_and_metadata)}")

The length of the corpus is: 30419
The length of the metadata is: 60331
The length of the merged data is 30419


In [29]:
corpus_and_metadata.head()

Unnamed: 0,TEI,sentences,title,author,gender,auth birth,auth death,pub date,publisher,location
0,B16700,"[Advertisement ., I AM sensible this Publick w...",Advertisement. I am sensible this publick way ...,,,,,1670.0,,[London :
1,B16191,[❧ CERTAYNE LITEL TREATIES Set forth by Iohn V...,Certayne litel treaties setforth by John Veron...,"Véron, John,",M,,-1563.0,1548.0,Humfrey Powell,[Imprynted at London :
2,B16427,[¶ The proude wyues Pater noster that wolde go...,The proude wyues pater noster that wolde go ga...,,,,,1560.0,Iohn Kynge],[Imprinted at London :
3,B16185,[An epistle vnto the right honorable and chris...,An epistle vnto the right honorable and christ...,"Vermigli, Pietro Martire,",M,1499.0,1562.0,1550.0,Byllinges gate],[Imprynted at Londo[n] :
4,B16656,"[LONDON ss ., Ad Generalem Quarterial ' Sessio...",London ss. Ad generalem quarterial sessionem P...,,,,,1677.0,Andrew Clark,[London] :


In [30]:
''' EXPLODING THE DATAFRAME '''
corpus_exploded = corpus_and_metadata.explode(column='sentences')

In [31]:
''' CREATING INDICES '''
## Grouping the corpus by TEI to generate an index for each document
corpus_grouped = corpus_exploded.groupby(by='TEI')

## Creating an index to parse the groups
teis = corpus['TEI'].to_list()

In [19]:
## Iterating through each row to assign an index
indexed_groups = []
    ## Findings the current group
for tei in teis:
    current_text = corpus_grouped.get_group(tei).copy()
    
    ## Generating a list with a 1 - len index to fit the current group
    current_length = len(current_text)
    index_list = []
    tei_sentence_index = []
    for index_value in range (1,(current_length+1)):
        index_list.append(index_value)
        tei_sentence_index.append(str(tei)+("_")+str(index_value))
    
    current_text['index'] = index_list
    current_text['TEI INDEX'] = tei_sentence_index
    ## Appending a new copy of the group to a list to reinstantiate
    indexed_groups.append(current_text)
    
## Creating the dataframe
tei_index = pd.concat(indexed_groups)
tei_index.reset_index(inplace=True)
tei_index.drop(columns='level_0',inplace=True)

In [20]:
tei_index.head()

Unnamed: 0,TEI,sentences,title,author,gender,auth birth,auth death,pub date,publisher,location,index,TEI INDEX
0,A49921,"THE LABOURING PERSONS Remembrancer : OR , A Pr...","The labouring persons remembrancer, or, A prac...","Lee, Francis,",M,,,1690.0,L. Lichfield,Oxford :,1,A49921_1
1,A49921,With Suitable DEVOTIONS .,"The labouring persons remembrancer, or, A prac...","Lee, Francis,",M,,,1690.0,L. Lichfield,Oxford :,2,A49921_2
2,A49921,"OXFORD , Printed by L. Lichfield , A. D. 1690.","The labouring persons remembrancer, or, A prac...","Lee, Francis,",M,,,1690.0,L. Lichfield,Oxford :,3,A49921_3
3,A49921,THE LABOURING PERSONS Remembrancer .,"The labouring persons remembrancer, or, A prac...","Lee, Francis,",M,,,1690.0,L. Lichfield,Oxford :,4,A49921_4
4,A49921,"MAN , as Eliphaz saith , is born to Labour as ...","The labouring persons remembrancer, or, A prac...","Lee, Francis,",M,,,1690.0,L. Lichfield,Oxford :,5,A49921_5


In [32]:
## Iterating through each row to assign an index
indexed_groups2 = []
    ## Findings the current group
for tei in teis:
    current_text = corpus_grouped.get_group(tei).copy()
    
    ## Generating a list with a 1 - len index to fit the current group
    current_length = len(current_text)
    index_list = []
    tei_sentence_index = []
    for index_value in range (1,(current_length+1)):
        index_list.append(index_value)
        tei_sentence_index.append(str(tei)+("_")+str(index_value))
    
    current_text['index'] = index_list
    current_text['TEI INDEX'] = tei_sentence_index
    ## Appending a new copy of the group to a list to reinstantiate
    indexed_groups2.append(current_text)
    
## Creating the dataframe
tei_index2 = pd.concat(indexed_groups2)
tei_index2.reset_index(inplace=True)
tei_index2.drop(columns='level_0',inplace=True)

In [33]:
tei_index2.head()

Unnamed: 0,TEI,sentences,title,author,gender,auth birth,auth death,pub date,publisher,location,index,TEI INDEX
0,B16700,Advertisement .,Advertisement. I am sensible this publick way ...,,,,,1670.0,,[London :,1,B16700_1
1,B16700,I AM sensible this Publick way of Practise has...,Advertisement. I am sensible this publick way ...,,,,,1670.0,,[London :,2,B16700_2
2,B16700,"As I have been bred to Physick , so I give Adv...",Advertisement. I am sensible this publick way ...,,,,,1670.0,,[London :,3,B16700_3
3,B16700,"An old inveterate Head-ach , though of some ye...",Advertisement. I am sensible this publick way ...,,,,,1670.0,,[London :,4,B16700_4
4,B16700,"Convulsion-Fits and Epilepsies , either in You...",Advertisement. I am sensible this publick way ...,,,,,1670.0,,[London :,5,B16700_5


## 💾 3. Saving the TEI Index Corpus

In [34]:
tei_full = pd.concat([tei_index,tei_index2])

In [39]:
print (f"The expected len for tei_full is {len(tei_index)+len(tei_index2)}, the actual len for tei_full is {len(tei_full)}")

The expected len for tei_full is 52452244, the actual len for tei_full is 52452244


In [38]:
tei_full.to_csv("EPL SENTENCE CORPUS.csv")

In [40]:
tei_full.describe()

Unnamed: 0,pub date,index
count,51783160.0,52452240.0
mean,1645.001,7079.814
std,38.73029,14122.0
min,1474.0,1.0
25%,1621.0,495.0
50%,1652.0,1704.0
75%,1676.0,5837.0
max,1818.0,138591.0


## 📖 4. Testing EPL SENTENCE CORPUS 

In [3]:
save_test = pd.read_csv("EPL SENTENCE CORPUS.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [4]:
save_test.head()

Unnamed: 0.1,Unnamed: 0,TEI,sentences,title,author,gender,auth birth,auth death,pub date,publisher,location,index,TEI INDEX
0,0,A49921,"THE LABOURING PERSONS Remembrancer : OR , A Pr...","The labouring persons remembrancer, or, A prac...","Lee, Francis,",M,,,1690.0,L. Lichfield,Oxford :,1,A49921_1
1,1,A49921,With Suitable DEVOTIONS .,"The labouring persons remembrancer, or, A prac...","Lee, Francis,",M,,,1690.0,L. Lichfield,Oxford :,2,A49921_2
2,2,A49921,"OXFORD , Printed by L. Lichfield , A. D. 1690.","The labouring persons remembrancer, or, A prac...","Lee, Francis,",M,,,1690.0,L. Lichfield,Oxford :,3,A49921_3
3,3,A49921,THE LABOURING PERSONS Remembrancer .,"The labouring persons remembrancer, or, A prac...","Lee, Francis,",M,,,1690.0,L. Lichfield,Oxford :,4,A49921_4
4,4,A49921,"MAN , as Eliphaz saith , is born to Labour as ...","The labouring persons remembrancer, or, A prac...","Lee, Francis,",M,,,1690.0,L. Lichfield,Oxford :,5,A49921_5


In [5]:
save_test.describe()

Unnamed: 0.1,Unnamed: 0,pub date,index
count,52452240.0,51783160.0,52452240.0
mean,13384480.0,1645.001,7079.814
std,8022603.0,38.73029,14122.0
min,0.0,1474.0,1.0
25%,6556530.0,1621.0,495.0
50%,13113060.0,1652.0,1704.0
75%,19669590.0,1676.0,5837.0
max,29999280.0,1818.0,138591.0


In [6]:
len(save_test)

52452244