# Parsing a TEI Document

## Brief Introduction to TEI

The Text Encoding Initiative (TEI) is both a standard for encoding texts to be machine actionable and an organization that oversees the TEI standards.

As standard, TEI provides a uniform way for humanities scholars to encode literary and documentary texts in a uniform way allowing them to be machine actionable for display, searching, or processing. TEI is a set of tags that piggy-back on basic XML.

To learn more, please see the following:
* [Text Encoding Initiative Home Page](https://tei-c.org/)
* [What is the TEI from the Women Writers Project](https://www.wwp.northeastern.edu/outreach/seminars/tei.html)
* [TEI By Example Project](https://teibyexample.org/)
* [Introduction to XML](https://www.w3schools.com/xml/xml_whatis.asp)

## Parsing TEI

### Set up

In [1]:
# ! pip install beatifulsoup4

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [3]:
# load xml file
url = "https://raw.githubusercontent.com/msaxton/nlp-data/main/gibbon.xml"
response= requests.get(url)
xml_str= response.text

In [4]:
xml_str[:250]

'<TEI xmlns="http://www.tei-c.org/ns/1.0">\n    <teiHeader>\n        <fileDesc>\n            <titleStmt>\n                <title>The history of the decline and fall of the Roman Empire: By Edward Gibbon, Esq; ... [pt.2]</title>\n                <author>Gib'

### Parse TEI

In [5]:
# use BeautifulSoup to creat an xml object
xml = BeautifulSoup(xml_str)

In [6]:
# find all footnotes
footnotes = xml.find_all('note', attrs={'place': 'bottom'})

In [7]:
len(footnotes)

828

In [8]:
footnotes[323].text

'Conſtantin. Porphyr. de Statu Imper. p. 86.'

In [9]:
# remove extra space
def remove_extra_space(text):
    text = text.replace('\n', '')
    text = text.replace('  ', '')
    return text

In [10]:
# prepare data for dataframe
processed_notes=[]
i = 1
for footnote in footnotes:
    d = {}
    footnote_num = f'footnote {str(i)}'
    footnote_text = remove_extra_space(footnote.text)
    d["number"] = footnote_num
    d["text"] = footnote_text
    processed_notes.append(d)
    i += 1

In [11]:
processed_notes[67]

{'number': 'footnote 68',
 'text': 'It was five hundred and thirty-ſeven miles from Seleucia, and two hundred and three from the neareſt coaſt of Syria, according to the reckoning of Pliny, who, in a few words (Hiſt. Natur. v. 21.), gives an excellent deſcription of Palmyra.'}

In [12]:
# convert to dataframe
df = pd.DataFrame.from_dict(processed_notes)

In [14]:
df.head()

Unnamed: 0,number,text
0,footnote 1,"Pons Aureoli,thirteen miles from Bergamo, and ..."
1,footnote 2,"On the death of Gallienus, ſee Trebellius Poll..."
2,footnote 3,"Some ſuppoſed him, oddly enough, to be a baſta..."
3,footnote 4,"Notoria,a periodical and official diſpatch whi..."
4,footnote 5,Hiſt. Auguſt. p. 208. Gallienus deſcribes the ...


In [15]:
# save dataframe as csv
file_name= 'gibbon_footnote.csv'
df.to_csv(file_name, index=False)