# Parsing a TEI Document - Homework

## Directions

Parse the tei of Gibbon's _Decline and Fall_ to extract all the **marginal notes**. (XML file provided)
1. Extract all marginal notes
2. Remove extraneous whitespace
3. Place marginal notes in a dataframe
4. Save teh dataframe as a csv file


## Hint

Here is a snippet of what a marginal note in the xml document looks like:

`<note place="margin">A. D. 268. March 20. Death of Gallienus.</note>`

These are different from the footnotes that we saw in class in that (a) they do not have numbers and (b) the white space is different. You are free to accomodate for that however you would like.

### Set up

In [36]:
# ! pip install beatifulsoup4

In [37]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [38]:
# load xml file
url = "https://raw.githubusercontent.com/msaxton/nlp-data/main/gibbon.xml"
response = requests.get(url)
xml_file = response.text

### Parse TEI

In [39]:
# use BeautifulSoup to creat an xml object
xml = BeautifulSoup(xml_file)

In [40]:
# find all footnotes
footnotes = xml.find_all( 'note', attrs = {'place': 'bottom'})

In [42]:
# remove extra space (if needed)
def remove_extra_space(text):
    text = text.replace("\n", " ")
    text = text.replace("  ", " ")
    return text

In [43]:
# prepare data for dataframe
notes = []
i = 1
for footnote in footnotes:
    d = {}
    footnote_num = f'footnote {str(i)}'
    footnote_text = remove_extra_space(footnote.text)
    d['number'] = footnote_num
    d['text'] = footnote_text
    notes.append(d)
    i += 1

In [44]:
# convert to dataframe
xml_df=pd.DataFrame.from_dict(notes)

In [45]:
xml_df.head

<bound method NDFrame.head of            number                                               text
0      footnote 1   Pons Aureoli,             thirteen miles from...
1      footnote 2               On the death of Gallienus, ſee Tr...
2      footnote 3               Some ſuppoſed him, oddly enough, ...
3      footnote 4   Notoria,               a periodical and offic...
4      footnote 5  Hiſt. Auguſt. p. 208. Gallienus deſcribes the ...
..            ...                                                ...
823  footnote 824               Euſebius de Martyr. Paleſtin. c. ...
824  footnote 825               When Paleſtine was divided into t...
825  footnote 826  Ut gloriari poſſint nullum ſe innocentium pere...
826  footnote 827  Grot. Annal. de Rebus Belgicis, l. i. p. 12. E...
827  footnote 828  Fra-Paolo (Iſtoria del Concilio Tridentino, l....

[828 rows x 2 columns]>

In [46]:
# save dataframe as csv
file_name = "gibbondataframe.csv"