# Parsing a TEI Document - Homework

## Directions

Parse the tei of Gibbon's _Decline and Fall_ to extract all the **marginal notes**. (XML file provided)
1. Extract all marginal notes
2. Remove extraneous whitespace
3. Place marginal notes in a dataframe
4. Save teh dataframe as a csv file


## Hint

Here is a snippet of what a marginal note in the xml document looks like:

`<note place="margin">A. D. 268. March 20. Death of Gallienus.</note>`

These are different from the footnotes that we saw in class in that (a) they do not have numbers and (b) the white space is different. You are free to accomodate for that however you would like.

### Set up

In [1]:
# ! pip install beatifulsoup4

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
# load xml file
url = "https://raw.githubusercontent.com/msaxton/nlp-data/main/gibbon.xml"
response = requests.get(url)
xml_file = response.text

### Parse TEI

In [5]:
# use BeautifulSoup to creat an xml object
xml = BeautifulSoup(xml_file)

In [6]:
# find all marginal note
marginal_notes = xml.find_all( 'note', attrs = {'place': 'margin'})

In [7]:
# remove extra space (if needed)
def removing_space(xml_file):
    xml_file = xml_file.replace("\n", " ")
    Xml_file = xml_file.replace("  ", " ")
    return xml_file

In [8]:
# prepare data for dataframe
notes = []
i = 1
for margins in marginal_notes: 
    h = {}
    margins_num = f'margins {str(i)}'
    margins_text = removing_space(margins.text)
    h['number']= margins_num
    h['text']= margins_text
    notes.append(h)
    i+=1

In [9]:
# convert to dataframe
xml_df=pd.DataFrame.from_dict(notes)

In [10]:
xml_df.head(10)

Unnamed: 0,number,text
0,margins 1,Aureolus invades Ital...
1,margins 2,A. D. 268.
2,margins 3,A. D. 268. March 20. Death of Gallienus.
3,margins 4,Character and eleva ...
4,margins 5,Death of Aureolus.
5,margins 6,Clemency and juſtice ...
6,margins 7,He under ...
7,margins 8,A. D. 269. The Goths invade the empire.
8,margins 9,Diſtreſs and firm ...
9,margins 10,His victory over the Goths.


In [11]:
# save dataframe as csv
file_name = "gibbondataframe.csv"