# DS 5001 Project Notebook: Greek and Roman Mythology

- David Vann (dv6bq@virginia.edu)
- DS 5001
- 5 May 2021

In [1]:
import os
from glob import glob

import numpy as np
import pandas as pd
import nltk

from eta_modules.preprocessing import Document, create_tables, extract_vocab

## Reading in the data

We start by loading in the XML files for each work and parsing them to a reasonable degree with BeautifulSoup and NLTK. 

Since these works are all either plays or poems/epics, the concept of a "chapter" or "paragraph" doesn't translate perfectly compared to, e.g., a novel. However, the Perseus Digital Library (where these files are sourced from) has added at least top-level divisions to break up texts. In some cases, these divisions truly exist in the text (for example, *The Iliad* is broken into 24 books); in other cases, like plays, these divisions don't seem to be directly present in the text, but are akin to something like a "scene". I've considered all of these largest divisions as "chapters".

To get at something like a "paragraph", I used a different approach based on whether the work was a play or not:

- For plays, I used each speaker section (denoted by a "\<sp>" in the files) as a "paragraph". 
- For everything else, there wasn't a built-in tag for "paragraph"-type divisions, but there is a self-closing "milestone" tag that marks the start of a new "card" used on the Perseus website to denote content to be displayed on one page. Since these are self-closing, they don't actually enclose the particular block of text that I wanted to get at; instead, I replaced these with newlines and split up text based on a double newline, which seemed to give fairly satisfactory results.

In [2]:
root_dir = os.path.abspath('..')
data_dir = os.path.join(root_dir, 'data')
output_dir = os.path.join(data_dir, 'outputs')

In [3]:
docpaths = glob(os.path.join(data_dir, 'raw', '**', '*.xml'), recursive=True)

In [4]:
doc_list = []

for path in docpaths:
    doc = Document(path)
    doc_list.append(doc)
    
    doc.parse_text_to_paras()
    doc.tokenize(remove_pos_tuple=True, remove_ws=True)

In [5]:
LIB, DOC, TOKEN = create_tables(doc_list)
VOCAB = extract_vocab(TOKEN)

In [10]:
DOC

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,para_str
work,chapter,para,Unnamed: 3_level_1
0,0,0,Watchman:\nRelease from this weary task of min...
0,0,1,Chorus:\nThis is now the tenth year since Pria...
0,1,0,Chorus:\nI have the power to proclaim the augu...
0,1,1,"Chorus:\nThen the wise seer of the host, notic..."
0,1,2,"Chorus:\nAlthough, O Lovely One, you are so gr..."
...,...,...,...
18,11,25,"But haply in that place a sacred tree,\na bitt..."
18,11,26,Meanwhile th' Olympian sovereign supreme\nto J...
18,11,27,After these things Jove gave his kingly mind\n...
18,11,28,Aeneas now is near; and waving wide\na spear l...


In [6]:
LIB

Unnamed: 0_level_0,author,title,path
work,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Aeschylus,Agamemnon,C:\Users\David\Documents\GitHub\latin-greek-te...
1,Aeschylus,Eumenides,C:\Users\David\Documents\GitHub\latin-greek-te...
2,Aeschylus,Libation Bearers,C:\Users\David\Documents\GitHub\latin-greek-te...
3,Aeschylus,Prometheus Bound,C:\Users\David\Documents\GitHub\latin-greek-te...
4,Euripides,Bacchae,C:\Users\David\Documents\GitHub\latin-greek-te...
5,Euripides,Iphigenia in Aulis,C:\Users\David\Documents\GitHub\latin-greek-te...
6,Euripides,The Trojan Women,C:\Users\David\Documents\GitHub\latin-greek-te...
7,Hesiod,Theogony,C:\Users\David\Documents\GitHub\latin-greek-te...
8,Hesiod,Works and Days,C:\Users\David\Documents\GitHub\latin-greek-te...
9,Homer,Iliad,C:\Users\David\Documents\GitHub\latin-greek-te...


In [24]:
TOKEN

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos,token_str,term_str
work,chapter,para,sent,token,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,0,0,0,NNP,Watchman:,watchman
0,0,0,0,1,NNP,Release,release
0,0,0,0,2,IN,from,from
0,0,0,0,3,DT,this,this
0,0,0,0,4,JJ,weary,weary
...,...,...,...,...,...,...,...
18,11,29,13,14,NN,wrath,wrath
18,11,29,13,15,TO,to,to
18,11,29,13,16,VB,darkness,darkness
18,11,29,13,17,JJ,fled,fled


In [25]:
VOCAB

Unnamed: 0_level_0,term_str,n
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,,6141
1,0,48
2,0h,1
3,1,4
4,10,1
...,...,...
27661,zodiacs,1
27662,zone,4
27663,zonecuriously,2
27664,zones,1
