# DS 5001 Project Notebook: Greek and Roman Mythology

- David Vann (dv6bq@virginia.edu)
- DS 5001
- 5 May 2021

In [1]:
import os
from glob import glob

import numpy as np
import pandas as pd
import nltk

from eta_modules.preprocessing import Document, Corpus

In [2]:
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('stopwords')

## Reading in the data

We start by loading in the XML files for each work and parsing them to a reasonable degree with BeautifulSoup and NLTK. 

Since these works are all either plays or poems/epics, the concept of a "chapter" or "paragraph" doesn't translate perfectly compared to, e.g., a novel. However, the Perseus Digital Library (where these files are sourced from) has added at least top-level divisions to break up texts. In some cases, these divisions truly exist in the text (for example, *The Iliad* is broken into 24 books); in other cases, like plays, these divisions don't seem to be directly present in the text, but are akin to something like a "scene". I've considered all of these largest divisions as "chapters".

To get at something like a "paragraph", I used a different approach based on whether the work was a play or not:

- For plays, I used each speaker section (denoted by a "\<sp>" in the files) as a "paragraph". 
- For everything else, there wasn't a built-in tag for "paragraph"-type divisions, but there is a self-closing "milestone" tag that marks the start of a new "card" used on the Perseus website to denote content to be displayed on one page. Since these are self-closing, they don't actually enclose the particular block of text that I wanted to get at; instead, I replaced these with newlines and split up text based on a double newline, which seemed to give fairly satisfactory results.

In [3]:
root_dir = os.path.abspath('..')
data_dir = os.path.join(root_dir, 'data')
output_dir = os.path.join(data_dir, 'outputs')

docpaths = glob(os.path.join(data_dir, 'raw', '**', '*.xml'), recursive=True)

OHCO = ['work', 'chapter', 'para', 'sent']

In [4]:
doc_list = []

for path in docpaths:
    doc = Document(path)
    doc_list.append(doc)
    
    doc.parse_text_to_paras()
    doc.tokenize(remove_pos_tuple=True, remove_ws=True)

In [6]:
paragraph_bag = OHCO[:3]

corp = Corpus(doc_list)
corp.extract_annotate_vocab()
corp.compute_tfidf(OHCO_level=paragraph_bag, methods=['n', 'max'])

ValueError: cannot join with no overlapping index names

In [None]:
corp.token['term_str'].value_counts().to_frame('n').reset_index()

In [None]:
corp.vocab

In [None]:
# corp.save_tables(os.path.join(output_dir, 'corpus'))