In [None]:
%pip install -r requirements.txt

# Introduction

To begin, we're going to initialize a pandas DataFrame of Greek tragedy by line.

You might be wondering why we aren't using Pausanias as usual. Tragedy has some nice built-in features that let us get to the heart of some experiments more quickly: rather than having broad geographical labels, each line comes pre-labeled by `dramatist`, `play`, and `speaker`.

From the `speaker` label, we can derive information such as age, gender, social status, etc.

These labels thus provide a lot of categorical data essentially for free, giving us a number of variables with which to experiment.

I've gone ahead and pre-processed the Perseus XML for you, although you should take a look at [`preprocess.py`](./preprocess.py) when you have a chance so that you know what's happening.

Let's load up the dataframe and confirm that things look as expected.

In [None]:
import pandas as pd

pd.set_option('display.max_colwidth', None)

df = pd.read_pickle("./greek-tragedy-by-line.pickle")

df[df['speaker'].str.lower() == 'chorus']

## TF-IDF

Before we get into the details of TF-IDF --- a somewhat old-school method that still deserves attention --- let's get a feel for what its results look like.

We'll divide the lines by dramatist and collapse all of the rows per dramatist into three very long strings. These strings are the **documents** that make up our **corpus**.

In [None]:
aeschylus_text = ' '.join(df[df['dramatist'] == 'Aeschylus']['text'].astype(str))
sophocles_text = ' '.join(df[df['dramatist'] == 'Sophocles']['text'].astype(str))
euripides_text = ' '.join(df[df['dramatist'] == 'Euripides']['text'].astype(str))

corpus = [aeschylus_text, sophocles_text, euripides_text]


Then, we're going to use the `TfidfVectorizer` class from `scikit-learn` to calculate the TF-IDF scores for each term in the corpus, labeled by document.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(corpus)

tfidf_df = pd.DataFrame(
    X.toarray(),
    index=["Aeschylus", "Sophocles", "Euripides"],
    columns=vectorizer.get_feature_names_out(),
)

tfidf_df

As you can see, we have only 3 rows but 12,121 columns -- a bit unwieldy. Let's pick a selection of words to examine.

In [None]:
keywords = [
    "apollo",
    "death", 
    "delphi",
    "divinity",
    "gods",
    "humankind",
    "humans",
    "life",
    "men",
    "sanctuary",
    "women",
    "zeus"
]

tfidf_df[keywords]

That's not so bad. But what do these numbers tell us?

> Discuss: How do you interpret the numbers above?

### Term Frequency (TF)

We have already worked with **term frequency** extensively in this course: it is the raw frequency of a given **term** in the **corpus**.

What is a **term** in this case? It could be a word, a token, or a lemma, or it could be a lexico-grammatical or syntactic feature. We use **term** as a flexible catch-all for any countable feature of the corpus.

# Bibliography

Reades, Jonathan, and Jennie Williams. 2023. “Clustering and Visualising Documents Using Word Embeddings.” Programming Historian, August. https://programminghistorian.org/en/lessons/clustering-visualizing-word-embeddings.
