In [None]:
%pip install -r requirements.txt

# Introduction

To begin, we're going to initialize a pandas DataFrame of Greek tragedy by line.

You might be wondering why we aren't using Pausanias as usual. Tragedy has some nice built-in features that let us get to the heart of some experiments more quickly: rather than having broad geographical labels, each line comes pre-labeled by `dramatist`, `play`, and `speaker`.

From the `speaker` label, we can derive information such as age, gender, social status, etc.

These labels thus provide a lot of categorical data essentially for free, giving us a number of variables with which to experiment.

I've gone ahead and pre-processed the Perseus XML for you, although you should take a look at [`preprocess.py`](./preprocess.py) when you have a chance so that you know what's happening.

Let's load up the dataframe and confirm that things look as expected.

In [None]:
import pandas as pd

pd.set_option('display.max_colwidth', None)

df = pd.read_pickle("./greek-tragedy-by-line.pickle")

df[df['speaker'].str.lower() == 'chorus']

## TF-IDF

Before we get into the details of TF-IDF --- a somewhat old-school method that still deserves attention --- let's get a feel for what its results look like.

We'll divide the lines by play and collapse all of the rows per play into three very long strings. These strings are the **documents** that make up our **corpus**.

In [None]:
docs = df.groupby(['dramatist', 'title'])['text'].apply(' '.join).reset_index()


Then, we're going to use the `TfidfVectorizer` class from `scikit-learn` to calculate the TF-IDF scores for each term in the corpus, labeled by document.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(docs['text'])

tfidf_df = pd.DataFrame(
    X.toarray(),
    index=docs['title'],
    columns=vectorizer.get_feature_names_out(),
)

tfidf_df

As you can see, we have only 31 rows but 12,121 columns -- a bit unwieldy. Let's pick a selection of words to examine.

In [None]:
keywords = [
    "apollo",
    "death", 
    "delphi",
    "divinity",
    "gods",
    "humankind",
    "humans",
    "life",
    "men",
    "sanctuary",
    "women",
    "zeus"
]

tfidf_df[keywords]

That's not so bad. But what do these numbers tell us?

> Discuss: How do you interpret the numbers above? What are some of the drawbacks of performing this analysis on translations?

### Term Frequency (TF)

We have already worked with **term frequency** extensively in this course: in its simplest form, it is the raw frequency of a given **term** in the **corpus**.

But we can intuitively see that this simple form will have a bias towards longer documents: the more terms in a document, the more chances there are for any given term to occur.

What is a **term** in this case? It could be a word, a token, a lemma, or n-grams thereof, or it could be a lexico-grammatical or syntactic feature. We use **term** as a flexible catch-all for any countable feature of the corpus.

There are various ways of normalizing TF depending on the needs of our corpus and analysis. A common normalization is simply to convert the raw count for a term into a relative count -- we've done this in nearly every class, taking the absolute frequency of term `t` and dividing it by the total number of terms in a document.

```math
tf(t, d) = \frac{f(t, d)}{\sum_{t' \in d}f(t', d)}
```

Alternatively, we can apply log normalization: 

```math
tf(t, d) = \log(1 + f(t, d))
```

Or we can even normalize according to the most frequently occurring term in the document:

```math
tf(t, d) = 0.5 + 0.5 \cdot \frac{f(t, d)}{max_{\{t' \in d\}}f(t', d)}
```

### (Inverse) Document Frequency (DF)

**Document frequency** is the number of documents containing the term divided by the total number of documents.

**Inverse document frequency**, then, is the inverse of that function:

```math
idf(t, D) = \log{\frac{N}{|{d : d \in D \text{ and } t \in d}|}}
```

Similarly $`tf(t, d)`$, $`idf(t, D)`$ can be weighted in various ways. 

> Discuss: For example, how might we account for a term that isn't in the corpus?

See https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting for notes on Scikit's implementation.

### Putting it all together

TF-IDF, then, is just the product of these two calculations.

```math
tfidf(t, d, D) = tf(t, d) \cdot idf(t, D)
```

# Exercises

1. Write your own TF and IDF functions using techniques we have covered in the course so far.
2. Use the "documents" that we passed to the `TfidfVectorizer` above, calculate your own scores for the English translations of the tragedies.
3. Compare them to the scores from scikit-learn -- what might account for similarities and differences? (You can consult the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

## Using TF-IDF for document clustering

So what can we use TF-IDF for? In addition to highlighting important terms for a given document, it can also be used for clustering documents and analyzing their overlap.

We're going to follow [this tutorial](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-pyy) for clustering with "sparse features," with modifications for our own (much smaller) dataset.

"Sparse features" refers to our use of TF-IDF for labeling words in this case -- rather than a dense representation of words in the vocabulary of the corpus (e.g., word embeddings), we're clustering via the sparse feature of a TF-IDF score.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

data_train = fetch_20newsgroups(
    subset="train",
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=(),
)


In [None]:
from time import time
from sklearn.model_selection import train_test_split


def size_mb(docs):
    return sum(len(s.encode("utf-8")) for s in docs) / 1e6


def load_dataset(verbose=False):
    data_train, data_test = train_test_split(docs, test_size=0.4)

    target_names = data_train.dramatist.unique()

    # split target in a training set and a test set
    y_train, y_test = data_train.dramatist, data_test.dramatist

    # Extracting features from the training data using a sparse vectorizer
    t0 = time()
    vectorizer = TfidfVectorizer(stop_words="english")
    X_train = vectorizer.fit_transform(data_train['text'])
    duration_train = time() - t0

    # Extracting features from the test data using the same vectorizer
    t0 = time()
    X_test = vectorizer.transform(data_test['text'])
    duration_test = time() - t0

    feature_names = vectorizer.get_feature_names_out()

    if verbose:
        # compute size of loaded data
        data_train_size_mb = size_mb(data_train['text'])
        data_test_size_mb = size_mb(data_test['text'])

        print(
            f"{len(data_train['text'])} documents - "
            f"{data_train_size_mb:.2f}MB (training set)"
        )
        print(f"{len(data_test['text'])} documents - {data_test_size_mb:.2f}MB (test set)")
        print(f"{len(target_names)} categories")
        print(
            f"vectorize training done in {duration_train:.3f}s "
            f"at {data_train_size_mb / duration_train:.3f}MB/s"
        )
        print(f"n_samples: {X_train.shape[0]}, n_features: {X_train.shape[1]}")
        print(
            f"vectorize testing done in {duration_test:.3f}s "
            f"at {data_test_size_mb / duration_test:.3f}MB/s"
        )
        print(f"n_samples: {X_test.shape[0]}, n_features: {X_test.shape[1]}")

    return X_train, X_test, y_train, y_test, feature_names, target_names

In [None]:
X_train, X_test, y_train, y_test, feature_names, target_names = load_dataset(
    verbose=True
)

In [None]:
from sklearn.linear_model import RidgeClassifier

clf = RidgeClassifier(tol=1e-2, solver="sparse_cg")
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

In [None]:
import matplotlib.pyplot as plt

from sklearn.metrics import ConfusionMatrixDisplay

fig, ax = plt.subplots(figsize=(10, 5))
ConfusionMatrixDisplay.from_predictions(y_test, pred, ax=ax)
ax.xaxis.set_ticklabels(target_names)
ax.yaxis.set_ticklabels(target_names)
_ = ax.set_title(
    f"Confusion Matrix for {clf.__class__.__name__}\non the original documents"
)

Try re-running the above analyses a few times. What do you notice? What does this mean about our data?

# Exercises

1. Rewrite the `load_data` function to train on play titles rather than dramatists. (Hint: You will need to rerun the `.groupby` operation above so that `title` is the left-most column in the DataFrame.)
2. Rerun and analyze the results.

# Bibliography

Reades, Jonathan, and Jennie Williams. 2023. “Clustering and Visualising Documents Using Word Embeddings.” Programming Historian, August. https://programminghistorian.org/en/lessons/clustering-visualizing-word-embeddings.
