# Group Comparison: t-test & Cohen's d
Brezina ch 6, p 186-197

## t-test
The t-test is used for comparing two groups. It considers both the differences between the groups themselves, as well as the internal *variation* of the groups. In order to understand the t-test, however, we need to understand variance. **Variance** as a statistical measure captures the spread between numbers in a dataset:
$$
\text{Variance} = \frac{\text{sum of squared distances from the mean}}{\text{degrees of freedom}}
$$
  
&nbsp;    
&nbsp;      
In mathematical notation, variance is:
$$
S^2 = \frac{\sum{(x_i - \bar{x})^2}}{n-1} 
$$
$S^2$ = sample variance  
$x_i$ = the value of one observation  
$\bar{x}$ = the mean value of all observations  
$n$ = the number of observations  

&nbsp;
&nbsp;

The **t-test** formula is as follows:

$$
t = \frac{\text{Mean of group 1} - \text{Mean of group 2}}{\sqrt{\frac{\text{Variance of group 1}}{\text{Number of cases in group 1}} + \frac{\text{Variance of group 2}}{\text{Number of cases in group 2}}}}
$$

In [None]:
%pip install scipy
%pip install pingouin
%pip install lxml

In [12]:
from lxml import etree
from scipy.stats import ttest_ind

files = [
    "./data/poetry/tlg0012.tlg001.perseus-grc1.tb.xml", #iliad
    "./data/poetry/tlg0012.tlg002.perseus-grc1.tb.xml" #odyssey
]

def get_rel_freqs_per_book(f, pos, num_books=24):
    tree = etree.parse(f)
    pos_vals=[0 for i in range(num_books)]
    total_vals=[0 for i in range(num_books)]
    rel_vals=[0 for i in range(num_books)]

    for l in tree.iterfind(".//sentence"):
        for element in l.findall(".//word"):
            if l.get("subdoc"):
                book_num = int(l.get("subdoc")[0])
                if element.get("postag", " ")[0] == pos:
                    pos_vals[book_num] += 1 
                    total_vals[book_num] += 1
                elif element.get("postag", " ")[0]:
                    total_vals[book_num] += 1

    for i in range(num_books):
        try:
            rel_vals[i] = pos_vals[i]/total_vals[i]
        except:
            rel_vals[i] = 0

    return rel_vals

def count_pos(words, pos: str):
    return len([w for w in words if w.get("postag", " ")[0] == pos])

iliad_freq = get_rel_freqs_per_book(files[0], "a")
odyssey_freq = get_rel_freqs_per_book(files[1], "a")

iliad_freq = iliad_freq[1:10]
odyssey_freq = odyssey_freq[1:10]

print(iliad_freq)
print(odyssey_freq)

t_stat, p_value = ttest_ind(iliad_freq, odyssey_freq)

print(f"T-statistic: {t_stat}, P-value: {p_value}")


[0.11734474667019203, 0.11428004422646217, 0.133139227104288, 0.12105616469008727, 0.12001605565962002, 0.12146957520091849, 0.12077660110943016, 0.11168054665812513, 0.12669683257918551]
[0.10786882642528188, 0.10967354982435096, 0.11988911988911989, 0.11278803395768765, 0.1081267217630854, 0.10293621329733378, 0.11324570273003033, 0.11416241663397411, 0.1161764705882353]
T-statistic: 3.339001894964257, P-value: 0.004162703662638815


## Cohen's *d*
> "In addition to the statistical test, we also need to calculate an effect size measure to evaluate in standardized terms (i.e. units comparable across linguistic variables and corpora) the size of the difference between the two groups." [@Brezina2018]  

You might recall we've already talked about Cohen's d when we discussed correlation measures. It's the same metric, here we're just applying it in a different context. Brezina describes Cohen's d as "the difference between the two means expressed in standard deviation units"

$$
\text{Cohen's } d = \frac{\text{Mean of group1 - Mean of group2}}{\text{pooled }SD}
$$

Interpretation of *d*: *d* > 0.3 small, *d* > 0.5 medium, *d* > 0.8 large effect

In [14]:
import pingouin as pg

cohens_d = pg.compute_effsize(
    iliad_freq, 
    odyssey_freq, 
    eftype='cohen'
)

print(f"Cohen's d: {cohens_d}")

result = pg.compute_esci(
    cohens_d,
    len(iliad_freq),
    len(odyssey_freq),
    paired=False, 
    eftype='cohen',
    confidence=0.95  # 95% confidence interval
)

print(result)

Cohen's d: 1.5740205882159723
[0.43 2.72]


# Text Classification

Text classification is a process in Machine Learning that categorizes text into a certain category. We've seen a version of text classification before, when we looked at Greek dramas in Week 6 and used TF-IDF data to assign text to a dramatist. Sentiment Analysis, which was part of our journal club, is also a form of text classification.

This week, we will use BERT to attempt to classify text as prose or poetry.

## Corpus Selection

We're going to use the [Perseus Treebank](https://perseusdl.github.io/treebank_data/) data as our initial corpus. 

> Discuss: What can you observe about our corpus? What are the potential issues or advantages with using this corpus?

The data has already been preproccessed using treebank_preprocess.py. This is similar to some of the preprocessing we've seen in the past. 

In [None]:

import pandas as pd

pd.set_option('display.max_colwidth', None)

df_all = pd.read_pickle("./corpus.pickle")
df_all

In [None]:
%pip install scikit-learn
%pip install torch
%pip install matplotlib
%pip install seaborn

## Exercises
Now that we've preprocessed the text, we should do some exploration to form descriptive statistics. 
1. Considering your initial observations about the corpus, explore the corpus both quantitatively and qualitatively.
2. Use data visualization to demonstrate what you've found.

In [None]:
# your code goes here
import matplotlib as plt
import seaborn as sns



## BERT
For this exercise, we're using [this](https://huggingface.co/bowphs/GreBerta) BERT model, pretrained on Ancient Greek. BERT stands for Bidirectional Encoder Representations from Transformers. "Bidirectional" meaning that it processes text from left to right and from right to left. This is useful for English, and particularly useful for ancient Greek, where word order is more flexible. "Encoder Representations from Transformers" means that it takes the same input processing as transformer models. 

### Training Bert 
You can view the code for training BERT at bert.py.  
In class, we are going to do our processing in Google Colab so that we can leverage the extra computational resources. 
If you're running the code on your own, you can copy paste bert.py into Colab. Be sure to Runtime -> Change Runtime Type -> GPU. Also do not forget to bring your pickled data into Colab.

Since training takes a while, we've provided a pre-trained model for you already. Unfortunately, the file is too large for GitHub, but you can access it [here](https://tufts.box.com/s/k44jjmvklnfkm5g30dbkpqvn9wxuxth9). Once downloaded, you can try it out with whatever text you want using the below code. 

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import os

def predict_text(texts, model, tokenizer, device):
    model.eval()
    with torch.no_grad():
        encodings = tokenizer(
            texts,
            truncation=True,
            padding=True,
            max_length=512,
            return_tensors='pt'
        ).to(device)

        outputs = model(**encodings)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        return predictions.cpu().numpy()

device = torch.device('cuda') #CHANGE THIS LINE from 'cuda' to 'cpu' if you are not running on a gpu!

# load model and tokenizer
model_name = "bowphs/GreBerta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
) .to(device)

# load saved model if available
model_path = 'best_model.pt' #you may need to change the path depending on what folder you have best_model.pt in
if os.path.exists(model_path):
    print(f"Loading saved model from {model_path}")
    model.load_state_dict(torch.load(model_path))

    #your Greek text goes here!
    new_texts = ["ποικιλόθρον᾿ ἀθανάτ᾿ Αφρόδιτα, παῖ Δίος δολόπλοκε, λίσσομαί σε,μή μ᾿ ἄσαισι μηδ᾿ ὀνίαισι δάμνα, πότνια, θῦμον,"]

    predictions = predict_text(new_texts, model, tokenizer, device)

    for text, pred in zip(new_texts, predictions):
        prose_prob, poetry_prob = pred
        predicted_class = "Poetry" if poetry_prob > 0.5 else "Prose"
        print(f"Text: {text[:50]}...")
        print(f"Prediction: {predicted_class}")
        print(f"Poetry probability: {poetry_prob:.2%}")
        print(f"Prose probability: {prose_prob:.2%}\n")
else:
    print("No saved model found.")

## Exercises
1. Using the resources provided in latin-text-classification, train a Latin BERT model to identify poetry vs prose. 
2. Using either Latin or Greek, test the BERT model using prose and/or poetry that the model was NOT trained on. If you know ancient Greek or Latin, choose language that you think might stump the model. What do the results tell you about the effectiveness of our classification model?