<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/2.compare/Log_odds_ratio_TODO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/2.compare/Log_odds_ratio_TODO.ipynb)

# Log odds-ratio

The log odds ratio with an informative (and uninformative) Dirichlet prior (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf)) is a common method for finding distinctive terms in two datasets (see [Jurafsky et al. 2014](https://firstmonday.org/ojs/index.php/fm/article/view/4944/3863) for an example article that uses it to make an empirical argument). This method for finding distinguishing words combines a number of desirable properties:

* it specifies an intuitive metric (the log-odds) for the ratio of two probabilities
* it can incorporate prior information in the form of pseudocounts, which can either act as a smoothing factor (in the uninformative case) or incorporate real information about the expected frequency of words overall.
* it accounts for variability of a frequency estimate by essentially converting the log-odds to a z-score.

In this homework you will implement this ratio for a dataset of your choice to characterize the words that differentiate each one.

## Part 1

Your first job is to find two datasets with some interesting opposition -- e.g., news articles from CNN vs. FoxNews, books written by Charles Dickens vs. James Joyce, screenplays of dramas vs. comedies.  Be creative -- this should be driven by what interests you and should reflect your own originality. **This dataset cannot come from Kaggle**.  Feel feel to use web scraping (see [here](https://github.com/CU-ITSS/Web-Data-Scraping-S2023) for a great tutorial) or manually copying/pasting text.  Aim for more than 10,000 tokens for each dataset.
   
Save those datasets in two files: "class1_dataset.txt" and "class2_dataset.txt"

**Describe each of those datasets and their source in 100-200 words.**

I selected to use the transcript of the movie "Oppenheimer" (dataset 1) and the movie "Barbie" (dataset 2). I thought it might be interesting to see if any obvious themes between the two movies might be visible through the type of language they were using. Comparisons between the two films were quite popular when they launched, as they were both released around the same time and had almost polar opposite plots. I picked [scrapsfromtheloft.com](https://scrapsfromtheloft.com/), a website that features transcripts of various movies, to acquire data. To do this, I manually copy pasted the transcripts of both "Oppenheimer" and "Barbie" into their own respective files.



## Part 2

Tokenize those texts by filling out the `read_and_tokenize` function below (your choice of tokenizer). The input is a filename and the output should be a list of tokens.

In [55]:
# make sure dependencies are installed
!pip install nltk



In [56]:
import nltk # using nltk for tokenization
nltk.download('punkt_tab')
from nltk import word_tokenize

def read_and_tokenize(filename: str) -> list[str]:
    document = open(filename).read()
    return word_tokenize(document)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [57]:
# change these file paths to wherever the datasets you created above live.
!wget --no-check-certificate https://raw.githubusercontent.com/GoldPapaya/info256-applied-nlp/main/data/class1_dataset.txt
!wget --no-check-certificate https://raw.githubusercontent.com/GoldPapaya/info256-applied-nlp/main/data/class2_dataset.txt

class1_tokens = read_and_tokenize("class1_dataset.txt")
class2_tokens = read_and_tokenize("class2_dataset.txt")

print(class1_tokens[:10], len(class1_tokens), len(set(class1_tokens))) # Oppenheimer print output: sample set of tokens, # of tokens, # of unique tokens
print(class2_tokens[:10], len(class2_tokens), len(set(class2_tokens))) # Barbie print output: sample set of tokens, # of tokens, # of unique tokens

--2025-09-05 20:30:48--  https://raw.githubusercontent.com/GoldPapaya/info256-applied-nlp/main/data/class1_dataset.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 112470 (110K) [text/plain]
Saving to: ‘class1_dataset.txt.5’


2025-09-05 20:30:48 (4.34 MB/s) - ‘class1_dataset.txt.5’ saved [112470/112470]

--2025-09-05 20:30:49--  https://raw.githubusercontent.com/GoldPapaya/info256-applied-nlp/main/data/class2_dataset.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 86591 (85K) [text/plain]
Saving to: ‘class2_dataset.txt.5’


## Part 3

Now let's find the words that characterize each of those sources (with respect to the other). Implement the log-odds ratio with an uninformative Dirichlet prior. This value, $\widehat\zeta_w^{(i-j)}$ for word $w$ reflecting the difference in usage between corpus $i$ and corpus $j$, is given by the following equation:

$$
\widehat{\zeta}_w^{(i-j)}= {\widehat{d}_w^{(i-j)} \over \sqrt{\sigma^2\left(\widehat{d}_w^{(i-j)}\right)}}
$$

Where:

$$
\widehat{d}_w^{(i-j)} = \log \left({y_w^i + \alpha_w} \over {n^i + \alpha_0 - y_w^i - \alpha_w}) \right) -  \log \left({y_w^j + \alpha_w} \over {n^j + \alpha_0 - y_w^j - \alpha_w}) \right)
$$

$$
\sigma^2\left(\widehat{d}_w^{(i-j)}\right) \approx {1 \over {y_w^i + \alpha_w}} + {1 \over {y_w^j + \alpha_w} }
$$

And:

* $y_w^i = $ count of word $w$ in corpus $i$ (likewise for $j$)
* $\alpha_w$ = 0.01
* $V$ = size of vocabulary (number of distinct word types)
* $\alpha_0 = V * \alpha_w$
* $n^i = $ number of words in corpus $i$ (likewise for $j$)

In this example, the two corpora are your class1 dataset (e.g., $i$ = your class1) and your class2 dataset (e.g., $j$ = class2). Using this metric, print out the 25 words most strongly aligned with class1, and 25 words most strongly aligned with class2.  Again, consult [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) for more detail.

In [58]:
import collections
from collections import Counter
import math

def logodds_with_uninformative_prior(tokens_i: list[str], tokens_j: list[str], display=25):
    n_i = len(tokens_i) # number of tokens in i
    n_j = len(tokens_j) # number of tokens in j
    counter_i = Counter(tokens_i) # freq dict of i
    counter_j = Counter(tokens_j) # freq dict of j
    vocabulary = set(tokens_i).union(set(tokens_j)) # union of all types between i and j
    V = len(vocabulary) # size of vocab
    a_w = 0.01
    a_0 = V * a_w

    z_scores = {}
    for word in vocabulary:
      y_i_w = counter_i.get(word, 0) # get word frequency from i corpus
      y_j_w = counter_j.get(word, 0) # get word frequency from j corpus

      # Compute log-odds
      log_odds = math.log((y_i_w + a_w)/(n_i + a_0 - y_i_w - a_w)) - math.log((y_j_w + a_w)/(n_j + a_0 - y_j_w - a_w))

      # Compute variance
      variance = (1/(y_i_w + a_w)) + (1/(y_j_w + a_w))

      # Compute z-score
      z = log_odds / math.sqrt(variance)
      z_scores[word] = z

    sorted_tokens = sorted(z_scores.items(), key=lambda x: x[1], reverse=True) # sort words by their respective z score

    print("Top 25 words for class1_dataset")
    count = 0
    for word, z in sorted_tokens:
        if z > 0:
            print(f"{word}: {z:.2f}")
            count += 1
            if count >= 25:
                break

    print("\nTop 25 words for class2_dataset")
    count = 0
    for word, z in sorted_tokens[::-1]:  # Reverse for negative z-scores
        if z < 0:
            print(f"{word}: {z:.2f}")
            count += 1
            if count >= 25:
                break

In [59]:
logodds_with_uninformative_prior(class1_tokens, class2_tokens)

Top 25 words for class1_dataset
the: 10.06
of: 6.77
he: 6.29
?: 5.96
to: 5.59
was: 5.39
him: 4.97
they: 4.64
would: 4.58
Well: 4.50
in: 4.01
a: 3.99
Robert: 3.95
He: 3.91
his: 3.86
not: 3.71
,: 3.62
as: 3.60
A: 3.53
security: 3.44
d: 3.43
did: 3.43
Los: 3.42
from: 3.39
years: 3.31

Top 25 words for class2_dataset
!: -10.49
]: -8.47
[: -8.47
Oh: -7.29
Okay: -5.40
Yeah: -5.29
Hi: -5.24
so: -5.14
just: -5.11
na: -4.66
I: -4.65
her: -4.45
m: -4.21
music: -4.18
playing: -4.12
night: -4.12
And: -3.98
love: -3.96
like: -3.78
go: -3.57
She: -3.54
World: -3.46
got: -3.43
she: -3.40
Hey: -3.40


To check your work, you can run log-odds on the party platforms from the lab section. With `nltk.word_tokenize` _before_ lower-casing, these should be your top 5 words (and scores, roughly). Depending on your tokenization strategy, your scores might be slightly different.

**Democrat**:
```
president:	4.75
biden:	4.27
to:	4.11
he:	4.09
has:	4.08
```
**Republican**
```
republicans:	-13.45
our:	-11.23
will:	-10.88
american:	-10.01
restore:	-7.97
```

In [60]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/2024_democrat_party_platform.txt
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/2024_republican_party_platform.txt

--2025-09-05 20:30:50--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/2024_democrat_party_platform.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 283046 (276K) [text/plain]
Saving to: ‘2024_democrat_party_platform.txt.5’


2025-09-05 20:30:50 (7.61 MB/s) - ‘2024_democrat_party_platform.txt.5’ saved [283046/283046]

--2025-09-05 20:30:50--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/2024_republican_party_platform.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35319 (34K) [text/plain]
Saving t

In [61]:
import nltk
logodds_with_uninformative_prior(
    [w.lower() for w in nltk.word_tokenize(open("2024_democrat_party_platform.txt").read())],
    [w.lower() for w in nltk.word_tokenize(open("2024_republican_party_platform.txt").read())]
)

Top 25 words for class1_dataset
president: 4.75
biden: 4.27
to: 4.11
he: 4.09
has: 4.08
more: 3.76
democrats: 3.39
for: 3.14
also: 3.13
administration: 3.06
's: 3.06
his: 2.93
a: 2.90
is: 2.82
$: 2.77
in: 2.56
than: 2.51
care: 2.51
communities: 2.49
as: 2.30
continue: 2.28
working: 2.26
work: 2.26
americans: 2.25
year: 2.23

Top 25 words for class2_dataset
republicans: -13.45
our: -11.23
will: -10.88
american: -10.01
restore: -7.97
great: -7.11
illegal: -6.32
republican: -6.07
policies: -5.91
:: -5.80
stop: -5.79
again: -5.56
we: -5.55
inflation: -5.55
4: -5.51
must: -5.44
1: -5.32
party: -5.30
3: -5.28
common: -5.18
bring: -5.15
peace: -5.14
5: -5.12
education: -5.11
commitment: -4.94
