The log-odds ratio with an informative (and uninformative) Dirichlet prior (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf)) is a common method for finding distinctive terms in two datasets (see [Jurafsky et al. 2014](https://firstmonday.org/ojs/index.php/fm/article/view/4944/3863) for an example article that uses it to make an empirical argument). This method for finding distinguishing words combines a number of desirable properties:

* it specifies an intuitive metric (the log-odds) for the ratio of two probabilities
* it can incorporate prior information in the form of pseudocounts, which can either act as a smoothing factor (in the uninformative case) or incorporate real information about the expected frequency of words overall.
* it accounts for variability of a frequency estimate by essentially converting the log-odds to a z-score.

In this homework you will implement this ratio for a dataset of your choice to characterize the words that differentiate each one.

Your first job is to find two datasets with some interesting opposition -- e.g., news articles from CNN vs. FoxNews, books written by Charles Dickens vs. James Joyce, screenplays of dramas vs. comedies.  Be creative -- this should be driven by what interests you and should reflect your own originality. **This dataset cannot come from Kaggle**.  Feel feel to use web scraping (see [here](https://github.com/CU-ITSS/Web-Data-Scraping-S2023) for a great tutorial) or manually copying/pasting text.  Aim for more than 10,000 tokens for each dataset. 
   
Save those datasets in two files: "class1_dataset.txt" and "class2_dataset.txt" 

Q1. Describe each of those datasets and their source in 100-200 words.

The two dataset I used are both transcripts or subtitles from Youtube videos. Because txt files are not supported in uploading of homework, the api and script used to download the transcripts are kept below.

The first dataset is the combined transcripts of OverSimplified's two videos of American Civil War (part I and part II). This Youtuber has so far posted only 29 videos in the past 7 years but gained 7.8 million subscribers, making him an oddly popular content creater under the history video topic. His videos tend to be amusing. There are always some funny details around the history event in his videos. (With the total duration of 53 minutes, as 30 & 23, the two videos together produce 9721 tokens before removing stopwords)

The second dataset is the transcript of a single long video of almost 90 minutes from Youtuber WarsofTheWorld. His video narration tends to be traditional like the ones on TV shows. As of now, he has 291 thousands of subscribers, which is a decent number but not comparable to OverSimplified. His video about American Civil War is relatively detailed and historically accurate about the events in this historic era.  (With the total duration of 90 minutes, the video produces 14281 tokens before removing stopwords)

Considering the difference in their narration style (and neglecting the animations and sound effects used in the video for this study), I would like to know specificially what are their preferences of word choices.

In [1]:
from youtube_transcript_api import YouTubeTranscriptApi
import nltk
from nltk.tokenize import word_tokenize
from string import punctuation

from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

from nltk.stem import PorterStemmer
porter = PorterStemmer()

OS_text_list_1 = YouTubeTranscriptApi.get_transcript("tsxmyL7TUJg&t=1s")
OS_text_list_2 = YouTubeTranscriptApi.get_transcript("sV6uuMAnJUE&t=410s")
OS_text_list = OS_text_list_1 + OS_text_list_2

WTW_text_list = YouTubeTranscriptApi.get_transcript("38KcVu5DkhA")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\linzh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Q2. Tokenize those texts by filling out the `read_and_tokenize` function below (your choice of tokenizer). The input is a filename and the output should be a list of tokens.

In [2]:
def read_and_tokenize(filename):
    
    lower_str = ' '.join([filename[i]['text'] for i in range(len(filename))]).lower()
    no_punc_str = ''.join([character for character in lower_str if character not in punctuation])
    no_digit_str = ''.join([character for character in no_punc_str if not character.isdigit()])
    
    orig_tokens = word_tokenize(no_digit_str)
    
    # tokens = [word for word in orig_tokens if not word in stop_words]     # remove stopwords only
    tokens = [porter.stem(word) for word in orig_tokens if not word in stop_words]     # remove stopwords & remove stemmer
    
    return tokens

In [3]:
# change these file paths to wherever the datasets you created above live.
class1_tokens=read_and_tokenize(OS_text_list)
class2_tokens=read_and_tokenize(WTW_text_list)

In [4]:
len(class1_tokens), len(class2_tokens)

(5387, 8048)

Q3.  Now let's find the words that characterize each of those sources (with respect to the other). Implement the log-odds ratio with an uninformative Dirichlet prior.  This value, $\hat\zeta_w^{(i-j)}$ for word $w$ reflecting the difference in usage between corpus $i$ and corpus $j$, is given by the following equation:

$$
\hat\zeta_w^{(i-j)}= {\hat{d}_w^{(i-j)} \over \sqrt{\sigma^2\left(\hat{d}_w^{(i-j)}\right)}}
$$

Where: 

$$
\hat{d}_w^{(i-j)} = \log \left({y_w^i + \alpha_w} \over {n^i + \alpha_0 - y_w^i - \alpha_w}) \right) -  \log \left({y_w^j + \alpha_w} \over {n^j + \alpha_0 - y_w^j - \alpha_w}) \right)
$$

$$
\sigma^2\left(\hat{d}_w^{(i-j)}\right) \approx {1 \over {y_w^i + \alpha_w}} + {1 \over {y_w^j + \alpha_w} }
$$

And:

* $y_w^i = $ count of word $w$ in corpus $i$ (likewise for $j$)
* $\alpha_w$ = 0.01
* $V$ = size of vocabulary (number of distinct word types)
* $\alpha_0 = V * \alpha_w$
* $n^i = $ number of words in corpus $i$ (likewise for $j$)

In this example, the two corpora are your class1 dataset (e.g., $i$ = your class1) and your class2 dataset (e.g., $j$ = class2). Using this metric, print out the 25 words most strongly aligned with class1, and 25 words most strongly aligned with class2.  Again, consult [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) for more detail.

In [5]:
from nltk.probability import FreqDist
import math

def find_freq_tokens(tokens, display_limit):
    
    freq_dict = FreqDist(tokens)
    bigram_freq = dict()
    
    sorted_items = sorted(freq_dict.items(), key=lambda item: item[1], reverse=True)
    top_keys = [item[0] for item in sorted_items[:display_limit]]
    
    return top_keys

In [6]:
def difference_in_usage(word, one_tokens, two_tokens):
    
    freq_dict_1 = FreqDist(one_tokens)
    freq_dict_2 = FreqDist(two_tokens)
    
    # prepare variables for the equation
    y_i_w = freq_dict_1[word]
    y_j_w = freq_dict_2[word]
    alpha_w = 0.01
    n_i = len(one_tokens)
    n_j = len(one_tokens)
    alpha_0 = len(freq_dict_1 + freq_dict_2) * alpha_w
    
    # calculation
    numerator = (math.log((y_i_w + alpha_w) / (n_i + alpha_0 - y_i_w - alpha_w)) - 
                 math.log((y_j_w + alpha_w) / (n_j + alpha_0 - y_j_w - alpha_w)))
    denominator = math.sqrt(1 / (y_i_w + alpha_w) + 1 / (y_j_w + alpha_w))
    
    return numerator / denominator

In [7]:
def logodds_with_uninformative_prior(one_tokens, two_tokens, display=25):
    
    one_top_words = find_freq_tokens(one_tokens, display)
    two_top_words = find_freq_tokens(two_tokens, display)
    
    top_words_diff_one = dict()
    top_words_diff_two = dict()
    
    for word in one_top_words:
        top_words_diff_one[word] = difference_in_usage(word, one_tokens, two_tokens)
    
    for word in two_top_words:
        top_words_diff_two[word] = difference_in_usage(word, two_tokens, one_tokens)
    
    return top_words_diff_one, top_words_diff_two

In [8]:
token_ranking_1, token_ranking_2 = logodds_with_uninformative_prior(class1_tokens, class2_tokens)

In [9]:
token_ranking_1

{'lincoln': 3.5179561549133496,
 'confeder': -7.477058298601654,
 'gener': -0.9145471516186507,
 'would': -3.9235732891828667,
 'war': -2.939859356705888,
 'union': -9.627840823795536,
 'north': 0.43457451767808747,
 'one': 2.0102779549050367,
 'lee': -1.3947309376396697,
 'grant': 1.6499086108057,
 'men': -0.11331830472200768,
 'state': -4.4775747905513485,
 'south': -1.1989778333401264,
 'forc': -6.2635521855988125,
 'mcclellan': 3.5192958774958165,
 'like': 3.425984944536229,
 'armi': -2.350182855776382,
 'time': 0.7021736575593186,
 'slave': -0.2639289403597094,
 'take': 2.1722811703089446,
 'move': 3.5491082928484605,
 'could': 1.742666082606499,
 'new': -0.8187224012919115,
 'well': 2.3322916543789005,
 'presid': -0.14346736553132017}

In [10]:
top_keys_1 = [item for item in token_ranking_1.keys()]
top_vals_1 = [item for item in token_ranking_1.values()]

top_keys_2 = [item for item in token_ranking_2.keys()]
top_vals_2 = [item for item in token_ranking_2.values()]

In [11]:
import pandas as pd

df = pd.DataFrame(list(zip(top_keys_1, top_vals_1, top_keys_2, top_vals_2)),
                    columns =['word_class1', 'value_class1', 'word_class2', 'value_class2'])
df

Unnamed: 0,word_class1,value_class1,word_class2,value_class2
0,lincoln,3.517956,union,9.559404
1,confeder,-7.477058,confeder,7.424273
2,gener,-0.914547,forc,6.237067
3,would,-3.923573,would,3.906376
4,war,-2.939859,state,4.460704
5,union,-9.627841,war,2.928033
6,north,0.434575,th,5.827023
7,one,2.010278,gener,0.911134
8,lee,-1.394731,battl,3.928163
9,grant,1.649909,lee,1.390402
