This notebook explores methods for comparing two different textual datasets to identify the terms that are distinct to each one:

* Difference of proportions (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) section 3.2.2
* Mann-Whitney rank-sums test (described in [Kilgarriff 2001, Comparing Corpora](https://www.sketchengine.eu/wp-content/uploads/comparing_corpora_2001.pdf), section 2.3)

In [1]:
import sys, operator
from collections import Counter
from scipy.stats import mannwhitneyu

In [2]:
# the convote data is already tokenized so just split on whitespace
repub_tokens=open("../data/repub.convote.txt", encoding="utf-8").read().split(" ")
dem_tokens=open("../data/dem.convote.txt", encoding="utf-8").read().split(" ")

Q1: First, calculate the simple "difference of proportions" measure from Monroe et al.'s "Fighting Words", section 3.2.2.  What are the top ten terms in this measurement that are most republican and most democrat?

In [3]:
def difference_of_proportions(one_tokens, two_tokens):
    top_10_republican= Counter(one_tokens)
    top_10_democrat= Counter(two_tokens)
    top_10_republican= {key: (value / len(one_tokens)) for key,value in top_10_republican.items()}
    top_10_democrat={key: (value / len(two_tokens)) for key,value in top_10_democrat.items()}
    difference_dict = {key: top_10_republican[key] - top_10_democrat.get(key, 0) for key in top_10_republican.keys()}
    difference_dict1 = {key: top_10_democrat[key] - top_10_republican.get(key, 0) for key in top_10_democrat.keys()} 
    final=sorted(difference_dict, key=difference_dict.get, reverse=True)[::-1][:10]
    final1=sorted(difference_dict1, key=difference_dict1.get, reverse=True)[::-1][:10]
    print("Top 10 republican:",final1)
    print("Top 10 democrat:",final)
    
    # your code here

In [4]:
difference_of_proportions(dem_tokens, repub_tokens)

Top 10 republican: ['not', '$', 'cuts', 'bill', 'republican', 'budget', 'billion', 'would', 'health', 'for']
Top 10 democrat: ['i', 'we', 'and', 'of', ',', 'chairman', 'that', 'as', 'gentleman', 'a']


Simply analyzing the difference in relative frequencies has a number of downsides: 1.) As Monroe et al (2009) points out (and we can see here as well), it tends to emphasize high-frequency words (be sure you understand why).  2.) We're not measuring whether a difference is statistically meaningful or just due to chance; the $\chi^2$ test is one method (described in Kilgarriff 2001 and in the context of collocations in Manning and Schuetze [here](https://nlp.stanford.edu/fsnlp/promo/colloc.pdf)) that addresses the desideratum of finding statistically significant terms, but it too has another downside: 3.) Simply counting up the total number of mentions of a term doesn't account for the "burstiness" of language -- if we see the word "Dracula" in a text, we're probably going to see it again in that same text.  The occurrence of words are not independent random events; they are tightly coupled with each other. If we're trying to understanding the robust differences between two corpora, we might prefer to prioritize words that show up more frequently *everywhere* in corpus A (but not in corpus B) over those that show up only very frequently within narrow slice of A (such as one text in a genre, one chapter in a book, or one speaker when measuring the differences between policital parties).

Q2 (check-plus): One measure that does account for this burstiness is the adaptation by corpus linguistics of the non-parametric Mann-Whitney rank-sum test. The specific adaptation of this test for text is described in Kilgarriff 2001, section 2.3.  Implement this test using a fixed chunk size of 500 and the [scikit-learn mannwhitneyu function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html); what are the top ten terms in this measurement that are most republican and most democrat? 

In [10]:
def mann_whitney_analysis(one_tokens, two_tokens):
    Republican_text=[one_tokens[i:i + 20000] for i in range(0, len(one_tokens),20000)]
    Democrat_text=[two_tokens[i:i + 20000] for i in range(0, len(two_tokens),20000)]
    Republican_count=[Counter(Republican_text[i]) for i in range(0,len(Republican_text))]
    Democrat_count=[Counter(Democrat_text[i]) for i in range(0,len(Democrat_text))]
    c=[i[1]for i in Republican_count]
    d=[i[1]for i in Democrat_count]
    
    
    #top_10_republican= Counter(Republican_text)
    #top_10_democrat= Counter(two_tokens)
    #top_10_republican= [(i, top_10_republican[i]) for i,count in top_10_republican.most_common(10)]
    #top_10_democrat=[(i, top_10_democrat[i]) for i,count in top_10_democrat.most_common(10)]

    #result=mannwhitneyu(c,d,use_continuity=True, alternative=None)
    print(Republican_count)
    
    # your code here

In [11]:
mann_whitney_analysis(dem_tokens, repub_tokens)

