<h1>Using the Mann-Whitney U Test</h1>
USES PYTHON 3.x

This notebook will run through how to do the <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann-Whitney U test</a>, comparing the word usage across two corpora.

First we will specify all the directories and filenames that will be used in our code. Keeping them up here makes them easier to find and change later.

In [14]:
CORPUS1_PATH = "corpus1/" # The directory to look in for all the text files to be analyzed.
CORPUS2_PATH = "corpus2/" # The directory to look in for all the text files to be analyzed.

#The names (and locations) of the csv files that will be created:
WORD_FREQUENCY_CSV_FILENAME1 = "Corpus1Frequencies.csv"
WORD_FREQUENCY_CSV_FILENAME2 = "Corpus2Frequencies.csv"
CORPUS_COMPARISON_FILENAME = "CorpusCompare.csv"
WORD_LIST_FILENAME = "words.txt"

<h1>Find All the Files to be Analyzed</h1>
Same as with the TF-IDF code, this next section is some very simple code that finds all the text files, and then keeps track of where they are located, so that future functions can find them all when they need to do processing on them. This is cheaper than trying to keep the contents of all the text files in memory. This is a good starting point for any sort of text analysis.

One difference from the TF-IDF code is that text files from two different corpora are being tracked, as opposed to text files from just one corpus. 

In [15]:
import os

corpus1dirs = os.listdir(CORPUS1_PATH) # returns list
corpus2dirs = os.listdir(CORPUS2_PATH) # returns list
corpus1 = []
corpus2 = []

#Loop over all of the files in the provided directory
for file in corpus1dirs:
    #Ensure that only text files are included:
    if file.endswith(".txt"):
        text_dir = os.path.join(CORPUS1_PATH, file)
        corpus1.append(text_dir)

#Loop over all of the files in the provided directory
for file in corpus2dirs:
    #Ensure that only text files are included:
    if file.endswith(".txt"):
        text_dir = os.path.join(CORPUS2_PATH, file)
        corpus2.append(text_dir)

<h1>Term Frequency</h1>

The Mann-Whitney U test needs some measure to rank by, so the the code below generates Corpus1Frequencies.csv and Corpus2Frequencies.csv files containing the relative frequencies for each word in each of the files in both corpora.

<b>NOTE:</b> Other measures could be used here (such as raw frequency, for example) and may be better suited to what you're trying to do.

In [16]:
import csv
import string
import pandas as pd
import re
from collections import defaultdict

word_list = []

def findFreq(corpus, csvFile):

    texts = []
    docs = {}
    num_words = 0
    counts = defaultdict(int)

    for text in corpus:
        num_words = 0
        with open(text, 'r', encoding="utf-8") as f:
            for line in f:
                # Use Regex to remove punctuation and isolate words
                words = re.findall(r'(\b\S+\b|#\w+|@\w+)', line.lower())
                for word in words:
                    counts[word] += 1
                num_words += len(words)

        relativefreqs = {}
        for word, rawCount in counts.items():
            word_list.append(word)
            relativefreqs[word] = rawCount / float(num_words)
            counts[word] = 0
        # add this document's relative freqs to our dictionary
        docs[os.path.basename(text)] = relativefreqs
        #print("Done with " + text)

    #output everything to a .csv file, using pandas as a go between.
    df = pd.DataFrame(docs)
    df = df.fillna(0)
    df.to_csv(csvFile, encoding="utf-8") # write out to CSV

findFreq(corpus1, WORD_FREQUENCY_CSV_FILENAME1)
print("Done first corpus.")

findFreq(corpus2, WORD_FREQUENCY_CSV_FILENAME2)
print("Done second corpus.")

target = open(WORD_LIST_FILENAME, 'w', encoding="utf-8")

unique_words = set(word_list)
for word in sorted(unique_words):
    target.write(str(word) + "\n")
target.close()
print("Done.")

Done first corpus.
Done second corpus.
Done.


<h1>Mann–Whitney U Test</h1>
The code below then iterates over both of the .csvs generated above, comparing each word across both corpora. If there is a word that only appears in one of the corpora, then a row of zeroes is generated for the corpus that does not contain that word.

A .csv file is generated, which shows the ranking for every word, according to whether it is more salient to corpus 2 versus corpus 1.

In [17]:
from scipy.stats import mannwhitneyu

df1 = pd.read_csv(WORD_FREQUENCY_CSV_FILENAME1, index_col=0) # read in the CSV
df1.rename(columns={'Unnamed: 0': 'Text'}, inplace=True) # add a label to the first column
df1 = df1.fillna(0) # replace NaNs with zeroes.

df2 = pd.read_csv(WORD_FREQUENCY_CSV_FILENAME2, index_col=0) # read in the CSV
df2.rename(columns={'Unnamed: 0': 'Text'}, inplace=True) # add a label to the first column
df2 = df2.fillna(0) # replace NaNs with zeroes.

total_docs = len(df1.columns) * len(df2.columns)

# Make "dummy" rows of all zeroes for any words that only appear in one corpus and not the other
missingInCorpus1 = []
missingInCorpus2 = []
for i in range(0, df1.shape[1]):
    missingInCorpus1.append(0)    

for i in range(0, df2.shape[1]):
    missingInCorpus2.append(0)

# Iterate over the wordlist and the two corpora, and output to csv
with open(CORPUS_COMPARISON_FILENAME, 'w', newline='', encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(['word', 'Mann Whitney U Value', 'Mann Whitney rho-value'])
    with open(WORD_LIST_FILENAME, 'r', encoding="utf-8") as f:
        for word in f:
            word = word.strip()
            if (word in df1.index):
                countsInCorpus1 = df1.loc[word].values
            else:
                countsInCorpus1 = missingInCorpus1
            if (word in df2.index):
                countsInCorpus2 = df2.loc[word].values
            else:
                countsInCorpus2 = missingInCorpus2
            try:
                mw = mannwhitneyu(countsInCorpus1, countsInCorpus2)
                mwStat = mw.statistic
                mwRho = mwStat / total_docs
            except ValueError: # Was having problems with this earlier, so this is mainly for debugging reasons
                mwStat = -1
                mwRho = -1
            writer.writerow([word, mwStat, mwRho])
print("Done.")

Done.


<h1>Tables and Graphs</h1>

Let's look at a table showing the highest ranked words (likely to be ones that are more salient to corpus 1). Sorted by the Mann-Whitney ρ value, which is just the U rank divided by the total number of documents in each corpus, multiplied together:

In [19]:
df = pd.read_csv(CORPUS_COMPARISON_FILENAME, index_col=0) # read in the CSV
df2 = df.sort_values("Mann Whitney rho-value", ascending=False)
df2.head(100)

Unnamed: 0_level_0,Mann Whitney U Value,Mann Whitney rho-value
word,Unnamed: 1_level_1,Unnamed: 2_level_1
sit,147.0,0.918750
vexed,146.0,0.912500
curls,146.0,0.912500
cheerful,145.0,0.906250
usually,143.0,0.893750
roused,141.0,0.881250
forgetting,141.0,0.881250
cousin,140.0,0.875000
impatient,140.0,0.875000
tying,140.0,0.875000
