# Word Frequency (unmodified)
This first attempt is based on the beginner's tutorial linked in the Kaggle contest overview: https://www.kaggle.com/rtatman/beginner-s-tutorial-python

The beginner's tutorial uses the simple method of counting up how often each word is used by each author and using this to generate probabilities.

In [2]:
# read in some helpful libraries
import nltk # the natural langauage toolkit, open-source NLP
import pandas as pd # dataframes

# read our data into a dataframe
texts = pd.read_csv("train.csv")

# look at the first few rows
texts.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [80]:
### Split data

# split the data by author
byAuthor = texts.groupby("author")

### Tokenize (split into individual words) our text

# word frequency by author
wordFreqByAuthor = nltk.probability.ConditionalFreqDist()

# for each author...
for name, group in byAuthor:
    # get all of the sentences they wrote and collapse them into a
    # single long string
    sentences = group['text'].str.cat(sep = ' ')
    
    # convert everything to lower case (so "The" and "the" get counted as 
    # the same word rather than two different words)
    sentences = sentences.lower()
    
    # split the text into individual tokens    
    tokens = nltk.tokenize.word_tokenize(sentences)
    
    # calculate the frequency of each token
    frequency = nltk.FreqDist(tokens)

    # add the frequencies for each author to our dictionary
    wordFreqByAuthor[name] = (frequency)
    
# now we have an dictionary where each entry is the frequency distrobution
# of words for a specific author.    

In [3]:
# see how often each author says "blood"
for i in wordFreqByAuthor.keys():
    print("blood: " + i)
    print(wordFreqByAuthor[i].freq('blood'))

# print a blank line
print()

# see how often each author says "scream"
for i in wordFreqByAuthor.keys():
    print("scream: " + i)
    print(wordFreqByAuthor[i].freq('scream'))
    
# print a blank line
print()

# see how often each author says "fear"
for i in wordFreqByAuthor.keys():
    print("fear: " + i)
    print(wordFreqByAuthor[i].freq('fear'))

blood: EAP
0.00014646397201676582
blood: MWS
0.00022773011333545174
blood: HPL
0.00022992337803427008

scream: EAP
1.7231055531384214e-05
scream: MWS
2.6480245736680435e-05
scream: HPL
9.196935121370803e-05

fear: EAP
0.00010338633318830528
fear: MWS
0.0006196377502383222
fear: HPL
0.0005748084450856752


In [4]:
# One way to guess authorship is to use the joint probabilty that each 
# author used each word in a given sentence.

# first, let's start with a test sentence
testSentence = "It was a dark and stormy night."

# and then lowercase & tokenize our test sentence
preProcessedTestSentence = nltk.tokenize.word_tokenize(testSentence.lower())

# create an empy dataframe to put our output in
testProbailities = pd.DataFrame(columns = ['author','word','probability'])

# For each author...
for i in wordFreqByAuthor.keys():
    # for each word in our test sentence...
    for j  in preProcessedTestSentence:
        # find out how frequently the author used that word
        wordFreq = wordFreqByAuthor[i].freq(j)
        # and add a very small amount to every prob. so none of them are 0
        smoothedWordFreq = wordFreq + 0.000001
        # add the author, word and smoothed freq. to our dataframe
        output = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['author','word','probability'])
        testProbailities = testProbailities.append(output, ignore_index = True)

# empty dataframe for the probability that each author wrote the sentence
testProbailitiesByAuthor = pd.DataFrame(columns = ['author','jointProbability'])

# now let's group the dataframe with our frequency by author
for i in wordFreqByAuthor.keys():
    # get the joint probability that each author wrote each word
    oneAuthor = testProbailities.query('author == "' + i + '"')
    jointProbability = oneAuthor.product(numeric_only = True)[0]
    
    # and add that to our dataframe
    output = pd.DataFrame([[i, jointProbability]], columns = ['author','jointProbability'])
    testProbailitiesByAuthor = testProbailitiesByAuthor.append(output, ignore_index = True)

# and our winner is...
testProbailitiesByAuthor.loc[testProbailitiesByAuthor['jointProbability'].idxmax(),'author']

'HPL'

The above code was directly from the beginner tutorial, with some minor adjustments to make it run (there was an issue with running "tokenize" and it needed to be decoded into Unicode from ASCII in order to run properly). For obvious reasons, the beginner code was very basic. It only tested the probability for once sentence, and did not print the probabilities (only the winning result).

Because the example only printed the name of the predicted author, I also printed the dataframe to see what it came up with for each author:

In [5]:
print(testProbailitiesByAuthor)

  author  jointProbability
0    EAP      1.331812e-21
1    MWS      1.748100e-21
2    HPL      2.483591e-20


First I will see how this fares when split into training and testing data:

In [5]:
import math
from sklearn.utils import shuffle

text_shuffle = shuffle(texts)

print(text_shuffle.shape)
text_shuffle.head()

(19579, 3)


Unnamed: 0,id,text,author
16389,id10465,I looked round on the audience; the females we...,MWS
15290,id23479,"The bent, goatish giant before him seemed like...",HPL
9080,id04706,Over and above the fumes and sickening closene...,HPL
17587,id12742,That Raymond should marry Idris was more than ...,MWS
16187,id00778,Whilst I had hitherto considered this but a na...,HPL


In [6]:
# We will use 80% of the labeled data for training and 20% for testing
train_size = math.floor(text_shuffle.shape[0] * .8)
train = text_shuffle.iloc[:train_size, :]
test = text_shuffle.iloc[train_size:, :]
print(train.shape, test.shape)

(15663, 3) (3916, 3)


In [8]:
### Split data

# split the data by author
byAuthor = train.groupby("author")

### Tokenize (split into individual words) our text

# word frequency by author
wordFreqByAuthor = nltk.probability.ConditionalFreqDist()

# for each author...
for name, group in byAuthor:
    # get all of the sentences they wrote and collapse them into a
    # single long string
    sentences = group['text'].str.cat(sep = ' ')
    
    # convert everything to lower case (so "The" and "the" get counted as 
    # the same word rather than two different words)
    sentences = sentences.lower()
    
    # split the text into individual tokens    
    tokens = nltk.tokenize.word_tokenize(sentences)
    
    # calculate the frequency of each token
    frequency = nltk.FreqDist(tokens)

    # add the frequencies for each author to our dictionary
    wordFreqByAuthor[name] = (frequency)
    
# now we have an dictionary where each entry is the frequency distrobution
# of words for a specific author.  

In [9]:
# first see if this works with the first sentence from test
testSentence = test.iloc[0]['text']
print(testSentence)

print(nltk.tokenize.word_tokenize(testSentence.lower()))

I love my cousin tenderly and sincerely.
['i', 'love', 'my', 'cousin', 'tenderly', 'and', 'sincerely', '.']


In [10]:
# and then lowercase & tokenize our test sentence
preProcessedTestSentence = nltk.tokenize.word_tokenize(testSentence.lower())

# create an empy dataframe to put our output in
testProbailities = pd.DataFrame(columns = ['author','word','probability'])

# For each author...
for i in wordFreqByAuthor.keys():
    # for each word in our test sentence...
    for j  in preProcessedTestSentence:
        # find out how frequently the author used that word
        wordFreq = wordFreqByAuthor[i].freq(j)
        # and add a very small amount to every prob. so none of them are 0
        smoothedWordFreq = wordFreq + 0.000001
        # add the author, word and smoothed freq. to our dataframe
        output = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['author','word','probability'])
        testProbailities = testProbailities.append(output, ignore_index = True)

# empty dataframe for the probability that each author wrote the sentence
testProbailitiesByAuthor = pd.DataFrame(columns = ['author','jointProbability'])

# now let's group the dataframe with our frequency by author
for i in wordFreqByAuthor.keys():
    # get the joint probability that each author wrote each word
    oneAuthor = testProbailities.query('author == "' + i + '"')
    jointProbability = oneAuthor.product(numeric_only = True)[0]
    
    # and add that to our dataframe
    output = pd.DataFrame([[i, jointProbability]], columns = ['author','jointProbability'])
    testProbailitiesByAuthor = testProbailitiesByAuthor.append(output, ignore_index = True)

# and our winner is...
print(testProbailitiesByAuthor)
testProbailitiesByAuthor.loc[testProbailitiesByAuthor['jointProbability'].idxmax(),'author']

  author  jointProbability
0    EAP      6.876213e-27
1    MWS      2.647353e-23
2    HPL      1.486582e-27


'MWS'

In [11]:
# was it right?
print(test.iloc[0]['author'])

MWS


It was! Now let's test the accuracy overall:

In [12]:
correct = 0

for index, row in test.iterrows():
    testSentence = row['text']
    # and then lowercase & tokenize our test sentence
    preProcessedTestSentence = nltk.tokenize.word_tokenize(testSentence.lower())

    # create an empy dataframe to put our output in
    testProbailities = pd.DataFrame(columns = ['author','word','probability'])

    # For each author...
    for i in wordFreqByAuthor.keys():
        # for each word in our test sentence...
        for j  in preProcessedTestSentence:
            # find out how frequently the author used that word
            wordFreq = wordFreqByAuthor[i].freq(j)
            # and add a very small amount to every prob. so none of them are 0
            smoothedWordFreq = wordFreq + 0.000001
            # add the author, word and smoothed freq. to our dataframe
            output = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['author','word','probability'])
            testProbailities = testProbailities.append(output, ignore_index = True)

    # empty dataframe for the probability that each author wrote the sentence
    testProbailitiesByAuthor = pd.DataFrame(columns = ['author','jointProbability'])

    # now let's group the dataframe with our frequency by author
    for i in wordFreqByAuthor.keys():
        # get the joint probability that each author wrote each word
        oneAuthor = testProbailities.query('author == "' + i + '"')
        jointProbability = oneAuthor.product(numeric_only = True)[0]

        # and add that to our dataframe
        output = pd.DataFrame([[i, jointProbability]], columns = ['author','jointProbability'])
        testProbailitiesByAuthor = testProbailitiesByAuthor.append(output, ignore_index = True)

    # and our winner is...
    pred = testProbailitiesByAuthor.loc[testProbailitiesByAuthor['jointProbability'].idxmax(),'author']
    if(pred == row['author']):
        correct = correct + 1
        
accuracy = correct / test.shape[0]
accuracy


0.8480592441266599

Given the simplicity of the model, this is better than I expected. However, there are ways to improve it while sticking to the same basic concept.

Some simple ways this could be improved are:
* Removing stopwords from the data before calculating the word frequency by author
* Incorporating features such as the length of the sentences or lexical diversity
* Looking at long words separately, as these are more likely to be unique
* Collocations
* Overall lengths of words, frequency of word lengths

# Word Frequency (minus stopwords)

In [13]:
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))

# We will use 80% of the labeled data for training and 20% for testing
train_size = math.floor(text_shuffle.shape[0] * .8)
train = text_shuffle.iloc[:train_size, :]
test = text_shuffle.iloc[train_size:, :]
print(train.shape, test.shape)

(15663, 3) (3916, 3)


In [14]:
### Split data

# split the data by author
byAuthor = train.groupby("author")

### Tokenize (split into individual words) our text

# word frequency by author
wordFreqByAuthor = nltk.probability.ConditionalFreqDist()

# for each author...
for name, group in byAuthor:
    # get all of the sentences they wrote and collapse them into a
    # single long string
    sentences = group['text'].str.cat(sep = ' ')
    
    # convert everything to lower case (so "The" and "the" get counted as 
    # the same word rather than two different words)
    sentences = sentences.lower()
    
    # split the text into individual tokens    
    tokens = nltk.tokenize.word_tokenize(sentences)
    
    #remove stopwords from tokens
    tokens_modified = []
    for i in tokens:
        if i not in stops:
            tokens_modified.append(i)
    
    # calculate the frequency of each token
    frequency = nltk.FreqDist(tokens_modified)

    # add the frequencies for each author to our dictionary
    wordFreqByAuthor[name] = (frequency)
    
# now we have an dictionary where each entry is the frequency distribution
# of words for a specific author, minus stopwords

In [15]:
# testing accuracy without stopwords
correct = 0

for index, row in test.iterrows():
    testSentence = row['text']
    # and then lowercase & tokenize our test sentence
    preProcessedTestSentence = nltk.tokenize.word_tokenize(testSentence.lower())

    # create an empy dataframe to put our output in
    testProbailities = pd.DataFrame(columns = ['author','word','probability'])

    # For each author...
    for i in wordFreqByAuthor.keys():
        # for each word in our test sentence...
        for j  in preProcessedTestSentence:
            # find out how frequently the author used that word
            wordFreq = wordFreqByAuthor[i].freq(j)
            # and add a very small amount to every prob. so none of them are 0
            smoothedWordFreq = wordFreq + 0.000001
            # add the author, word and smoothed freq. to our dataframe
            output = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['author','word','probability'])
            testProbailities = testProbailities.append(output, ignore_index = True)

    # empty dataframe for the probability that each author wrote the sentence
    testProbailitiesByAuthor = pd.DataFrame(columns = ['author','jointProbability'])

    # now let's group the dataframe with our frequency by author
    for i in wordFreqByAuthor.keys():
        # get the joint probability that each author wrote each word
        oneAuthor = testProbailities.query('author == "' + i + '"')
        jointProbability = oneAuthor.product(numeric_only = True)[0]

        # and add that to our dataframe
        output = pd.DataFrame([[i, jointProbability]], columns = ['author','jointProbability'])
        testProbailitiesByAuthor = testProbailitiesByAuthor.append(output, ignore_index = True)

    # and our winner is...
    pred = testProbailitiesByAuthor.loc[testProbailitiesByAuthor['jointProbability'].idxmax(),'author']
    if(pred == row['author']):
        correct = correct + 1
        
accuracy = correct / test.shape[0]
accuracy

0.8192032686414709

Interestingly, it appears that removing the stopwords made the model less accurate. Given that this is relying solely on word frequency, this does make sense, since different authors may use these words at different rates. That is, the stopwords themselves are not indicative of any particular author, but their frequency might be.

# Bigrams
This model does not involve taking out the stop words, as that drastically decreased the effectiveness of the model. In addition to taking into account the frequency of individual words, this model takes into account the frequency of bigrams (pairs of words which appear together) in each of the authors' works.

In [70]:
# split the data by author
byAuthor = texts.groupby("author")

### Tokenize (split into individual words) our text

# word frequency by author
wordFreqByAuthor = nltk.probability.ConditionalFreqDist()
biFreqByAuthor = nltk.probability.ConditionalFreqDist()

# for each author...
for name, group in byAuthor:
    # get all of the sentences they wrote and collapse them into a
    # single long string
    sentences = group['text'].str.cat(sep = ' ')
    
    # convert everything to lower case (so "The" and "the" get counted as 
    # the same word rather than two different words)
    sentences = sentences.lower()
    
    # split the text into individual tokens    
    tokens = nltk.tokenize.word_tokenize(sentences)
    
    # calculate the frequency of each token
    frequency = nltk.FreqDist(tokens)
    bigrams = nltk.FreqDist(nltk.bigrams(tokens))
    # add the frequencies for each author to our dictionary
    wordFreqByAuthor[name] = (frequency)
    biFreqByAuthor[name] = (bigrams)
    
    
# now we have an dictionary where each entry is the frequency distribution
# of words for a specific author and another dictionary where each entry is
# the frequency distribution of bigrams for a specific author

In [42]:
correct = 0

for index, row in test.iterrows():
    testSentence = row['text']
    # and then lowercase & tokenize our test sentence
    preProcessedTestSentence = nltk.tokenize.word_tokenize(testSentence.lower())
    preProcessedBigrams = nltk.bigrams(preProcessedTestSentence)

    # create an empy dataframe to put our output in
    testProbailities = pd.DataFrame(columns = ['author','word','probability'])
    testBiProbailities = pd.DataFrame(columns = ['author', 'bigram', 'probability'])

    # For each author...
    for i in wordFreqByAuthor.keys():
        # for each word in our test sentence...
        for j  in preProcessedTestSentence:
            # find out how frequently the author used that word
            wordFreq = wordFreqByAuthor[i].freq(j)
            # and add a very small amount to every prob. so none of them are 0
            smoothedWordFreq = wordFreq + 0.000001
            # add the author, word and smoothed freq. to our dataframe
            output = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['author','word','probability'])
            testProbailities = testProbailities.append(output, ignore_index = True)

    # For each author...
    for i in biFreqByAuthor.keys():
        # for each bigram in our test sentence...
        for j  in preProcessedBigrams:
            # find out how frequently the author used that bigram
            biFreq = biFreqByAuthor[i].freq(j)
            # and add a very small amount to every prob. so none of them are 0
            smoothedBiFreq = biFreq + 0.000001
            # add the author, word and smoothed freq. to our dataframe
            biOutput = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['author','word','probability'])
            testBiProbailities = testBiProbailities.append(biOutput, ignore_index = True)

    # empty dataframe for the probability that each author wrote the sentence
    testProbailitiesByAuthor = pd.DataFrame(columns = ['author','jointProbability'])
    testBiProbailitiesByAuthor = pd.DataFrame(columns = ['author', 'jointBiProbability'])
    
    
    # now let's group the dataframe with our frequency by author
    for i in wordFreqByAuthor.keys():
        # get the joint probability that each author wrote each word
        oneAuthor = testProbailities.query('author == "' + i + '"')
        jointProbability = oneAuthor.product(numeric_only = True)[0]
        # and add that to our dataframe
        output = pd.DataFrame([[i, jointProbability]], columns = ['author','jointProbability'])
        testProbailitiesByAuthor = testProbailitiesByAuthor.append(output, ignore_index = True)
        
    
    for i in biFreqByAuthor.keys():
        oneAuthorBigram = testBiProbailities.query('author ==" ' + i + '"')
       
        jointBiProbability = oneAuthorBigram.product(numeric_only = True)[0]
        biOutput = pd.DataFrame([[i, jointBiProbability]], columns = ['author','jointBiProbability'])
        testBiProbailitiesByAuthor = testBiProbailitiesByAuthor.append(biOutput, ignore_index = True)
        
    
    totalProbailitiesByAuthor = pd.DataFrame(testProbailitiesByAuthor.values + testBiProbailitiesByAuthor.values,
                                            columns = testProbailitiesByAuthor.columns,
                                            index = testProbailitiesByAuthor.index)
        
    # and our winner is...
    pred = testProbailitiesByAuthor.loc[testProbailitiesByAuthor['jointProbability'].idxmax(),'author']
    if(pred == row['author']):
        correct = correct + 1
        
accuracy = correct / test.shape[0]
accuracy

0.9177732379979571

Factoring in the bigrams greately improved the accuracy, from 84.8% to 91.7%, an increase of almost 7 percentage points.

# Word Length
This model builds off of the previous model, which included bigrams. On top of unigrams and bigrams, this model takes into account the frequency distribution of word lengths.

In [17]:
# We will use 80% of the labeled data for training and 20% for testing
train_size = math.floor(text_shuffle.shape[0] * .8)
train = text_shuffle.iloc[:train_size, :]
test = text_shuffle.iloc[train_size:, :]
print(train.shape, test.shape)

(15663, 3) (3916, 3)


In [59]:
# split the data by author
byAuthor = texts.groupby("author")

### Tokenize (split into individual words) our text

# word frequency by author
wordFreqByAuthor = nltk.probability.ConditionalFreqDist()
biFreqByAuthor = nltk.probability.ConditionalFreqDist()
LenFreqByAuthor = nltk.probability.ConditionalFreqDist()

# for each author...
for name, group in byAuthor:
    # get all of the sentences they wrote and collapse them into a
    # single long string
    sentences = group['text'].str.cat(sep = ' ')
    
    # convert everything to lower case (so "The" and "the" get counted as 
    # the same word rather than two different words)
    sentences = sentences.lower()
    
    # split the text into individual tokens    
    tokens = nltk.tokenize.word_tokenize(sentences)
    
    lengths = []
    for i in tokens:
        lengths.append(len(i))
    
    # calculate the frequency of each token
    frequency = nltk.FreqDist(tokens)
    bigrams = nltk.FreqDist(nltk.bigrams(tokens))
    length = nltk.FreqDist(lengths)
    # add the frequencies for each author to our dictionary
    wordFreqByAuthor[name] = (frequency)
    biFreqByAuthor[name] = (bigrams)
    LenFreqByAuthor[name] = (length)

# now we have an dictionary where each entry is the frequency distribution
# of words for a specific author, another dictionary where each entry is
# the frequency distribution of bigrams for a specific author, and a third
# dictionary where each entry is the frequency distribution of the lengths
# of words for a specific author

In [64]:
correct = 0

for index, row in test.iterrows():
    testSentence = row['text']
    # and then lowercase & tokenize our test sentence
    preProcessedTestSentence = nltk.tokenize.word_tokenize(testSentence.lower())
    preProcessedBigrams = nltk.bigrams(preProcessedTestSentence)

    # create an empy dataframe to put our output in
    testProbailities = pd.DataFrame(columns = ['author','word','probability'])
    testBiProbailities = pd.DataFrame(columns = ['author', 'bigram', 'probability'])
    testLenProbailities = pd.DataFrame(columns = ['author', 'len', 'probability'])

    # For each author...
    for i in wordFreqByAuthor.keys():
        # for each word in our test sentence...
        for j  in preProcessedTestSentence:
            # find out how frequently the author used that word
            wordFreq = wordFreqByAuthor[i].freq(j)
            # and add a very small amount to every prob. so none of them are 0
            smoothedWordFreq = wordFreq + 0.000001
            # add the author, word and smoothed freq. to our dataframe
            output = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['author','word','probability'])
            testProbailities = testProbailities.append(output, ignore_index = True)

    # For each author...
    for i in biFreqByAuthor.keys():
        # for each bigram in our test sentence...
        for j  in preProcessedBigrams:
            # find out how frequently the author used that bigram
            biFreq = biFreqByAuthor[i].freq(j)
            # and add a very small amount to every prob. so none of them are 0
            smoothedBiFreq = biFreq + 0.000001
            # add the author, word and smoothed freq. to our dataframe
            biOutput = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['author','word','probability'])
            testBiProbailities = testBiProbailities.append(biOutput, ignore_index = True)
            
    # For each author...
    for i in LenFreqByAuthor.keys():
        # for each bigram in our test sentence...
        for j  in preProcessedTestSentence:
            jLen = len(j)
            # find out how frequently the author used words of that length
            lenFreq = LenFreqByAuthor[i].freq(jLen)
            # and add a very small amount to every prob. so none of them are 0
            smoothedBiFreq = biFreq + 0.000001
            # add the author, word and smoothed freq. to our dataframe
            lenOutput = pd.DataFrame([[i, jLen, smoothedWordFreq]], columns = ['author','word','probability'])
            testLenProbailities = testLenProbailities.append(lenOutput, ignore_index = True)

    # empty dataframe for the probability that each author wrote the sentence
    testProbailitiesByAuthor = pd.DataFrame(columns = ['author','jointProbability'])
    testBiProbailitiesByAuthor = pd.DataFrame(columns = ['author', 'jointBiProbability'])
    testLenProbailitiesByAuthor = pd.DataFrame(columns = ['author','jointLenProbability'])
    
    
    # now let's group the dataframe with our frequency by author
    for i in wordFreqByAuthor.keys():
        # get the joint probability that each author wrote each word
        oneAuthor = testProbailities.query('author == "' + i + '"')
        jointProbability = oneAuthor.product(numeric_only = True)[0]
        # and add that to our dataframe
        output = pd.DataFrame([[i, jointProbability]], columns = ['author','jointProbability'])
        testProbailitiesByAuthor = testProbailitiesByAuthor.append(output, ignore_index = True)
        
    
    for i in biFreqByAuthor.keys():
        oneAuthorBigram = testBiProbailities.query('author ==" ' + i + '"')
       
        jointBiProbability = oneAuthorBigram.product(numeric_only = True)[0]
        biOutput = pd.DataFrame([[i, jointBiProbability]], columns = ['author','jointBiProbability'])
        testBiProbailitiesByAuthor = testBiProbailitiesByAuthor.append(biOutput, ignore_index = True)
    
    for i in LenFreqByAuthor.keys():
        # get the joint probability that each author wrote each word
        oneAuthorLen = testLenProbailities.query('author == "' + i + '"')
        jointLenProbability = oneAuthorLen.product(numeric_only = True)[0]
        # and add that to our dataframe
        LenOutput = pd.DataFrame([[i, jointLenProbability]], columns = ['author','jointLenProbability'])
        testLenProbailitiesByAuthor = testLenProbailitiesByAuthor.append(LenOutput, ignore_index = True)
        

    totalProbailitiesByAuthor = pd.DataFrame(testProbailitiesByAuthor.values + testBiProbailitiesByAuthor.values
                                             + testLenProbailitiesByAuthor.values,
                                            columns = testProbailitiesByAuthor.columns,
                                            index = testProbailitiesByAuthor.index)
        
    # and our winner is...
    pred = testProbailitiesByAuthor.loc[testProbailitiesByAuthor['jointProbability'].idxmax(),'author']
    if(pred == row['author']):
        correct = correct + 1
        
accuracy = correct / test.shape[0]
accuracy

0.9177732379979571

Facotring in the word length doesn't seem to have affected the accuracy of the model at all. Presumably these authors all tend to use words of a similar length. Alternatively, I messed something up in calculating the frequency distribution of the word lengths. Unfortunately, I'm not really sure how I can evaluate whether the lengths frequency distribution came out incorrectly.

Because, for whatever reason, adding in the lengths had no effect on the model, I will use the previous model for the submission.

# Exporting Submission

In [78]:
test_df = pd.read_csv("test.csv")
predictions = pd.DataFrame(columns = ['EAP', 'HPL', 'MWS', 'id'])
for index, row in test_df.iterrows():
    testSentence = row['text']
    # and then lowercase & tokenize our test sentence
    preProcessedTestSentence = nltk.tokenize.word_tokenize(testSentence.lower())
    preProcessedBigrams = nltk.bigrams(preProcessedTestSentence)

    # create an empy dataframe to put our output in
    testProbailities = pd.DataFrame(columns = ['author','word','probability'])
    testBiProbailities = pd.DataFrame(columns = ['author', 'bigram', 'probability'])

    # For each author...
    for i in wordFreqByAuthor.keys():
        # for each word in our test sentence...
        for j  in preProcessedTestSentence:
            # find out how frequently the author used that word
            wordFreq = wordFreqByAuthor[i].freq(j)
            # and add a very small amount to every prob. so none of them are 0
            smoothedWordFreq = wordFreq + 0.000001
            # add the author, word and smoothed freq. to our dataframe
            output = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['author','word','probability'])
            testProbailities = testProbailities.append(output, ignore_index = True)

    # For each author...
    for i in biFreqByAuthor.keys():
        # for each bigram in our test sentence...
        for j  in preProcessedBigrams:
            # find out how frequently the author used that bigram
            biFreq = biFreqByAuthor[i].freq(j)
            # and add a very small amount to every prob. so none of them are 0
            smoothedBiFreq = biFreq + 0.000001
            # add the author, word and smoothed freq. to our dataframe
            biOutput = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['author','word','probability'])
            testBiProbailities = testBiProbailities.append(biOutput, ignore_index = True)

    # empty dataframe for the probability that each author wrote the sentence
    testProbailitiesByAuthor = pd.DataFrame(columns = ['author','jointProbability'])
    testBiProbailitiesByAuthor = pd.DataFrame(columns = ['author', 'jointProbability'])
    
    
    # now let's group the dataframe with our frequency by author
    for i in wordFreqByAuthor.keys():
        # get the joint probability that each author wrote each word
        oneAuthor = testProbailities.query('author == "' + i + '"')
        jointProbability = oneAuthor.product(numeric_only = True)[0]
        # and add that to our dataframe
        output = pd.DataFrame([[i, jointProbability]], columns = ['author','jointProbability'])
        testProbailitiesByAuthor = testProbailitiesByAuthor.append(output, ignore_index = True)
        
    print(testProbailitiesByAuthor)
    for i in biFreqByAuthor.keys():
        oneAuthorBigram = testBiProbailities.query('author ==" ' + i + '"')
       
        jointProbability = oneAuthorBigram.product(numeric_only = True)[0]
        biOutput = pd.DataFrame([[i, jointProbability]], columns = ['author','jointProbability'])
        testBiProbailitiesByAuthor = testBiProbailitiesByAuthor.append(biOutput, ignore_index = True)
        
    print(testBiProbailitiesByAuthor)
    totalProbailitiesByAuthor = pd.DataFrame(testProbailitiesByAuthor.values + testBiProbailitiesByAuthor.values,
                                            columns = testProbailitiesByAuthor.columns,
                                            index = testProbailitiesByAuthor.index)
    print(totalProbailitiesByAuthor)
    totalProbailitiesByAuthor['id'] = row['id']
    
    
    prob_indexed = totalProbailitiesByAuthor.pivot(index = 'id', columns='author', values='jointProbability')
    print(prob_indexed)
    predictions = predictions.append(prob_indexed.loc[row['id']], ignore_index = True)
    
predictions['id'] = test_df['id']
predictions.to_csv("submission_word_freq.csv", index=False)

  author  jointProbability
0    HPL      3.699657e-67
1    EAP      4.663878e-66
2    MWS      2.962311e-62
  author jointProbability
0    HPL              NaN
1    EAP              NaN
2    MWS              NaN
   author jointProbability
0  HPLHPL              NaN
1  EAPEAP              NaN
2  MWSMWS              NaN
author  EAPEAP HPLHPL MWSMWS
id                          
id02310    NaN    NaN    NaN
  author  jointProbability
0    HPL     8.623553e-203
1    EAP     2.561907e-195
2    MWS     5.857305e-205
  author jointProbability
0    HPL              NaN
1    EAP              NaN
2    MWS              NaN
   author jointProbability
0  HPLHPL              NaN
1  EAPEAP              NaN
2  MWSMWS              NaN
author  EAPEAP HPLHPL MWSMWS
id                          
id24541    NaN    NaN    NaN
  author  jointProbability
0    HPL     1.828107e-110
1    EAP     3.879998e-113
2    MWS     4.388748e-118
  author jointProbability
0    HPL              NaN
1    EAP              NaN


KeyboardInterrupt: 

My attempt to print out the results of the model with the bigrams showed there was an issue with the way it was being saved (the intermediate dataframes for some of the words have been printed up above). In hindsight, it makes sense that the authors would end up with their names duplicated, with the way I was trying to join the dataframes. I'm not sure why it's coming up with NaN values.

I'm not sure how it ended up calculating a 91% accuracy rate with it turning out like that, so I suspect that my calculation was off somewhere. Below, I am exporting the results from the original model (just counting word frequency) and will submit that to the contest to see the score.

In [85]:
test_df = pd.read_csv("test.csv")
predictions = pd.DataFrame(columns = ['EAP', 'HPL', 'MWS', 'id'])
for index, row in test_df.iterrows():
    testSentence = row['text']
    # and then lowercase & tokenize our test sentence
    preProcessedTestSentence = nltk.tokenize.word_tokenize(testSentence.lower())

    # create an empy dataframe to put our output in
    testProbailities = pd.DataFrame(columns = ['author','word','probability'])

    # For each author...
    for i in wordFreqByAuthor.keys():
        # for each word in our test sentence...
        for j  in preProcessedTestSentence:
            # find out how frequently the author used that word
            wordFreq = wordFreqByAuthor[i].freq(j)
            # and add a very small amount to every prob. so none of them are 0
            smoothedWordFreq = wordFreq + 0.000001
            # add the author, word and smoothed freq. to our dataframe
            output = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['author','word','probability'])
            testProbailities = testProbailities.append(output, ignore_index = True)

    # empty dataframe for the probability that each author wrote the sentence
    testProbailitiesByAuthor = pd.DataFrame(columns = ['author','jointProbability'])    
    
    # now let's group the dataframe with our frequency by author
    for i in wordFreqByAuthor.keys():
        # get the joint probability that each author wrote each word
        oneAuthor = testProbailities.query('author == "' + i + '"')
        jointProbability = oneAuthor.product(numeric_only = True)[0]
        # and add that to our dataframe
        output = pd.DataFrame([[i, jointProbability]], columns = ['author','jointProbability'])
        testProbailitiesByAuthor = testProbailitiesByAuthor.append(output, ignore_index = True)
    
    testProbailitiesByAuthor['id'] = row['id']
    prob_indexed = testProbailitiesByAuthor.pivot(index = 'id', columns='author', values='jointProbability')
    predictions = predictions.append(prob_indexed.loc[row['id']], ignore_index = True)
    
predictions['id'] = test_df['id']
predictions.to_csv("submission_word_freq.csv", index=False)

This model scored a 1.09697 in the contest, which is unsurprising considering how generally basic it is. Unfortunately, I'm running out of time/awakeness to continue working on figuring out the issue with the bigram and the word length models.

The Frequency Distribution was an interesting concept, but trying to work with more than one at a time was more difficult than I expected it to be.