In [1]:
%%html
<style>
.h1_cell, .just_text {
    box-sizing: border-box;
    padding-top:5px;
    padding-bottom:5px;
    font-family: "Times New Roman", Georgia, Serif;
    font-size: 125%;
    line-height: 22px; /* 5px +12px + 5px */
    text-indent: 25px;
    background-color: #fbfbea;
    padding: 10px;
}

hr { 
    display: block;
    margin-top: 0.5em;
    margin-bottom: 0.5em;
    margin-left: auto;
    margin-right: auto;
    border-style: inset;
    border-width: 2px;
}
</style>

<h1>
<center>
Module 4 - Gothic author identification
</center>
</h1>
<div class=h1_cell>
<p>
This week we are going to take on the task of identifying authors of gothic novels. Our authors to choose from are these three:
<ol>
<li>EAP - Edgar Allen Poe (https://en.wikipedia.org/wiki/Edgar_Allan_Poe): American writer who wrote poetry and short stories that revolved around tales of mystery and the grisly and the grim. Arguably his most famous work is the poem - "The Raven" and he is also widely considered the pioneer of the genre of the detective fiction.</li>
<p>
<li>HPL - HP Lovecraft (https://en.wikipedia.org/wiki/H._P._Lovecraft): Best known for authoring works of horror fiction, the stories that he is most celebrated for revolve around the fictional mythology of the infamous creature "Cthulhu" - a hybrid chimera mix of Octopus head and humanoid body with wings on the back.</li>
<p>
<li>MWS - Mary Shelley (https://en.wikipedia.org/wiki/Mary_Shelley): Seemed to have been involved in a whole panoply of literary pursuits - novelist, dramatist, travel-writer, biographer. She is most celebrated for the classic tale of Frankenstein where the scientist Frankenstein a.k.a "The Modern Prometheus" creates the Monster that comes to be associated with his name.</li>
</ol>
<p>
What we have is a table of sentences from their books. Each sentence is labeled with the author who wrote it. Given a new sentence our job is to predict who the author is of that sentence.
<p>
The sentences are all jumbled up, i.e., we do not have paragraph or chapter level info.
<p>
<h2>Why is this interesting?</h2>
<p>
One application of this style of analysis is in literary studies. An ancient book is found but the author is unknown. Or perhaps the author is known but there is a suspicion that someone else ghost wrote it. Or even looking at plagiarism: some portions of a book by author X look like they were lifted from author Y.
<p>
Let's bring in the table and look at it.
</div>

In [2]:
import pandas as pd

gothic_table = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vQqRwyE0ceZREKqhuaOw8uQguTG6Alr5kocggvAnczrWaimXE8ncR--GC0o_PyVDlb-R6Z60v-XaWm9/pub?output=csv',
                          encoding='utf-8')

In [3]:
gothic_table.head()


Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [4]:
len(gothic_table)

19579

<h2>
Let's devise a plan
</h2>
<div class=h1_cell>
<p>
This looks similar to our tweet problem in prior weeks. We are given some text. The text has a label. Our goal is to build a model that will predict the label using the content of the text.
<p>
<ul>
<li>Instead of using a bag of hashtags, let's use a bag of words.
<p>
<li>Naive Bayes worked well for us in tweet problem so let's try it again here.
<p>
<li>I want to do a bit of wrangling of the text in a sentence, more than we did for the tweets.
</ul>
<p>
I'll tackle the wrangling first.
</div>

<h2>
Wrangling a sentence into words
</h2>
<div class=h1_cell>
<p>
Once we start dealing with English, the complexity goes up a notch. One problem is that we will find words with apostrophes. For example, contractions like "I'll go", "it's easy", "won't quit". Or possesives like "John's game", "Tess' party".
<p>
A second problem is that we typically want to remove words that are so common they are useless in differentiating.
Let's start with this second problem first. We will start to use the nltk package this week. nltk is like pandas in that it has lots of functions for doing a wide array of NLP tasks. For now, I know that nltk has a built-in set of words that are very common. They are called "stop words" (https://en.wikipedia.org/wiki/Stop_words). The general idea is that we want to delete these words from a sentence before doing any analysis.
<p>
Here they are.
</div>

In [5]:
from nltk.corpus import stopwords  # see more at http://xpo6.com/list-of-english-stop-words/

In [6]:
#import nltk
#nltk.download('stopwords')

swords = stopwords.words('english')
swords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

<div class=h1_cell>
<p>
If you scroll through them, you will see the contractions. But with a big caveat: You will see pieces of the contraction but not the full contraction. For instance, you see "ll" I assume from "I'll". You see "doesn" I assume from "doesn't". What does this mean? It means that some other wrangling tool must be applied before we start looking for stop words. That tool is called a word tokenizer. nltk also has a sentence tokenizer but we don't need that - some nice person already broke the books into sentences for us. The word tokenizer takes a sentence as input and produces a list of words. nltk has several word tokenizers built in. Let's look at 2 of them below.
<p>
</div>

In [7]:
from nltk.tokenize import WordPunctTokenizer
word_punct_tokenizer = WordPunctTokenizer()          #instantiate class

from nltk.tokenize import TreebankWordTokenizer
treeb_tokenizer = TreebankWordTokenizer()            #instantiate class

<div class=h1_cell>
<p>
I am going to have a bake-off between the 2 tokenizers. What I want is to use a tokenizer and then remove the stop words from the tokenized list: a tokenizer produces a list of words. I'll try a test sentence out with each tokenizer and get its list of words. I'll then run through the stop words to see how many I remove.
</div>

In [8]:
#First up: the punctuation tokenizer

test_sentence = "I'll say it's 6 o'clock!"

word_tokes = word_punct_tokenizer.tokenize(test_sentence)
for item in word_tokes:
    print(item)

I
'
ll
say
it
'
s
6
o
'
clock
!


<div class=h1_cell>
<p>
You can see it treats an apostrophe as a separate "word". So "I'll" becomes three words: "I", "'", "ll". This feels like what we want to match against stop words. Let's do that now and see how many we match.
<p>
</div>

In [9]:
#How many matches in stop words?

for word in swords:
    c = word_tokes.count(word)
    if c > 0:
        print(word)

it
s
ll
o


<div class=h1_cell>
<p>
I like the first 3. Would have preferred "oclock" instead of "o", "'", "clock". And we still have the apostrophes in the list - 3 of them. We can deal with them later.
<p>
Next batter up.
<p>
</div>

In [10]:
#http://www.nltk.org/_modules/nltk/tokenize/treebank.html
word_tokes = treeb_tokenizer.tokenize(test_sentence)
for item in word_tokes:
    print(item)

I
'll
say
it
's
6
o'clock
!


<div class=h1_cell>
<p>
Not looking good. We will not match "'ll" nor "'s" I predict. However, I do like that o'clock stays together.
<p>
</div>

In [11]:
for word in swords:
    c = word_tokes.count(word)
    if c > 0:
        print(word)

it


<div class=h1_cell>
<p>
Only one word removed. The winner, at least in terms of stop word matching, is the punct tokenizer. So I'll use that.
<p>
As an aside, there is nothing magical about tokenizers. You can see their code with a little digging. They mostly are made up of a bunch of re pattern matches. Nothing stopping you from extending a tokenizer to your own taste. For instance, would not be hard to change one-word contractions into their full two-word equivalent using the re sub method.
<p>
Aside part 2: you can check out how various nltk tokenizers do on sentences you type in here: 
http://textanalysisonline.com/nltk-word-tokenize
</div>

<h2>
Challenge 1
</h2>
<div class=h1_cell>
<p>
I'd like you to work on a function `sentence_wrangler`. It will take a raw sentence from a row and tokenize it. It will then remove the following from that word list:
<p>
<ul>
<li>The stop words we have been using.
<p>
<li>Words that contain any punctuation (see string package).
<p>
</ul>
<p>
Have it return 2 lists for debugging: the list of wrangled words and the list of removed words.
</div>

In [12]:
import string
punctuation = string.punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:
def sentence_wrangler(sentence, swords, punctuation):
    
    ##
    # punctuation = string
    # swords = a list
    ##
    
    ## Function Dependencies
    from nltk.tokenize import WordPunctTokenizer
    
    # tokenize sentence
    word_punct_tokenizer = WordPunctTokenizer()    
    word_tokes = word_punct_tokenizer.tokenize(sentence.lower())

    
    # remove punctuation from text
    remove_punctuation = str.maketrans('', '', punctuation)
    
    word_tokes_remove_punct = [ x.translate(remove_punctuation) for x in word_tokes ]
    
    removed = [ x for x in word_tokes_remove_punct if (x in swords) & (x != '') ]
    wrangled = [ x for x in word_tokes_remove_punct if (x not in swords) & (x != '') ]
    
    return (wrangled , removed)

In [14]:
test_sentence = "I'll say it's 6 o'clock!"
punctuation = string.punctuation

sentence_wrangler(test_sentence, swords, punctuation)

(['say', '6', 'clock'], ['i', 'll', 'it', 's', 'o'])

<div class=h1_cell>
<p>
Ok, let's try it on first 10 sentences in the table. I'll print out the raw sentence and then the words I remove.
<p>
If you are matching my results, move on to challenge 2.
</div>

In [15]:
for i in range(10):
    text = gothic_table.loc[i, 'text']
    print(text+'\n')
#    print(' '.join(sentence_wrangler(text, swords, punctuation)[1]).encode('ascii')) # prints b' for beginning of line
    print(' '.join(sentence_wrangler(text, swords, punctuation)[1]))
    print('='*10)

This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.

this me no of the of my as i its and to the i out being of the so the
It never once occurred to me that the fumbling might be a mere mistake.

it once to me that the be a
In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.

in his was a from which as he down the all of he with an of the
How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.

how is as we from on the by and all as in and
Finding nothing else, not even gold, the Superintendent abandoned his attemp

<h1>
Challenge 2

</h1>
<div class=h1_cell>
Fill out `all_words` below to produce the bag of words. Use your sentence_wrangler.
<p>
Remember that we now have 3 predicted values. So you will need to follow each word with a list of 3 numbers. Make the first number in list a count of EAP, the second number a count of HPL and the third number the count of MWS (the three authors we want to classify the text as belonging too).
</div>

In [16]:
def all_words(table, author_list, stop_words, punctuation):
    
    ##
    # table = dataframe of text
    # stop_words = standard nltk stopwords (to avoid)
    # punctuation = standard punctuation to avoid
    # author_list = list of author classes
    # uses function 'sentence_wrangler'
    ##
    
    import numpy as np # to create boolean vector for incrementing
    
    # initialize empty dictionary of words
    all_words = {}
    
    for i in range(len(table)):
        # get table text + label/author
        text = gothic_table.loc[i, 'text']
        author = gothic_table.loc[i, 'author']
   
        # wrangle it 
        word_entries = sentence_wrangler(text, stop_words, punctuation)[0]
    
        # create increment vector of '0's and '1' identifying the author
        increment = np.array([author] * len(author_list)) == np.array(author_list)
        increment = list(increment.astype(int))
        # can probably make this more efficient...

        # add to dictionary / increment count entries
        for x in word_entries:
            if x not in all_words:
                all_words[x] = [x.item() for x in increment] # .item() changes numpy int64 to int
            else:
                all_words[x] = [(x + y).item() for x, y in zip(all_words[x], increment)]  

    return all_words

In [17]:
authors = ["EAP","HPL","MWS"] # I added this as an argument to allow an easy generalization to n-many authors
bag_of_words = all_words(gothic_table, authors, swords, punctuation)
len(bag_of_words)  #unique words

24944

In [18]:
for item in bag_of_words:
    print(type((bag_of_words[item])[0]))

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class

<h2>
Do you match my length?
</h2>
<div class=h1_cell>
If not, your `sentence_wrangler` is not matching mine I suspect.
</div>

In [19]:
sorted(bag_of_words.items())[:5]

[('aaem', [1, 0, 0]),
 ('ab', [1, 0, 0]),
 ('aback', [2, 0, 0]),
 ('abaft', [0, 0, 1]),
 ('abandon', [7, 3, 1])]

<h2>
Do you match my content?
</h2>
<div class=h1_cell>
If not, you might have list ordering screwed up in `all_words`.
</div>

<h1>
Challenge 3
</h1>
<div class=h1_cell>
Let's take a look at words that are odd. Build a list of keys in the bag of words that contain at least one character that is not a letter. I am calling these odd words.
</div>

In [20]:
#build odd_words

import copy

def odd_words(bag):

    odd_words = copy.deepcopy(bag) # do it inefficiently
    
    for key in list(odd_words):
        if key.isalpha():
            del odd_words[key]
        
    return list(odd_words)

#     words = []
#     for key in bag_of_words:
#         text = str(key)
#         words.append(text)
            
#    odd_words = [key for key in words if (len(''.join(x for x in key if x.isalnum())) != len(key)) ]
# bag_of_words.get(key)

#    odd_words = [key for key in bag_of_words if len(''.join(filter(str.isalpha, key)) == len(key)]

#     odd_words = [key for key in bag_of_words if len(set(string.digits).intersection(key)) < len(key)]
    
#     return odd_words

In [21]:
#len(odd_words(bag_of_words))

(odd_words(bag_of_words))

# 110

[]

In [22]:
odd_words

<function __main__.odd_words>

</h1>
<div class=h1_cell>
These are words that slipped through our `sentence_wrangler`.
You can look the byte codes up, e.g., google for "\xe9". I suppose we could add further wrangling to `sentence_wrangler` at this point to clean up even more punctuation, but I am ready to move on.
</div>

<h2>
Challenge 4
</h2>
<div class=h1_cell>
<p>
Get ready for Naive Bayes. What are we missing? We have bag_of_words that gives us the triple values we need. We are missing `P(O)`: the total count of the sentences for each author. Build that now in `total_count`.
</ul>
</div>

In [23]:
total_count = [0,0,0]
author_list = ["EAP", "HPL", "MWS"]

# import numpy as np

for i in range(0, len(gothic_table)):
    author = gothic_table['author'][i]
#     increment = np.array([author]*len(author_list)) == np.array(author_list)
#     increment = list(increment.astype(int))
    increment = [x==y for x,y in zip([author]*len(author_list), author_list)]
    total_count = [x+y for x,y in zip(total_count, increment)]

total_count    

# for row in gothic_table.iterrows(): # row[1][2] selects out the authors from each row
#     increment = np.array([row[1][2]]*len(author_list)) == np.array(author_list)
#     increment = list(increment.astype(int))
#     total_count = [x+y for x,y in zip(total_count, increment)]
    
# # EAP, HPL, MWS


background_prob = ()
for i in range(len(author_list)): 
    background_prob = background_prob + (total_count[i] / sum(total_count) ,) 
    # author probability = author sentence count / total sentence count? 



total_count_dict = {}

total_count_dict['total_count'] = total_count
total_count_dict['background_prob'] = background_prob

total_count_dict


# could roll this into a function like previous ones.
# not sure if it's worth it at this point.




{'background_prob': (0.40349353899586293,
  0.2878083661065427,
  0.30869809489759437),
 'total_count': [7900, 5635, 6044]}

<h2>
Challenge 5
</h2>
<div class=h1_cell>
<p>
Ok, let's get to it. Define `naive_bayes_gothic`. Fill in my function below and match my results. As last week, I expect your function to return the 3 probabilities for each of EAP, HPL, MWS.
</div>

In [24]:
from functools import reduce # need to use to computing a product over a list 
import operator

def naive_bayes_gothic(raw_sentence, bag, counts_dict, author_list):
 

    ##
    # Modified from module_3 naive_bayes method
    # Will return list of probabilities for each class
    # Dependencies: 
    # sentence_wrangler
    # all_words
    # stop_words
    # punctuation
    ##
    
    
    ### Useful for indexing loops
    number_of_classes = len(author_list)
    
    
    ### Extract raw_sentence to wrangled/tokenized string
    wrangled_text = sentence_wrangler(raw_sentence, swords, punctuation)[0] 
    # sentence_wrangler outputs (filtered_text, removed_text)
    
    # moved this to counts_dict
#     ### Compute Background Class Probabilities:
#     background_prob = ()
#     for i in range(number_of_classes): 
#         background_prob = background_prob + (counts_dict['total_count'][i] / sum(counts_dict['total_count']) ,) 
#         # author probability = author sentence count / total sentence count? 
        
            
    ### Compute P(hashtags=evidence | Outcomes = classes)
    likelihoods = ((1,) * number_of_classes ,) 
    # initializes list of sublists [ [*], [*], ... ]   

    for word in wrangled_text: # loop over words

        if word not in list(bag):
            pass # for the moment don't use the word as evidence
        else: 
            new_probabilities = () # initialize a sublist [*]
            for i in range(number_of_classes): # iterate over the classes
                new_probabilities = new_probabilities + (bag[word][i] / counts_dict['total_count'][i] ,)

            likelihoods = likelihoods + ( new_probabilities , ) #add a sublist of P(Ei|O) for fixed Ei and all classes O to likelihoods
                                                                # need to add ( (...), (...), ... )
            
    ### Compute class probabilities given evidence: P(O | E1 ... Ek)
    # we ignore P(E1 ... Ek) normalization factor since it's the same for both terms & so doesn't influence comparison
    probabilities = () 
    
    for i in range(number_of_classes): # multiply all corresponding probabilities for each class
#        probabilities = probabilities + tuple( ( background_prob[i] * reduce(lambda x, y: x * y, [item[i] for item in likelihoods], 1), ) )
        probabilities = probabilities + tuple( ( counts_dict['background_prob'][i] * reduce(operator.mul, [item[i] for item in likelihoods], 1), ) )

    return probabilities 



    ########### 
    ###########

    ##
    ## DEPRECATED LIST IMPLEMENTATION (INEFFICIENT)
    ##
    
    ########### 
    ###########
    
#     ##
#     # Modified from module_3 naive_bayes method
#     # Will return list of probabilities for each class
#     # Dependencies: 
#     # sentence_wrangler
#     # all_words
#     # stop_words
#     # punctuation
#     ##
    
#     from functools import reduce # need to use to computing a product over a list 
        
#     ### Useful for indexing loops
#     number_of_classes = len(author_list)
    
    
#     ### Extract raw_sentence to wrangled/tokenized string
#     wrangled_text = sentence_wrangler(raw_sentence, swords, punctuation)[0] # sentence_wrangler outputs (filtered_text, removed_text)
    
    
#     ### Compute Background Class Probabilities:
#     background_prob = []
#     for i in range(number_of_classes): 
#         background_prob.append(total_count[i] / sum(total_count)) # author probability = author count / total words? 
#         # or author word count / total words..


#     ### Compute P(hashtags=evidence | Outcomes = classes)
#     likelihoods = [[1] * number_of_classes] # initializes list of sublists [ [*], [*], ... ]   

#     for word in wrangled_text: # loop over hashtags

#         if word not in list(bag):
#             pass # for the moment don't use the word as evidence
#         else: 
#             new_probabilities = [] # initialize a sublist [*]
#             for i in range(number_of_classes): # iterate over the classes
#                 new_probabilities.append(bag[word][i] / total_count[i])

#             likelihoods.append(new_probabilities) #add a sublist of P(Ei|O) for fixed Ei and all classes O to likelihoods

#     ### Compute class probabilities given evidence: P(O | E1 ... Ek)
#     # we ignore P(E1 ... Ek) normalization factor since it's the same for both terms & so doesn't influence comparison
#     probabilities = [] 

#     for i in range(number_of_classes): # multiply all corresponding probabilities for each class
#         probabilities.append( background_prob[i] * reduce(lambda x, y: x * y, [item[i] for item in likelihoods], 1) )

#     return probabilities    


In [25]:
####
#### NB using BERNOULLI
####

from functools import reduce # need to use to computing a product over a list 
import operator

def naive_bayes_bernoulli_gothic(raw_sentence, bag, counts_dict, author_list):

    ### Useful for indexing loops
    number_of_classes = len(author_list)
    
    ### Extract raw_sentence to wrangled/tokenized string
    wrangled_text = sentence_wrangler(raw_sentence, swords, punctuation)[0] 
    # sentence_wrangler outputs (filtered_text, removed_text)
    
    
    ### Compute Background Class Probabilities: (pre-computed)
    background_prob = counts_dict['background_prob']
            
    ### Compute P(hashtags=evidence | Outcomes = classes)
    likelihoods = ((1,) * number_of_classes ,) 
    # initializes list of sublists [ [*], [*], ... ]   

    for word in list(bag): # loop over words
        new_probabilities = () # initialize a sublist [*]
        
        for i in range(number_of_classes): # iterate over the classes
            if word in wrangled_text:
                new_probabilities = new_probabilities + (bag[word][i] / counts_dict['total_count'][i] ,)
            else:
                new_probabilities = new_probabilities + (1 - (bag[word][i] / counts_dict['total_count'][i]) ,)

        likelihoods = likelihoods + ( new_probabilities , ) #add a sublist of P(Ei|O) for fixed Ei and all classes O to likelihoods
                                                                # need to add ( (...), (...), ... )

    ### Compute class probabilities given evidence: P(O | E1 ... Ek)
    # we ignore P(E1 ... Ek) normalization factor since it's the same for both terms & so doesn't influence comparison
    probabilities = () 
    for i in range(number_of_classes): # multiply all corresponding probabilities for each class
        probabilities = probabilities + tuple( ( background_prob[i] * reduce(operator.mul, [item[i] for item in likelihoods], 1), ) )

    return probabilities 



In [26]:
### Test Cell

import time

start_time = time.time()

for i in range(5):
    print(naive_bayes_gothic(gothic_table.loc[i, 'text'], bag_of_words, total_count_dict, author_list))
    print(gothic_table.loc[i, 'author'])

finish_time = time.time()

print("took: " + str(finish_time - start_time) + " seconds")


print('–'*5 + " versus " + '–'*5)


start_time = time.time()

for i in range(5):
    print(naive_bayes_bernoulli_gothic(gothic_table.loc[i, 'text'], bag_of_words, total_count_dict, author_list))
    print(gothic_table.loc[i, 'author'])

finish_time = time.time()


print("took: " + str(finish_time - start_time) + " seconds")



(1.1091900736782457e-50, 0.0, 0.0)
EAP
(0.0, 2.757386334036038e-15, 0.0)
HPL
(4.709053372343932e-49, 0.0, 0.0)
EAP
(0.0, 0.0, 6.192200285854468e-53)
MWS
(0.0, 3.1095121234157133e-44, 0.0)
HPL
took: 0.042639970779418945 seconds
––––– versus –––––
(6.642284615797462e-56, 0.0, 0.0)
EAP
(0.0, 1.6013824753359453e-21, 0.0)
HPL
(2.2835639300107067e-51, 0.0, 0.0)
EAP
(0.0, 0.0, 1.1755598347741017e-56)
MWS
(0.0, 1.8263282459906969e-50, 0.0)
HPL
took: 7.896744966506958 seconds


In [27]:
author_list = ["EAP", "HPL", "MWS"]

for i in range(5):
    print(naive_bayes_gothic(gothic_table.loc[i, 'text'], bag_of_words, total_count_dict, author_list))
    print(gothic_table.loc[i, 'author'])


(1.1091900736782457e-50, 0.0, 0.0)
EAP
(0.0, 2.757386334036038e-15, 0.0)
HPL
(4.709053372343932e-49, 0.0, 0.0)
EAP
(0.0, 0.0, 6.192200285854468e-53)
MWS
(0.0, 3.1095121234157133e-44, 0.0)
HPL



<div class=h1_cell>
Not bad. Five out of five correct. Notice all those zeros. We are doing well because we are finding a word that only appears in the book of a specific author. For example, I can see under challenge 2 that `abaft` only appears in MWS. Hence, P(abaft|EAP) and P(abaft|HPL) will both be 0. That will zero-out the numerator (no matter how many other words do match) and return 0 as result.
</div>

<h2>
Challenge 6
</h2>
<div class=h1_cell>
<p>
Generate your predictions, get actuals and zip it up. I ended up timing prediction generation but it took only 13 seconds. Gotta love NB and use of fast dictionary look-up in Python.
</div>

In [28]:
import time

In [29]:
start = time.time()
author_list = ["EAP", "HPL", "MWS"] # for reference

import numpy as np # use to fetch max index fast

predictions = []

for i,row in gothic_table.iterrows():
    if i%1000 == 0: print('did 1000')
    bayes_output = naive_bayes_gothic(row['text'], bag_of_words, total_count_dict, author_list)
    index_max = np.argmax(bayes_output)
#    predictions.append([index_max, author_list[index_max]]) # wasn't sure if you wanted author name or index
    predictions.append(author_list[index_max]) # wasn't sure if you wanted author name or index
    
end = time.time()
print(end - start)  # in seconds

# mine took ~97 seconds?!?!?!
# tuples faster than lists im guessing.. time to rewrite bayes - down to ~93
# using operator.mult down to ~89
# back to ~91

did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
did 1000
91.64962911605835


<div class=h1_cell>
<p>
Go ahead and build `zipped`.
<p>

</div>

In [30]:
#build zipped

actuals = gothic_table['author']  #easy peasy to pull a column out of a table

### using predictions as two-item lists (index, author)
# numerical_predictions = [x[0] for x in predictions]
# author_predictions = [x[1] for x in predictions]
# zipped = zip(author_predictions,actuals)

zipped = zip(predictions,actuals)

# zipped[:20] #???

In [31]:
predictions[:20]

confusion_dictionary = {(1,1):0, (1,0):0, (0,1):0, (0,0):0}



In [32]:
correct = 0
for pair in zipped:
    if pair[0] == pair[1]: correct += 1

correct 

18305

In [33]:
1.0*correct/len(predictions)

0.9349302824454773

<h2>
Go Naive Bayes!
</h2>
<div class=h1_cell>
<p>
I'm claiming 94% accuracy. I like it.
</div>

<h2>
Multinomial versus Bernoulli
</h2>
<div class=h1_cell>
<p>
We are using Multinomial Naive Bayes because we are counting how many times a word occurs for an author. We could also use Bernoulli Naive Bayes where we look for features that are true or false, e.g., a sentence is greater than 10 words in length. This paper discusses the difference between the two: http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf.
</div>

<h2>
Write your bag_of_words out to file
</h2>
<div class=h1_cell>
Here is code I used to write my bag_of_words out to file. I then read it back in just to make sure of round-trip. You should do the same. You will need this file for the midterm.

In [34]:
import json

with open('bag_of_words.txt', 'w') as file:
    file.write(json.dumps(bag_of_words))

In [35]:
bag2 = json.load(open("bag_of_words.txt"))  # making sure I can read it in again

In [36]:
sorted(bag2.items())[:5]

[('aaem', [1, 0, 0]),
 ('ab', [1, 0, 0]),
 ('aback', [2, 0, 0]),
 ('abaft', [0, 0, 1]),
 ('abandon', [7, 3, 1])]

In [37]:
bag2 == bag_of_words

True