# Challenge 4.4.2: Supervised NLP  
# Kevin Hahn  

## Challenge 0:  

<b>Recall that the logistic regression model's best performance on the test set was 93%. See what you can do to improve performance. Suggested avenues of investigation include: Other modeling techniques (SVM?), making more features that take advantage of the spaCy information (include grammar, phrases, POS, etc), making sentence-level features (number of words, amount of punctuation), or including contextual information (length of previous and next sentences, words repeated from one sentence to the next, etc), and anything else your heart desires. Make sure to design your models on the test set, or use cross_validation with multiple folds, and see if you can get accuracy above 97%.</b>  

By engineering new features including counts of punctuation per sentence and word length per sentence, I was able to improve upon the logistic regression model's original performance. The punctuation count actually decreased the accuracy of the model so I excluded that feature. Using a different modeling technique, SVM Classifier, on a 80/20 train/test split, I was able to get the accuracy to 96.4%.


## Challenge 1:  
<b>Find out whether your new model is good at identifying Alice in Wonderland vs any other work, Persuasion vs any other work, or Austen vs any other work.  This will involve pulling a new book from the Project Gutenberg corpus (print(gutenberg.fileids()) for a list) and processing it.</b>  

So for this challenge I loaded up another Jane Austen text, this time Sense and Sensibility. I wrote a function to use Random Forest and Logistic Regression Classifier methods to determine how good either was at identifying sentences as being from Carroll's Alice in Wonderland.  

Adding the second Jane Austen text to the dataframe seemed to help increase the accuracy of the models, with training and test set scores ranging from 95.6% to 99.3% accuracy. Here are the outcomes of the RFC and Logit Classifier methods:

RANDOM FOREST CLASSIFIER  
Training set score: 0.993304816812  

Test set score: 0.955648535565

LOGIT REGRESSION CLASSIFIER  
Training set score: 0.976008926911  

Test set score: 0.960390516039

In [43]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

import nltk
from nltk.corpus import gutenberg, stopwords
import spacy

# nltk.download()

In [4]:
# nltk.download()

In [5]:
# Import the data we just downloaded and installed.
from nltk.corpus import gutenberg, stopwords

# Grab and process the raw data.
print(gutenberg.fileids())

persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')
emma = gutenberg.raw('austen-emma.txt')
sense = gutenberg.raw('austen-sense.txt')

# Print the first 100 characters of Alice in Wonderland.
print('\nRaw:\n', alice[0:100])

print('\nRaw:\n', emma[0:100])

print('\nRaw:\n', sense[0:100])

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Raw:
 [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was

Raw:
 [Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a

Raw:
 [Sense and Sensibility by Jane Austen 1811]

CHAPTER 1


The family of Dashwood had long been settle


In [6]:
# This pattern matches all text between square brackets.
pattern = "[\[].*?[\]]"
persuasion = re.sub(pattern, "", persuasion)
alice = re.sub(pattern, "", alice)
sense = re.sub(pattern, "", sense)

# Print the first 100 characters of Alice again.
print('Title removed:\n', alice[0:100])
print('')
print(sense[0:100])

Title removed:
 

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on



CHAPTER 1


The family of Dashwood had long been settled in Sussex.
Their estate was large, and th


In [7]:
# Now we'll match and remove chapter headings.
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
sense = re.sub(r'CHAPTER .*', '', sense)


# Ok, what's it look like now?
print('Chapter headings removed:\n', alice[0:100])
print(sense[0:100])

Chapter headings removed:
 



Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothin





The family of Dashwood had long been settled in Sussex.
Their estate was large, and their resid


In [8]:
# Remove newlines and other extra whitespace by splitting and rejoining.
persuasion = ' '.join(persuasion.split())
alice = ' '.join(alice.split())
sense = ' '.join(sense.split())

# All done with cleanup? Let's see how it looks.
print('Extra whitespace removed:\n', alice[0:100])

Extra whitespace removed:
 Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to


In [9]:
# Here is a list of the stopwords identified by NLTK.
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [10]:
import spacy
nlp = spacy.load('en')

# All the processing work is done here, so it may take a while.
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)
sense_doc = nlp(sense)

In [11]:
# Let's explore the objects we've built.
print("The alice_doc object is a {} object.".format(type(alice_doc)))
print("It is {} tokens long".format(len(alice_doc)))
print("The first three tokens are '{}'".format(alice_doc[:3]))
print("The type of each token is {}".format(type(alice_doc[0])))

The alice_doc object is a <class 'spacy.tokens.doc.Doc'> object.
It is 34420 tokens long
The first three tokens are 'Alice was beginning'
The type of each token is <class 'spacy.tokens.token.Token'>


In [12]:
print("The emma_doc object is a {} object.".format(type(sense_doc)))
print("It is {} tokens long".format(len(sense_doc)))
print("The first three tokens are '{}'".format(sense_doc[:3]))
print("The type of each token is {}".format(type(sense_doc[0])))

The emma_doc object is a <class 'spacy.tokens.doc.Doc'> object.
It is 140800 tokens long
The first three tokens are 'The family of'
The type of each token is <class 'spacy.tokens.token.Token'>


In [13]:
from collections import Counter

# Utility function to calculate how frequently words appear in the text.
def word_frequencies(text, include_stop=True):
    
    # Build a list of words.
    # Strip out punctuation and, optionally, stop words.
    words = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            words.append(token.text)
            
    # Build and return a Counter object containing word counts.
    return Counter(words)
    
# The most frequent words:
alice_freq = word_frequencies(alice_doc).most_common(100)
persuasion_freq = word_frequencies(persuasion_doc).most_common(100)
sense_freq = word_frequencies(sense_doc).most_common(100)
print('Alice:', alice_freq)
print('')
print('Persuasion:', persuasion_freq)
print('')
print('Sense:', sense_freq)

Alice: [('the', 1524), ('and', 796), ('to', 724), ('a', 611), ('I', 534), ('it', 524), ('she', 508), ('of', 499), ('said', 453), ('Alice', 394), ('was', 363), ('in', 355), ('you', 343), ('that', 274), ('as', 245), ('her', 243), ("n't", 205), ('at', 202), ("'s", 190), ('on', 189), ('had', 184), ('with', 175), ('all', 173), ('be', 145), ('for', 139), ('but', 132), ('not', 130), ('they', 129), ('very', 126), ('little', 124), ('so', 122), ('do', 118), ('out', 116), ('this', 111), ('The', 102), ('he', 101), ('down', 99), ('is', 98), ('up', 98), ('about', 94), ('one', 94), ('his', 94), ('what', 93), ('were', 86), ('them', 86), ('like', 84), ('went', 83), ('herself', 83), ('know', 83), ('could', 82), ('would', 82), ('again', 80), ('if', 78), ('or', 75), ('thought', 74), ('did', 74), ('have', 73), ('Queen', 73), ('then', 71), ('no', 69), ('when', 69), ('time', 68), ('into', 67), ('And', 67), ('see', 66), ('there', 65), ('It', 63), ('off', 62), ('me', 61), ('King', 61), ('Turtle', 58), ('began'

In [14]:
# Use our optional keyword argument to remove stop words.
alice_freq = word_frequencies(alice_doc, include_stop=False).most_common(100)
persuasion_freq = word_frequencies(persuasion_doc, include_stop=False).most_common(100)
sense_freq = word_frequencies(sense_doc, include_stop=False).most_common(100)
print('Alice:', alice_freq)
print('')
print('Persuasion:', persuasion_freq)
print('')
print('Sense:', sense_freq)

Alice: [('said', 453), ('Alice', 394), ("n't", 205), ("'s", 190), ('little', 124), ('like', 84), ('went', 83), ('know', 83), ('thought', 74), ('Queen', 73), ('time', 68), ('King', 61), ('Turtle', 58), ('began', 57), ("'m", 55), ('Hatter', 55), ('Mock', 55), ('Gryphon', 55), ('way', 54), ("'ll", 53), ('head', 49), ('thing', 49), ('think', 47), ('voice', 46), ('looked', 45), ('got', 45), ('Rabbit', 42), ("'ve", 42), ('Duchess', 42), ('round', 41), ('came', 40), ('tone', 40), ('Dormouse', 40), ('great', 39), ("'re", 38), ('Oh', 34), ('March', 34), ('large', 33), ('looking', 32), ('moment', 31), ('long', 31), ('Hare', 31), ('things', 30), ('right', 30), ('heard', 30), ('Mouse', 30), ('found', 29), ('door', 29), ('replied', 29), ('day', 28), ('eyes', 28), ('dear', 28), ('look', 28), ('going', 27), ('tell', 27), ("'d", 27), ('good', 26), ('Caterpillar', 26), ('Cat', 26), ('come', 25), ('away', 25), ('poor', 25), ('course', 25), ('soon', 24), ('wo', 24), ('shall', 23), ('took', 23), ('felt', 

In [15]:
# Pull out just the text from our frequency lists.
alice_common = [pair[0] for pair in alice_freq]
persuasion_common = [pair[0] for pair in persuasion_freq]
sense_common = [pair[0] for pair in sense_freq]

# Use sets to find the unique values in each top ten.
print('Unique to Alice:', set(alice_common) - set(persuasion_common) - set(sense_common))
print(len(set(alice_common) - set(persuasion_common) - set(sense_common)))
print('')
print('Unique to Persuasion:', set(persuasion_common) - set(alice_common) - set(sense_common))
print(len(set(persuasion_common) - set(alice_common) - set(sense_common)))
print('')
print('Unique to Sense:', set(sense_common) - set(alice_common) - set(persuasion_common))
print(len(set(sense_common) - set(alice_common) - set(persuasion_common)))

Unique to Alice: {'White', 'asked', "'ve", 'tone', 'Duchess', 'went', 'Come', 'wo', 'door', 'spoke', 'minute', 'jury', 'Mouse', 'curious', 'find', 'head', 'Dormouse', 'March', 'sat', 'Caterpillar', 'feet', 'tea', "'re", 'round', 'Mock', 'getting', 'Rabbit', 'began', 'Alice', 'question', 'Queen', 'ran', 'King', 'Hare', 'wonder', "'m", 'Gryphon', 'use', 'took', 'Cat', "'d", 'tried', 'old', 'right', 'court', 'course', 'sort', 'looked', 'Turtle', 'table', 'words', 'voice', "n't", 'things', 'large', 'added', 'end', 'eat', 'got', 'eyes', 'hand', 'Hatter', "'ll"}
63

Unique to Persuasion: {'Henrietta', 'best', 'Mary', 'ought', 'Benwick', 'knew', 'Bath', 'Charles', 'Walter', 'character', 'Smith', 'friend', 'short', 'Uppercross', 'Elliot', 'Croft', 'gone', 'Elizabeth', 'Captain', 'Harville', 'leave', 'evening', 'seen', 'having', 'life', 'wife', 'Admiral', 'Anne', 'years', 'Musgrove', 'party', 'Mr', 'Louisa', 'father', 'Clay', 'Mrs', 'Wentworth', 'Kellynch', 'near', 'certainly', 'possible', 'Lym

In [16]:
# Utility function to calculate how frequently lemmas appear in the text.
def lemma_frequencies(text, include_stop=True):
    
    # Build a list of lemmas.
    # Strip out punctuation and, optionally, stop words.
    lemmas = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            lemmas.append(token.lemma_)
            
    # Build and return a Counter object containing word counts.
    return Counter(lemmas)

# Instantiate our list of most common lemmas.
alice_lemma_freq = lemma_frequencies(alice_doc, include_stop=False).most_common(100)
persuasion_lemma_freq = lemma_frequencies(persuasion_doc, include_stop=False).most_common(100)
sense_lemma_freq = lemma_frequencies(sense_doc, include_stop=False).most_common(100)
print('\nAlice:', alice_lemma_freq)
print('')
print('Persuasion:', persuasion_lemma_freq)
print('')
print('Sense:', sense_lemma_freq)
print('')
print('')

# Again, identify the lemmas common to one text but not the other.
alice_lemma_common = [pair[0] for pair in alice_lemma_freq]
persuasion_lemma_common = [pair[0] for pair in persuasion_lemma_freq]
sense_lemma_common = [pair[0] for pair in sense_lemma_freq]
print('Unique to Alice:', set(alice_lemma_common) - set(persuasion_lemma_common) - set(sense_lemma_common))
print('')
print('Unique to Persuasion:', set(persuasion_lemma_common) - set(alice_lemma_common) - set(sense_lemma_common))
print('')
print('Uniqe to Sense:', set(sense_lemma_common) - set(alice_lemma_common) - set(persuasion_lemma_common))


Alice: [('say', 476), ('alice', 396), ('be', 214), ('not', 200), ('think', 130), ('go', 130), ('little', 126), ('look', 106), ('know', 103), ('come', 97), ('like', 92), ('begin', 91), ('thing', 79), ('time', 77), ('queen', 74), ('will', 71), ('get', 67), ('king', 63), ('turtle', 60), ('head', 59), ("'s", 57), ('hatter', 57), ('find', 56), ('way', 55), ('mock', 55), ('gryphon', 55), ('cat', 50), ('rabbit', 49), ('voice', 49), ('hear', 48), ('mouse', 47), ('oh', 44), ('try', 44), ('good', 43), ('turn', 42), ('duchess', 42), ('tone', 42), ('large', 41), ('tell', 41), ('round', 41), ('have', 40), ('dormouse', 40), ('great', 39), ('speak', 38), ('feel', 37), ('sit', 36), ('hand', 36), ('eye', 35), ('take', 35), ('march', 35), ('reply', 34), ('ask', 33), ('long', 33), ('day', 32), ('dear', 32), ('right', 32), ('minute', 32), ('shall', 31), ('moment', 31), ('door', 31), ('grow', 31), ('hare', 31), ('white', 30), ('see', 30), ('talk', 30), ('foot', 29), ('word', 29), ('run', 28), ('poor', 27)

In [17]:
# Initial exploration of sentences.
sentences = list(alice_doc.sents)
print("Alice in Wonderland has {} sentences.".format(len(sentences)))

example_sentence = sentences[2]
print("Here is an example: \n{}\n".format(example_sentence))

Alice in Wonderland has 1163 sentences.
Here is an example: 
There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!



In [18]:
# Look at some metrics around this sentence.
example_words = [token for token in example_sentence if not token.is_punct]
unique_words = set([token.text for token in example_words])

print(("There are {} words in this sentence, and {} of them are"
       " unique.").format(len(example_words), len(unique_words)))

There are 29 words in this sentence, and 25 of them are unique.


In [19]:
print(nlp("I need a break")[3].pos_)
print(nlp("I need to break the glass")[3].pos_)

NOUN
VERB


In [20]:
# View the part of speech for some tokens in our sentence.
print('\nParts of speech:')
for token in example_sentence[:9]:
    print(token.orth_, token.pos_)
    
## There is acting as part of the verb, compound verb


Parts of speech:
There ADV
was VERB
nothing NOUN
so ADP
VERY PROPN
remarkable ADJ
in ADP
that DET
; PUNCT


In [21]:
# View the dependencies for some tokens.
print('\nDependencies:')
for token in example_sentence[:9]:
    print(token.orth_, token.dep_, token.head.orth_)


Dependencies:
There expl was
was ROOT was
nothing attr was
so advmod remarkable
VERY compound remarkable
remarkable amod nothing
in prep remarkable
that pobj in
; punct was


In [22]:
# Extract the first ten entities.
entities = list(alice_doc.ents)[0:10]
for entity in entities:
    print(entity.label_, ' '.join(t.orth_ for t in entity))

print('')

sense_entities = list(sense_doc.ents)[0:10]
for entity in sense_entities:
    print(entity.label_, ' '.join(t.orth_ for t in entity))


PERSON Alice
PERSON Alice
PERSON White Rabbit
ORG VERY
PERSON Alice
ORG VERY
PERSON Rabbit
PERSON Rabbit
EVENT WATCH
ORG POCKET

ORG Dashwood
GPE Sussex
GPE Norland Park
DATE many generations
DATE many years
DATE ten years
PERSON Henry Dashwood
GPE Norland
NORP Gentleman
PERSON Henry Dashwood


In [23]:
# All of the uniqe entities spaCy thinks are people.
people = [entity.text for entity in list(alice_doc.ents) if entity.label_ == "PERSON"]
print(set(people))

print('')

sense_people = [entity.text for entity in list(sense_doc.ents) if entity.label_ == 'PERSON']
print(set(sense_people))

{'Game', 'Conqueror', 'THIS', 'Begin', 'Majesty', 'Prizes', 'then!--Bill', 'Seaography', 'Lory', 'Soup', 'ALICE', 'Mine', 'Pigeon', 'VOICE', 'Mouse', 'Knave', 'Hjckrrh', "W. RABBIT'", 'Latitude', 'Mary Ann', 'Shy', 'Dormouse', 'Morcar', 'Duck', "Rabbit's--'Pat", 'Hush', 'Silence', 'Beautiful', 'Hand', 'Longitude', 'Mock', 'White Rabbit', 'Jack', 'Tis', 'Pat', 'Rabbit', 'Cheshire Puss', 'Alice', 'THAT', 'Mock Turtle', 'Bill', 'Soles', 'Lizard', 'Queen', 'Ahem', 'Pepper', 'Last', 'Hare', 'King', 'Tut', 'Gryphon', 'THESE', 'O Mouse', 'Cheshire Cat', 'Dinn', 'Treacle', 'WILLIAM', 'Cat', 'Footman', 'Twinkle', 'Dinah', 'Brandy', 'Edwin', 'Shakespeare', 'WHAT', 'Eaglet', 'ME', 'Drawling', 'Father William', 'Mabel', 'Edgar Atheling', 'Ma', 'ALL', 'ONE', 'Tortoise', 'Beau', 'Dodo', 'Tillie;', "Mary Ann!'", 'Idiot', 'Curiouser', 'William'}

{'John Middleton', 'Godby', 'Brandon', 'Sally', 'Ferrars--"very', "Lady Middleton's", 'her;--', 'Westons', 'Ferrars', 'Fanny', 'Absence', "Ma'am", 'Palmers',

In [24]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter

In [25]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
sense = re.sub(r'Chapter .*', '', sense)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)
sense = text_cleaner(sense)

In [26]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]
sense_sents = [[sent, "Austen"] for sent in sense_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents + sense_sents)
sentences.head(10)

Unnamed: 0,0,1
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !, I, shall, be, late, !, ')",Carroll
4,"((, when, she, thought, it, over, afterwards, ...",Carroll
5,"(In, another, moment, down, went, Alice, after...",Carroll
6,"(The, rabbit, -, hole, went, straight, on, lik...",Carroll
7,"(Either, the, well, was, very, deep, ,, or, sh...",Carroll
8,"(First, ,, she, tried, to, look, down, and, ma...",Carroll
9,"(She, took, down, a, jar, from, one, of, the, ...",Carroll


In [27]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]
    

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 500 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)
sensewords = bag_of_words(sense_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + persuasionwords + sensewords)

In [28]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 500
Processing row 1000
Processing row 1500
Processing row 2000
Processing row 2500
Processing row 3000
Processing row 3500
Processing row 4000
Processing row 4500
Processing row 5000
Processing row 5500
Processing row 6000
Processing row 6500
Processing row 7000
Processing row 7500
Processing row 8000
Processing row 8500


Unnamed: 0,seventeen,surround,wretchedness,prodigious,knock,camden,widow,represent,wise,tis,...,fair,undoubtedly,strengthen,exclaim,string,execute,protection,representation,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !, I, shall, be, late, !, ')",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"((, when, she, thought, it, over, afterwards, ...",Carroll


In [29]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))


X_train, X_test, y_train, y_test = train_test_split(X, 
                                                Y,
                                                test_size=0.4,
                                                random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.992188952948

Test set score: 0.953417015342


In [30]:
def test_ml_model(X, Y):
    X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
    train = rfc.fit(X_train, y_train)

    print('RANDOM FOREST CLASSIFIER')
    print('Training set score:', rfc.score(X_train, y_train))
    print('\nTest set score:', rfc.score(X_test, y_test))
    print('\n')
    
    lr = LogisticRegression()
    train = lr.fit(X_train, y_train)
    print(X_train.shape, y_train.shape)
    print('LOGIT REGRESSION CLASSIFIER')
    print('Training set score:', lr.score(X_train, y_train))
    print('\nTest set score:', lr.score(X_test, y_test))

In [31]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(5377, 3599) (5377,)
Training set score: 0.976938813465

Test set score: 0.960948396095


In [32]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.954621536173

Test set score: 0.951743375174


In [33]:
# Clean the Emma data.
emma = gutenberg.raw('austen-emma.txt')
emma = re.sub(r'VOLUME \w+', '', emma)
emma = re.sub(r'CHAPTER \w+', '', emma)
emma = text_cleaner(emma)
print(emma[:100])

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to


In [34]:
# Parse our cleaned data.
emma_doc = nlp(emma)

In [35]:
# Group into sentences.
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]
emma_sents = [[sent, "Austen"] for sent in emma_doc.sents]

# Emma is quite long, let's cut it down to the same length as Alice.
# emma_sents = emma_sents[0:len(alice_sents)]

In [36]:
# Build a new Bag of Words data frame for Emma word counts.
# We'll use the same common words from Alice and Persuasion.
emma_sentences = pd.DataFrame(emma_sents)
emma_bow = bow_features(emma_sentences, common_words)

print('done')

Processing row 0
Processing row 500
Processing row 1000
Processing row 1500
Processing row 2000
Processing row 2500
Processing row 3000
Processing row 3500
Processing row 4000
Processing row 4500
Processing row 5000
Processing row 5500
Processing row 6000
Processing row 6500
Processing row 7000
Processing row 7500
done


In [37]:
# Now we can model it!
# Let's use logistic regression again.

# Combine the Emma sentence data with the Alice data from the test set.
X_Emma_test = np.concatenate((
    X_train[y_train[y_train=='Carroll'].index],
    emma_bow.drop(['text_sentence','text_source'], 1)
), axis=0)
y_Emma_test = pd.concat([y_train[y_train=='Carroll'],
                         pd.Series(['Austen'] * emma_bow.shape[0])])

# Model.
print('\nTest set score:', lr.score(X_Emma_test, y_Emma_test))
lr_Emma_predicted = lr.predict(X_Emma_test)
pd.crosstab(y_Emma_test, lr_Emma_predicted)


Test set score: 0.920713694907


col_0,Austen,Carroll
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
Austen,7724,45
Carroll,626,68


# Challenge 0:  


In [40]:
def punct_count(sentence):
    count = 0 
    for token in sentence:
        if token.is_punct:
            count += 1
    return count

In [63]:
word_counts = bow_features(sentences, common_words)
word_counts['sents_len'] = word_counts['text_sentence'].apply(lambda x: len(x))
# word_counts['punct_count'] = word_counts['text_sentence'].apply(lambda x: punct_count(x))
word_counts.head(3)

Processing row 0
Processing row 500
Processing row 1000
Processing row 1500
Processing row 2000
Processing row 2500
Processing row 3000
Processing row 3500
Processing row 4000
Processing row 4500
Processing row 5000
Processing row 5500
Processing row 6000
Processing row 6500
Processing row 7000
Processing row 7500
Processing row 8000
Processing row 8500


Unnamed: 0,seventeen,surround,wretchedness,prodigious,knock,camden,widow,represent,wise,tis,...,undoubtedly,strengthen,exclaim,string,execute,protection,representation,text_sentence,text_source,sents_len
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll,67
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll,63
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll,33


In [64]:
X = word_counts.drop(['text_sentence', 'text_source'], axis=1)
Y = word_counts['text_source']

In [65]:
X_train, X_validation, y_train, y_validation = train_test_split(X.as_matrix(), Y, test_size=0.20, random_state=99)

In [74]:
svm = SVC(kernel = 'linear', tol=0.001)
print(svm.fit(X_train, y_train))

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [75]:
svm.score(X_validation, y_validation)

0.96374790853318459

# Challenge 1:  

In [54]:
outcome = word_counts['text_source'].apply(lambda x: 1 if x == 'Carroll' else 0)
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

In [57]:
outcome.head(5)

0    1
1    1
2    1
3    1
4    1
Name: text_source, dtype: int64

In [58]:
test_ml_model(X, outcome)

RANDOM FOREST CLASSIFIER
Training set score: 0.993304816812

Test set score: 0.955648535565


(5377, 3601) (5377,)
LOGIT REGRESSION CLASSIFIER
Training set score: 0.976008926911

Test set score: 0.960390516039


In [None]:
word_counts['text_source']


outcome = 1: alice, 0: other
          1: austen, 0: other
            

In [None]:
alice_doc emma_doc
alice_sents

In [None]:
X = alice_sents + emma_sents

In [None]:
X = NewSentences[0]
Y = NewSentences[1]

In [None]:
NewSentences = pd.DataFrame(alice_sents + emma_sents)

X = NewSentences[0]
Y = NewSentences[1]

X_train, X_validation, y_train, y_validation = train_test_split(X, Y, test_size=0.20, random_state=99)

In [None]:
svm = SVC(kernel = 'linear')
print(svm.fit(X_train, y_train))
# print(svm.score(X_validation, y_validation))