In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter
import nltk
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import model_selection

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

#from __future__ import unicode_literals
import en_core_web_sm

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

nltk.download('gutenberg')
!python -m spacy download en

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


symbolic link created for C:\Users\User\Anaconda3\lib\site-packages\spacy\data\en <<===>> C:\Users\User\Anaconda3\lib\site-packages\en_core_web_sm
[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
[+] Linking successful
C:\Users\User\Anaconda3\lib\site-packages\en_core_web_sm -->
C:\Users\User\Anaconda3\lib\site-packages\spacy\data\en
You can now load the model via spacy.load('en')


# Challenge 1 of 2:

Recall that the logistic regression model's best performance on the test set was 93%.  See what you can do to improve performance.  Suggested avenues of investigation include: Other modeling techniques (SVM?), making more features that take advantage of the spaCy information (include grammar, phrases, POS, etc), making sentence-level features (number of words, amount of punctuation), or including contextual information (length of previous and next sentences, words repeated from one sentence to the next, etc), and anything else your heart desires.  Make sure to design your models on the test set, or use cross_validation with multiple folds, and see if you can get accuracy above 90%.  

### Working the Problem

__Initial problematic observations__
- Large dimensionality (1611 columns)
- Over fitting

__Hopes are to decrease over fitting and increase accuracy of test set__

__Proceedures__ <br>

1) import and clean data with regular expressions (re) <br>
2) parse and tokenize sentences with spacy (creating "span" objects) <br>
3) Part of Speech (POS) Tagging <br>
4) 






The first thing we'll try to improve the model is to remove all upper case. <br>
Apparently, upper case words can be a pain. <br>
I'll add "text = text.lower()" in the function below. 


- Dimensionality seems to be huge with 1611 columns

## Clean the Data

In [2]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = text.lower()
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice[:int(len(alice)/10)])
persuasion = text_cleaner(persuasion[:int(len(persuasion)/10)])

## Parse and Tokenize with Spacy

In [3]:
# Parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [4]:
# EXAMPLE CODE THAT I DIDN'T WANT TO LOSE. LOOPED THROUGH DOCS 

#dict_1 = {}

#for index, sent in enumerate (alice_doc.sents):
 #   c = Counter(([token.pos_ for token in sent]))
 #   dict_1[index] = {}
 #   for pos in c:        
 #       dict_1[index][pos] = c[pos]
    
#print(dict_1)

#______________persuasion___________________
#------------------------------------------------
#dict_2 = {}

#for index, sent in enumerate (persuasion_doc.sents):
#    c = Counter(([token.pos_ for token in sent]))
 #   dict_2[index] = {}
#    for pos in c:        
  #      dict_2[index][pos] = c[pos]
    
#print(dict_1)

In [4]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents)
sentences.head(10)

Unnamed: 0,0,1
0,"(alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(so, she, was, considering, in, her, own, mind...",Carroll
2,"(there, was, nothing, so, very, remarkable, in...",Carroll
3,"(oh, dear, !)",Carroll
4,"(oh, dear, !)",Carroll
5,"(i, shall, be, late, !, ')",Carroll
6,"((, when, she, thought, it, over, afterwards, ...",Carroll
7,"(in, another, moment, down, went, alice, after...",Carroll
8,"(the, rabbit, -, hole, went, straight, on, lik...",Carroll
9,"(either, the, well, was, very, deep, ,, or, sh...",Carroll


In [5]:
print(sentences.loc[1,0])

so she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a white rabbit with pink eyes ran close by her.


In [6]:
# View the part of speech for some tokens in our sentence.
print('\nParts of speech:')
for token in sentences.loc[0,0]:
    print(token.orth_, token.pos_)


Parts of speech:
alice PROPN
was AUX
beginning VERB
to PART
get AUX
very ADV
tired ADJ
of ADP
sitting VERB
by ADP
her DET
sister NOUN
on ADP
the DET
bank NOUN
, PUNCT
and CCONJ
of ADP
having VERB
nothing PRON
to PART
do AUX
: PUNCT
once ADV
or CCONJ
twice ADV
she PRON
had AUX
peeped VERB
into ADP
the DET
book NOUN
her DET
sister NOUN
was AUX
reading VERB
, PUNCT
but CCONJ
it PRON
had AUX
no DET
pictures NOUN
or CCONJ
conversations NOUN
in ADP
it PRON
, PUNCT
' PUNCT
and CCONJ
what PRON
is AUX
the DET
use NOUN
of ADP
a DET
book NOUN
, PUNCT
' PUNCT
thought VERB
alice NOUN
' PUNCT
without ADP
pictures NOUN
or CCONJ
conversation NOUN
? PUNCT
' PUNCT


In [7]:
# Extract the first ten entities.
entities = list(alice_doc.ents)[0:10]
for entity in entities:
    print(entity.label_, ' '.join(t.orth_ for t in entity))

PERSON alice
WORK_OF_ART ' thought alice ' without pictures or conversation
DATE the hot day
ORDINAL first
LOC earth
QUANTITY four thousand miles
LOC earth
GPE new zealand
GPE australia
PERSON dinah'll


In [8]:
nounphrases = [[np.orth_, np.root.head.orth_] for np in alice_doc.noun_chunks]
print("There were {} noun phrases found.".format(len(nounphrases)))

There were 704 noun phrases found.


## Lemmatize with pipeline

In [10]:
# EXAMPLE CODE THAT I DIDN'T WANT TO LOSE. LOOPED THROUGH DOCS 
#nlp = en_core_web_sm.load()
#c = Counter(([token.pos_ for token in alice_doc]))
#sbase = sum(c.values())
#for el, cnt in c.items():
 #   print(el, '{0:2.2f}%'.format((100.0* cnt)/sbase))

In [9]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop
                ]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(200)]

# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + persuasionwords)

In [10]:
# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                     
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400


Unnamed: 0,sir,danger,key,garden,sailor,fancy,shut,thump,choose,marry,rabbit,make,taste,sleepy,field,suddenly,hole,half,dignity,long,favour,object,oh,remember,lock,point,spirit,poison,daughter,way,flash,young,great,spend,run,distance,certainly,golden,telescope,lady,advantage,good,mile,bath,deal,friend,dark,home,evil,white,well,listen,married,longitude,pride,russell,mouse,man,want,time,pink,latitude,talk,rise,turn,heir,miss,thirteen,keep,happy,ill,head,pleasure,judgement,book,duchess,walter,end,honour,far,elizabeth,pair,hold,drop,somersetshire,pick,low,handsome,hardly,short,moment,go,use,care,glass,society,curiosity,observe,navy,dear,reach,burn,manage,cat,know,small,quit,have,picture,lie,taunton,begin,close,wish,hot,nice,table,peep,croft,pretty,wait,let,add,possible,year,remarkable,speak,conduct,clay,believe,rate,dinah,look,tear,act,near,country,daisy,bottle,shelf,passage,girl,hour,oblige,birth,marriage,hope,alas,cry,youth,beautifully,hear,acquaintance,fall,kid,word,tired,monkford,health,rich,anne,mean,watch,roof,live,round,gentleman,day,inch,stop,hall,world,principle,mr,actually,try,finish,feel,room,favourite,shall,love,bring,alice,new,see,term,history,pretend,fan,hint,box,twice,spring,deserve,bank,comfort,get,require,ear,excellent,tenant,thought,feeling,consider,ask,child,inclination,rule,curiouser,come,pocket,right,month,generally,wentworth,father,remain,baldwin,walk,pop,kellynch,glove,profession,stupid,life,high,little,london,change,old,eat,foot,late,soon,hand,bit,chain,degree,sort,mary,dry,cake,read,say,follow,neighbour,start,cupboard,door,deep,equal,earth,family,occur,fortune,think,consequence,people,will,wonder,hang,extremely,mind,forget,expect,drink,character,waistcoat,fact,house,neighbourhood,question,influence,admiral,sister,leave,funny,place,grow,take,corner,esq,bat,air,lose,eye,jar,persuade,age,reply,town,worth,large,happen,claim,give,open,mrs,`,away,bear,minute,baronet,tell,interest,consult,notice,elliot,trouble,candle,poor,curious,strong,night,natural,surprised,find,label,lovely,idea,sit,thing,woman,difficulty,true,fortunately,mark,sure,shepherd,like,hurry,conversation,mention,ought,size,mother,suppose,settle,set,present,sight,receive,charles,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,"(alice, was, beginning, to, get, very, tired, ...",Carroll
1,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"(so, she, was, considering, in, her, own, mind...",Carroll
2,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"(there, was, nothing, so, very, remarkable, in...",Carroll
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"(oh, dear, !)",Carroll
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"(oh, dear, !)",Carroll


In [11]:
dict_1 = {}

for index, sent in enumerate (sentences[0]):
    c = Counter(([token.pos_ for token in sent]))
    dict_1[index] = {}
    for pos in c:
        dict_1[index][pos] = c[pos]
        
#print(dict_1)

In [12]:
df = pd.DataFrame(dict_1).transpose()
df.head()

Unnamed: 0,PROPN,AUX,VERB,PART,ADV,ADJ,ADP,DET,NOUN,PUNCT,CCONJ,PRON,SCONJ,INTJ,NUM
0,1.0,7.0,6.0,2.0,3.0,1.0,8.0,7.0,11.0,10.0,6.0,5.0,,,
1,,2.0,9.0,,8.0,8.0,6.0,7.0,8.0,7.0,2.0,4.0,2.0,,
2,,2.0,3.0,1.0,5.0,1.0,3.0,3.0,3.0,3.0,1.0,4.0,1.0,,
3,,,,,,,,,,1.0,,,,2.0,
4,,,,,,,,,,1.0,,,,2.0,


In [13]:
df = df.fillna(0)

In [14]:
frames = [df, word_counts]
df_2 = pd.merge(word_counts, df, right_index=True, left_index=True)

In [15]:
df.shape

(431, 15)

In [42]:
# the commented out code is to document how columns with numeric titles don't use quotations

#sentences = sentences.rename(columns={0:'sentences', 1: 'Author'})

Time to bag some words!  Since spaCy has already tokenized and labelled our data, we can move directly to recording how often various words occur.  We will exclude stopwords and punctuation.  In addition, in an attempt to keep our feature space from exploding, we will work with lemmas (root words) rather than the raw text terms, and we'll only use the 2000 most common words for each text.

In [21]:
df_2['all tokens'] = df_2['text_sentence'].str.len()

By transforming all the text into lowercase, we've reduced the dimensionality by almost 30 columns.

## Trying out BoW

Now let's give the bag of words features a whirl by trying a random forest.

In [16]:
X = np.array(df_2.drop(['text_sentence','text_source'], 1))

In [17]:
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(X)

In [18]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier()
Y = df_2['text_source']
X = scaled
#X = np.array(df_2.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.9922480620155039

Test set score: 0.838150289017341


In [19]:
from sklearn import model_selection
seed = 0
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = rfc
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))



Accuracy: 74.598% (26.444%)


Holy overfitting, Batman! Overfitting is a known problem when using bag of words, since it basically involves throwing a massive number of features at a model – some of those features (in this case, word frequencies) will capture noise in the training set. Since overfitting is also a known problem with Random Forests, the divergence between training score and test score is expected.


## BoW with Logistic Regression

Let's try a technique with some protection against overfitting due to extraneous features – logistic regression with ridge regularization (from ridge regression, also called L2 regularization).

In [20]:
#from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(258, 359) (258,)
Training set score: 0.9844961240310077

Test set score: 0.8265895953757225


In [21]:
seed = 0
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = lr
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))



Accuracy: 81.015% (15.842%)


Logistic regression performs a bit better than the random forest.  

# BoW with Gradient Boosting

And finally, let's see what gradient boosting can do:

In [22]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.9612403100775194

Test set score: 0.8554913294797688


In [23]:
seed = 0
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = clf
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))



Accuracy: 75.502% (24.814%)


Looks like logistic regression is the winner, but there's room for improvement.


# Challenge 2 of 2:
Find out whether your new model is good at identifying Alice in Wonderland vs any other work, Persuasion vs any other work, or Austen vs any other work.  This will involve pulling a new book from the Project Gutenberg corpus (print(gutenberg.fileids()) for a list) and processing it.

Record your work for each challenge in a notebook and submit it below.

In [24]:
# Clean the Emma data.
emma = gutenberg.raw('austen-emma.txt')
emma = re.sub(r'VOLUME \w+', '', emma)
emma = re.sub(r'CHAPTER \w+', '', emma)
emma = text_cleaner(emma[:int(len(emma)/60)])
print(emma[:100])

emma woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to


In [25]:
# Parse our cleaned data.
emma_doc = nlp(emma)

In [26]:
# Group into sentences.
persuasion_sents = [[sent, "Austen_per"] for sent in persuasion_doc.sents]
emma_sents = [[sent, "Austen_em"] for sent in emma_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(emma_sents + persuasion_sents)
sentences.head()

Unnamed: 0,0,1
0,"(emma, woodhouse, ,, handsome, ,, clever, ,, a...",Austen_em
1,"(she, was, the, youngest, of, the, two, daught...",Austen_em
2,"(her, mother, had, died, too, long, ago, for, ...",Austen_em
3,"(sixteen, years, had, miss, taylor, been, in, ...",Austen_em
4,"(between, _, them, _)",Austen_em


In [27]:
# View the part of speech for some tokens in our sentence.
print('\nParts of speech:')
for token in sentences.loc[0,0]:
    print(token.orth_, token.pos_)


Parts of speech:
emma PROPN
woodhouse PROPN
, PUNCT
handsome ADJ
, PUNCT
clever ADJ
, PUNCT
and CCONJ
rich ADJ
, PUNCT
with ADP
a DET
comfortable ADJ
home NOUN
and CCONJ
happy ADJ
disposition NOUN
, PUNCT
seemed VERB
to PART
unite VERB
some DET
of ADP
the DET
best ADJ
blessings NOUN
of ADP
existence NOUN
; PUNCT
and CCONJ
had AUX
lived VERB
nearly ADV
twenty NUM
- PUNCT
one NUM
years NOUN
in ADP
the DET
world NOUN
with ADP
very ADV
little ADJ
to PART
distress VERB
or CCONJ
vex VERB
her PRON
. PUNCT


In [28]:
# Extract the first ten entities.
entities = list(emma_doc.ents)[0:10]
for entity in entities:
    print(entity.label_, ' '.join(t.orth_ for t in entity))

DATE nearly twenty - one years
CARDINAL two
DATE sixteen years
PERSON taylor
PERSON woodhouse
PERSON taylor
PERSON taylor
PERSON taylor
PERSON taylor
ORDINAL first


In [29]:
nounphrases = [[np.orth_, np.root.head.orth_] for np in emma_doc.noun_chunks]
print("There were {} noun phrases found.".format(len(nounphrases)))

There were 694 noun phrases found.


## Lemmatize with pipeline

In [30]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop
                ]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(200)]

# Set up the bags.
emmawords = bag_of_words(emma_doc)
persuasionwords = bag_of_words(persuasion_doc)

# Combine bags to create a set of unique words.
common_words = set(emmawords + persuasionwords)

In [31]:
# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                     
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
print('done')
word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400
done


Unnamed: 0,sir,danger,joke,sailor,choose,marry,woodhouse,taste,governess,servant,half,dignity,long,favour,object,sixteen,oh,clever,remember,point,humours,spirit,daughter,way,young,great,pass,spend,certainly,visitor,lady,advantage,good,mile,existence,fault,bath,deal,friend,home,promise,evil,well,married,able,pride,russell,pleasant,man,want,emma,time,usual,talk,rise,smile,turn,knightley,interested,heir,miss,hate,thirteen,sad,happy,ill,head,pleasure,judgement,book,walter,end,particularly,randall,honour,far,comfortable,elizabeth,papa,hold,somersetshire,hardly,handsome,short,go,care,society,horse,observe,navy,dinner,dear,know,small,weston,highbury,husband,quit,have,taunton,begin,wish,companion,croft,sorrow,pretty,let,add,possible,year,speak,conduct,clay,manner,believe,blessing,look,tear,act,near,country,joy,girl,carriage,hour,wedding,marriage,birth,oblige,hope,troublesome,sigh,youth,hear,cheerful,acquaintance,business,behave,james,mistress,acceptable,evening,word,monkford,beloved,health,kindness,afraid,rich,anne,mean,live,hannah,gentleman,day,fanciful,ago,hall,world,principle,mr,seven,isabella,hartfield,feel,meet,room,favourite,shall,bring,love,see,term,backgammon,rain,history,disposition,gentle,hint,spring,matrimony,deserve,match,comfort,require,temper,pay,excellent,feeling,thought,tenant,consider,body,child,inclination,come,right,month,wentworth,father,odd,circumstance,remain,baldwin,walk,kellynch,profession,welcome,wife,life,high,little,london,change,old,late,soon,hand,degree,early,mary,say,follow,neighbour,attach,equal,family,tolerably,affectionate,fortune,think,consequence,people,taylor,extremely,mind,expect,civil,character,fact,house,neighbourhood,influence,admiral,sister,leave,till,place,grow,unite,nearly,affection,take,esq,lose,distress,impossible,eye,persuade,age,reply,town,large,claim,give,glad,disagreeable,mrs,`,open,away,bear,visit,baronet,perfect,tell,aware,interest,consult,elliot,allow,poor,indulgent,strong,night,find,idea,sit,thing,woman,difficulty,true,habit,power,ah,sure,shepherd,like,vex,mention,kind,ought,mother,suppose,settle,set,fond,dirty,present,receive,charles,text_sentence,text_source
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,"(emma, woodhouse, ,, handsome, ,, clever, ,, a...",Austen_em
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"(she, was, the, youngest, of, the, two, daught...",Austen_em
2,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,"(her, mother, had, died, too, long, ago, for, ...",Austen_em
3,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,"(sixteen, years, had, miss, taylor, been, in, ...",Austen_em
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"(between, _, them, _)",Austen_em


In [56]:
# Build a new Bag of Words data frame for Emma word counts.
# We'll use the same common words from Alice and Persuasion.
#emma_sentences = pd.DataFrame(emma_sents)
#emma_bow = bow_features(emma_sentences, common_words)

#print('done')

In [32]:
dict_2 = {}

for index, sent in enumerate (sentences[0]):
    c = Counter(([token.pos_ for token in sent]))
    dict_2[index] = {}
    for pos in c:
        dict_2[index][pos] = c[pos]
        
#print(dict_2)

In [33]:
df = pd.DataFrame(dict_2).transpose()
df.head()

Unnamed: 0,PROPN,PUNCT,ADJ,CCONJ,ADP,DET,NOUN,VERB,PART,AUX,ADV,NUM,PRON,SCONJ,INTJ
0,2.0,8.0,7.0,4.0,5.0,4.0,6.0,5.0,2.0,1.0,2.0,2.0,1.0,,
1,,5.0,4.0,1.0,6.0,6.0,8.0,,1.0,3.0,2.0,1.0,1.0,,
2,,3.0,5.0,1.0,5.0,6.0,8.0,3.0,1.0,5.0,3.0,,2.0,2.0,
3,5.0,4.0,2.0,1.0,3.0,3.0,6.0,,1.0,2.0,2.0,1.0,,2.0,
4,1.0,,,,2.0,,,,,,,,1.0,,


In [34]:
df = df.fillna(0).copy()

In [35]:
#frames = [df, word_counts]
df_2 = pd.merge(word_counts, df, right_index=True, left_index=True)

In [36]:
df_2.text_source.value_counts()

Austen_per    292
Austen_em     141
Name: text_source, dtype: int64

## Trying out BoW

Now let's give the bag of words features a whirl by trying a random forest.

In [37]:
X = np.array(df_2.drop(['text_sentence','text_source'], 1))

In [38]:
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(X)

In [39]:
rfc = ensemble.RandomForestClassifier()
Y = df_2['text_source']
X = scaled
#X = np.array(df_2.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.9922779922779923

Test set score: 0.8045977011494253


In [40]:
from sklearn import model_selection
seed = 0
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = rfc
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))



Accuracy: 72.378% (31.157%)


Holy overfitting, Batman! Overfitting is a known problem when using bag of words, since it basically involves throwing a massive number of features at a model – some of those features (in this case, word frequencies) will capture noise in the training set. Since overfitting is also a known problem with Random Forests, the divergence between training score and test score is expected.


## BoW with Logistic Regression

Let's try a technique with some protection against overfitting due to extraneous features – logistic regression with ridge regularization (from ridge regression, also called L2 regularization).

In [41]:
#from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(259, 326) (259,)
Training set score: 0.9845559845559846

Test set score: 0.8160919540229885


In [42]:
seed = 0
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = lr
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))



Accuracy: 71.586% (21.892%)


Logistic regression performs a bit better than the random forest.  

# BoW with Gradient Boosting

And finally, let's see what gradient boosting can do:

In [43]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.9343629343629344

Test set score: 0.8160919540229885


In [44]:
seed = 0
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = clf
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))



Accuracy: 77.178% (27.009%)


Performance seems to be worse than the earlier model. Logistic regression is still best performer.