# Challenge 0:

Recall that the logistic regression model's best performance on the test set was 93%.  See what you can do to improve performance.  Suggested avenues of investigation include: Other modeling techniques (SVM?), making more features that take advantage of the spaCy information (include grammar, phrases, POS, etc), making sentence-level features (number of words, amount of punctuation), or including contextual information (length of previous and next sentences, words repeated from one sentence to the next, etc), and anything else your heart desires.  Make sure to design your models on the test set, or use cross_validation with multiple folds, and see if you can get accuracy above 90%.  

In [1]:
####################### Imports ##############################

##### Infrastructure #######
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
import time
from nltk.corpus import gutenberg, stopwords
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import chi2, f_classif
from sklearn.feature_selection import SelectKBest

####### Models ########
from sklearn import ensemble
from sklearn.linear_model import LogisticRegression


In [2]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

In [3]:
# Parse the cleaned novels. This can take a bit.
start_time = time.time()
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

print("-- Execution time: %s seconds ---" % (time.time() - start_time))

-- Execution time: 18.080790281295776 seconds ---


In [4]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents)
sentences.head()

Unnamed: 0,0,1
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


I beleive Spacy is able to do this, or sklearn is able to do this ALOT quicker than is outlined here. This is a very manual, log(n^2) approach.

In [5]:
# Bag of Words
# Utility function to create a list of the 1000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(750)]
    

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    start_time = time.time()
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 100 == 0:
            print("Processing row {} - {} seconds".format(i, (time.time() - start_time)))
            
    return df

# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + persuasionwords)

In [6]:
# Create our data frame with features. This can take a while to run.
start_time = time.time()
word_counts = bow_features(sentences, common_words)
print("-- Execution time: %s seconds ---" % (time.time() - start_time))
word_counts.head()

Processing row 0 - 1.9805071353912354 seconds
Processing row 100 - 65.56992554664612 seconds
Processing row 200 - 119.88716554641724 seconds
Processing row 300 - 172.43793988227844 seconds
Processing row 400 - 226.03345704078674 seconds
Processing row 500 - 276.171106338501 seconds
Processing row 600 - 326.9330608844757 seconds
Processing row 700 - 382.7561273574829 seconds
Processing row 800 - 436.7270016670227 seconds
Processing row 900 - 481.4471776485443 seconds
Processing row 1000 - 533.5522894859314 seconds
Processing row 1100 - 592.0452156066895 seconds
Processing row 1200 - 645.963278055191 seconds
Processing row 1300 - 686.4512197971344 seconds
Processing row 1400 - 732.5760359764099 seconds
Processing row 1500 - 782.2334468364716 seconds
Processing row 1600 - 830.8626594543457 seconds
Processing row 1700 - 902.3781747817993 seconds
Processing row 1800 - 972.348582983017 seconds
Processing row 1900 - 1045.6064910888672 seconds
Processing row 2000 - 1125.3452563285828 seconds
P

Unnamed: 0,roof,town,sure,listen,proof,little,croft,intimate,temper,lessen,...,handsome,struggle,offended,rejoice,occasion,rest,ought,father,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, shall, be, late, !, ')",Carroll


In [7]:
len(sentences)

5318

In [8]:
# The word counts dataframe will have source, sentence and all the words. Need to add other features to this:
# Number of words in sentence
# Number of unique words in sentence
def number_of_words(sentence):
    example_words = [token for token in sentence if not token.is_punct]
    unique_words = set([token.text for token in example_words])
    return len(unique_words)
    

In [9]:
# Add Unique words
new_df = word_counts
new_df['Unique_Words'] = new_df['text_sentence'].apply(lambda x: number_of_words(x))

In [10]:
# Split the data
Y = new_df['text_source']
X = new_df.drop(['text_sentence','text_source'], 1)

In [11]:
# Select The Best Features
start_time = time.time()
selector = SelectKBest(f_classif, k=100)
selector.fit(X,Y)

idxs_selected = selector.get_support(indices=True)
best_features = X[X.columns[idxs_selected]]
print("--- %s seconds ---" % (time.time() - start_time))

--- 0.36888718605041504 seconds ---


In [12]:
# Do the Test Train Split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.3,
                                                    random_state=0)

In [13]:
# Check the balance
Y.value_counts()

Austen     3649
Carroll    1669
Name: text_source, dtype: int64

In [14]:
################ Logistic Regression ######################
start_time = time.time()
parameters = {
                'penalty':['l1'],
                'C':[1]
               
              }

lr = LogisticRegression(class_weight='balanced')

grid = GridSearchCV(lr, parameters, scoring='accuracy', cv=5, verbose=0)
#Fit the Data
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))
print("-- Execution time: %s seconds ---" % (time.time() - start_time))


0.9110275689223057
-- Execution time: 52.03978419303894 seconds ---


In [16]:
grid.best_params_

{'C': 1, 'penalty': 'l2'}

In [20]:
################ Gradient Boost #############################
start_time = time.time()
parameters = {'subsample':[1],
              'max_depth':[4],
              'loss':['deviance'],
             'n_estimators':[800]}

# Initialize the model.
clf = ensemble.GradientBoostingClassifier()

#Create grid and perform 8 cross validation
gradient_grid = GridSearchCV(clf, parameters, cv=5, verbose=0)

#Fit the Data
gradient_grid.fit(X_train, y_train)
print(gradient_grid.score(X_test, y_test))
print("--- %s seconds ---" % (time.time() - start_time))

0.9078947368421053
--- 450.75105357170105 seconds ---


In [21]:
gradient_grid.best_params_

{'loss': 'deviance', 'max_depth': 4, 'n_estimators': 800, 'subsample': 1}

# Challenge 1

Find out whether your new model is good at identifying Alice in Wonderland vs any other work, Persuasion vs any other work, or Austen vs any other work. This will involve pulling a new book from the Project Gutenberg corpus (print(gutenberg.fileids()) for a list) and processing it.

In [22]:
# Clean the Emma data.
emma = gutenberg.raw('austen-emma.txt')
emma = re.sub(r'VOLUME \w+', '', emma)
emma = re.sub(r'CHAPTER \w+', '', emma)
emma = text_cleaner(emma)

In [23]:
# Parse our cleaned data.
emma_doc = nlp(emma)

# Group into sentences.
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]
emma_sents = [[sent, "Austen"] for sent in emma_doc.sents]

# Emma is quite long, let's cut it down to the same length as Alice.
emma_sents = emma_sents[0:len(alice_sents)]

In [27]:
# Build a new Bag of Words data frame for Emma word counts.
# We'll use the same common words from Alice and Persuasion.
emma_sentences = pd.DataFrame(emma_sents)
emma_bow = bow_features(emma_sentences, common_words)

Processing row 0 - 0.3026435375213623 seconds
Processing row 100 - 8.328272104263306 seconds
Processing row 200 - 17.45161008834839 seconds
Processing row 300 - 29.218074560165405 seconds
Processing row 400 - 37.252437114715576 seconds
Processing row 500 - 43.764498233795166 seconds
Processing row 600 - 50.63793087005615 seconds
Processing row 700 - 58.030407428741455 seconds
Processing row 800 - 64.70669102668762 seconds
Processing row 900 - 71.33534455299377 seconds
Processing row 1000 - 78.088059425354 seconds
Processing row 1100 - 85.86960220336914 seconds
Processing row 1200 - 94.57090473175049 seconds
Processing row 1300 - 99.6506450176239 seconds
Processing row 1400 - 104.70747780799866 seconds
Processing row 1500 - 111.2115421295166 seconds
Processing row 1600 - 119.91576433181763 seconds


In [25]:
# Add Unique Words
emma_bow['Unique_Words'] = new_df['text_sentence'].apply(lambda x: number_of_words(x))

In [28]:
# Combine the Emma sentence data with the Alice data from the test set.
X_Emma_test = np.concatenate((
    X_train[y_train[y_train=='Carroll'].index],
    emma_bow.drop(['text_sentence','text_source'], 1)
), axis=0)
y_Emma_test = pd.concat([y_train[y_train=='Carroll'],
                         pd.Series(['Austen'] * emma_bow.shape[0])])

# Model.
print('\nTest set score:', lr.score(X_Emma_test, y_Emma_test))
lr_Emma_predicted = lr.predict(X_Emma_test)
pd.crosstab(y_Emma_test, lr_Emma_predicted)

KeyError: "Int64Index([ 553, 1464,  286,  465,   37, 1202,  622, 1495, 1368, 1377,\n            ...\n             544,  423,  659,  797,  755,   99,  537,  705, 1033, 1653],\n           dtype='int64', length=1180) not in index"