# Challenge 0:
Recall that the logistic regression model's best performance on the test set was 93%. See what you can do to improve performance. Suggested avenues of investigation include: Other modeling techniques (SVM?), making more features that take advantage of the spaCy information (include grammar, phrases, POS, etc), making sentence-level features (number of words, amount of punctuation), or including contextual information (length of previous and next sentences, words repeated from one sentence to the next, etc), and anything else your heart desires. Make sure to design your models on the test set, or use cross_validation with multiple folds, and see if you can get accuracy above 90%.

In [29]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter

In [30]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice[:int(len(alice)/10)])
persuasion = text_cleaner(persuasion[:int(len(persuasion)/10)])

This challenge requires using supervised NLP to analyze a pre-labelled dataset for training and testing. 

The goal of this exercise (challenge) was to predict whether  a sentence comes from Alice in Wonderland by Lewis Carroll or Persuasion by Jane Austen. 

The following supervised models will be used in the challenge: Logistic regression, Linear SVM, Stocastich Gradient Descent, Gradient Boosting, Random Forest, and Navie Bayes. 

We use BoW - Bag of Words. To count how many times each word appears. We will then use those counts as features.

We are using the F1-score metric and the training/testing accuracy scores to the determine the success of how well the supervised learning models perform. 

In [31]:
# Parse the cleaned novels. This can take a bit.
import en_core_web_sm
nlp = en_core_web_sm.load()
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [32]:
alice[:100]

'Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to'

In [33]:
type(alice)

str

In [34]:
print(len(persuasion))
print(len(alice))

46284
14139


In [35]:
# Cutting persuasion to same lenght as Alice to speed things up
persuasion = persuasion[:141709]

In [36]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents)
sentences.head()

Unnamed: 0,0,1
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(Oh, dear, !)",Carroll


### Function for Bag of Words 

To make a list of the most common 3000 words used. Counter at (if i % 700 == 0)

In [37]:
# Utility function to create a list of the 3000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(3000)]
    

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 700 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + persuasionwords)

In [38]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0


Unnamed: 0,decision,Clay,rouse,intervention,represent,breathe,Miss,sound,London,throwing,...,brilliancy,limited,man,boot,reality,prevent,submit,method,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll


# Models

1. Logistic regression 
2. Naive Bayes
3. Gradient Boosting
4. Random Forest 
5. Linear SVM
6. Stocastich Gradient Descent 


# Logistic Regression model

In [40]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
train = lr.fit(X_train, y_train)

y_hat = lr.predict(X_test)

print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))
print(classification_report(y_test, y_hat))
print(cross)



(266, 1613) (266,)
Training set score: 0.9699248120300752

Test set score: 0.8764044943820225
              precision    recall  f1-score   support

      Austen       0.85      1.00      0.92       125
     Carroll       1.00      0.58      0.74        53

   micro avg       0.88      0.88      0.88       178
   macro avg       0.93      0.79      0.83       178
weighted avg       0.89      0.88      0.87       178

text_source  Austen  Carroll
row_0                       
Austen          106        5
Carroll          19       48


# Results from Logistic regression

Test set score of 87.6% is pretty good!

Books f1-scores 

Austen - 0.92: this is a great score means that the logistic regression was able to accurately determine the sentences to this author. 

Carroll - 0.74: this is not as strong of an f1-score. In the support used less sentences to fit the model. 

# Naive Bayes 

In [41]:
from sklearn.naive_bayes import GaussianNB


#Initialize and fit
nb = GaussianNB()
nb.fit(X_train, y_train)

# Apply to testing data
y_hat = nb.predict(X_test)

# Showing model performance
cross = pd.crosstab(y_hat, y_test)
print("Accuracy is: %0.3f" % nb.score(X_test, y_test))
print(classification_report(y_test, y_hat))
print(cross)

Accuracy is: 0.865
              precision    recall  f1-score   support

      Austen       0.95      0.85      0.90       125
     Carroll       0.72      0.91      0.80        53

   micro avg       0.87      0.87      0.87       178
   macro avg       0.84      0.88      0.85       178
weighted avg       0.88      0.87      0.87       178

text_source  Austen  Carroll
row_0                       
Austen          106        5
Carroll          19       48


# Result for Naive Bayes

The test accuracy score of 86.5% is the same as the logistic regression model. 

Authors of Books f1-scores

Austen - 0.90: slightly less then the logistic regression model. However it is still very high. 

Carroll - 0.80: showed improvement in modeling and predicting Carroll sentences from the logisitic regression. 

# Gradient Boosting



In [42]:
clf = ensemble.GradientBoostingClassifier()



clf.fit(X_train, y_train)

# Apply to testing data
y_hat = clf.predict(X_test)

train = clf.fit(X_train, y_train)
print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))
print(classification_report(y_test, y_hat))
print(cross)

Training set score: 0.9661654135338346

Test set score: 0.8202247191011236
              precision    recall  f1-score   support

      Austen       0.81      0.98      0.88       125
     Carroll       0.89      0.45      0.60        53

   micro avg       0.82      0.82      0.82       178
   macro avg       0.85      0.71      0.74       178
weighted avg       0.83      0.82      0.80       178

text_source  Austen  Carroll
row_0                       
Austen          106        5
Carroll          19       48


# Result of Gradient Boosting

Test accuracy score: 82.02% which is less then the previous two models. 

F1-scores 

Austen - 0.88: this is a descent f1-score by itself, however it is slightly less then the previous two models. 

Carroll - 0.60: did a poor job at predicting the sentences.

# Random Forest

In [43]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))


rfc.fit(X_train, y_train)

# Apply to testing data
y_hat = rfc.predict(X_test)

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))
print(classification_report(y_test, y_hat))
print(cross)



Training set score: 0.9849624060150376

Test set score: 0.8370786516853933
              precision    recall  f1-score   support

      Austen       0.83      0.96      0.89       125
     Carroll       0.85      0.55      0.67        53

   micro avg       0.84      0.84      0.84       178
   macro avg       0.84      0.75      0.78       178
weighted avg       0.84      0.84      0.83       178

text_source  Austen  Carroll
row_0                       
Austen          106        5
Carroll          19       48


# Result for Random forest

Test accuracy: 83.7% which is roughly the same as the Gradient Boosting model. 

F1-scores:

Austen - 0.89: is the same as the Gradient Boosting model. 

Carroll: 0.67: slight improvement from Gradient Boosting. However, this is not a very good score. 

# Linear SVM 

In [45]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report,confusion_matrix


# Create instance and fit
sv = SVC(kernel='linear')
sv.fit(X_train, y_train)

# Apply to testing data
y_hat = sv.predict(X_test)

# Showing model performance
cross = pd.crosstab(y_hat, y_test)
print("Accuracy is: %0.3f" % sv.score(X_test, y_test))
print(classification_report(y_test, y_hat))
print(cross)

Accuracy is: 0.871
              precision    recall  f1-score   support

      Austen       0.85      0.98      0.91       125
     Carroll       0.94      0.60      0.74        53

   micro avg       0.87      0.87      0.87       178
   macro avg       0.90      0.79      0.83       178
weighted avg       0.88      0.87      0.86       178

text_source  Austen  Carroll
row_0                       
Austen          123       21
Carroll           2       32


# Results Linear SVM 

Test accuracy: 87.1% the highest test accuracy of all the models. 

F1-scores: Are roughly identical to that of the logistic regression f1-scores at 0.91 (0.92 LR) for Austen and both at 0.74 for Carroll. 



# Stochastic Gradient Descent

In [46]:
from sklearn.linear_model import SGDClassifier
from sklearn import metrics


# Create instance and fit
sgdc = SGDClassifier(loss = 'hinge')
sgdc.fit(X_train, y_train)

# Apply to testing data
y_hat = sgdc.predict(X_test)


# Showing model performance
cross = pd.crosstab(y_hat, y_test)
print("Accuracy is: %0.3f" % sgdc.score(X_test, y_test))
print(classification_report(y_test, y_hat))
print(cross)



Accuracy is: 0.843
              precision    recall  f1-score   support

      Austen       0.87      0.91      0.89       125
     Carroll       0.77      0.68      0.72        53

   micro avg       0.84      0.84      0.84       178
   macro avg       0.82      0.80      0.81       178
weighted avg       0.84      0.84      0.84       178

text_source  Austen  Carroll
row_0                       
Austen          114       17
Carroll          11       36


# Results Stocastich Gradient Descent 

Test accuracy: 84.3% 

F1-scores: 

Austen - 0.89: this is a great score, however it is still not the best score from the previous models. 

Carroll - 0.72: a decent score but not the best score. 

# Overall

Comparing the f1-scores and test scores for all the models

---------------| Logistic R.|----|Naive Bayes|----|Gradient Boosting|---|Random Forest|---|Linear SVM|---|Stocastic G.D.

Austen -          **0.92**            0.90              0.88                 0.89               0.91          0.89

Carroll -         **0.74**            0.80              0.60                 0.67               **0.74**       0.72

Test score -      **87.6%**           86.5%             82.0%                83.7%              87.1%         84.3%



The logistic regression model performed the best with NLP analysis. Logistic Regression had the highest f1-scores for both authors and highest testing accuracy. The Linear SVM model was very close to match the results of the logistic regression. Minor tweeks in the hyperparameters or in the model the Linear SVM model could produce a more favorable results. 



# Challenge 1:
Find out whether your new model is good at identifying Alice in Wonderland vs any other work, Persuasion vs any other work, or Austen vs any other work. This will involve pulling a new book from the Project Gutenberg corpus (print(gutenberg.fileids()) for a list) and processing it.

Record your work for each challenge in a notebook and submit it below.

In [47]:
# Clean the Emma data.
emma = gutenberg.raw('austen-emma.txt')
emma = re.sub(r'VOLUME \w+', '', emma)
emma = re.sub(r'CHAPTER \w+', '', emma)
emma = text_cleaner(emma)
print(emma[:100])

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to


In [48]:
# Parse our cleaned data.
emma_doc = nlp(emma)

In [49]:
# Group into sentences.
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]
emma_sents = [[sent, "Austen"] for sent in emma_doc.sents]

# Emma is quite long, let's cut it down to the same length as Alice.
emma_sents = emma_sents[0:len(alice_sents)]

In [50]:
# Build a new Bag of Words data frame for Emma word counts.
# We'll use the same common words from Alice and Persuasion.
emma_sentences = pd.DataFrame(emma_sents)
emma_bow = bow_features(emma_sentences, common_words)

print('done')

Processing row 0
done


# Comparing Logistic Regression with Linear SVM in the new model



In [53]:
# Now we can model it!
# Let's use logistic regression again.

# Combine the Emma sentence data with the Alice data from the test set.
X_Emma_test = np.concatenate((
    X_train[y_train[y_train=='Carroll'].index],
    emma_bow.drop(['text_sentence','text_source'], 1)
), axis=0)
y_Emma_test = pd.concat([y_train[y_train=='Carroll'],
                         pd.Series(['Austen'] * emma_bow.shape[0])])

# Model.
print('\nTest set score:', lr.score(X_Emma_test, y_Emma_test))
lr_Emma_predicted = lr.predict(X_Emma_test)
pd.crosstab(y_Emma_test, lr_Emma_predicted)
print(classification_report(y_Emma_test, lr_Emma_predicted))
print(cross)


Test set score: 0.6926829268292682
              precision    recall  f1-score   support

      Austen       0.69      0.95      0.79       129
     Carroll       0.74      0.26      0.39        76

   micro avg       0.69      0.69      0.69       205
   macro avg       0.71      0.60      0.59       205
weighted avg       0.71      0.69      0.64       205

text_source  Austen  Carroll
row_0                       
Austen          114       17
Carroll          11       36


In [54]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report,confusion_matrix

# Create instance and fit
sv = SVC(kernel='linear')
sv.fit(X_train, y_train)

X_Emma_test = np.concatenate((
    X_train[y_train[y_train=='Carroll'].index],
    emma_bow.drop(['text_sentence','text_source'], 1)
), axis=0)

y_Emma_test = pd.concat([y_train[y_train=='Carroll'],
                         pd.Series(['Austen'] * emma_bow.shape[0])])

# Model.
print('\nTest set score:', sv.score(X_Emma_test, y_Emma_test))
lr_Emma_predicted = sv.predict(X_Emma_test)
pd.crosstab(y_Emma_test, lr_Emma_predicted)
print(classification_report(y_Emma_test, lr_Emma_predicted))
print(cross)




Test set score: 0.6048780487804878
              precision    recall  f1-score   support

      Austen       0.65      0.81      0.72       129
     Carroll       0.44      0.26      0.33        76

   micro avg       0.60      0.60      0.60       205
   macro avg       0.55      0.53      0.53       205
weighted avg       0.57      0.60      0.58       205

text_source  Austen  Carroll
row_0                       
Austen          114       17
Carroll          11       36


# Results

Logistic Regression  69.2% test accuracy
f1-scores 

Austen - 0.79

Carrol - 0.39


Linear SVM - 60.4% test accuracy 

Austen - 0.72

Carroll - 0.33 

The logistic regression model out performed the Linear SVM model. All the metric we used to determine the a more successful model were observed in the logistic regression model. 