# A supervised problem

## Challenge 0:
Recall that the logistic regression model's best performance on the test set was 93%. See what you can do to improve performance. Suggested avenues of investigation include: Other modeling techniques (SVM?), making more features that take advantage of the spaCy information (include grammar, phrases, POS, etc), making sentence-level features (number of words, amount of punctuation), or including contextual information (length of previous and next sentences, words repeated from one sentence to the next, etc), and anything else your heart desires. Make sure to design your models on the test set, or use cross_validation with multiple folds, and see if you can get accuracy above 90%.

## Challenge 1:
Find out whether your new model is good at identifying Alice in Wonderland vs any other work, Persuasion vs any other work, or Austen vs any other work. This will involve pulling a new book from the Project Gutenberg corpus (print(gutenberg.fileids()) for a list) and processing it.

## Getting the data into shape

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter

In [58]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

In [59]:
alice[:100]

'Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to'

In [60]:
type(alice)

str

In [61]:
print(len(persuasion))
print(len(alice))

462818
141708


In [62]:
# Cutting persuasion to same lenght as Alice to speed things up
persuasion = persuasion[:141709]

In [63]:
# Parse the cleaned novels. This can take a bit.
import en_core_web_sm
nlp = en_core_web_sm.load()
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [92]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents)
sentences.head()

Unnamed: 0,0,1
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


In [93]:
len(sentences)

2624

In [66]:
type(sentences.iloc[0,0])

spacy.tokens.span.Span

In [67]:
sent = list(sentences.iloc[0,0])
print(sent)

[Alice, was, beginning, to, get, very, tired, of, sitting, by, her, sister, on, the, bank, ,, and, of, having, nothing, to, do, :, once, or, twice, she, had, peeped, into, the, book, her, sister, was, reading, ,, but, it, had, no, pictures, or, conversations, in, it, ,, ', and, what, is, the, use, of, a, book, ,, ', thought, Alice, ', without, pictures, or, conversation, ?, ']


In [68]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]
    

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 500 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + persuasionwords)

In [69]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 500
Processing row 1000
Processing row 1500
Processing row 2000
Processing row 2500


Unnamed: 0,dreadfully,place,conquest,independent,knot,equality,ridicule,foot,valet,wrinkle,...,pretty,require,grace,usual,fortunately,christmas,sweet,bound,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, shall, be, late, !, ')",Carroll


In [70]:
from sklearn.model_selection import train_test_split

Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

In [71]:
X.shape

(2624, 3155)

# Challenge 0

## Logistic regression

In [72]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(1574, 3155) (1574,)
Training set score: 0.9752223634053367

Test set score: 0.8923809523809524


## Linear SVM

In [73]:
from sklearn.svm import SVC

# Create instance and fit
sv = SVC(kernel='linear')
sv.fit(X_train, y_train)

# Apply to testing data
y_hat = sv.predict(X_test)

# Showing model performance
cross = pd.crosstab(y_hat, y_test)
print("Accuracy is: %0.3f" % sv.score(X_test, y_test))
print(metrics.classification_report(y_test, y_hat))

Accuracy is: 0.875
             precision    recall  f1-score   support

     Austen       0.89      0.77      0.82       396
    Carroll       0.87      0.94      0.90       654

avg / total       0.88      0.88      0.87      1050



## Stochastic gradient descent

In [74]:
from sklearn.linear_model import SGDClassifier
from sklearn import metrics


# Create instance and fit
sgdc = SGDClassifier(loss = 'hinge')
sgdc.fit(X_train, y_train)

# Apply to testing data
y_hat = sgdc.predict(X_test)


# Showing model performance
cross = pd.crosstab(y_hat, y_test)
print("Accuracy is: %0.3f" % sgdc.score(X_test, y_test))
print(metrics.classification_report(y_test, y_hat))
print(cross)

Accuracy is: 0.870
             precision    recall  f1-score   support

     Austen       0.82      0.84      0.83       396
    Carroll       0.90      0.89      0.90       654

avg / total       0.87      0.87      0.87      1050

text_source  Austen  Carroll
row_0                       
Austen          331       71
Carroll          65      583


In [24]:
from sklearn import model_selection

params = {'loss': ['log'],
          'penalty': ['l2'],
          'alpha': [0.0001, 0.00001],
          'average': [True, False],
          #'class_weight': ['balanced', None],
          'learning_rate':['optimal', 'invscaling'],
          # Tried 0.5 in previous grid
          'power_t': [1.5],
          'eta0': [1],
          'n_iter': [5, 100]
          #'tol': [0.001, 0.0001],
         }


# Initialize the model
sgdc = SGDClassifier()

# Apply GridSearch to the model
grid = model_selection.GridSearchCV(sgdc, params)
grid.fit(X_train, y_train)

# Save instance for CV
sgd_best = grid.best_estimator_
print(grid.best_estimator_)

# Metrics
print("Accuracy is: ", grid.score(X_test, y_test))

y_hat = grid.predict(X_test)
print(metrics.classification_report(y_test, y_hat))
cross = pd.crosstab(y_hat, y_test)

KeyboardInterrupt: 

## Naive Bayes

In [75]:
from sklearn.naive_bayes import GaussianNB


#Initialize and fit
nb = GaussianNB()
nb.fit(X_train, y_train)

# Apply to testing data
y_hat = nb.predict(X_test)

# Showing model performance
cross = pd.crosstab(y_hat, y_test)
print("Accuracy is: %0.3f" % nb.score(X_test, y_test))
print(metrics.classification_report(y_test, y_hat))
print(cross)


Accuracy is: 0.909
             precision    recall  f1-score   support

     Austen       0.96      0.79      0.87       396
    Carroll       0.89      0.98      0.93       654

avg / total       0.91      0.91      0.91      1050

text_source  Austen  Carroll
row_0                       
Austen          314       14
Carroll          82      640


## Using tfidf

In [76]:
sentences.head()

Unnamed: 0,0,1
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


In [77]:
sents = sentences.iloc[:, 0]
sents = sents.apply(lambda x: ''.join(str(x)))

In [78]:
sents.head()

0    Alice was beginning to get very tired of sitti...
1    So she was considering in her own mind (as wel...
2    There was nothing so VERY remarkable in that; ...
3                                             Oh dear!
4                                    I shall be late!'
Name: 0, dtype: object

In [79]:
X_train, X_test, y_train, y_test = train_test_split(sents, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

In [80]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create tf-idf matrix on train data
vectorizer = TfidfVectorizer(min_df=1, strip_accents='ascii', analyzer='word', lowercase=True,
                             ngram_range=(1,2), use_idf=True, binary=False, sublinear_tf=True)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [81]:
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

#Initialize and fit
nb = GaussianNB()
nb.fit(X_train.toarray(), y_train)

# Apply to testing data
y_hat = nb.predict(X_test.toarray())

# Showing model performance
cross = pd.crosstab(y_hat, y_test)
print("Accuracy is: %0.3f" % nb.score(X_test.toarray(), y_test))
print(metrics.classification_report(y_test, y_hat))
print(cross)


Accuracy is: 0.902
             precision    recall  f1-score   support

     Austen       0.89      0.85      0.87       396
    Carroll       0.91      0.93      0.92       654

avg / total       0.90      0.90      0.90      1050

text_source  Austen  Carroll
row_0                       
Austen          336       43
Carroll          60      611


In [82]:
from sklearn.linear_model import SGDClassifier


# Create instance and fit
sgdc = SGDClassifier(loss = 'hinge')
sgdc.fit(X_train, y_train)

# Apply to testing data
y_hat = sgdc.predict(X_test)


# Showing model performance
cross = pd.crosstab(y_hat, y_test)
print("Accuracy is: %0.3f" % sgdc.score(X_test, y_test))
print(metrics.classification_report(y_test, y_hat))
print(cross)

Accuracy is: 0.921
             precision    recall  f1-score   support

     Austen       0.91      0.88      0.89       396
    Carroll       0.93      0.95      0.94       654

avg / total       0.92      0.92      0.92      1050

text_source  Austen  Carroll
row_0                       
Austen          347       34
Carroll          49      620


In [74]:
from sklearn import model_selection

count = 0
kf = model_selection.KFold(n_splits=6, shuffle=True)

for train_index, test_index in kf.split(sents, Y):    
    Xk_train = vectorizer.fit_transform(sents[train_index])
    Xk_test =  vectorizer.transform(sents[test_index])
    yk_train, yk_test = Y[train_index], Y[test_index]
    # Create instance based on GridCV
    sgdc = SGDClassifier(loss = 'hinge')
    sgdc.fit(Xk_train, yk_train)
    print('Score for iteration {} is {}'.format(count, sgdc.score(Xk_test, yk_test)))
    count += 1

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


Score for iteration 0 is 0.9246575342465754
Score for iteration 1 is 0.8972602739726028
Score for iteration 2 is 0.919908466819222
Score for iteration 3 is 0.9016018306636155
Score for iteration 4 is 0.9130434782608695
Score for iteration 5 is 0.9382151029748284


# Challenge 1

Find out whether your new model is good at identifying Alice in Wonderland vs any other work, Persuasion vs any other work, or Austen vs any other work. This will involve pulling a new book from the Project Gutenberg corpus (print(gutenberg.fileids()) for a list) and processing it.



In [22]:
(print(gutenberg.fileids()))

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [23]:
walt = gutenberg.raw('whitman-leaves.txt')
print(walt[0:1000])

[Leaves of Grass by Walt Whitman 1855]


Come, said my soul,
Such verses for my Body let us write, (for we are one,)
That should I after return,
Or, long, long hence, in other spheres,
There to some group of mates the chants resuming,
(Tallying Earth's soil, trees, winds, tumultuous waves,)
Ever with pleas'd smile I may keep on,
Ever and ever yet the verses owning--as, first, I here and now
Signing for Soul and Body, set to them my name,

Walt Whitman



[BOOK I.  INSCRIPTIONS]

}  One's-Self I Sing

One's-self I sing, a simple separate person,
Yet utter the word Democratic, the word En-Masse.

Of physiology from top to toe I sing,
Not physiognomy alone nor brain alone is worthy for the Muse, I say
    the Form complete is worthier far,
The Female equally with the Male I sing.

Of Life immense in passion, pulse, and power,
Cheerful, for freest action form'd under the laws divine,
The Modern Man I sing.



}  As I Ponder'd in Silence

As I ponder'd in silence,
Returning upon my poems, c

In [24]:

# The Chapter indicator is idiosyncratic
#walt = re.sub(r'[Book ', '', walt)
#alice = re.sub(r'CHAPTER .*', '', alice)
    
walt = text_cleaner(walt)
walt[0:100]

'Come, said my soul, Such verses for my Body let us write, (for we are one,) That should I after retu'

In [26]:
# Use less words Walt...
walt = walt[:141709]

In [94]:
walt_doc = nlp(walt)

# Group into sentences.
walt_sents = [[sent, "Whitman"] for sent in walt_doc.sents]

# Combine the sentences from the Carroll and Whitman
sentences = pd.DataFrame(alice_sents + walt_sents)
sentences.head()

Unnamed: 0,0,1
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


In [104]:
sentences.iloc[:, 1].value_counts()[0]/sentences.iloc[:,1].count()

0.5962843872811718

In [96]:
# # Name columns properly
# sentences.columns = ['Text', 'Author']

# # Remove Austen
# sentences_Carroll = sentences.loc[sentences['Author']!="Austen", :]

In [97]:
# Create X and y (based on all observations)
sents = sentences.iloc[:, 0]
Y = sentences.iloc[:,1]
sents = sents.apply(lambda x: ''.join(str(x)))



In [98]:
#X_train, X_test, y_train, y_test = train_test_split(sents, 
#                                                    Y,
#                                                    test_size=0.4,
#                                                    random_state=0)





In [99]:
# Create tf-idf matrix on train data
#vectorizer = TfidfVectorizer(min_df=1, strip_accents='ascii', analyzer='word', lowercase=True,
#                             ngram_range=(1,2), use_idf=True, binary=False, sublinear_tf=True)
#X_train = vectorizer.fit_transform(X_train)

# Applying current vectorizer to new corpus 
X_test = vectorizer.transform(sents)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [100]:
X_test.shape

(2799, 20970)

In [101]:
Y.shape

(2799,)

In [102]:
from sklearn.linear_model import SGDClassifier

# Apply existing method to new testing data
y_hat = sgdc.predict(X_test)

# Showing model performance
cross = pd.crosstab(y_hat, Y)
print("Accuracy is: %0.3f" % sgdc.score(X_test, Y))
print(metrics.classification_report(Y, y_hat))
print(cross)

Accuracy is: 0.583


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


             precision    recall  f1-score   support

     Austen       0.00      0.00      0.00         0
    Carroll       0.70      0.98      0.82      1669
    Whitman       0.00      0.00      0.00      1130

avg / total       0.42      0.58      0.49      2799

1        Carroll  Whitman
row_0                    
Austen        36      436
Carroll     1633      694
