## Challenge:  Build Your Own NLP Model

For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing
2. Create features using two different NLP methods: For example, BoW vs tf-idf.
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes.
4. Assess your models using cross-validation and determine whether one model performed better.
5. Pick one of the models and try to increase accuracy by at least 5 percentage points.

In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import sklearn
import spacy
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter
import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/HeatherKacmarski/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [2]:
# Import the data we just downloaded and installed.
from nltk.corpus import gutenberg, stopwords

# Grab and process the raw data.
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [3]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

In [4]:
emma = gutenberg.raw('austen-emma.txt')
persuasion = gutenberg.raw('austen-persuasion.txt')
sense = gutenberg.raw('austen-sense.txt')

In [11]:
# The Chapter indicator is idiosyncratic
emma = re.sub(r'Chapter \d+', '', emma)
persuasion = re.sub(r'CHAPTER .*', '', persuasion)
sense = re.sub(r'CHAPTER .*','',sense)
    
emma = text_cleaner(emma[:int(len(emma)/10)])
persuasion = text_cleaner(persuasion[:int(len(persuasion)/10)])
sense = text_cleaner(sense[:int(len(sense)/10)])

In [12]:
# Parse using SpaCy
nlp = spacy.load('en')
emma_doc = nlp(emma)
persuasion_doc = nlp(persuasion)
sense_doc = nlp(sense)

In [13]:
# Group into sentences
emma_sentence = [[sent, 'emma'] for sent in emma_doc.sents]
persuasion_sentence = [[sent, 'persuasion'] for sent in persuasion_doc.sents]
sense_sentence = [[sent, 'sense'] for sent in sense_doc.sents]

# Combine
sentences = pd.DataFrame(emma_sentence + persuasion_sentence + sense_sentence)
sentences.head()

Unnamed: 0,0,1
0,(VOLUME),emma
1,"(I, CHAPTER)",emma
2,"(I, Emma, Woodhouse, ,, handsome, ,, clever, ,...",emma
3,"(She, was, the, youngest, of, the, two, daught...",emma
4,"(Her, mother, had, died, too, long, ago, for, ...",emma


In [14]:
# Look at excerpts from each 
print(emma_doc[:100])
print('\nEmma book length:', len(emma_doc))

print(persuasion_doc[:100])
print('\nPersuasion book length:', len(persuasion_doc))

print(sense_doc[:100])
print('\nSense book length:', len(sense_doc))

VOLUME I CHAPTER I Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her. She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period. Her mother had died too long ago for her

Emma book length: 19074
Chapter 1 Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who, for his own amusement, never took up any book but the Baronetage; there he found occupation for an idle hour, and consolation in a distressed one; there his faculties were roused into admiration and respect, by contemplating the limited remnant of the earliest patents; there any unwelcome sensations, arising from domestic affairs changed naturally into pity and contempt as he turned over the almost endles

Time to get into NLP!  Let's start with a Bag of Words Feature.

In [15]:
# Create bag of words function for each text
def bag_of_words(text):
    
    # filter out punctuation and stop words
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return most common words
    return [item[0] for item in Counter(allwords).most_common(500)]

# Get bags 
emma_words = bag_of_words(emma_doc)
persuasion_words = bag_of_words(persuasion_doc)
sense_words = bag_of_words(sense_doc)

# Combine bags to create common set of unique words
common_words = set(emma_words + persuasion_words + sense_words)

In [16]:
# Create bag of words data frame using combined common words and sentences
def bow_features(sentences, common_words):
    
    # Build data frame
    Jane_Austen = pd.DataFrame(columns=common_words)
    Jane_Austen['text_sentence'] = sentences[0]
    Jane_Austen['text_source'] = sentences[1]
    Jane_Austen.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentences in enumerate(Jane_Austen['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentences
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            Jane_Austen.loc[i, word] += 1
    
    return Jane_Austen

In [17]:
#Create our Bag of Words (bow) Features
bow = bow_features(sentences, common_words)
bow.head()

Unnamed: 0,eld,mind,Elizabeth,death,unkind,laugh,sanguine,succeed,satisfied,safe,...,poor,adopt,cottage,boil,Colonel,change,situation,fix,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,(VOLUME),emma
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, CHAPTER)",emma
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, Emma, Woodhouse, ,, handsome, ,, clever, ,...",emma
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(She, was, the, youngest, of, the, two, daught...",emma
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Her, mother, had, died, too, long, ago, for, ...",emma


In [21]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier()
Y = bow['text_source']
X = np.array(bow.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.9497584541062802

Test set score: 0.7076700434153401




### BOW with Logistic Regression

In [22]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(1035, 934) (1035,)
Training set score: 0.9053140096618357

Test set score: 0.7308248914616498




### BOW with Gradient Boosting

In [23]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.8231884057971014

Test set score: 0.7004341534008683


Let's grab a model and try to increase the accuracy by 5%.  We will try this with Logistic Regression and amending the Bag of Words feature size.

In [24]:
# Update function to include 1000 most common words
def bag_of_words(text):
    
    # filter out punctuation and stop words
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return most common words
    return [item[0] for item in Counter(allwords).most_common(1000)]

emma_words = bag_of_words(emma_doc)
persuasion_words = bag_of_words(persuasion_doc)
sense_words = bag_of_words(sense_doc)

# Combine bags to create common set of unique words
common_words = set(emma_words + persuasion_words + sense_words)

In [25]:
# Create bow features 
updated_bow = bow_features(sentences, common_words)

In [26]:
updated_bow.head()

Unnamed: 0,Elizabeth,couple,walking,bride,join,laugh,sanguine,satisfied,opening,remembrance,...,deal,plan,choice,mile,poor,deem,Colonel,attention,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,(VOLUME),emma
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, CHAPTER)",emma
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, Emma, Woodhouse, ,, handsome, ,, clever, ,...",emma
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(She, was, the, youngest, of, the, two, daught...",emma
4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,"(Her, mother, had, died, too, long, ago, for, ...",emma


In [28]:
from sklearn.model_selection import cross_val_score

X1 = updated_bow.drop(['text_sentence', 'text_source'], 1)
Y1 = updated_bow['text_source']

# Rerun BoW
lr = LogisticRegression()
lr_updated_bow = lr.fit(X1, Y1)
print('BoW (big) Logistic Regression Scores: ', cross_val_score(lr_updated_bow, X1, Y1, cv=5))
print('Avg. Score ', np.mean(cross_val_score(lr_updated_bow, X1, Y1, cv=5)))



BoW (big) Logistic Regression Scores:  [0.71181556 0.69275362 0.72173913 0.70144928 0.69186047]




Avg. Score  0.7039236112122881




In [31]:
rfc1 = ensemble.RandomForestClassifier()
Y1 =updated_bow['text_source']
X1 = np.array(updated_bow.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X1, 
                                                    Y1,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.9623188405797102

Test set score: 0.6989869753979739


In [32]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.8280193236714976

Test set score: 0.6989869753979739
