# 4.4.5 Challenge: Build your own NLP model¶
For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing
2. Create features using two different NLP methods: For example, BoW vs tf-idf.
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes.
4. Assess your models using cross-validation and determine whether one model performed better.
5. Pick one of the models and try to increase accuracy by at least 5 percentage points.

In [3]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import requests             
from bs4 import BeautifulSoup 
import csv                  
import webbrowser
import io

# NLP 
import nltk
import spacy
from nltk.corpus import gutenberg, stopwords
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
nltk.download('inaugural')
from nltk.corpus import inaugural
print(list(inaugural.fileids()))

[nltk_data] Downloading package inaugural to
[nltk_data]     C:\Users\kgrosse\AppData\Roaming\nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reaga

In [5]:
#compare two presidential speeches
rsvlt = inaugural.raw('1905-Roosevelt.txt')
taft = inaugural.raw('1909-Taft.txt')

In [6]:
# Parse with spacy language package
nlp = spacy.load('en_core_web_sm')
rsvlt_doc = nlp(rsvlt)
taft_doc = nlp(taft)

In [7]:
# Group into sentences and create data frame of sentences
rsvlt_sents = [[sent, 'Roosevelt'] for sent in rsvlt_doc.sents]
taft_sents = [[sent, 'Taft'] for sent in taft_doc.sents]

# Combine
sentences = pd.DataFrame(rsvlt_sents + taft_sents)
sentences.head()

Unnamed: 0,0,1
0,"(My, fellow, citizens, ,, no, people, on, eart...",Roosevelt
1,"(To, us, as, a, people, it, has, been, granted...",Roosevelt
2,"(We, are, the, heirs, of, the, ages, ,, and, y...",Roosevelt
3,"(We, have, not, been, obliged, to, fight, for,...",Roosevelt
4,"(Under, such, conditions, it, would, be, our, ...",Roosevelt


In [11]:
print(taft_doc[:100])
print('\nTaft speech length:', len(taft_doc))

My fellow citizens: Anyone who has taken the oath I have just taken must feel a heavy weight of responsibility. If not, he has no conception of the powers and duties of the office upon which he is about to enter, or he is lacking in a proper sense of the obligation which the oath imposes.

The office of an inaugural address is to give a summary outline of the main policies of the new administration, so far as they can be anticipated. I have had the honor to be one of

Taft speech length: 5888


In [12]:
print(rsvlt_doc[:100])
print('\nRoosevelt speech length:', len(rsvlt_doc))

My fellow citizens, no people on earth have more cause to be thankful than ours, and this is said reverently, in no spirit of boastfulness in our own strength, but with gratitude to the Giver of Good who has blessed us with the conditions which have enabled us to achieve so large a measure of well-being and of happiness. To us as a people it has been granted to lay the foundations of our national life in a new continent. We are the heirs of the ages, and yet we have

Roosevelt speech length: 1094


# Create features using two different NLP methods

## 1. Bag of Words

In [13]:
def bow(text):
    #clean text
    allwords = [token.lemma_
                for token in text
               if not token.is_punct
               and not token.is_stop]
    #get common words
    return [item[0] for item in Counter(allwords).most_common(500)]

#bags of words
rsvlt_words = bow(rsvlt_doc)
taft_words = bow(taft_doc)

In [14]:
#combine bags
common_words = set(rsvlt_words+taft_words)

In [21]:
print(type(common_words))

<class 'set'>


In [17]:
#create BOW datafame with common words and sentences
def bow_features(sentences, common_words):
    df=pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source']= sentences[1]
    df.loc[:, common_words] = 0
    
    #process each row
    for i, sentence in enumerate(df['text_sentence']):
        #convert sentences to lemmas and filter
        words = [token.lemma_
                for token in sentence
                if (
                    not token.is_punct
                    and not token.is_stop
                    and token.lemma_ in common_words
                )]
        #populate the row with word counts
        for word in words:
            df.loc[i, word]+=1
            
            # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df
            

In [18]:
# Create bow features 
bow = bow_features(sentences, common_words)
bow.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150


Unnamed: 0,free,interfere,continue,protection,expeditionary,prime,suppression,everyday,principle,scale,...,line,secure,banking,intense,financial,control,thankful,express,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,"(My, fellow, citizens, ,, no, people, on, eart...",Roosevelt
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(To, us, as, a, people, it, has, been, granted...",Roosevelt
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(We, are, the, heirs, of, the, ages, ,, and, y...",Roosevelt
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(We, have, not, been, obliged, to, fight, for,...",Roosevelt
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Under, such, conditions, it, would, be, our, ...",Roosevelt


## 2. TF-IDF Features (Unsupervised)

In [23]:
#get sentences
rsvlt_sents = inaugural.sents('1905-Roosevelt.txt')
taft_sents = inaugural.sents('1909-Taft.txt')

#create lists of sentences and join them
rsvlt_list = [" ".join(sent) for sent in rsvlt_sents]
taft_list = [" ".join(sent)for sent in taft_sents]
all_sents = rsvlt_list+taft_list

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(all_sents, test_size=0.4, random_state=0)
#Vectorize
vectorizor = TfidfVectorizer(max_df=0.5,
                             min_df=2,
                             stop_words='english',
                             lowercase=True,
                             use_idf=True,
                             norm=u'l2',
                             smooth_idf=True)

In [27]:
#apply the vectorizer
taft_rsvlt_tfidf=vectorizor.fit_transform(all_sents)
print('Number of features: %d'% taft_rsvlt_tfidf.get_shape()[1])

Number of features: 474


In [30]:
#Reshapes the vectorizer output into something people can read
taft_rsvlt_tfidf_csr = taft_rsvlt_tfidf.tocsr()

In [28]:
#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(taft_rsvlt_tfidf, test_size=0.4, random_state=0)

In [29]:
#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()

# Supervised Learning Models

In [32]:
from sklearn.model_selection import cross_val_score

#create inputs

#BOW
X_bow = bow.drop(['text_sentence', 'text_source'],1)
Y_bow = bow['text_source']

#Tfidf
X_tfidf = taft_rsvlt_tfidf_csr
Y_tfidf = ['Roosevelt']*len(rsvlt_list)+['Taft']*len(taft_list)

## Logistic Regression

In [38]:
from sklearn.linear_model import LogisticRegression


# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

#Bow
lr = LogisticRegression()
lr_bow = lr.fit(X_bow,Y_bow)
print('BOW LR Score: ', cross_val_score(lr_bow,X_bow,Y_bow, cv=10).mean())

#Tfidf
lr_tfidf = lr.fit(X_tfidf,Y_tfidf)
print('Tfidf LR Score: ', cross_val_score(lr_tfidf,X_tfidf,Y_tfidf, cv=10).mean())

BOW LR Score:  0.8423558897243109
Tfidf LR Score:  0.8285964912280702


## Random Forest

In [41]:
from sklearn import ensemble

#Bow
rfc = ensemble.RandomForestClassifier()
rfc_bow = rfc.fit(X_bow,Y_bow)
print('BOW RFC Score: ', cross_val_score(rfc_bow,X_bow,Y_bow, cv=10).mean())

#Tfidf
rfc_tfidf = rfc.fit(X_tfidf,Y_tfidf)
print('Tfidf RFC Score: ', cross_val_score(rfc_tfidf,X_tfidf,Y_tfidf, cv=10).mean())

BOW RFC Score:  0.8328320802005014
Tfidf RFC Score:  0.8499415204678362


## Gradient Boosting

In [42]:
#Bow
gbc = ensemble.GradientBoostingClassifier()
gbc_bow = gbc.fit(X_bow,Y_bow)
print('BOW GBC Score: ', cross_val_score(gbc_bow,X_bow,Y_bow, cv=10).mean())

#Tfidf
gbc_tfidf = gbc.fit(X_tfidf,Y_tfidf)
print('Tfidf GBC Score: ', cross_val_score(gbc_tfidf,X_tfidf,Y_tfidf, cv=10).mean())

BOW GBC Score:  0.8070175438596492
Tfidf GBC Score:  0.8080701754385965


# Pick a model and try to increase accuracy by 5%

In [52]:
# Increase BoW size

# Update function to include 1000 most common words and leave in punctuation
def bow(text):
    
    # filter out punctuation and stop words
    allwords = [token.lemma_
                for token in text
                if not token.is_stop]
    
    # Return most common words
    return [item[0] for item in Counter(allwords).most_common(1000)]

# Get bags 
rsvlt_words2 = bow(rsvlt_doc)
taft_words2 = bow(taft_doc)

# Combine bags to create common set of unique words
common_words2 = set(rsvlt_words2 + taft_words2)

In [53]:
# Create bow features 
bow2 = bow_features(sentences, common_words)
bow2.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150


Unnamed: 0,free,interfere,continue,protection,expeditionary,prime,suppression,everyday,principle,scale,...,line,secure,banking,intense,financial,control,thankful,express,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,"(My, fellow, citizens, ,, no, people, on, eart...",Roosevelt
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(To, us, as, a, people, it, has, been, granted...",Roosevelt
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(We, are, the, heirs, of, the, ages, ,, and, y...",Roosevelt
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(We, have, not, been, obliged, to, fight, for,...",Roosevelt
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Under, such, conditions, it, would, be, our, ...",Roosevelt


In [54]:
#Create new inputs
X_bow2 = bow2.drop(['text_sentence', 'text_source'], 1)
Y_bow2 = bow2['text_source']

# Rerun BoW
lr_bow2 = lr.fit(X_bow2, Y_bow2)
print('BOW 2 LR Scores: ', cross_val_score(lr_bow2, X_bow2, Y_bow2, cv=10).mean())

BOW 2 LR Scores:  0.8423558897243109


Removing punctuation and changing the number of common words did not change logistc regression score. Let's try it with another model

In [None]:
#Create new inputs
X_bow2 = bow2.drop(['text_sentence', 'text_source'], 1)
Y_bow2 = bow2['text_source']

# Rerun BoW
lr_bow2 = lr.fit(X_bow2, Y_bow2)
print('BOW 2 LR Scores: ', cross_val_score(lr_bow2, X_bow2, Y_bow2, cv=10).mean())