# 4.4.5 [Challenge: Build NLP Model](https://courses.thinkful.com/data-201v1/project/4.4.5)

**Challenge NOT the Capstone, link for [Capstone](https://github.com/EileenHBO/thinkful_repo/blob/master/unit_4/3_bow_tfidf.ipynb)**

Choose a corpus from nltk for a classification model. The analysis pipeline should include:

1. [Data cleaning / processing / language parsing](#section1)
2. [Create features w/ two NLP methods: e.g. BoW vs tf-idf.](#section2)
3. [Fit classification models for each feature set.](#section3)
4. [Assess your models using cross-validation and determine whether one model performed better.](#section4)
5. [Try to increase accuracy by at least 5 percentage points for 1 model](#section5)

To try to keep things simple to took a second look at the content available in NLTK's corpus. First I downloaded the question type corpus qt? When I tried to parse the dataset it was not consistently identifying the tags for sentence type as either a new sentence or seperate sentence. It was taking me quite a bit of time to clean so I thought I different data set would be safer.  

While looking through the corpus I noticed they had inaugural addresses. I thought it might be interesting to see if I could identify the 2 presidents writing styles. My first stab at this I felt like the sample size was too small. The corpus included all president innaugural addresses through 2008. To expand the dataset I thought it might be useful to group the speaches in some way. In Behavioral Economics researchers have done comparison's of writing in liberal vs. conservative newspapers to see if the topics were presented with some bias from different sources. 

If newspapers use different language to describe the same topics politicians may also speak differently by party. 

In [3]:
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
import time
%matplotlib inline

from sklearn.svm import SVC
from sklearn import ensemble
from sklearn.model_selection import GridSearchCV, cross_val_score 
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from nltk.corpus import inaugural, stopwords,gutenberg #, genesis, 
from collections import Counter

nlp = spacy.load('en')



______
<a id='section1'></a>

## 1. Data cleaning/processing/language parsing

In [12]:
washington = inaugural.raw(fileids=('1789-Washington.txt', '1793-Washington.txt'))
jefferson = inaugural.raw(fileids=('1801-Jefferson.txt', '1805-Jefferson.txt'))

washington = re.sub(r'Fellow-Citizens of the Senate and of the House of Representatives:','', washington)
jefferson = re.sub(r'Friends and Fellow Citizens:','', jefferson)

washington = ' '.join(washington.split())
jefferson = ' '.join(jefferson.split())

washington = re.sub(r'--','', washington)
jefferson = re.sub(r'--','', jefferson)

In [19]:
washington_doc = nlp(washington)
jefferson_doc = nlp(jefferson)

In [21]:
print('washington_doc: {}'.format(len(washington_doc)))
print('jefferson_doc: {}'.format(len(jefferson_doc)))

washington_doc: 1673
jefferson_doc: 4310


In [65]:
rep = inaugural.raw(fileids=('2001-Bush.txt',
 '2005-Bush.txt','1981-Reagan.txt', '1989-Bush.txt','1969-Nixon.txt'))
dem = inaugural.raw(fileids=('2009-Obama.txt','1993-Clinton.txt',
 '1997-Clinton.txt', '1977-Carter.txt','1961-Kennedy.txt',
 '1965-Johnson.txt',))
dirty_both_parties = rep + dem

In [69]:
inaugural.paras('2001-Bush.txt')

[[['President', 'Clinton', ',', 'distinguished', 'guests', 'and', 'my', 'fellow', 'citizens', ',', 'the', 'peaceful', 'transfer', 'of', 'authority', 'is', 'rare', 'in', 'history', ',', 'yet', 'common', 'in', 'our', 'country', '.'], ['With', 'a', 'simple', 'oath', ',', 'we', 'affirm', 'old', 'traditions', 'and', 'make', 'new', 'beginnings', '.']], [['As', 'I', 'begin', ',', 'I', 'thank', 'President', 'Clinton', 'for', 'his', 'service', 'to', 'our', 'nation', '.']], ...]

In [66]:
def text_cleaner(text):
    text = re.sub(r'--',' ',text)
    text = re.sub(r'¡¦',' ',text)
    text = re.sub(r'¡§',' ',text)
    text = re.sub(r'¡X',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

rep = text_cleaner(rep)
dem = text_cleaner(dem)
both_parties = text_cleaner(dirty_both_parties)

In [48]:
#nlp = spacy.load('en')

# All the processing work is done here, so it may take a while.
dem_doc = nlp(dem)
rep_doc = nlp(rep)
all_doc = nlp(both_parties)

# Group into sentences.
rep_sents = [[sent, "rep"] for sent in rep_doc.sents]
dem_sents = [[sent, "dem"] for sent in dem_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(rep_sents + dem_sents)
print('Rep Sentences: {}'.format(len(rep_sents)))
print('Dem Sentences: {}'.format(len(dem_sents)))

Rep Sentences: 581
Dem Sentences: 506


In [76]:
# Utility function to calculate how frequently lemas appear in the text.
def lemma_frequencies(text, include_stop=True):
    
    # Build a list of lemas.
    # Strip out punctuation and, optionally, stop words.
    lemmas = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            lemmas.append(token.lemma_)
            
    # Build and return a Counter object containing word counts.
    return Counter(lemmas)

# Instantiate our list of most common lemmas.
rep_lemma_freq = lemma_frequencies(rep_doc, include_stop=False).most_common(10)
dem_lemma_freq = lemma_frequencies(dem_doc, include_stop=False).most_common(10)
print('\nRep:', rep_lemma_freq)
print('Dem:', dem_lemma_freq)

# Again, identify the lemmas common to one text but not the other.
rep_lemma_common = [pair[0] for pair in rep_lemma_freq]
dem_lemma_common = [pair[0] for pair in dem_lemma_freq]
print('Unique to Rep:', set(rep_lemma_common) - set(dem_lemma_common))
print('Unique to Dem:', set(dem_lemma_common) - set(rep_lemma_common))


Rep: [('world', 79), ('america', 78), ('nation', 71), ('freedom', 66), ('government', 66), ('people', 62), ('time', 60), ('new', 55), ('great', 55), ("'s", 53)]
Dem: [('new', 70), ('nation', 69), ('world', 63), ('america', 52), ('american', 49), ('people', 49), ('let', 47), ("'s", 45), ('time', 37), ('work', 30)]
Unique to Rep: {'government', 'great', 'freedom'}
Unique to Dem: {'american', 'let', 'work'}


In [79]:
def word_frequencies(text, include_stop=True):
    
    # Build a list of words.
    # Strip out punctuation and, optionally, stop words.
    words = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            words.append(token.text)
            
    # Build and return a Counter object containing word counts.
    return Counter(words)

# Use our optional keyword argument to remove stop words.
dem_freq = word_frequencies(dem_doc, include_stop=False).most_common(10)
rep_freq = word_frequencies(rep_doc, include_stop=False).most_common(10)
print('Dem:', dem_freq)
print('Rep:', rep_freq)

# Again, identify the lemmas common to one text but not the other.
rep_common = [pair[0] for pair in rep_freq]
dem_common = [pair[0] for pair in dem_freq]
print('Unique to Rep:', set(rep_common) - set(dem_common))
print('Unique to Dem:', set(dem_common) - set(rep_common))

Dem: [('new', 70), ('world', 63), ('nation', 57), ('america', 51), ('people', 47), ('let', 47), ("'s", 45), ('time', 33), ('today', 28), ('work', 26)]
Rep: [('america', 78), ('world', 77), ('freedom', 66), ('government', 61), ('people', 60), ('nation', 57), ('new', 55), ("'s", 55), ('let', 48), ('time', 47)]
Unique to Rep: {'government', 'freedom'}
Unique to Dem: {'today', 'work'}


_____
<a id='section2'></a>

## 2. Create features w/2 NLP methods

In [109]:
# Create first set of features using Bag of Words:

def bag_of_words(text):
    
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    return [item[0] for item in Counter(allwords).most_common(2000)]

def bow_features(sentences, common_words):
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    for i, sentence in enumerate(df['text_sentence']):
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]

        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
            
        if i % 500 == 0:
            print("Processing row {}".format(i))
    return df

repwords = bag_of_words(rep_doc)
demwords = bag_of_words(dem_doc)

common_words = set(repwords + demwords)
word_counts = bow_features(sentences, common_words)

Processing row 0
Processing row 500
Processing row 1000


In [97]:
word_counts['random'] = np.random.rand(len(word_counts['text_source']),1)
word_counts = word_counts.sort_values(by='random').reset_index(drop=True)
word_counts = word_counts.drop('random',axis=1)

### Word2vec

In [53]:
for sentence in all_doc.sents:
    print(sentence)

President Clinton, distinguished guests and my fellow citizens, the peaceful transfer of authority is rare in history, yet common in our country.
With a simple oath, we affirm old traditions and make new beginnings.
As I begin, I thank President Clinton for his service to our nation.
And I thank Vice President Gore for a contest conducted with spirit and ended with grace.
I am honored and humbled to stand here, where so many of America's leaders have come before me, and so many will follow.
We have a place, all of us, in a long story a story we continue, but whose end we will not see.
It is the story of a new world that became a friend and liberator of the old, a story of a slave-holding society that became a servant of freedom, the story of a power that went into the world to protect but not possess, to defend but not to conquer.
It is the American story a story of flawed and fallible people, united across the generations by grand and enduring ideals.
The grandest of these ideals is a

In [55]:
# Organize parsed doc into 
vocabulary = []
for sentence in all_doc.sents:
    vocab = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    vocabulary.append(vocab)
    
print('We have {} sentences and {} tokens.'.format(len(vocabulary), len(all_doc)))

We have 1089 sentences and 23602 tokens.


In [57]:
import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    vocabulary,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

done!


In [None]:
model.wv.most_similar(positiveb)

# Create features using ftidf

In [4]:
all_paras = inaugural.paras(fileids=('2001-Bush.txt', '2005-Bush.txt','1981-Reagan.txt', 
                                    '1989-Bush.txt','1969-Nixon.txt', '2009-Obama.txt',
                                    '1993-Clinton.txt', '1997-Clinton.txt', '1977-Carter.txt',
                                    '1961-Kennedy.txt','1965-Johnson.txt'))

rep_paras = inaugural.paras(fileids=('2001-Bush.txt', '2005-Bush.txt','1981-Reagan.txt', 
                                    '1989-Bush.txt','1969-Nixon.txt')) 
dem_paras = inaugural.paras(fileids=('2009-Obama.txt', '1993-Clinton.txt', '1997-Clinton.txt', 
                                     '1977-Carter.txt', '1961-Kennedy.txt','1965-Johnson.txt'))

In [5]:
def para_cleanear(text):
    all_paras = []
    for paragraph in text:
        para = paragraph[0]
        para = [re.sub('--','',word) for word in para]
        para = [re.sub('¡¦','',word) for word in para]
        para = [re.sub('¡§','',word) for word in para]
        para = [re.sub('¡X','',word) for word in para]
        all_paras.append(' '.join(para))
    return all_paras

rep_paras = para_cleanear(rep_paras)
dem_paras = para_cleanear(dem_paras)
#both_parties = para_cleanear(dirty_both_parties)



In [6]:
# Group into sentences.
rep_sents = [[sent, "rep"] for sent in rep_paras]
dem_sents = [[sent, "dem"] for sent in dem_paras]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(rep_sents + dem_sents)

sentences = sentences.rename(columns={0:'text', 1:'party'})

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test = train_test_split(list(sentences['text']), test_size=0.4, random_state=0)
# Step 1 
vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

#Applying the vectorizer
party_tfidf=vectorizer.fit_transform(list(sentences['text']))
print("Number of features: %d" % party_tfidf.get_shape()[1])

# call the fit function on the training test
# call transform without fit on the test set


Number of features: 511


In [None]:
# Step 2 - could do any type of feature reduction
# Sparse matrix - if your data is sparse could do dimensionality reduction more efficiently using 
# singular value reduction (SVD instead of PCA) - output of SVD is not sparse anymore

In [12]:
X = party_tfidf
y = sentences['party']

y.value_counts()/len(y)

rep    0.53719
dem    0.46281
Name: party, dtype: float64

In [13]:
# create train and test data sets out of the vectorized data
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(party_tfidf,
                                                                y,
                                                                test_size=0.4, 
                                                                random_state=0)


In [14]:
rfr = RandomForestClassifier(max_depth=9, 
                             n_estimators=50,
                             criterion='gini',
                             n_jobs=-1)
clf = rfr.fit(X_train_tfidf, y_train)

train_predict = clf.predict(X_train_tfidf)
test_predict =  clf.predict(X_test_tfidf)

print('Train Score: %.3f'%clf.score(X_test_tfidf, y_test))
print('Test Score: %.3f'%clf.score(X_train_tfidf, y_train))

Train Score: 0.610
Test Score: 0.834


In [125]:
# Testing Criterion, seems gini is best
rfr = RandomForestClassifier()

pipeline = Pipeline([
    ('svd', )
    ('clf', RandomForestClassifier(n_jobs=-1))
])

parameters = {
    'clf__max_depth': (4,9, None),
    'clf__n_estimators': (50, 70, 100,150),
    'clf__criterion': ('gini',),
}
    
grid_search = GridSearchCV(pipeline, parameters, verbose=1,
                           refit=True)
grid_search.fit(X, y)
print( 'Best score: %.3f'%grid_search.best_score_)

print( 'Best parameters: {}'.format(grid_search.best_estimator_))

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best score: 0.584
Best parameters: Pipeline(memory=None,
     steps=[('clf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=9, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:   14.2s finished


In [121]:
rfr = RandomForestClassifier()

pipeline = Pipeline([
    ('clf', RandomForestClassifier(n_jobs=-1))
])

parameters = {
    'clf__max_depth': (4,9, None),
    'clf__n_estimators': (50,100),
    'clf__criterion': ('gini',)
}
    
grid_search = GridSearchCV(pipeline, parameters, verbose=1,
                           refit=True)
grid_search.fit(X, y)
print( 'Best score: %.3f'%grid_search.best_score_)

print( 'Best parameters: {}'.format(grid_search.best_estimator_))

Fitting 3 folds for each of 6 candidates, totalling 18 fits
Best score: 0.590
Best parameters: Pipeline(memory=None,
     steps=[('clf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=9, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])


[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:    6.7s finished


In [105]:


#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(party_tfidf, test_size=0.4, random_state=0)


#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]
#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]
#List of features
terms = vectorizer.get_feature_names()
#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

#Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train[5])
print('Tf_idf vector:', tfidf_bypara[5])

Original sentence: Each and every one of us , in our own way , must assume personal responsibility  not only for ourselves and our families , but for our neighbors and our nation .
Tf_idf vector: {'way': 0.39386678192627333, 'assume': 0.4740534006709175, 'neighbors': 0.43229141262647164, 'responsibility': 0.4173859139184391, 'personal': 0.43229141262647164, 'nation': 0.2686626125992759}


In [82]:
print(X_train_tfidf)

  (0, 129)	0.4049264055500716
  (0, 298)	0.38483685959223873
  (0, 274)	0.4049264055500716
  (0, 192)	0.4049264055500716
  (0, 388)	0.38483685959223873
  (0, 140)	0.31419428241337494
  (0, 430)	0.3364326889022654
  (1, 69)	0.41808047454270975
  (1, 293)	0.5276353742641202
  (1, 285)	0.5014578887878202
  (1, 317)	0.43838530975150364
  (1, 311)	0.32119765658834587
  (2, 232)	0.40221313561078387
  (2, 492)	0.40221313561078387
  (2, 409)	0.40221313561078387
  (2, 148)	0.40221313561078387
  (2, 370)	0.38225820266014005
  (2, 509)	0.3060534730004564
  (2, 329)	0.24728105159388925
  (2, 304)	0.22794822625891
  (3, 45)	0.5568957609061334
  (3, 500)	0.5292665847667112
  (3, 286)	0.5568957609061334
  (3, 304)	0.31561227038716244
  (4, 423)	0.5414847496354206
  :	:
  (213, 124)	0.3984541292074249
  (213, 217)	0.21641447707273098
  (213, 204)	0.22414297389267268
  (213, 295)	0.1947608982077708
  (213, 211)	0.23360188954174946
  (213, 261)	0.24579655275763287
  (213, 436)	0.23360188954174946
  (213

In [79]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

#Our SVD data reducer.  We are going to reduce the feature space from 1379 to 130.
svd= TruncatedSVD(130)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf)

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

#Looking at what sorts of paragraphs our solution considers similar, for the first five identified topics
paras_by_component=pd.DataFrame(X_train_lsa,index=X_train)
for i in range(5):
    print('Component {}:'.format(i))
    print(paras_by_component.loc[:,i].sort_values(ascending=False)[0:10])


Percent variance captured by all components: 86.4437588185385
Component 0:
Fellow citizens , let us build that America , a nation ever moving forward toward realizing the full potential of all its citizens .                                                                                                                                                                                                           0.586648
My fellow citizens :                                                                                                                                                                                                                                                                                                                           0.586188
Mr . President , I want our fellow citizens to know how much you did to carry on this tradition .                                                                                                                                            

____
<a id='section3'></a>

## 3. Fit Classification models for each features set.

In [101]:
word_counts.groupby('text_source').count()['text_sentence']/len(word_counts['text_sentence'])

text_source
dem    0.395622
rep    0.604378
Name: text_sentence, dtype: float64

In [123]:
X = word_counts.drop(columns=['text_sentence', 'text_source'], axis=1)
y = word_counts['text_source']
#scaled_data = scaler.fit_transform(df_job[test_features])

pipeline = Pipeline([
    #('threshold', VarianceThreshold()),
    #('features', SelectKBest(f_regression)),
    ('clf', RandomForestClassifier(n_jobs=-1))
])

parameters = {
    #'threshold__threshold': (.001, .002),
    #'features__k': ( 'all',100,),
    'clf__max_depth': (9,4,3),
    'clf__n_estimators': (50,100,150)
}
    
grid_search = GridSearchCV(pipeline, parameters, verbose=1,
                           refit=True)
grid_search.fit(X, y)
print( 'Best score: {}'.format(grid_search.best_score_))
#print( 'Best score: %.3f'%(np.sqrt(grid_search.best_score_*(-1))))
print( 'Best parameters: {}'.format(grid_search.best_estimator_.get_params()))

NameError: name 'word_counts' is not defined

_________
<a id='section4'></a>

## 4. Assess Your Models

______
<a id='section5'></a>

## 5. Improve a Model

In [42]:
questions = questions.replace(' ?', '? ')

In [43]:
questions_doc = nlp(questions)

In [46]:
re.sub(r'DESC:manner', 'DESC:manner.',questions)
re.sub(r'ENTY:cremat', 'ENTY:cremat.',questions)
re.sub(r'ABBR:exp', 'A BBR:expression.',questions)
re.sub(r'ABBR:abb', 'ABBR:manner.',questions)
re.sub(r'DESC:manner', 'DESC:manner.',questions)

"DESC:manner. How did serfdom develop in and then leave Russia? \nENTY:cremat What films featured the character Popeye Doyle? \nDESC:manner. How can I find a list of celebrities ' real names? \nENTY:animal What fowl grabs the spotlight after the Chinese Year of the Monkey? \nABBR:exp What is the full form of .com? \nHUM:ind What contemptible scoundrel stole the cork from my lunch? \nHUM:gr What team did baseball 's St. Louis Browns become? \nHUM:title What is the oldest profession? \nDESC:def What are liver enzymes? \nHUM:ind Name the scar-faced bounty hunter of The Old West .\nNUM:date When was Ozzy Osbourne born? \nDESC:reason Why do heavier objects travel downhill faster? \nHUM:ind Who was The Pride of the Yankees? \nHUM:ind Who killed Gandhi? \nENTY:event What is considered the costliest disaster the insurance industry has ever faced? \nLOC:state What sprawling U.S. state boasts the most airports? \nDESC:desc What did the only repealed amendment to the U.S. Constitution deal with? 