# Challenge: Build Your Own NLP Model

For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing
2. Create features using two different NLP methods: For example, BoW vs tf-idf.
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes.
4. Assess your models using cross-validation and determine whether one model performed better.
5. Pick one of the models and try to increase accuracy by at least 5 percentage points.

Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout.

### Import Statements

In [26]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter
import nltk

nltk.download('gutenberg')
!python -m spacy download en

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [4]:
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


For this assignment, I'll make models that analyze Blake's poems and Bryant's stories.

### 1. Data Cleaning / Processing / Language Parsing

In [0]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    text = re.sub(r'--',' ',text)
    text = re.sub('[\[].*?[\]]', '', text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
poems = gutenberg.raw('blake-poems.txt')
stories = gutenberg.raw('bryant-stories.txt')

# The Chapter indicator is idiosyncratic
poems = re.sub(r'Chapter \d+', '', poems)
stories = re.sub(r'CHAPTER .*', '', stories)
    
poems = text_cleaner(poems[:int(len(poems)/10)])
stories = text_cleaner(stories[:int(len(stories)/10)])

I'll look at the _poems_ and _stories_ variables to see what they look like after the text cleaning.

In [13]:
poems

'SONGS OF INNOCENCE AND OF EXPERIENCE and THE BOOK of THEL SONGS OF INNOCENCE INTRODUCTION Piping down the valleys wild, Piping songs of pleasant glee, On a cloud I saw a child, And he laughing said to me: "Pipe a song about a Lamb!" So I piped with merry cheer. "Piper, pipe that song again;" So I piped: he wept to hear. "Drop thy pipe, thy happy pipe; Sing thy songs of happy cheer:!" So I sang the same again, While he wept with joy to hear. "Piper, sit thee down and write In a book, that all may read." So he vanish\'d from my sight; And I pluck\'d a hollow reed, And I made a rural pen, And I stain\'d the water clear, And I wrote my happy songs Every child may joy to hear. THE SHEPHERD How sweet is the Shepherd\'s sweet lot! From the morn to the evening he stays; He shall follow his sheep all the day, And his tongue shall be filled with praise. For he hears the lambs\' innocent call, And he hears the ewes\' tender reply; He is watching while they are in peace, For they know when their 

In [14]:
stories

'TWO LITTLE RIDDLES IN RHYME There\'s a garden that I ken, Full of little gentlemen; Little caps of blue they wear, And green ribbons, very fair. (Flax.) From house to house he goes, A messenger small and slight, And whether it rains or snows, He sleeps outside in the night. (The path.) THE LITTLE YELLOW TULIP Once there was a little yellow Tulip, and she lived down in a little dark house under the ground. One day she was sitting there, all by herself, and it was very still. Suddenly, she heard a little _tap, tap, tap_, at the door. "Who is that?" she said. "It\'s the Rain, and I want to come in," said a soft, sad, little voice. "No, you can\'t come in," the little Tulip said. By and by she heard another little _tap, tap, tap_ on the window-pane. "Who is there?" she said. The same soft little voice answered, "It\'s the Rain, and I want to come in!" "No, you can\'t come in," said the little Tulip. Then it was very still for a long time. At last, there came a little rustling, whispering 

In [0]:
# Parse the cleaned text.
nlp = spacy.load('en')

poems_doc=nlp(poems)
stories_doc=nlp(stories)

To get a better sense of what happened when the cleaned text was parsed, I'll look at the new variables.

In [17]:
poems_doc

SONGS OF INNOCENCE AND OF EXPERIENCE and THE BOOK of THEL SONGS OF INNOCENCE INTRODUCTION Piping down the valleys wild, Piping songs of pleasant glee, On a cloud I saw a child, And he laughing said to me: "Pipe a song about a Lamb!" So I piped with merry cheer. "Piper, pipe that song again;" So I piped: he wept to hear. "Drop thy pipe, thy happy pipe; Sing thy songs of happy cheer:!" So I sang the same again, While he wept with joy to hear. "Piper, sit thee down and write In a book, that all may read." So he vanish'd from my sight; And I pluck'd a hollow reed, And I made a rural pen, And I stain'd the water clear, And I wrote my happy songs Every child may joy to hear. THE SHEPHERD How sweet is the Shepherd's sweet lot! From the morn to the evening he stays; He shall follow his sheep all the day, And his tongue shall be filled with praise. For he hears the lambs' innocent call, And he hears the ewes' tender reply; He is watching while they are in peace, For they know when their Shepher

In [18]:
stories_doc

TWO LITTLE RIDDLES IN RHYME There's a garden that I ken, Full of little gentlemen; Little caps of blue they wear, And green ribbons, very fair. (Flax.) From house to house he goes, A messenger small and slight, And whether it rains or snows, He sleeps outside in the night. (The path.) THE LITTLE YELLOW TULIP Once there was a little yellow Tulip, and she lived down in a little dark house under the ground. One day she was sitting there, all by herself, and it was very still. Suddenly, she heard a little _tap, tap, tap_, at the door. "Who is that?" she said. "It's the Rain, and I want to come in," said a soft, sad, little voice. "No, you can't come in," the little Tulip said. By and by she heard another little _tap, tap, tap_ on the window-pane. "Who is there?" she said. The same soft little voice answered, "It's the Rain, and I want to come in!" "No, you can't come in," said the little Tulip. Then it was very still for a long time. At last, there came a little rustling, whispering sound,

In [19]:
# Group into sentences.
poems_sentences = [[sent, 'Blake'] for sent in poems_doc.sents]
stories_sentences = [[sent, 'Bryant'] for sent in stories_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(poems_sentences + stories_sentences)
sentences.head()

Unnamed: 0,0,1
0,"(SONGS, OF, INNOCENCE, AND, OF, EXPERIENCE, an...",Blake
1,"(INNOCENCE, INTRODUCTION, Piping, down, the, v...",Blake
2,(wild),Blake
3,"(,, Piping, songs, of, pleasant, glee, ,, On, ...",Blake
4,"("", Pipe, a, song, about, a, Lamb, !, "")",Blake


### 2. Create Features Using Two Different NLP Methods, like Bag of Words (BoW) and tf-idf

_BoW_

In [0]:
# Utility function to create a list of the 2,000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]
    
# Creates a data frame with features for each word in our common word set. Each value's the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

    # Set up the bags.
poems_words = bag_of_words(poems_doc)
stories_words = bag_of_words(stories_doc)

# Combine bags to create a set of unique words.
common_words = set(poems_words + stories_words)

In [21]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400
Processing row 450


Unnamed: 0,play,arise,show,pleased,thing,course,rainbow,lion,bad,town,steal,southern,fiercely,lead,decent,castor,poke,tent,give,east,rustle,lean,house,beast,care,chocolate,fearful,round,suffering,low,wet,tooth,glad,begin,Little,rejoice,pen,morn,Mouse,church,...,specially,reflection,ripe,naughty,pooh,joy,frightened,snow,lift,catch,keyhole,corner,voice,pleasant,Clouds,powerful,remember,whisper,cellar,ao,wood,teach,good,dle,shine,strong,paper,shirt,sit,man,especially,start,dost,wretch,comfort,youth,woman,kill,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"(SONGS, OF, INNOCENCE, AND, OF, EXPERIENCE, an...",Blake
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"(INNOCENCE, INTRODUCTION, Piping, down, the, v...",Blake
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,(wild),Blake
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"(,, Piping, songs, of, pleasant, glee, ,, On, ...",Blake
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"("", Pipe, a, song, about, a, Lamb, !, "")",Blake


_tf-idf_

In [32]:
import nltk
from nltk.corpus import gutenberg
nltk.download('punkt')
nltk.download('gutenberg')

#reading in the data, this time in the form of paragraphs
poems=gutenberg.paras('blake-poems.txt')

#processing
poems_paras=[]
for paragraph in poems:
    para=paragraph[0]

    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]

    #Forming each paragraph into a string and adding it to the list of strings.
    poems_paras.append(' '.join(para))

print(poems_paras[0:4])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
['[ Poems by William Blake 1789 ]', 'SONGS OF INNOCENCE AND OF EXPERIENCE and THE BOOK of THEL', 'SONGS OF INNOCENCE', 'INTRODUCTION']


In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test = train_test_split(poems_paras, test_size=0.4, random_state=0)

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )


#Applying the vectorizer
poems_paras_tfidf=vectorizer.fit_transform(poems_paras)
print("Number of features: %d" % poems_paras_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(poems_paras_tfidf, test_size=0.4, random_state=0)

#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]

#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]

#List of features
terms = vectorizer.get_feature_names()

#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

#Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train[5])
print('Tf_idf vector:', tfidf_bypara[5])

Number of features: 404
Original sentence: The sun descending in the west , The evening star does shine ; The birds are silent in their nest , And I must seek for mine .
Tf_idf vector: {'seek': 0.3790496718281308, 'nest': 0.3790496718281308, 'silent': 0.3790496718281308, 'evening': 0.3997564247870558, 'shine': 0.3498651426862711, 'birds': 0.3291583897273461, 'does': 0.2749034743404047, 'sun': 0.32068061354441146}


### 3. Fit Supervised Learning Models for Each Feature Set to Predict the Category Outcomes

_Setting Up the Train and Test Sets_

In [0]:
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

_Random Forest - BoW_


In [29]:
from sklearn import ensemble

rfc = ensemble.RandomForestClassifier()

train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.9756944444444444

Test set score: 0.8860103626943006




_Random Forest - tf-idf_

### 4. Assess Your Models Using Cross-Validation

### 5. Pick a Model and Try to Increase its Accuracy by Five Percent