Note, all imports for this notebook are below:

In [29]:
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.feature_extraction.text import CountVectorizer, \
HashingVectorizer, TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_union, make_pipeline, Pipeline

import string

# Intro to Natural Language Processing 

Natural language processing is a fairly large field and one that we will not cover in nearly enough detail this week. For our purposes, it involves the processing of written language into numbers that we will then use during modeling. This puts it squarely in the preprocessing / feature engineering stage.

> Note, there are plenty of cases where we may want to predict what the next word of a statement will be or to automatically provide an answer to a question, program a chat bot, etc. These also fall under the NLP umbrella, but for this week's discussions, we'll focus on just going from language -> numbers

## A Simple Example

Suppose we are building a spam/ham classifier. Input are emails, output is a binary classification of whether or not the email is a spam message or not.

Here's an example of an input email:

> Hello, I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.

and a second example:

> Hello, I am writing in regards to your application to the position of Data Scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews and we would like to invite you for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message further information on date, time and location of the interview. Please let me know if I can be of any further assistance. Best Regards.

We might want to look at certain words and determine where they happen:

### Check for Understanding 1 (10 Minutes)

With a partner, do the following:

1. Identify which email is a spam email and which is a ham email. 
2. What words exist in the spam email that do not exist in the ham email? What about the reverse?
3. Using these words, create a rule that would classify a spam email and do not share it with your partner.
> For example, the rule could be "If the email refers to cancer, it is a spam email"
4. Go online (or into your inbox!) and find a second example of a spam email. PM this text to your partner. Does their rule correctly classify this email?

## Bag of Words Approaches

The bag-of-words model is a simplifying representation used in natural language processing. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. In other words, how many times does each word happen?

One way to do this easily is with the `Counter` module in the standard library.

In [None]:
spam = '''
Hello, I saw your contact information on LinkedIn. 
I have carefully read through your profile and you seem to have an outstanding personality. 
This is one major reason why I am in contact with you.
My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". 
I am 86 years old and I was diagnosed with cancer 2 years ago. 
I will be going in for an operation later this week. 
I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.
'''

ham = '''
Hello, I am writing in regards to your application to the position of Data Scientist at Hooli X.
We are pleased to inform you that you passed the first round of interviews and we would like to invite you for an on-site 
interview with our Senior Data Scientist Mr. John Smith. 
You will find attached to this message further information on date, time and location of the interview.
Please let me know if I can be of any further assistance. Best Regards.
'''

print(Counter(spam.lower().split()))
print('\n')
print(Counter(ham.lower().split()))

In [None]:
Counter(spam.filter())

In this case, outside of prepositions and articles, most words are unique across these documents. However, for the ham email, the following interesting words happen more than once:

- data
- scientist
- further

and in the spam email:

- contact
- years
- etc

### Scikit-learn Options

Scikit-learn offers two different options for Bag-of-Words approaches to NLP:

- [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [`HashingVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)

In **most** cases, you'll be using the `CountVectorizer` object, but there may be times, especially if you are on an older machine, that `HashingVectorizer` will be a huge timesaver.

> **Major Note**: NLP can take a lot of computing power. Don't be afraid to look for alternatives like `HashingVectorizer`, to cut your dataset down, to use fewer words, use a larger virtualized machine (such as on AWS). Working smart, not hard applies very well here!

[`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) will take a set of words and split them up into one column per word, with (by default) the count of the word for that row in that column. We'll walk through a brief example now and then dive into the options. 

This library is very full-featured and has a number of options to set or tweak -- get used to reviewing those options to make sure that you are gaining the most out of your work!

In [None]:
df = pd.DataFrame().from_dict({0: spam, 1: ham}, orient='index')

In [None]:
df.columns = ['text']

In [None]:
cv = CountVectorizer()
cv.fit(df['text'])
cv.transform(df['text'])

What is sparse matrix? Transformations like this tend to create matrices with a very large amount of zeroes (words may only show up once or twice across a number of documents, so a lot of extra memory is expended to represent a very large number of zeroes).

Sparse matrices collapse regular (dense) matrices by marking down only cases where a non-zero value is found for a certain combination of row and column. It then drops all the zeroes, allowing for a reduced memory footprint.

We can call `.todense()` on the output to convert it into a dense matrix. 

In [None]:
cv.transform(df['text']).todense()

In addition, we can grab the feature names and apply them if we are interested in seeing the distribution of words:

In [None]:
cv_df = pd.DataFrame(cv.transform(df['text']).todense(),
                    columns=cv.get_feature_names()) # use .get_feature_names()

In [None]:
cv_df.head()

Why the random numbers at the front? That comes from the `8,750,000.00` and `86` sections of the spam email -- our parser does not interpret `8,750,000.00` as one number but rather the string of words `[8, 750, 000, 00]`. 

### `CountVectorizer` Options

`CountVectorizer` takes the following (useful) keyword arguments:

| Argument | Default Value | Definition |
| :--- | :--- | :--- |
| `decode_error` | `strict` | What to do if text cannot be decoded. `strict` will raise a `UnicodeDecodeError`, `ignore` will skip that word, `replace` will attempt to replace it with a non-Unicode variant|
| `strip_accents` | `None` | When preprocessing a word, `CountVectorizer` does nothing with the accented characters. `ascii` will convert those characters if they have a direct ASCII mapping (à -> a, for example), and `unicode` is slower but will do it for all characters | 
| `preprocessor / tokenizer` | `None` | Ways to override how to split text into words (`tokenizer`) and what to do with those words before vectorizing (`preprocessor`) -- we'll discuss shortly |
| `ngram_range` | `(1, 1)` | Sometimes we may want each sequence of _n_ words as well as each individual word. This is known as an _ngram_ and we cna set that here | 
| `stop_words` | `None` | Whether or not to remove stop words. Will discuss later. |
| `max_df` | `1.0` | `df` refers to the document frequency -- how often does a given word show up across documents. If we set this to a float less than 1.0, any word that happens more frequently than that value will be discarded. Any integer will be the number of documents instead of the proportion | 
| `min_df` | `1` | Same as `max_df`, but for the number / proportion of documents that a word has to appear in before it is included |
| `max_features` | `None` | The top _n_ occuring features to include. If `None`, include all features. This is a great way to coerce `CountVectorizer` to return a matrix of a specific shape / size |
| `binary` | `False` | If `True`, return dummy variables instead of a count of occurances |

Whew! That's a lot!


### Check for Understanding 2 (15 Minutes)

For this Check for Understanding (and many others), we'll be using the [`20 Newsgroups`](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html) dataset. This dataset includes a number of old messages from a set of newsgroups and is a standard dataset to practice NLP techniques with. 

For this first section, we will be using the first 100 messages from the `sci.space` newsgroup. We'll also be stripping them of any headers, footers, or quoted messages. 

## self practice review (12/08/2017)

In [None]:
space = fetch_20newsgroups(subset='train', categories=['sci.space'],
                          remove=['headers', 'footers', 'quotes'])



In [None]:
space_messages = space['data'][:100]

In [None]:
cv = CountVectorizer(stop_words='english', min_df=0.01)
space_words = cv.fit_transform(space_messages)

In [None]:
words = pd.DataFrame(space_words.todense(), 
                     columns=cv.get_feature_names())

In [None]:
words.head()
print(words.shape)

In [None]:
words.sum().sort_values(ascending=False).head()

In [None]:
cv = CountVectorizer(stop_words='english', 
                     min_df=0.01, ngram_range=(1,2))
space_words = pd.DataFrame(cv.fit_transform(space_messages).todense(),
                          columns=cv.get_feature_names())

In [None]:
space_words.sum().sort_values(ascending=False).head()
print(space_words.shape)

In [None]:
cv = CountVectorizer(stop_words='english',min_df=0.1, max_df=0.9)
space_words = pd.DataFrame(cv.fit_transform(space_messages).todense(),
                          columns=cv.get_feature_names())
print(space_words.sum().sort_values(ascending=False).head())
print(space_words.shape)

In [None]:
space_words.head()

In [None]:
plt.figure(figsize=(100,95))
sns.heatmap(space_words.corr(), annot=True)

In [None]:
cv = CountVectorizer(max_features=100)
space_words = pd.DataFrame(cv.fit_transform(space_messages).todense(),
                          columns=cv.get_feature_names())
space_words.head()

In [None]:
print(space_words.shape)
print(space_words.sum().sort_values(ascending=False).head())

## -------------------------------------------------------------

In [None]:
space = fetch_20newsgroups(subset='train',
                          categories=['sci.space'],
                          remove=('headers', 'footers', 'quotes'))
space_messages = space['data'][:100]

In [None]:
print(space_messages[0], '\n', space_messages[-1])

In [None]:
cv = CountVectorizer()
cv.fit(space_messages)
sm = cv.transform(space_messages)

In [None]:
df = pd.DataFrame(sm.todense(),columns=cv.get_feature_names())

In [None]:
df.sum().sort_values(ascending=False).head()

In [None]:
cv = CountVectorizer(ngram_range=(1,2))
cv.fit(space_messages)
smn = cv.transform(space_messages)
df_smn = pd.DataFrame(smn.todense(), columns=cv.get_feature_names())

In [None]:
df_smn.head()

In [None]:
df_smn.sum().sort_values(ascending=False).head()

In [None]:
cv = CountVectorizer(min_df=0.10, max_df=.90)
smm = cv.fit_transform(space_messages)
df_smm = pd.DataFrame(smm.todense(), columns=cv.get_feature_names())

In [None]:
df_smm.sum().sort_values(ascending=False).head()

In [None]:
cv= CountVectorizer(max_features=100)
smm2 = cv.fit_transform(space_messages)
df_smm2 = pd.DataFrame(smm2.todense(), columns=cv.get_feature_names())
df_smm2.sum().sort_values(ascending=False).head()

In [None]:
print(df.shape)
print(df_smn.shape)
print(df_smm.shape)
print(df_smm2.shape)

In [None]:

sns.heatmap(df_smm2.corr(), annot=True)

`space_messages` now contains 100 space-themed messages that we will do some feature extraction on:

Individually, please try the following:

1. Use `CountVectorizer` (at its default settings) and Pandas to create a DataFrame where each column is a word in the text. 
    1. How large is your DataFrame? 
    2. What are the top 10 occuring words?
2. Reinstantiate `CountVectorizer` and set `ngram_range` to `(1, 2)`, then fit and transform the space messages into a DataFrame
    1. What does this do?
    2. How large is your DataFrame?
    3. What are the top 10 occuring words?
3. Reinstatiate `CountVectorizer` and set `min_df` to `0.10` and `max_df` to `0.90`, then fit and transform the space messages into a DataFrame
    1. What does this do?
    2. How large is your DataFrame?
    3. What are the top 10 occuring words?
4. Reinstate `CountVectorizer` and set `max_features` to `100`, then fit and transform the space messages into a DataFrame
    1. What does this do?
    2. How large is your DataFrame?
    3. What are the top 10 occuring words?
5. Pick your smallest DataFrame and do a correlation heatmap on it. Do you see any trends? Do words co-occur together?

When finished, with a partner or a small group, answer the following:

6. What are each of these techniques doing? Which words are being dropped with each different technique? Which technique do you think captures the most useful words from the documents?

Certain words occur a huge amount but may not offer anything useful to us as modelers. We'll discuss removing these words shortly.

### `HashingVectorizer`

Sometimes we may have too many features or too many rows and run out of memory trying to process all of it. Sklearn provides a second option known as `HashingVectorizer` to avoid that.

- Benefit: Low memory footprint, allowing you to do more work
- Downside: No ability to determine which feature corresponds to what the original word is.

In most cases, we'll be using `CountVectorizer`, but `HashingVectorizer` can be a nice ace in the hole. 

#### How it Works

As you have seen we can set the `CountVectorizer` dictionary to have a fixed size, only keeping words of certain frequencies, however, we still have to compute a dictionary and hold the dictionary in memory. This could be a problem when we have a large corpus or in streaming applications where we don't know which words we will encounter in the future.

These problems can be solved using the `HashingVectorizer`, which converts a collection of text documents to a matrix of occurrences, calculated with the hashing trick. Each word is mapped to a feature with the use of a hash function that converts it to a hash. If we encounter that word again in the text, it will be converted to the same hash, allowing us to count word occurence without retaining a dictionary in memory. This is very convenient!

The main drawback of the this trick is that it's not possible to compute the inverse transform, and thus we lose information on what words the important features correspond to. The hash function employed is the signed 32-bit version of Murmurhash3.

We'll instantiate `HashingVectorizer` with the keyword argument `norm=None` to get back real value counts of words. The default is to normalize these values so as to prevent hashing collisions.

In [None]:
hvec = HashingVectorizer(stop_words='english', norm=None)
hvec.fit(space_messages)
hvec.n_features

answer = pd.DataFrame(hvec.transform(space_messages).todense())
answer.sum().sort_values(ascending=False).head(10)

In [None]:
hvec.get_stop_words

### Stemming

Often, slightly different version of a word exist. For example: LinkedIn sees 6000+ variations of the title "Software Engineer" and 8000+ variations of the word "IBM".

It would be wrong to consider the words "MR." and "mr" to be different features, thus we need a technique to normalize words to a common root. This technique is called Stemming.

- Science, Scientist => Scien
- Swimming, Swimmer, Swim => Swim

We could define a Stemmer based on rules that we've decided on:

In [None]:
def stem(tokens):
    '''rules-based stemming of a bunch of tokens'''
    
    new_bag = []
    for token in tokens:
        # define rules here
        if token.endswith('s'):
            new_bag.append(token[:-1])
        elif token.endswith('er'):
            new_bag.append(token[:-2])
        elif token.endswith('tion'):
            new_bag.append(token[:-4])
        elif token.endswith('tist'):
            new_bag.append(token[:-4])
        elif token.endswith('ce'):
            new_bag.append(token[:-2])
        elif token.endswith('ing'):
            new_bag.append(token[:-2])
        else:
            new_bag.append(token)

    return new_bag

stem(['Science', 'Scientist'])

But it is often easier to make use of `nltk` to do so:

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('general'))
print(stemmer.stem('scientists'))
print(stemmer.stem('scientist'))

### Stop Words

Some words are very common and provide no information on the text content.

We should remove these stop words. Every language has different stop words and you can add your own words as well if it makes sense (for example, if you there was a domain word that was prevelant but ultimately uninformative, you should use your judgement to remove it).

`nltk` provides their own list of stop words:

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop_russian = stopwords.words('russian')
print(stop)

We can also use `CountVectorizer` to remove common stopwords for us by setting the `stop_words` keyword argument to `english`

In [None]:
cv = CountVectorizer()
cv.fit(space_messages)
stopped_words = pd.DataFrame(cv.transform(space_messages).todense(),
                              columns=cv.get_feature_names())
print(stopped_words.sum().sort_values(ascending=False).head(10))

In [None]:
cv = CountVectorizer(stop_words='english')
cv.fit(space_messages)
stopped_words = pd.DataFrame(cv.transform(space_messages).todense(),
                              columns=cv.get_feature_names())
print(stopped_words.sum().sort_values(ascending=False).head(10))

In [None]:
stop.extend(['grace', 'jense'])
stop

This dramatically changes the type of "common" words available.

In [None]:
sent = 'i am a data scientist'

for word in sent.split():
    if word not in stop:
        print(word)
    else:
        pass

In [None]:
stop[:-1]

### How to combine transformers

Working with text data requires a lot of preprocessing, as the effectiveness of your model can dramatically change based on how well you've processed your text data. 

In **most** cases, tweaking the settings of `CountVectorizer` will generally be ok. However, sometimes you may want to create a custom function to preprocess text for `CountVectorizer`.

Imagine that I wanted to do the following to my text:

- Remove punctuation and numbers
- Turn text lowercase
- Remove stop words
- Stem remaining words

We could imagine building a function to do that:

In [None]:
def clean(text):
    stemmer = PorterStemmer()
    stop = stopwords.words('english')
    text = text.translate(str.maketrans('','', string.punctuation))
    text = text.translate(str.maketrans('','', string.digits))
    text = text.lower.strip()
    final_text = []
    for w in text.split():
        if w not in stop:
            final_text.append(stemmer.stem(w.strip()))
    return ' '.join(final_text)

In [None]:
def cleaner(text):
    stemmer = PorterStemmer()
    stop = stopwords.words('english')
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.translate(str.maketrans('', '', string.digits))
    text = text.lower().strip()
    final_text = []
    for w in text.split():
        if w not in stop:
            final_text.append(stemmer.stem(w.strip()))
    return ' '.join(final_text)

That would change our original spam message from:

In [None]:
print(spam)

to:

In [None]:
spam_cleaned = cleaner(spam)
print(spam_cleaned)

We can pass this function into `CountVectorizer` as a way to preprocess the text as a part of the fitting.

In [None]:
cv = CountVectorizer(preprocessor=cleaner)
custom_preprocess = pd.DataFrame(cv.fit_transform(space_messages).todense(),
                                columns=cv.get_feature_names())
print(custom_preprocess.sum().sort_values(ascending=False).head(10))

In [None]:
custom_preprocess.shape

In [None]:
cv = CountVectorizer(preprocessor=cleaner)
cv.fit(space_messages)
custom_preprocess = pd.DataFrame(cv.transform(space_messages).todense(),
                              columns=cv.get_feature_names())
print(custom_preprocess.sum().sort_values(ascending=False).head(10))

### When to use a custom preprocessor callable

Starting off, I'd make use of what's built into `CountVectorizer` and see if that changes your outcome. However, if you've been working on a dataset for a longer amount of time (such as during your projects or your Capstone) you may want to start developing custom preprocessors that make more sense for the domain that you are in.

### Term frequency - Inverse document Frequency (tf-idf)

Tf-idf tells us which words are the most discriminating between documents. Words that occur a lot in one document but don't occur in many other documents tells you a lot more than a word that occurs frequently in all documents.

Take a look at the map below. These types of maps usually highlight the most **uniquely favorite food** of a state, not the actual favorite food (which in most cases is probably pizza).

Tf-idf, conceptually, does the same thing: it highlights what is common or typical in one or two cases and rare in all others. That's usually more interesting (and predictive) then words or items that are common everywhere.

![](./images/map-foods.png)

- Term frequency is the frequency of a certain term in a document
- Inverse document frequency is defined as the frequency of documents that contain that term over the whole corpus.

This technique reweights terms to strengthen those that are highly specific to a particular document, while suppressing terms that are common to most documents.

Formally, you'll usually see it like this:

$$ tf(t, d) = \frac{N_{term}}{N_{document}} $$

$$ idf(t, D) = log(\frac{N_{documents}}{N_{\text{documents that contain term t}}}) $$

We'll use sklearn's `TfidfVectorizer` to count vectorize and then transform that data using Tfidf:

In [None]:
tvec = TfidfVectorizer(stop_words='english')
tvec.fit([spam, ham])

df = pd.DataFrame(tvec.transform([spam, ham]).todense(),
                   columns=tvec.get_feature_names(),
                   index=['spam', 'ham'])
df.head()

In this case, we see that:

- words that don't show up a document have `0.0000` for values
- words that do show up in a document are weighted based on tf-idf -- as those scores increase in value, the those terms occur more uniquely in document _d_ than in other documents. 

### Check for Understanding 3 (15 minutes)

Let's practice some of these new techniques on `space_messages` that we established previously. As a reminder, these contain the first 100 messages in the space newsgroups dataset.

In [None]:
print(space_messages[0])

Individually, please try and tackle the following questions. You may find it helpful to use some of the code you wrote for the first Check for Understanding.

1. Use the default `CountVectorizer()` to create a DataFrame and answer the following questions (note, we also did this same prompt in the first CfU -- this is to remind us of where we're starting from):
    - How large is your DataFrame?
    - What are the top 10 occuring words?
2. Use `CountVectorizer()` and remove stop words with the `stop_words` keyword argument. 
    - How large is your DataFrame?
    - What are the top 10 occuring words?
3. Use `TfIdfVectorizer()` and remove stop words with the `stop_words` keyword argument.
    - How large is your DataFrame?
    - Interpret the values that are in the DataFrame. What do they mean?
4. We've tried a bunch of different ways to transform the same text data into numbers across this check for understanding and the previous one. Pick two methods and compare and contrast them. What type of words are included in each DataFrame? What types of words are removed or reduced in impact?

When you're finished working individually, share your work with a partner. Did you come to the same conclusions? What happens if you try this with new data?

In [None]:
cv = CountVectorizer(stop_words='english')
cv_sm = cv.fit_transform(space_messages)
df_new = pd.DataFrame(cv_sm.todense(), columns=cv.get_feature_names())

In [None]:
df_new.sum().sort_values(ascending=False).head()

In [None]:
tv = TfidfVectorizer(stop_words='english')
tvsd = pd.DataFrame(tv.fit_transform(space_messages).todense(),
                   columns=cv.get_feature_names())
tvsd.sum().sort_values(ascending=False).head(10)

In [None]:
tvsd.shape

### How to apply NLP

Typically, we'll use NLP in the feature engineering stage to help us create new features for our data, which we'll then use in either structured or unstructured modeling techniques. What follows is an example of how we might apply this process. We'll first do this in an ad-hoc fashion, then refactor our code for a more production-ready format.

#### Getting Data

Let's use a truncated dataset about the news. This is a set of 8000 randomly selected news articles with an indicator as to whether the news article is discussing economic news or not. 

In [2]:
econ = pd.read_csv('datasets/economic_news.csv', 
                   usecols=[7, 11, 14])
econ.head()

Unnamed: 0,relevance,headline,text
0,yes,Yields on CDs Fell in the Latest Week,NEW YORK -- Yields on most certificates of dep...
1,no,The Morning Brief: White House Seeks to Limit ...,The Wall Street Journal Online</br></br>The Mo...
2,no,Banking Bill Negotiators Set Compromise --- Pl...,WASHINGTON -- In an effort to achieve banking ...
3,no,Manager's Journal: Sniffing Out Drug Abusers I...,The statistics on the enormous costs of employ...
4,yes,Currency Trading: Dollar Remains in Tight Rang...,NEW YORK -- Indecision marked the dollar's ton...


In [3]:
print(econ.iloc[61,1])

Bummer of a Recovery; On economic growth, real GDP has risen 0.8% over the 13 quarters since the recession began, compared to an average increase of 9.9% in past recoveries.


Let's dummify the `relevance` column and strip out `</br>` from our text:

In [4]:
econ['relevance'] = econ['relevance'].apply(lambda x: 1 if x == 'yes' else 0)
econ['text'] = econ['text'].apply(lambda x: x.replace('</br>', ''))
econ['headline'] = econ['headline'].apply(lambda x: x.replace('</br>',''))

In [None]:
econ['relevance'] = econ['relevance'].apply(lambda x: 1 if x == 'yes' else 0)
econ['text'] = econ['text'].apply(lambda x: x.replace('</br>', ''))
econ['headline'] = econ['headline'].apply(lambda x: x.replace('</br>', ''))

In [None]:
for x in range(3):
    print(econ.iloc[0,x])

And let's split into a training set and a test set, using the `text` of the document.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(econ['text'].values, 
                                                   econ['relevance'].values,
                                                   test_size=0.33,
                                                   random_state=2017)

In [None]:
print(X_train.shape)
print(y_train.shape)

#### Transforming data using `CountVectorizer`

We'll use `CountVectorizer` in its default form to prepare a dataframe for modeling. Note that I'm going to leave this in its sparse form, because it typically won't matter for the next step.

In [None]:
cv = CountVectorizer(stop_words='english', min_df=0.01)
cv.fit(X_train)
X = cv.transform(X_train)
print(X.shape)

#### Dimensionality Reduction with PCA

Typically, those 38,000+ columns are not individually informative. Reducing them in dimensionality using PCA can be very helpful. We'll be using a variant of PCA known as [`TruncatedSVD`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) -- it does more or less the same thing as PCA but in a slightly different fashion. It's best used with sparse matrices like these. 

**Note**: `TruncatedSVD` works on the actual matrix itself, _not_ the covariance matrix like PCA. This means that the maximum number of components you can make is limited to the smaller of your rows or columns. In this case, even though we have a large number of columns, we only have around 5,000 rows, meaning that the largest number of components `TruncatedSVD` will be able to make is that number.

First, we'll pick a high number of components and graph the explained variance ratio to see if there's a useful cutoff point:

In [None]:
tsvd = TruncatedSVD(n_components=200)
tsvd.fit(X)

In [None]:
plt.plot(range(200),tsvd.explained_variance_ratio_.cumsum())

In [None]:
tsvd = TruncatedSVD(n_components=200)
tsvd.fit(X)
plt.plot(range(1000), tsvd.explained_variance_ratio_.cumsum())

This graph would suggest that somewhere north of 100 components might be pretty good (it is where the rate of change begins to flatten out). I'm going to stick with a smaller number (10) for the time being for modeling speed. This will be one of the things we tweak in our Check for Understanding later.

In [None]:
tsvd = TruncatedSVD(n_components=200)
tsvd.fit(X)
X_tsvd = tsvd.transform(X)

#### Predicting with `RandomForestClassifier`

At this point, we can move forward with modeling on our new sparse matrix of data:

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_tsvd, y_train)

In [None]:
print(rfc.score(X_tsvd, y_train))
print(confusion_matrix(y_train, rfc.predict(X_tsvd)))
print(classification_report(y_train, rfc.predict(X_tsvd)))

In [None]:
# trying test data with RandomForestClassifier
X_test_cv = cv.transform(X_test)
X_test_svd = tsvd.transform(X_test_cv)
print(rfc.score(X_test_svd, y_test))

print(classification_report(y_test, rfc.predict(X_test_svd)))

In [None]:
cmat = pd.DataFrame(confusion_matrix(y_test, rfc.predict(X_test_svd)),
                  columns=['Predicted 0', 'Predicted 1'], 
                  index=['Actual 0', 'Actual 1'])

In [None]:
cmat

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_tsvd, y_train)
print(rfc.score(X_tsvd, y_train))
print(confusion_matrix(y_train, rfc.predict(X_tsvd)))
print(classification_report(y_train, rfc.predict(X_tsvd)))

In [None]:
cm = confusion_matrix(y_train, rfc.predict(X_tsvd))
pd.DataFrame(cm, columns=['Predict_0', 'Predict_1'],index=['Actual_0', 'Actual_1'])

In [None]:
cr = classification_report(y_train, rfc.predict(X_tsvd))

In [None]:
cr.type()

This is an exceptionally good model with high accuracy, though we do misclassify about 12% of our class 1 (economic) news as non-economic news. However, we should check to make sure that it works equally well on our test data. As before, we'll make sure that we are applying the already fit transformation to our test data, not refitting.

In [None]:
X_test_cv = cv.transform(X_test)
X_test_svd = tsvd.transform(X_test_cv)
print(rfc.score(X_test_svd, y_test))
print(confusion_matrix(y_test, rfc.predict(X_test_svd)))
print(classification_report(y_test, rfc.predict(X_test_svd)))

We do a little worse here, as expected -- we almost completely misclassify all of our economic news as non-economic news. However, this process does identify our general process for modeling with text data:

1. Use the NLP techniques that we've identified to clean and prepare the data
2. Consider using a dimensionality reduction technique like `TruncatedSVD`
3. Use the prepared data with machine learning techniques that we've already seen before to predict things.

### Check for Understanding 3 (20 minutes)

This is an open ended check for understanding. Your goal at the end of 20 minutes is to have a better model with this same data that maximizes the class 1 recall of our model on the test data:

> Our current model at this point correctly predicts 24 of 486 economic articles correctly -- we want to increase the predictive power of our model for that test case!

How you do so is up to you (exploring how the data is transformed is highly encouraged). Some options:

- Use stop words and other arguments in `CountVectorizer` to change which features are engineered
- Make use of `TfidfVectorizer` to highlight more unique words
- Modify your the number of components created in `TruncatedSVD`
- Try a different modeling technique
- Use `GridSearchCV` to optimize hyperparameters

In [8]:
X_train, X_test, y_train, y_test = train_test_split(econ['text'].values, 
                                                   econ['relevance'].values,
                                                   test_size=0.33,
                                                   random_state=2017)

In [9]:
print(X_train.shape)
print(y_train.shape)

(5360,)
(5360,)


In [10]:
cv = CountVectorizer(stop_words='english', min_df=0.01)
X_cv = cv.fit_transform(X_train)
X_cv

<5360x1928 sparse matrix of type '<class 'numpy.int64'>'
	with 370173 stored elements in Compressed Sparse Row format>

In [None]:
tv = TfidfVectorizer(stop_words='english', min_df=0.01)
X_tv = tv.fit_transform(X_train)
X_tv

In [11]:
tsvd_cv = TruncatedSVD(n_components=1000)
X_cv_tsvd = tsvd_cv.fit_transform(X_cv)

In [None]:
tsvd_xtv = TruncatedSVD(n_components=1000)
X_tv = tsvd_xtv.fit_transform(X_tv)

In [12]:
lg = LogisticRegression()
lg.fit(X_cv_tsvd, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [13]:
lg.score(X_cv_tsvd, y_train)

0.87294776119402984

In [14]:


params_grid = {'penalty':['l1','l2'],
              'C':[0.05, 0.5]}

gs = GridSearchCV(LogisticRegression(),
                  param_grid=params_grid,
                 n_jobs=-1,
                 verbose=2,
                 cv=5)



In [15]:
gs.fit(X_cv_tsvd, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Done  16 out of  20 | elapsed:   10.2s remaining:    2.5s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   11.4s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.05, 0.5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)

In [16]:
print(gs.best_estimator_)
print(gs.best_params_)
print(gs.best_score_)

LogisticRegression(C=0.05, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
{'C': 0.05, 'penalty': 'l1'}
0.823694029851


In [17]:
gsb = gs.best_estimator_

X_test_cv = cv.transform(X_test)
X_test_svd = tsvd_cv.transform(X_test_cv)

In [18]:
print(gsb.score(X_test_svd, y_test))

0.816666666667


In [19]:
cmat = confusion_matrix(y_test, gsb.predict(X_test_svd))
df = pd.DataFrame(cmat, columns=['Predict 0', 'Predict 1'], index=['Actual 0', 'Actual 1'])
df

Unnamed: 0,Predict 0,Predict 1
Actual 0,2118,36
Actual 1,448,38


In [20]:
print(classification_report(y_test, gsb.predict(X_test_svd)))

             precision    recall  f1-score   support

          0       0.83      0.98      0.90      2154
          1       0.51      0.08      0.14       486

avg / total       0.77      0.82      0.76      2640



### Revisiting Pipelines

We may also want to create a `Pipeline` object to reproducibly and reliably transform and predict data with one step versus many. We'll refactor our code here to take advantage of that and end with a check for understanding where you can practice the same. 

The first step is to break down our current model into a set of sequential steps. From my original example using the economic news dataset, my steps were:

1. Apply `CountVectorizer` to `X` (matrix of article text)
2. Apply `TruncatedSVD` with 10 components to the count vectorized data
3. Predict using `RandomForestClassifier`

These steps are _sequential_ and can feed one directly into the other. We do not need to include a feature union here.

In [21]:
pipeline = make_pipeline(
    CountVectorizer(stop_words='english'),
    TruncatedSVD(n_components=10),
    RandomForestClassifier())

If we had more hyperparameters or different steps, we could change what is inside that pipeline to account for that. 

Next, let's fit this to our training data:

In [22]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

Then we'll score it and run the predictions from the training set.

In [23]:
print(pipeline.score(X_train, y_train))
predictions = pipeline.predict(X_train)
print(confusion_matrix(y_train, predictions))
print(classification_report(y_train, predictions))

0.982649253731
[[4425    1]
 [  92  842]]
             precision    recall  f1-score   support

          0       0.98      1.00      0.99      4426
          1       1.00      0.90      0.95       934

avg / total       0.98      0.98      0.98      5360



And finally, we'll score and predict with our test set. We do not need to refit this to the test set!

In [24]:
print(pipeline.score(X_test, y_test))
predictions = pipeline.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.800757575758
[[2077   77]
 [ 449   37]]
             precision    recall  f1-score   support

          0       0.82      0.96      0.89      2154
          1       0.32      0.08      0.12       486

avg / total       0.73      0.80      0.75      2640



### Check For Understanding 4 (15 Minutes)

Pair up and do the following:

1. Using the model you created in the last Check for Understanding, diagram out the steps involved in going from the training data to a fitted model with your partner. Highlight what steps can happen sequentially and which must happen in parallel (typically parallel steps would happen during the feature engineering stage)

2. Pick a model and refactor the code to use a `Pipeline` instead of its current version. Write this code on the machine that doesn't have the original model on it. This is to help you think about what your code is *doing* versus how you've *written* it

3. If you have time, replicate the process for the other model. 

If you used a grid search, feel free to set the best parameters manually (such as `RandomForestClassifier(n_estimators=100)`).

In [25]:
pipe = Pipeline([('countvec', CountVectorizer(stop_words='english')), 
                 ('trunc', TruncatedSVD(n_components=1000)),
                    ('log', LogisticRegression(penalty='l1', C=0.05))])

pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('countvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
    ...ty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [26]:
pipe.score(X_train, y_train)

0.82667910447761195

In [27]:
pipe.score(X_test, y_test)

0.81515151515151518

In [28]:
print(confusion_matrix(y_test, pipe.predict(X_test)))
print(classification_report(y_test, pipe.predict(X_test)))

[[2118   36]
 [ 452   34]]
             precision    recall  f1-score   support

          0       0.82      0.98      0.90      2154
          1       0.49      0.07      0.12       486

avg / total       0.76      0.82      0.75      2640



In [None]:
pipe.named_steps.keys()

In [None]:

gs = GridSearchCV(pipe, grid, verbose=2, n_jobs=-1)

In [None]:
gs.fit(X_train, y_train)