
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP II: `CountVectorizer` and `TfidfVectorizer`

_Authors: Dave Yerrington (SF), Justin Pounders (ATL), Riley Dallas (AUS)_

---


![](https://snag.gy/uvESGH.jpg)

## Learning Objectives
---

*After this lesson, you will be able to:*

- Extract features from unstructured text with `sklearn`
- Describe how count vectorization and TF-IDF work.
- Implement `CountVectorizer` in a spam classification model.


## Imports
---

We'll need the following libraries for today's lecture:
- `pandas`
- `CountVectorizer` and `TfidfVectorizer` from `sklearn.feature_extraction.text`
- `Pipeline`
- `train_test_split` and `GridSearchCV`
- `LogisticRegression`

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

<a name="intro"></a>
## Introduction to Text Feature Extraction

---

The models we’ve been using so far accept a two-dimensional matrix of real numbers as input `X` and a target vector of classes or numbers as `y`. What if our features are not given in the form of a table of numbers but rather are unstructured? This is the case when we work with text documents.

> We need a way to go from unstructured data to our numeric `X` matrix in order to use the same models. This is called _feature extraction_, and this lesson is dedicated to it.

The applications of using text data in statistical modeling are practically infinite. Some examples include:

- Sentiment analysis of Yelp reviews.
- Identifying topics of new articles.
- Classification of political authors.


<a id='simple'></a>
## A Simple Example
---

Suppose we're building a model that predicts whether a sentence is from a children's book or not. The inputs are strings and the output is a binary label.

Here are some sample inputs:

In [None]:
s1 = 'Spot is a dog.'
s2 = 'Run Spot Run.'
s3 = 'Run Forrest, Run!'
s4 = 'The quick brown fox jumped over the lazy dog.'

## Basic terminology

---

Virtually all NLP uses this base terminology:

- a collection of text is a **document**. You can think of a document as a row in your feature matrix.
- a collection of documents is a **corpus** (plural corpora)

In [None]:
# Build a corpus
train_sentences =
test_sentences =

<a id='bow'></a>
## Bag of Words/Word Counting
---

The bag-of-words model is a simplified representation of the raw data. In this model, text (such as a sentence or document) is represented as the bag (multiset) of its words.

Bag-of-words representations discard grammar, order, and structure in the text but track occurrences.

In [None]:
# Instantiate empty dictionary.
bag_of_words = {}

# Iterate through each word in each sentence.
for sentence in [s1, s2, s3, s4]:
    for word in sentence.split():
        
        # Create a key-value pair in the dictionary for each word,
        # with the key being the word and the value = 0.
        # And assign a value of 0. 
        bag_of_words[word] = 0
        
# Iterate back through each word in each sentence.
for sentence in [s1, s2, s3, s4]:
    for word in sentence.split():
        
        # Every time we see word again, add 1 to the value.
        bag_of_words[word] += 1

<details><summary>What might be some of the advantages of the bag-of-words approach?</summary>

- Efficient to store.
- Efficient to model.
- Keeps a decent amount of information.
</details>

<details><summary>What might be some of the disadvantages of the bag-of-words approach?</summary>

- Since bag-of-words models discard grammar, order, structure, and context, we lose a decent amount of information.
- Phrases like "not bad" or "not good" won't be interpreted properly.
</details>

<a name="countvectorizer"></a>
## Demo: Scikit-Learn `CountVectorizer`
---

Scikit-learn offers a `CountVectorizer` class with many configurable options:

**Note**: There are several parameters to tweak.

In [None]:
# Instantiate a CountVectorizer
cvec =

In [None]:
# Fit the vectorizer on our corpus


In [None]:
# Transform the corpus
X_train = 


In [None]:
# Convert X_train into a DataFrame
X_train_df =

In [None]:
# Transform test
X_test =
X_test_df =

<a id='stopwords'></a>
## Stop Words

---

Some words are commonly used and provide no legitimate information about the content of the text. This isn't always going to be the case!

In [None]:
from sklearn.feature_extraction import stop_words
 


`CountVectorizer` gives you the option to eliminate stop words from your corpus.

```python
cvec = CountVectorizer(stop_words='english')
```

You can optionally pass your own list of stop words that you'd like to remove.
```python
cvec = CountVectorizer(stop_words=['list', 'of', 'words', 'to', 'stop'])
```

<details><summary>What might be an example of a situation in which we do not want to remove stopwords?</summary>
    
- Any case where stopwords are important/predictive, we'd want to keep them.
    - Poetry.
    - Chatbots.
    - Translation services. (e.g. Google Translate)
- If our data set is a small size, there may not be a reason to remove stopwords.
</details>

## Vocabulary size

---
One downside to `CountVectorizer` is the size of its vocabulary (`cvec.get_feature_names()`) can get rather large. To mitigate this problem, you can set `max_features` to only include the $N$ most popular vocabulary words in the corpus.

```python
cvec = CountVectorizer(max_features=1000) # Only the top 1,000 words from the entire corpus will be saved
```

## N-Gram Range
---

`CountVectorizer` has the ability to capture $n$-word phrases, also called $n$-grams. Consider the following:

> The quick brown fox jumped over the lazy dog.

In the example sentence, the 2-grams (aka bi-grams) are:
- 'the quick'
- 'quick brown'
- 'brown fox'
- 'fox jumped'
- 'jumped over'
- 'over the'
- 'the lazy'
- 'lazy dog'

And the 3-grams are:
- 'the quick brown'
- 'quick brown fox'
- 'brown fox jumped'
- 'fox jumped over'
- 'jumped over the'
- 'over the lazy'
- 'the lazy dog'

The `ngram_range` determines what $n$-grams should be considered as features.

```python
cvec = CountVectorizer(ngram_range=(1,2)) # Captures every single word and every 2-gram
```

## Min/Max Document Frequency
---

We can tell `CountVectorizer` to only consider words that occur within a certain threshold in the corpus.

For example, if we only want `CountVectorizer` to add words that occur in **at least** two documents, we can do the following:

```python
cvec = CountVectorizer(min_df=2) # A word must occur in at least two documents from the corpus
```

Conversely, we can set an upper threshold with `max_df`:

```python
cvec = CountVectorizer(max_df=.98) # Ignore words that occur in > 98% of the documents from the corpus
```

## Spam classification model
---
Let's test this stuff out on some SMS text data.  Can you predict real vs. promotional texts just based on what is written?  Let's see...

> This data set was taken from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [None]:
df = pd.read_csv('./datasets/SMSSpamCollection', sep='\t', names=['label', 'message'])
df.head()

## Data Cleaning
---

Convert ham/spam into binary labels:
- 0 for ham
- 1 for spam

In [None]:
df['label'] =

## Baseline accuracy
---

We need to calculate baseline accuracy in order to tell if our model is outperforming the null model (predicting the majority class).

## Model prep
---

Let's set up our data for modeling:
- `X` will be the `message` column. **NOTE**: `CountVectorizer` requires a vector, so make sure you set `X` to be a `pandas` Series, **not** a DataFrame.
- `y` will be the `label` column

In [None]:
X =
y =

## Train/Test split
---

Use the train_test_split function to split your data into a training set and a holdout set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

## Pipeline
---

Our pipeline will consist of two stages:
1. An instance of `CountVectorizer`
2. A `LogisticRegression` instance

In [None]:
pipe =

## `GridSearchCV`
---

At this point, you could use your `pipeline` object as a model:

```python
# Evaluate how your model will perform on unseen data
cross_val_score(pipe, X_train, y_train, cv=3).mean() 

# Fit your model
pipe.fit(X_train, y_train)

# Training score
pipe.score(X_train, y_train)

# Test score
pipe.score(X_test, y_test)
```

Since we want to tune over the `CountVectorizer`, so we'll load our `pipeline` object into `GridSearchCV`.

In [None]:
pipe_params = {
    'cvec__max_features': [2500, 3000, 3500],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=3)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

<a name="tfidf"></a>
## Term Frequency-Inverse Document Frequency (TF-IDF)

---

A TF-IDF score tells us which words are most discriminating between documents. Words that occur often in one document but don't occur in many documents contain a great deal of discriminating power.

- This weight is a statistical measure used to evaluate how important a word is to a document in a collection (corpus).
- The importance increases in proportion to the number of times a word appears in a document but is offset by the frequency of the word in the corpus.

Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

The inverse document frequency is a measure of how much information the word provides — that is, whether the term is common or rare across all documents. It's the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of that quotient.

**Let's see how it's calculated:**

Term frequency (`tf`) is the frequency of a certain term in a document:

$$
\mathrm{tf}(t,d) = \frac{N_\text{term}}{N_\text{terms in Document}}
$$

where

- $N_\text{term}$ is the number of times a term/word $t$ appears in document $d$
- $N_\text{terms in Document}$ is the number of terms/words in document $d$

Inverse document frequency (`idf`) is defined as the frequency of documents that contain that term over the whole corpus:

$$
\mathrm{idf}(t, D) = \log\frac{N_\text{Documents}}{N_\text{Documents that contain term}}
$$

where

- $N_\text{Documents}$ is the number of documents in the corpus $D$
- $N_\text{Documents that contain term}$ is the number of documents in $D$ that contain term/word $t$

TF-IDF is then calculated as:

$$
\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)
$$


> **You might ask: But what is `log` used for?**<br>
> Good question! This is a sublinear transformation that helps separate our extremes between rare and common values.

> "...any linear function, ${\displaystyle g}$, for sufficiently large input ${\displaystyle f}$, grows slower than ${\displaystyle g}$" — Wikipedia

<a id='tfidf-vec'></a>
## Practice Using the `TfidfVectorizer`

---

### Why Use TF-IDF?
- Common words are penalized.
- Rare words have more influence.

Scikit-learn provides a TF-IDF vectorizer that works similarly to the other vectorizers we've covered. Notice that we can also eliminate stop words to improve our analysis.

As you did above, import and initialize the `TfidfVectorizer`, then fit the spam and ham data.

In [None]:
texts = ["am a cat", "am a rat", "am a bat"]

In [None]:
# Fit the transformer
tvec = 

In [None]:
df  =

<a id='resources'></a>
## Additional Resources

---

- Check out this [Yelp blog post](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) on how it completed a classification task (with more than 1,000 response variables) using restaurant review text.
- Always check documentation: 
    - [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). 
    - [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). 
    - [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).
- A list of all stop words is available [here](https://github.com/ga-students/DSI-DC-2/blob/master/curriculum/Week-05/5.04-nlp/stop-words.txt).
- Wikipedia's [feature hashing](https://github.com/generalassembly-studio/DSI-course-materials/tree/master/curriculum/04-lessons/week-06/4.1-lesson) and [hash functions](https://en.wikipedia.org/wiki/Hash_function) entries are a great place to turn for more information on hashing.
- Check out Charlie Greenbacker's [introduction to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf), which he delivered at the [DC-NLP Meetup](http://www.meetup.com/DC-NLP/).
- Wikipedia also has a [walk through](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) of TF-IDF.
- We played with Google's [ngram tool](https://books.google.com/ngrams/graph?content=data+science&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cdata%20science%3B%2Cc0).
- A hilarious data scientist has gone rogue and used NLP and eigenfaces (eigenvalues for face recognition) [for Tinder](http://dataconomy.com/hacking-tinder-with-facial-recognition-nlp/).
- We referenced KPCB's 2016 internet trends. If you're into startups, check out [this insightful deck](http://www.kpcb.com/internet-trends).
- [Count vectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).
- [Choosing a stemmer](https://www.elastic.co/guide/en/elasticsearch/guide/current/choosing-a-stemmer.html).
- [Feature hashing](https://en.wikipedia.org/wiki/Feature_hashing).
- [Term frequency-inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
- [TF-IDF vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).