<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP Part 2: `CountVectorizer` and `TfidfVectorizer`



### Learning Objectives

- Extract features from unstructured text by fitting and transforming with `CountVectorizer` and `TfidfVectorizer`.
- Describe how CountVectorizers and TF-IDFVectorizers work.
- Understand `stop_words`, `max_features`, `min_df`, `max_df`, and `ngram_range`.


In [None]:
# imports
import pandas as pd
import matplotlib.pyplot as plt

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Introduction to Text Feature Extraction

The models we've learned, like linear regression, logistic regression, and k-nearest neighbors, take in an `X` and a `y` variable.
- `X` is a matrix/dataframe of independent variables.
- `y` is a vector/series of representing the target variable.

Text data (also called natural language data) is not already organized as a matrix or vector of real numbers. We say that this data is **unstructured**.

> This lesson will focus on how to transform our unstructured text data into a numeric `X` matrix. This matrix is known as a term-document matrix, where each document is a row of the matrix and each column represents a frequency count of the occurence of the term in each document.  



## Basic terminology

---

- A collection of text is a **document**. 
    - You can think of a document as a row in your feature matrix.
- A collection of documents is a **corpus**. 
    - You can think of your full dataframe as the corpus.

# Count Vectorizer

In order to use unstructured data, first we have to put it in a structured format. 

Consider a list of job descriptions for data scientists extracted from LinkedIn. We can consider each job description as a document, and we want to find out how many common job requirements there are, in all the documents.



In [None]:
# First define the corpus of documents (body of documents)

jobs_data = pd.read_csv('data/jobs.csv')
jobs_data.head()


In [None]:
# Have a look at the full description for the first job
jobs_data['description'][6]

In [None]:
# CountVectorizer requires a vector, so we need to get the series from the DataFrame. 
jobs = jobs_data['description']


In [None]:
# instantiate the CountVectorizer
cvec = CountVectorizer(binary=True)


Some important default hyperparameters to consider first:
- `analyzer = 'word'` : the features to be identified are words
- `token_pattern = r'(?u)\b\w\w+\b'` : tokens consist of two or more alphanumeric characters
- `binary = False` : each occurence of the token is counted, if `True` then the matrix only records whether the token exists in the document.

Let's see the results when we fit the CountVectorizer to the `jobs` corpus.

In [None]:
# Fit the count vectorizer to the corpus
cvec.fit(jobs)

When we have unstructured text data, there is a lot of information in that text data.
- When we force unstructured text data to follow a "spreadsheet" or "dataframe" structure, we might lose some of that information.
- For example, CountVectorizer creates a vector (column) for each token and counts up the number of occurrences of each token in each document.

Our tokens are now stored as a **bag-of-words**. This is a simplified way of looking at and storing our data. 
- Bag-of-words representations discard grammar, order, and structure in the text but track occurrences.

<img src="images/countvectorizer.png" alt="drawing" width="750"/>

[Source](https://towardsdatascience.com/nlp-learning-series-part-2-conventional-methods-for-text-classification-40f2839dd061).


In [None]:
# The count vectorizer identifies each token as a feature. 
cvec.get_feature_names()


In [None]:
# Transform into a term-document matrix
cv_matrix = cvec.transform(jobs)
cv_matrix

## Term-Document Matrix

As you can see, the matrix is in 'Compressed Sparse Row format', as there are many zeros. This is because many of the tokens appear only in one of the documents, but they are still assigned to one whole column of the matrix.

Let's print the matrix to see what it looks like.

In [None]:
print(cv_matrix)

In [None]:
# converting to a dense matrix (show the whole matrix, including all the zeros)
cv_dense = cv_matrix.todense()
cv_dense

In [None]:
# Format as a dataframe to see the terms and documents
df = pd.DataFrame(data=cv_dense,columns = cvec.get_feature_names())
df

In [None]:

# plot top occuring words
df.sum().sort_values(ascending=False).head(20).plot(kind='barh');

# Find the top 20 occurring instead. 

Creating a term-document matrix like this is also known as a bag-of-words approach.

<details><summary>What might be some of the advantages of using this bag-of-words approach when modeling?</summary>

- Efficient to store.
- Efficient to model.
- Keeps a decent amount of information.
</details>

<details><summary>What might be some of the disadvantages of using this bag-of-words approach when modeling?</summary>

- Since bag-of-words models discard grammar, order, structure, and context, we lose a decent amount of information.
- Phrases like "not bad" or "not good" won't be interpreted properly.
</details>

Let's see if we can process from the corpus so that it gives us more meaningful information.

We will consider some of the different hyperparameters of `CountVectorizer`:
- `stop_words`
- `max_features`, `max_df`, `min_df`
- `ngram_range`

 ## Stopwords
 
 Notice that there are many common words ('and','to', 'the') in the term-document matrix, which may not be very useful for analyzing common job requirements. 

`CountVectorizer` gives you the option to eliminate stopwords from your corpus when instantiating your vectorizer.

```python
cvec = CountVectorizer(stop_words='english')
```

You can optionally pass your own list of stopwords that you'd like to remove.
```python
cvec = CountVectorizer(stop_words=['list', 'of', 'words', 'to', 'stop'])
```
or to add more stopwords to the default set:

```python
from sklearn.feature_extraction import text

xtra_stop_words = text.ENGLISH_STOP_WORDS.union(['list', 'of', 'words', 'to', 'stop'])
cvec = CountVectorizer(stop_words=xtra_stop_words)
```


In [None]:
from sklearn.feature_extraction import text

xtra_stop_words = text.ENGLISH_STOP_WORDS.union(['data', 'science'])

# instantiate the CountVectorizer with default english stopwords
cvec_stopwords = CountVectorizer(stop_words=xtra_stop_words)

# fit_transform is a more efficient way of performing both the fit and transform in one step
cvec_stopwords.fit_transform(jobs)

# how many features are identified?

In [None]:
cvec_stopwords.get_feature_names()

In [None]:
cv_matrix = cvec_stopwords.transform(jobs)

In [None]:
cv_matrix

In [None]:
# Format as a dataframe to see the terms and documents
df = pd.DataFrame(data=cv_matrix.todense(),columns = cvec_stopwords.get_feature_names())
df

In [None]:

# plot top occuring words
df.sum().sort_values(ascending=False).head(20).plot(kind='barh');

# Find the top 20 occurring instead. 

### Vocabulary size

---
One downside to `CountVectorizer` is the size of its vocabulary (`cvec.get_feature_names()`) can get really large. We're creating one column for every unique token in your corpus of data!

There are three hyperparameters to help you control this.

1. You can set `max_features` to only include the $N$ most popular vocabulary words in the corpus.

```python
cvec = CountVectorizer(max_features=1_000) # Only the top 1,000 words from the entire corpus will be saved
```

2. You can tell `CountVectorizer` to only consider words that occur in **at least** some number of documents (df = document frequency)

```python
cvec = CountVectorizer(min_df=2) # A word must occur in at least two documents from the corpus
```

3. Conversely, you can tell `CountVectorizer` to only consider words that occur in **at most** some percentage of documents.

```python
cvec = CountVectorizer(max_df=.98) # Ignore words that occur in > 98% of the documents from the corpus
```

Both `max_df` and `min_df` can accept either an integer or a float.
- An integer tells us the number of documents.
- A float tells us the percentage of documents.

<details><summary>Why might we want to control these vocabulary size hyperparameters?</summary>
    
- If we have too many features, our models may take a **very** long time to fit.
- Control for overfitting/underfitting.
- Words in 99% of documents or words occuring in only one document might not be very informative.
</details>

### N-Gram Range
---

`CountVectorizer` has the ability to capture $n$-word phrases, also called $n$-grams. 

The `ngram_range` determines what $n$-grams should be considered as features.

```python
cvec = CountVectorizer(ngram_range=(2,2)) # Captures only 2-grams
```

```python
cvec = CountVectorizer(ngram_range=(1,2)) # Captures every 1-gram and every 2-gram
```
Let's see the difference with our `jobs` documents.

In [None]:
# Try with different ngram_ranges and add to the list of stop words 
cvec = CountVectorizer(ngram_range=(2,2), stop_words='english', max_df=3, binary=True)

cvec.fit(jobs)
cv_matrix = cvec.transform(jobs)

In [None]:
# What are the feature names now?
cvec.get_feature_names()

In [None]:
# Store as a data frame
df = pd.DataFrame(data=cv_matrix.todense(),columns = cvec.get_feature_names())
df

In [None]:
# Plot the top 20 occurring features
# plot top occuring words
df.sum().sort_values(ascending=False).head(20).plot(kind='barh');

Congratulations! We've used `CountVectorizer` to transform our text data into something we can pass into a model.

But what if we want to do something more than just count up the occurrence of each token?

## Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer

---

When modeling, which word do you think tends to be the most helpful?
- Words that are common across all documents.
- Words that are rare across all documents.
- Words that are rare across some documents, and common across some documents.

<details><summary>Answer:</summary>

- Words that are common in certain documents but rare in other documents tend to be more informative than words that are common in all documents or rare in all documents.
- Example: If we were examining poetry over time, the word "thine" might be common in some documents but rare in most documents. The word "thine" is probably pretty informative in this case.
</details>

TF-IDF is a score that tells us which words are important to one document, relative to all other documents. Words that occur often in one document but don't occur in many documents contain more predictive power.

Variations of the TF-IDF score are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.
- If you want to see how it can be calculated, check out [the Wikipedia page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and [`sklearn`](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting) page.

<img src="images/tfidfvectorizer.png" alt="drawing" width="750"/>

[Source](https://towardsdatascience.com/nlp-learning-series-part-2-conventional-methods-for-text-classification-40f2839dd061).

### Practice Using the `TfidfVectorizer`

---

`sklearn` provides a TF-IDF vectorizer that works similarly to the CountVectorizer.
- The arguments `stop_words`, `max_features`, `min_df`, `max_df`, and `ngram_range` also work here.



In [None]:
# Instantiate the transformer.
tvec = TfidfVectorizer(ngram_range=(2,2), stop_words='english', max_df=3)

In [None]:
# fit and transform the job descriptions
tv_matrix = tvec.fit_transform(jobs)

In [None]:
# how many features are there? 

# is it the same as the count vectorizer with the same hyperparameter values?
tvec.get_feature_names()

In [None]:
# Put in a dataframe and view it
tv_matrix.todense()

df = pd.DataFrame(data = tv_matrix.todense(), columns = tvec.get_feature_names())
# What kind of values are stored now?


In [None]:
df

In [None]:
# Plot the top occurring features. Are they the same as before?
df.sum().sort_values(ascending=False).head(20).plot(kind='barh')

Are the results different? Try with different hyperparameter values. 