# Chapter 5: Term Frequency-Inverse Document Frequency

## Instructions

- Run the cells with "assert" statements to see if your answer's output matches what the output should be. If it runs without error, your answer matches! If your output is different, you'll get a hint.

In this notebook, you'll compare the differences between `CountVectorizer` and `TfidfVectorizer`.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB

We'll continue to look at the cappuccino cup reviews, but now with 500+ reviews instead of just 6 reviews.

In [2]:
data = pd.read_csv('reviews.csv')
data

Unnamed: 0,sentiment,review
0,negative,I wanted to love this. I was even prepared for...
1,positive,Grove Square Cappuccino Cups were excellent. T...
2,negative,I bought the Grove Square hazelnut cappuccino ...
3,negative,"I love my Keurig, and I love most of the Keuri..."
4,negative,It's a powdered drink. No filter in k-cup.<br ...
...,...,...
509,positive,This is my favorite K-Cup flavor. I like my co...
510,positive,If you are looking for the taste of French Van...
511,positive,I have purchased and used 3 boxes of the Hazel...
512,positive,"Yummy, great tasting and very convenient. Only..."


Our goal is to create a Naive Bayes model that will look at the review text and determine if the review is positive or negative. Let's start by prepping the data.

In [3]:
# define the input and output of the model
X = data.review
y = data.sentiment

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Create a `CountVectorizer` object that filters out English stop words and includes both unigrams and bigrams. Name the object `cv`.

In [4]:
### BEGIN SOLUTION
cv = CountVectorizer(stop_words='english', ngram_range=(1,2))#, min_df=.01)
### END SOLUTION
cv

CountVectorizer(ngram_range=(1, 2), stop_words='english')

In [5]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert cv.stop_words == 'english', "The stop_words should be the English stop words."
assert cv.ngram_range == (1,2), "The ngram_range should be between 1 and 2."
### END HIDDEN TESTS

Next, we're going to fit and transform the `X_train` dataset.
* The `fit` step identifies all of the unique terms in the `X_train` data set and makes them into columns.
* The `transform` step fills in all of the word counts for the `X_train` data set.

We then transform (rather than fitting and transforming) the `X_test` dataset because we want the columns in the `X_train_cv` and `X_test_cv` datasets to match.

In [6]:
X_train_cv = cv.fit_transform(X_train)
X_test_cv  = cv.transform(X_test)

Turn `X_train_cv` into a pandas dataframe and save the output in a variable called `dtm_cv`.

In [7]:
### BEGIN SOLUTION
dtm_cv = pd.DataFrame(X_train_cv.toarray(), columns=cv.get_feature_names())
### END SOLUTION
dtm_cv

Unnamed: 0,00,00 cups,00 thought,0g,0g protein,10,10 00,10 2012,10 47,10 bought,...,yummy gas,yummy great,yummy kuerig,yummy perfect,yummy price,yummy run,yummy strong,yummy suitable,yummy treat,yummy won
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
355,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
356,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
357,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert dtm_cv.shape == (359,7923), "The dimensions of the document-term matrix should be 359 documents x 7923 terms."
### END HIDDEN TESTS

Next, we're going to put this document-term matrix through a Naive Bayes model and see how well the model performs.

In [9]:
mnb = MultinomialNB()
mnb.fit(X_train_cv, y_train)
mnb.score(X_test_cv, y_test)

0.8774193548387097

Using `CountVectorizer`, we are able to predict the sentiment of a review with 87.7% accuracy. Next, you are tasked with repeating the whole process again, but using `TfidfVectorizer` instead to see if you can get a better prediction score.

1. Create a `TfidfVectorizer` object with the same hyperparameters as the `CountVectorizer` object we created earlier and name it `tv`.

2. Take the `X_train` data and turn it into a TF-IDF matrix called `X_train_tv`.

3. Take the `X_test` data and turn it into a TF-IDF matrix with the same columns as `X_train_tv` and call it `X_test_tv`.

4. Turn `X_train_tv` into a pandas dataframe and call it `tfidf`.

(See final step below)

In [10]:
### BEGIN SOLUTION
tv = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
X_train_tv = tv.fit_transform(X_train)
X_test_tv  = tv.transform(X_test)
tfidf = pd.DataFrame(X_train_tv.toarray(), columns=tv.get_feature_names())
### END SOLUTION
tfidf

Unnamed: 0,00,00 cups,00 thought,0g,0g protein,10,10 00,10 2012,10 47,10 bought,...,yummy gas,yummy great,yummy kuerig,yummy perfect,yummy price,yummy run,yummy strong,yummy suitable,yummy treat,yummy won
0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
355,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
356,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
357,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert tv.stop_words == 'english', "The stop_words should be the English stop words."
assert tv.ngram_range == (1,2), "The ngram_range should be between 1 and 2."
assert tfidf.shape == (359,7923), "The dimensions of the TF-IDF matrix should be 359 documents x 7923 terms."
### END HIDDEN TESTS

5. Fit a GaussianNB model and save the final score as `tfidf_score`.

In [12]:
### BEGIN SOLUTION
gnb = GaussianNB()
gnb.fit(X_train_tv.toarray(), y_train)
tfidf_score = gnb.score(X_test_tv.toarray(), y_test)
### END SOLUTION
tfidf_score

0.8451612903225807

In [13]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert round(tfidf_score,3) == 0.845, "The final prediction score using TfidfVectorizer and GaussianNB is incorrect."
### END HIDDEN TESTS

The final prediction accuracy using the TF-IDF Vectorizer was 84.5% versus the final prediction accuracy using the Count Vectorizer, which was 87.7%.

This tells us that while TF-IDF can be the better option over simple word counts, it is not always the case. The best approach is to try both vectorizers and choose the one that works best for your dataset and analysis goal.