### Required Codio Assignment 18.3: Bag of Words and TF–IDF

**Expected Time = 60 minutes**

**Total Points = 35**

In this activity you will use the Scikit-Learn vectorization tools `CountVectorizer` and  `TfidfVectorizer`  to create a bag of words representation of text in a DataFrame.  You will explore how different parameter settings affect the performance of a `LogisticRegression` estimator on a binary classification problem.

This activity focuses on using term frequency inverse document frequency (TF_IDF) to vectorize text.  First, you will compute tfidf by hand on a small example.  Then, you will use ScikitLearn to implement the `TfidfVectorizer` together with a `LogisticRegression` estimator to see if the performance on predicting the WhatsApp status improves with a different representation.

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)
- [Problem 7](#-Problem-7)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

### The Data

The cell below uses the "sad" and "happy" sentiments datasets from Kaggle to form the target of our classification models.  The data is also split and named appropriately below. 

In [3]:
happy_df = pd.read_csv('data/Emotion(happy).csv')
sad_df = pd.read_csv('data/Emotion(sad).csv.zip', compression = 'zip')

In [4]:
full_df = pd.concat([happy_df, sad_df]).reset_index(drop = True)

In [5]:
X = full_df.drop('sentiment', axis = 1)
y = full_df['sentiment']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X['content'], y, random_state = 42)

In [7]:
X_train.head()

1287    ['You Hurt Me But I Still Love You.', 'True Lo...
1112    Sorry isn’t always enough. Sometimes you actua...
823     Sometimes two people have to fall apart to rea...
651     True love isn’t love at first sight but love a...
1101    i am scared of getting too close to anyone bec...
Name: content, dtype: object

[Back to top](#-Index)

### Problem 1

#### Using the `CountVectorizer`

**5 Points**

To create a bag of words representation of your text data, below create an instance of the `CountVectorizer` with default settings as `cvect`. 

Next, use the `fit_transform` function on `cvect` to transform the training data `X_train` and assign the transformed version of the text to `dtm`.  


In [8]:
### GRADED
cvect = ''
dtm = ''

    
# YOUR CODE HERE
cvect = CountVectorizer()
dtm = cvect.fit_transform(X_train)

### ANSWER CHECK
pd.DataFrame(dtm.toarray(), columns = cvect.get_feature_names_out()).head()

Unnamed: 0,0_0,100,123whatsappstatus,204,30,404,44,45,55,805,...,yes,yesterday,yet,you,young,your,yours,yourself,yous,yuh
0,0,0,0,0,0,0,0,0,0,0,...,2,0,1,112,0,13,0,2,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Problem 2

#### Limiting words with the `CountVectorizer`

**5 Points**

Now, to remove stopwords from the text before vectorizing create a new instance of the `CountVectorizer` with argument `stop_words = 'english'`. Assign this to the variable `cvect2`.

Next, use the `fit_transform` function on `cvect2` to transform the training data `X_train` and assign the transformed version of the text to `X_train_vect_2`.  

Finally, transform the test data `X_test` as `X_test_vect_2` below.

In [9]:
### GRADED
cvect2 = ''
X_train_vect_2 = ''
X_test_vect_2 = ''
    
# YOUR CODE HERE
cvect2 = CountVectorizer(stop_words = 'english')
X_train_vect_2 = cvect2.fit_transform(X_train)
X_test_vect_2 = cvect2.transform(X_test)

### ANSWER CHECK
X_train_vect_2

<1007x1622 sparse matrix of type '<class 'numpy.int64'>'
	with 41589 stored elements in Compressed Sparse Row format>

[Back to top](#-Index)

### Problem 3

#### Using the text with `LogisticRegression`

**5 Points**

Create a `Pipeline` object named `vect_pipe_1` below that has steps named `cvect` and `lgr`, using both a default `CountVectorizer` transformer and `LogisticRegression` estimator. 

Fit this pipeline on the training data `X_train` and `y_train`.

Finally, use the function `score` to evaluate it on the test set `X_test` and `y_test`. 

In [10]:
### GRADED
vect_pipe_1 = ''

test_acc = ''

    
# YOUR CODE HERE
vect_pipe_1 = Pipeline([('cvect', CountVectorizer()),
                       ('lgr', LogisticRegression())])
vect_pipe_1.fit(X_train, y_train)
test_acc = vect_pipe_1.score(X_test, y_test)


### ANSWER CHECK
vect_pipe_1.named_steps

{'cvect': CountVectorizer(), 'lgr': LogisticRegression()}

[Back to top](#-Index)

### Problem 4

#### Pipeline and Grid Search

**5 Points**

Initialize a `GridSearchCV` object with the pipeline `vect_pipe_1` and parameter grid `params` given below. Assign this result to the variable `grid`.

Fit the `grid` object on training data `X_train` and `y_train`.

Finaly, use the function `score` to evaluate it on the test set `X_test` and `y_test`. Assign the result to `test_acc`. 

In [11]:
params = {'cvect__max_features': [100, 500, 1000, 2000],
         'cvect__stop_words': ['english', None]}

In [14]:
### GRADED
grid = ''
test_acc = ''

    
# YOUR CODE HERE
grid = GridSearchCV(vect_pipe_1, param_grid=params)
grid.fit(X_train, y_train)
test_acc = grid.score(X_test, y_test)

### ANSWER CHECK
grid.best_params_

{'cvect__max_features': 2000, 'cvect__stop_words': None}

[Back to top](#-Index)


### Problem 5

#### Using `TfidfVectorizer` in a `Pipeline`

**5 Points** 

Now, you are to use the Scikit-Learn transformer `TfidfVectorizer` to transform the WhatsApp data from Kaggle.  The data is loaded and split below. 

Initialize a `TfidfVectorizer` object with default parameters and assign it to the variable `tfidif`. 

Next, use the function `fit_transform` with argument equal to `X_train` on `tfidf`. Assign this result to the variable `dtm`.


In [15]:
### GRADED
tfidf = ''
dtm = ''

    
# YOUR CODE HERE
tfidf = TfidfVectorizer()
dtm = tfidf.fit_transform(X_train)

### ANSWER CHECK
pd.DataFrame(dtm.toarray(), columns = tfidf.get_feature_names_out()).head()

Unnamed: 0,0_0,100,123whatsappstatus,204,30,404,44,45,55,805,...,yes,yesterday,yet,you,young,your,yours,yourself,yous,yuh
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01397,0.0,0.006924,0.398355,0.0,0.070646,0.0,0.01305,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.184358,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


[Back to top](#-Index)


### Problem 6

#### Pipeline with `TfidfVectorizer`

*5 Points** 

Below, create a pipeline named `tfidf_pipe` with steps `tfidf` and `lgr` given by a `TfidfVectorizer` and a `LogisticRegression` estimators, respectively. 

Next, use the function `fit` on `tfidf_pipe` to fit the training data `X_train` and `y_train`.

Finally, use the function `score` on `tfidf_pipe` to compute the score on the test data `X_test` and `y_test`. Assign the result to `test_acc`.

In [17]:
### GRADED
tfidf_pipe = ''

test_acc = ''

    
# YOUR CODE HERE
tfidf_pipe = Pipeline([('tfidf', TfidfVectorizer()),
                       ('lgr', LogisticRegression())])
tfidf_pipe.fit(X_train, y_train)
test_acc = tfidf_pipe.score(X_test, y_test)


### ANSWER CHECK
tfidf_pipe.named_steps

{'tfidf': TfidfVectorizer(), 'lgr': LogisticRegression()}

[Back to top](#-Index)


### Problem 7

#### Grid Searching the Pipeline

**5 Points** 

Initialize a `GridSearchCV` object with the pipeline `tfidf_pipe` and parameter grid `params` given below. Assign this result to the variable `grid`.

Fit the `grid` object on training data `X_train` and `y_train`.

Finaly, use the function `score` to evaluate it on the test set `X_test` and `y_test`. Assign the result to `test_acc`. 

In [18]:
params = {'tfidf__max_features': [100, 500, 1000, 2000],
         'tfidf__stop_words': ['english', None]}

In [19]:
### GRADED
grid = ''
test_acc = ''

    
# YOUR CODE HERE
grid = GridSearchCV(tfidf_pipe, param_grid=params)
grid.fit(X_train, y_train)
test_acc = grid.score(X_test, y_test)

### ANSWER CHECK
grid.best_params_

{'tfidf__max_features': 500, 'tfidf__stop_words': 'english'}