### Codio Activity 18.4: Bag of Words: Count Vectorization

**Expected Time = 60 minutes**

**Total Points = 25**

In this activity you will use the scikit-learn vectorization tool `CountVectorizer` to create a bag of words representation of text in a DataFrame.  You will explore how different parameter settings affect the performance of a `LogisticRegression` estimator on a binary classification problem.

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [1]:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

### The Data

Below, the data from kaggle is again loaded.  Now, we join the "sad" and "happy" sentiments which will form the target of our classification models.  The data is also split and named appropriately below. 

In [2]:
happy_df = pd.read_csv('data/Emotion(happy).csv')
sad_df = pd.read_csv('data/Emotion(sad).csv.zip', compression = 'zip')

In [3]:
full_df = pd.concat([happy_df, sad_df]).reset_index(drop = True)

In [4]:
X = full_df.drop('sentiment', axis = 1)
y = full_df['sentiment']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X['content'], y, random_state = 42)

In [6]:
X_train.head()

1287    ['You Hurt Me But I Still Love You.', 'True Lo...
1112    Sorry isn’t always enough. Sometimes you actua...
823     Sometimes two people have to fall apart to rea...
651     True love isn’t love at first sight but love a...
1101    i am scared of getting too close to anyone bec...
Name: content, dtype: object

[Back to top](#-Index)

### Problem 1

#### Using the `CountVectorizer`

**5 Points**

To create a bag of words representation of your text data, create an instance of the `CountVectorizer` as `cvect` below.  Leave all the default settings, and assign the transformed version of the text to `dtm`.  Note that because the vectorizer will return a `scipy.sparse` array, to view the contents of the resulting document term matrix the `toarray()` function is used together with the `.get_feature_names()` function to retrieve the fitted vocabulary.

Hint: Make sure to transform X_train

In [7]:
### GRADED
cvect = ''
dtm = ''

    
# YOUR CODE HERE
cvect = CountVectorizer()
dtm = cvect.fit_transform(X_train)

### ANSWER CHECK
pd.DataFrame(dtm.toarray(), columns = cvect.get_feature_names_out()).head()

Unnamed: 0,0_0,100,123whatsappstatus,204,30,404,44,45,55,805,...,yes,yesterday,yet,you,young,your,yours,yourself,yous,yuh
0,0,0,0,0,0,0,0,0,0,0,...,2,0,1,112,0,13,0,2,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Problem 2

#### Limiting words with the `CountVectorizer`

**5 Points**

Now, to remove stopwords from the text before vectorizing create a new instance of the `CountVectorizer` and set `stop_words = 'english'` to remove the english language stop words using the same list as in our earlier assignment.  Fit and transform the training data and transform the test data as `X_train_vect_2` and `X_test_vect_2` below.

Hint: Use `fit_transform` for the training data, and `transform` for the test data.

In [8]:
### GRADED
cvect2 = ''
X_train_vect_2 = ''
X_test_vect_2 = ''
    
# YOUR CODE HERE
cvect2 = CountVectorizer(stop_words = 'english')
X_train_vect_2 = cvect2.fit_transform(X_train)
X_test_vect_2 = cvect2.transform(X_test)

### ANSWER CHECK
X_train_vect_2

<1007x1622 sparse matrix of type '<class 'numpy.int64'>'
	with 41589 stored elements in Compressed Sparse Row format>

### Problem 3

#### Limiting words with stopwords and higher counts

**5 Points**

Now, remove stopwords using `stop_words = 'english'` and limit the features to the top 300 words based on counts using the `max_features` argument.  Fit and transform your data appropriately as `X_train_vect_3` and `X_test_vect_3` below.

In [9]:
### GRADED
cvect3 = ''
X_train_vect_3 = ''
X_test_vect_3 = ''
    
# YOUR CODE HERE
cvect3 = CountVectorizer(stop_words = 'english', max_features = 300)
X_train_vect_3 = cvect3.fit_transform(X_train)
X_test_vect_3 = cvect3.transform(X_test)

### ANSWER CHECK
X_train_vect_3

<1007x300 sparse matrix of type '<class 'numpy.int64'>'
	with 33225 stored elements in Compressed Sparse Row format>

[Back to top](#-Index)

### Problem 4

#### Using the text with `LogisticRegression`

**5 Points**

Create a `Pipeline` object named `vect_pipe_1` below that has steps named `cvect` and `lgr`, using both a default `CountVectorizer` transformer and `LogisticRegression` estimator. Fit this on the training data and evaluate it on the test set. 

In [10]:
### GRADED
vect_pipe_1 = ''

test_acc = ''

    
# YOUR CODE HERE
vect_pipe_1 = Pipeline([('cvect', CountVectorizer()), ('lgr', LogisticRegression())])
vect_pipe_1.fit(X_train, y_train)
test_acc = vect_pipe_1.score(X_test, y_test)


### ANSWER CHECK
vect_pipe_1.named_steps

{'cvect': CountVectorizer(), 'lgr': LogisticRegression()}

[Back to top](#-Index)

### Problem 5

#### Pipeline and Grid Search

**5 Points**

Finally, to abstract this work into a single step you can create a `Pipeline` with named steps `cvect` and `lgr` below that vectorize and model the data.  Then, use the parameter grid to perform a grid search for the ideal parameters to represent the text and build a classification model. 

Hint: Use vect_pipe_1 from problem 4

In [11]:
params = {'cvect__max_features': [100, 500, 1000, 2000],
         'cvect__stop_words': ['english', None]}

In [26]:
### GRADED
grid = ''
test_acc = ''

    
# YOUR CODE HERE
grid = GridSearchCV(vect_pipe_1, param_grid=params)
grid.fit(X_train, y_train)
test_acc = grid.score(X_test, y_test)

### ANSWER CHECK
grid.best_params_

KeyboardInterrupt: 

In [27]:
X_train.shape

(1007,)

In [29]:
X.shape

(1343, 1)