### Bag of words: Exercises


- In this Exercise, you are going to classify whether a given movie review is **positive or negative**.
- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [1]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download


- This data consists of two columns.
        - review
        - sentiment
- Reviews are the statements given by users after watching the movie.
- sentiment feature tells whether the given review is positive or negative.

In [2]:
#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable

df=pd.read_csv("movies_sentiment_data.csv")

#2. print the shape of the data
print(df.shape)

#3. print top 5 datapoints
df.head()

(19000, 2)


Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [3]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df["category"]=df["sentiment"].apply(lambda x:1 if(x =="positive") else 0)
df = df.sample(frac=0.2)  # Adjust the fraction as needed
df.shape

(3800, 3)

#check the distribution of 'Category' and see whether the Target labels are balanced or not.
df["category"].value_counts()

In [4]:
#Do the 'train-test' splitting with test size of 20%

from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y=train_test_split(df.review,df.category,test_size=0.2,stratify=df.category)

In [5]:
train_x.shape

(3040,)

In [6]:
train_y.shape

(3040,)

In [7]:
train_y.value_counts()

category
1    1521
0    1519
Name: count, dtype: int64

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
train_x_cv=v.fit_transform(train_x.values)
test_x_cv=v.transform(test_x.values)

In [9]:
v.get_feature_names_out()[10252]

'fascination'

In [23]:
train_x_cv.toarray()[0][496]

0

In [20]:
train_nu=train_x_cv.toarray()

In [10]:
v.vocabulary_

{'an': 1299,
 'astronaut': 1991,
 'gets': 11707,
 'lost': 16626,
 'in': 14041,
 'deep': 7269,
 'space': 25980,
 'and': 1342,
 'finds': 10589,
 'himself': 13205,
 'traveling': 28563,
 'through': 28027,
 'unknown': 29304,
 'territory': 27796,
 'on': 19576,
 'board': 3327,
 'of': 19469,
 'living': 16459,
 'spaceship': 25985,
 'accompanied': 557,
 'by': 4146,
 'group': 12323,
 'alien': 1061,
 'outlaws': 19848,
 'this': 27962,
 'incredible': 14135,
 'plotted': 21061,
 'enjoyable': 9393,
 'tv': 28846,
 'installment': 14457,
 'comes': 5634,
 'along': 1134,
 'as': 1858,
 'positive': 21294,
 'birth': 3093,
 'fantasy': 10206,
 'the': 27863,
 'individual': 14194,
 'characters': 4818,
 'conflict': 5935,
 'at': 1998,
 'beginning': 2754,
 'series': 24716,
 'have': 12822,
 'to': 28197,
 'learn': 16055,
 'get': 11705,
 'with': 30738,
 'each': 8845,
 'other': 19793,
 'evolve': 9753,
 'into': 14631,
 'powerful': 21367,
 'last': 15935,
 'most': 18389,
 'action': 640,
 'takes': 27469,
 'place': 20947,
 'i

In [21]:
np.where(train_nu[0]!=0)

(array([  486,   557,   640,   657,  1061,  1134,  1151,  1299,  1342,
         1466,  1713,  1858,  1991,  1998,  2067,  2625,  2754,  2780,
         3093,  3327,  3357,  4146,  4818,  5634,  5935,  6603,  7269,
         8845,  9393,  9615,  9705,  9753,  9778,  9917,  9970, 10202,
        10206, 10396, 10485, 10589, 10932, 11063, 11705, 11707, 11817,
        11911, 12323, 12770, 12822, 13147, 13205, 13608, 13994, 14041,
        14135, 14194, 14416, 14457, 14555, 14631, 14790, 15935, 16055,
        16459, 16626, 17888, 18180, 18389, 18476, 18892, 19469, 19576,
        19584, 19793, 19848, 20239, 20761, 20947, 20981, 21061, 21294,
        21367, 21469, 23249, 24333, 24716, 24782, 24997, 25980, 25985,
        26389, 26648, 27077, 27180, 27469, 27796, 27858, 27863, 27962,
        28027, 28197, 28221, 28400, 28563, 28600, 28846, 29304, 29571,
        30278, 30508, 30554, 30738, 30790, 30888], dtype=int64),)

In [31]:
rf_model = KNeighborsClassifier(n_neighbors=5)
rf_model.fit(train_x_cv,train_y)


In [32]:
from sklearn.metrics import classification_report

y_pred = rf_model.predict(test_x_cv)

print(classification_report(test_y, y_pred))

              precision    recall  f1-score   support

           0       0.62      0.54      0.58       380
           1       0.59      0.66      0.62       380

    accuracy                           0.60       760
   macro avg       0.60      0.60      0.60       760
weighted avg       0.60      0.60      0.60       760



**Exercise-1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**
- use CountVectorizer for pre-processing the text.

- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [33]:
#1. create a pipeline object
from sklearn.pipeline import Pipeline

clf=Pipeline([
    ("vc",CountVectorizer()),
    ("rf",RandomForestClassifier())
])



#2. fit with X_train and y_train

clf.fit(train_x,train_y)

#3. get the predictions for X_test and store it in y_pred

y_pred=clf.predict(test_x)

#4. print the classfication report
print(classification_report(test_y,y_pred))

              precision    recall  f1-score   support

           0       0.82      0.79      0.80       380
           1       0.79      0.82      0.81       380

    accuracy                           0.81       760
   macro avg       0.81      0.81      0.81       760
weighted avg       0.81      0.81      0.81       760



**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



In [34]:

#1. create a pipeline object
from sklearn.pipeline import Pipeline

clf=Pipeline([
    ("cv",CountVectorizer()),
    ("kn",KNeighborsClassifier())
])

#2. fit with X_train and y_train

clf.fit(train_x,train_y)

#3. get the predictions for X_test and store it in y_pred

y_pred=clf.predict(test_x)

#4. print the classfication report
print(classification_report(test_y,y_pred))

              precision    recall  f1-score   support

           0       0.62      0.54      0.58       380
           1       0.59      0.66      0.62       380

    accuracy                           0.60       760
   macro avg       0.60      0.60      0.60       760
weighted avg       0.60      0.60      0.60       760



**Exercise-3**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html



In [35]:

#1. create a pipeline object



#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report


### Can you write some observations of why model like KNN fails to produce good results unlike RandomForest and MultinomialNB?



## [**Solution**](./bag_of_words_exercise_solutions.ipynb)