### Bag of words: Exercises



*   In this exercise, I am going to classify whether a given movie review is
positive or negative.
*    I will use the Bag of Words approach for pre-processing the text and apply different classification algorithms.


*    Sklearn's CountVectorizer has inbuilt implementations for Bag of Words.


In [None]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download


- This data consists of two columns.
        - review
        - sentiment
- Reviews are the statements given by users after watching the movie.
- sentiment feature tells whether the given review is positive or negative.

In [None]:
#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable

df= pd.read_csv('movies_sentiment_data.csv')

#2. print the shape of the data
df.shape


#3. print top 5 datapoints
df.head()


Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [None]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['Category'] = np.where(df['sentiment'] == 'positive', 1, 0)


In [None]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.

df['Category'].value_counts()

Category
1    9500
0    9500
Name: count, dtype: int64

In [None]:
#Do the 'train-test' splitting with test size of 20%

X_train, X_test, y_train, y_test = train_test_split(df['review'], df['Category'], test_size=0.2, random_state=42)

**Exercise-1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**
- use CountVectorizer for pre-processing the text.

- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [None]:
from os import pipe
#1. create a pipeline object
rf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', RandomForestClassifier(n_estimators=50, criterion='entropy'))
])




#2. fit with X_train and y_train
rf.fit(X_train, y_train)



#3. get the predictions for X_test and store it in y_pred
y_pred = rf.predict(X_test)



#4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.83      0.85      0.84      1864
           1       0.85      0.83      0.84      1936

    accuracy                           0.84      3800
   macro avg       0.84      0.84      0.84      3800
weighted avg       0.84      0.84      0.84      3800



**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



In [None]:

#1. create a pipeline object
knn = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', KNeighborsClassifier(n_neighbors=10, metric='euclidean'))
])


#2. fit with X_train and y_train
knn.fit(X_train, y_train)



#3. get the predictions for X_test and store it in y_pred
y_pred = knn.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.64      0.64      0.64      1864
           1       0.65      0.65      0.65      1936

    accuracy                           0.65      3800
   macro avg       0.65      0.65      0.65      3800
weighted avg       0.65      0.65      0.65      3800



**Exercise-3**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html



In [None]:

#1. create a pipeline object
nb = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])



#2. fit with X_train and y_train
nb.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = nb.predict(X_test)



#4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.82      0.88      0.85      1864
           1       0.87      0.81      0.84      1936

    accuracy                           0.84      3800
   macro avg       0.85      0.84      0.84      3800
weighted avg       0.85      0.84      0.84      3800



In [None]:
p=rf.predict(['tasnu is a panda'])
if p == 1:
    print('positive')
else:
    print('negative')

positive
