### Bag of words: Exercises

-   In this Exercise, you are going to classify whether a given movie review is **positive or negative**.
-   you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
-   Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.


In [1]:
# Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download

-   This data consists of two columns. - review - sentiment
-   Reviews are the statements given by users after watching the movie.
-   sentiment feature tells whether the given review is positive or negative.


In [2]:
# 1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable
df = pd.read_csv("IMDB Dataset.csv")


# 2. print the shape of the data
print(df.shape)

# 3. print top 5 datapoints
df.head()

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
# creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df["category"] = df["sentiment"].map({"positive": 1, "negative": 0})

In [4]:
# check the distribution of 'Category' and see whether the Target labels are balanced or not.
df["category"].value_counts()

category
1    25000
0    25000
Name: count, dtype: int64

In [5]:
# Do the 'train-test' splitting with test size of 20%
X_train, X_test, y_train, y_test = train_test_split(
    df["review"], df["category"], test_size=0.2
)

In [6]:
X_train.shape, X_test.shape

((40000,), (10000,))

**Exercise-1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**

-   use CountVectorizer for pre-processing the text.

-   use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
-   print the classification report.

**References**:

-   https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

-   https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


In [7]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("count_vectorizer", CountVectorizer()),
        ("classifier", RandomForestClassifier(n_estimators=50, criterion="entropy")),
    ]
)

# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)


# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)


# 4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.83      0.84      5081
           1       0.83      0.84      0.84      4919

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**

-   use CountVectorizer for pre-processing the text.
-   use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.
-   print the classification report.

**References**:

-   https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
-   https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html


In [8]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("count_vectorizer", CountVectorizer()),
        ("classifier", KNeighborsClassifier(n_neighbors=10, metric="euclidean")),
    ]
)

# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)


# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)


# 4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.67      0.67      0.67      5081
           1       0.66      0.66      0.66      4919

    accuracy                           0.66     10000
   macro avg       0.66      0.66      0.66     10000
weighted avg       0.66      0.66      0.66     10000



**Exercise-3**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**

-   use CountVectorizer for pre-processing the text.
-   use **Multinomial Naive Bayes** as the classifier.
-   print the classification report.

**References**:

-   https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
-   https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html


In [9]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("count_vectorizer", CountVectorizer()),
        ("classifier", MultinomialNB()),
    ]
)

# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)


# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)


# 4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.89      0.86      5081
           1       0.88      0.82      0.85      4919

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000

