### **Bag of n_grams: Exercise**

-   In this Exercise, you are going to classify whether a given movie review is **positive or negative**.

-   You will use a Bag of n-grams to pre-process the text and apply different classification algorithms.

-   Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.


### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download

-   This data consists of two columns. - review - sentiment
-   Reviews are the statements given by users after watching the movie.
-   sentiment feature tells whether the given review is positive or negative.


In [2]:
# import pandas library
import pandas as pd

# read the dataset and store it in a variable df
df = pd.read_csv("IMDB dataset.csv")
df = df.sample(5000)

# print the shape of dataframe
print(df.shape)


# print top 5 rows
df.head()


(5000, 2)


Unnamed: 0,review,sentiment
44860,A broke would be screenwriter and his would be...,negative
47924,"Tedious girls-at-reform-school flick, which pl...",negative
44024,I have a little hobby of finding really cool p...,positive
16077,It's proof that movie makers and their financi...,negative
47091,I remember seeing this movie a long time ago o...,negative


In [3]:
# check the distribution of labels
df.sentiment.value_counts()

sentiment
negative    2518
positive    2482
Name: count, dtype: int64

In [4]:
# Add the new column "label_num" which gives a unique number to each of these labels
df["label_num"] = df["sentiment"].map({"positive": 0, "negative": 1})

# check the results with top 5 rows
df.head()

Unnamed: 0,review,sentiment,label_num
44860,A broke would be screenwriter and his would be...,negative,1
47924,"Tedious girls-at-reform-school flick, which pl...",negative,1
44024,I have a little hobby of finding really cool p...,positive,0
16077,It's proof that movie makers and their financi...,negative,1
47091,I remember seeing this movie a long time ago o...,negative,1


### **Modelling without Pre-processing Text data**


In [5]:
# import train-test-split from sklearn
from sklearn.model_selection import train_test_split

# Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
X_train, X_test, y_train, y_test = train_test_split(
    df["review"],
    df["label_num"],
    test_size=0.2,
    random_state=2024,
    stratify=df["label_num"],
)

In [6]:
# print the shapes of X_train and X_test
X_train.shape, X_test.shape

((4000,), (1000,))

**Attempt 1** :

1. using sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

-   using CountVectorizer with unigram, bigram, and trigrams.
-   use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean' distance.
-   print the classification report.


In [7]:
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", CountVectorizer(ngram_range=(1, 3))),
        ("classifier", KNeighborsClassifier(n_neighbors=10, metric="euclidean")),
    ]
)


# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)


# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)


# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.55      0.78      0.64       496
           1       0.63      0.37      0.46       504

    accuracy                           0.57      1000
   macro avg       0.59      0.57      0.55      1000
weighted avg       0.59      0.57      0.55      1000



**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

-   using CountVectorizer with unigram, bigram, and trigrams.
-   use **KNN** as the classifier with n_neighbors of 10 and metric as 'cosine' distance.
-   print the classification report.


In [8]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", CountVectorizer(ngram_range=(1, 3))),
        ("classifier", KNeighborsClassifier(n_neighbors=10, metric="cosine")),
    ]
)


# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)


# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)


# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.57      0.82      0.67       496
           1       0.69      0.39      0.50       504

    accuracy                           0.60      1000
   macro avg       0.63      0.61      0.59      1000
weighted avg       0.63      0.60      0.59      1000



**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

-   using CountVectorizer with only trigrams.
-   use **RandomForest** as the classifier.
-   print the classification report.


In [9]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", CountVectorizer(ngram_range=(3, 3))),
        ("classifier", RandomForestClassifier()),
    ]
)


# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)


# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)


# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.62      0.92      0.74       496
           1       0.85      0.43      0.57       504

    accuracy                           0.68      1000
   macro avg       0.73      0.68      0.66      1000
weighted avg       0.73      0.68      0.66      1000



**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

-   using CountVectorizer with both unigram and bigrams.
-   use **Multinomial Naive Bayes** as the classifier with an alpha value of 0.75.
-   print the classification report.


In [10]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", CountVectorizer(ngram_range=(1, 2))),
        ("classifier", MultinomialNB(alpha=0.75)),
    ]
)


# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)


# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)


# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.85      0.81      0.83       496
           1       0.82      0.86      0.84       504

    accuracy                           0.83      1000
   macro avg       0.84      0.83      0.83      1000
weighted avg       0.84      0.83      0.83      1000



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>


In [11]:
# use this utility function to get the preprocessed text data

import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")


def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [12]:
# create a new column "preprocessed_txt" and use the utility function above to get the clean data
# this will take some time, please be patient
df["preprocessed_txt"] = df["review"].apply(preprocess)

In [13]:
# print the top 5 rows
df.head()

Unnamed: 0,review,sentiment,label_num,preprocessed_txt
44860,A broke would be screenwriter and his would be...,negative,1,broke screenwriter agent Tom Wood Arye Gross f...
47924,"Tedious girls-at-reform-school flick, which pl...",negative,1,tedious girl reform school flick play somewhat...
44024,I have a little hobby of finding really cool p...,positive,0,little hobby find cool pic pretty unknown -and...
16077,It's proof that movie makers and their financi...,negative,1,proof movie maker financier treat audience con...
47091,I remember seeing this movie a long time ago o...,negative,1,remember see movie long time ago television re...


**Build a model with pre processed text**


In [14]:
# Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
# Note: Make sure to use only the "preprocessed_txt" column for splitting

X_train, X_test, y_train, y_test = train_test_split(
    df["preprocessed_txt"],
    df["label_num"],
    test_size=0.2,
    random_state=2024,
    stratify=df["label_num"],
)

**Let's check the scores with our best model till now**

-   Random Forest


**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

-   using CountVectorizer with only trigrams.
-   use **RandomForest** as the classifier.
-   print the classification report.


In [15]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", CountVectorizer(ngram_range=(3, 3))),
        ("classifier", RandomForestClassifier()),
    ]
)


# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)


# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)


# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.53      0.95      0.69       496
           1       0.80      0.18      0.30       504

    accuracy                           0.56      1000
   macro avg       0.67      0.57      0.49      1000
weighted avg       0.67      0.56      0.49      1000



**Attempt2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

-   using CountVectorizer with unigram, Bigram, and trigrams.
-   use **RandomForest** as the classifier.
-   print the classification report.


In [16]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", CountVectorizer(ngram_range=(1, 3))),
        ("classifier", RandomForestClassifier()),
    ]
)


# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)


# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)


# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.77      0.87      0.82       496
           1       0.86      0.75      0.80       504

    accuracy                           0.81      1000
   macro avg       0.82      0.81      0.81      1000
weighted avg       0.82      0.81      0.81      1000



In [18]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", CountVectorizer(ngram_range=(1, 2))),
        ("classifier", MultinomialNB(alpha=0.75)),
    ]
)


# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)


# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)


# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.83      0.78      0.80       496
           1       0.79      0.84      0.82       504

    accuracy                           0.81      1000
   macro avg       0.81      0.81      0.81      1000
weighted avg       0.81      0.81      0.81      1000



In [19]:
from sklearn.metrics import confusion_matrix

# finally print the confusion matrix for the best model
cm = confusion_matrix(y_test, y_pred)

cm


array([[385, 111],
       [ 80, 424]], dtype=int64)