<a href="https://colab.research.google.com/github/newtonxp/Natural_language_processing/blob/main/bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download


- This data consists of two columns.
        - review
        - sentiment
- Reviews are the statements given by users after watching the movie.
- sentiment feature tells whether the given review is positive or negative.

In [2]:
# read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable
df = pd.read_csv('movies_sentiment_data.csv')


# print the shape of the data
print(df.shape)

# print top 5 datapoints
df.head()

(19000, 2)


Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [3]:
#create a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['Category'] = df['sentiment'].apply(lambda x: 0 if x=='negative' else 1)

In [6]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.
df.Category.value_counts()


1    9500
0    9500
Name: Category, dtype: int64

**Train-Test split**
- Random state = 0 to make it consistent for every run
- test size = 0.2 implies 80% train data

In [7]:
#Do the 'train-test' splitting with test size of 20%

x_train, x_test, y_train, y_test = train_test_split(df.review, df.Category, test_size=0.2, random_state=0, )

**Attempt 1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**
- using CountVectorizer for pre-processing the text.

- using **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
- printing the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [8]:
# create a pipeline object
clf = Pipeline([
        ('Vectorizer', CountVectorizer()),
        ('Classifier', RandomForestClassifier(n_estimators=50, criterion='entropy'))
    ])



# fit with X_train and y_train
clf.fit(x_train, y_train)


# get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


# print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.84      0.83      1908
           1       0.83      0.81      0.82      1892

    accuracy                           0.83      3800
   macro avg       0.83      0.83      0.83      3800
weighted avg       0.83      0.83      0.83      3800



**Attempt 2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- using CountVectorizer for pre-processing the text.
- using **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.
- printing the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



In [9]:

# create a pipeline object
clf = Pipeline([
        ('Vectorizer', CountVectorizer()),
        ('Classifier', KNeighborsClassifier(n_neighbors=10, metric='euclidean'))
    ])

# fit with X_train and y_train
clf.fit(x_train, y_train)


# get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


# print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.64      0.65      0.65      1908
           1       0.64      0.63      0.64      1892

    accuracy                           0.64      3800
   macro avg       0.64      0.64      0.64      3800
weighted avg       0.64      0.64      0.64      3800



**Attempt 3**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..


- using CountVectorizer for pre-processing the text.
- using **Multinomial Naive Bayes** as the classifier.
- printing the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html



In [10]:
# create a pipeline object
clf = Pipeline([
        ('Vectorizer', CountVectorizer()),
        ('Classifier', MultinomialNB())
    ])

# fit with X_train and y_train
clf.fit(x_train, y_train)


# get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


# print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.88      0.86      1908
           1       0.87      0.82      0.85      1892

    accuracy                           0.85      3800
   macro avg       0.85      0.85      0.85      3800
weighted avg       0.85      0.85      0.85      3800



**OBSERVATIONS**

In this process, we utilize the Bag of Words technique to transform text into a high-dimensional numeric vector. However, certain models like K-Nearest Neighbours (KNN) struggle with high-dimensional data because calculating distances in each dimension becomes challenging and computationally expensive. As a result, the model's performance is affected.

The Multinomial NaiveBayes algorithm is particularly well-suited for text classification due to its ability to efficiently calculate probabilities for words in the corpus and store them in a contingency table.

On the other hand, Random Forest overcomes the challenges of high-dimensional data by employing Bootstrapping (Row and Column Sampling) with multiple decision trees. This approach addresses issues like high variance and overfitting while using feature importance of words to improve category classification.
