Sentiments analysis using naive bayes. 

Download link: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv

About Dataset
IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.
For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/

In [6]:
import pandas as pd
import re 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [7]:
df = pd.read_csv("../../Downloads/Data for ML/IMDB Dataset.csv")
print(df.head(3))

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive


Info of the data. 

--> To check about the missing values. 

In [8]:
df.info()                       # There are no missing values present in the data. 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


Text preprocessing. 

--> Removal html tags.
--> Removal special characters. 
--> Converting everything to the lower  case. 
--> Removal of stop words. 
--> Stemming. 

In [9]:
# Removal of html tags. 
def rem_html(text: str) -> str:
    eg_text = re.compile("<.*?>")   
    return re.sub(eg_text, " ", text)

def rem_specialChar(text: str) -> str:
    #Specific
    phrase = re.sub(r"won't", r"will not", text)
    phrase = re.sub(r"don't", r"do not", phrase)
    phrase = re.sub(r"can't", r"can not", phrase)
    
    #General
    phrase = re.sub(r"n't", r" not", phrase)
    phrase = re.sub(r"'s", r" is", phrase)
    phrase = re.sub(r"'m", r" am", phrase)
    phrase = re.sub(r"'re", r" are", phrase)
    phrase = re.sub(r"'ll", r" will", phrase)
    phrase = re.sub(r"'t", r" not", phrase)
    phrase = re.sub(r"'ve", r" have", phrase)
    
    # Clean punctuations
    phrase = re.sub(r'[?|!|\'|"|#|@|:]', r'', phrase)
    phrase = re.sub(r'[(|)|.|,|\|/]', r'', phrase)
    phrase = re.sub(r'-', r' ', phrase)    
    
    # Special characters (Remove all words which are not in that range)
    phrase = re.sub(r'[^A-Za-z0-9]+', r' ', phrase)
    
    # Remove all the alphanumeric words
    phrase = re.sub(r'\S*\d\S*', r'', phrase)
    return phrase

def lowercase(text: str) -> str:
    return text.lower()

def tokens(text: str) -> list[str]:
    return text.split()

def rem_stopwords(text: list[str]) -> list[str]:
    stop = stopwords.words('english')
    return [word for word in text if word not in stop]

def rem_stemmer(text: list[str]) -> list[str]:
    stemmer = PorterStemmer()
    ls: list = []
    for word in text:
        ls.append(stemmer.stem(word))
    return ls

def join_words(text: list[str]) -> str:
    return " ".join(text)

Here I am applying text preprocessing in all the data. This is not the correct way of preprocessing the data (data leakage problem) but for text data it is fine. 

In [10]:
# Because of memory issue let's take only 20,000 data points. 

df = df.sample(20000, random_state=47)

In [11]:
df['review'] = df['review'].apply(rem_html)
df['review'] = df['review'].apply(rem_specialChar)
df['review'] = df['review'].apply(lowercase)
df['review'] = df['review'].apply(tokens)
df['review'] = df['review'].apply(rem_stopwords)
df['review'] = df['review'].apply(rem_stemmer)
print(df.head(3))

                                                  review sentiment
48243  [cleo, second, husband, amateurish, attempt, p...  negative
48967  [first, would, like, clarifi, consid, one, fun...  negative
36155  [volcano, set, lo, angel, minor, earthquak, hi...  negative


In [12]:
df['review'] = df['review'].apply(join_words)

Now convert the sentiments into 0 (negative) and 1 (positive).

In [13]:
df['sentiment'].replace({'positive': 1, 'negative': 0}, inplace=True)
print(df.head(3))

                                                  review  sentiment
48243  cleo second husband amateurish attempt psychod...          0
48967  first would like clarifi consid one funniest f...          0
36155  volcano set lo angel minor earthquak hit vacat...          0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['sentiment'].replace({'positive': 1, 'negative': 0}, inplace=True)
  df['sentiment'].replace({'positive': 1, 'negative': 0}, inplace=True)


Now it's time to convert each unique word into unique feature. Here we are making our own vocabulary or text corpus or dictionary based on the data we have.    

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X = cv.fit_transform(df['review']).toarray()
print(X.shape)
print(f"Number of data points: {X.shape[0]}")
print(f"Number of unique words/features: {X.shape[1]}")

(20000, 58702)
Number of data points: 20000
Number of unique words/features: 58702


In [15]:
Y = df.iloc[:, -1].values
print(f"Number of sentiments: {Y.shape[0]}")

Number of sentiments: 20000


Split my data into train and test dataset. 

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
print(f"Number of data points in training dataset: {X_train.shape[0]}")

Number of data points in training dataset: 16000


In [17]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

clf1 = GaussianNB()
clf2 = BernoulliNB()
clf3 = MultinomialNB()

Now let's train the models clf1, clf2, and clf3.

In [18]:
clf1.fit(X_train, Y_train)
clf2.fit(X_train, Y_train)
clf3.fit(X_train, Y_train)

Let's predict on test dataset. 

In [19]:
y_pred1 = clf1.predict(X_test)
y_pred2 = clf2.predict(X_test)
y_pred3 = clf3.predict(X_test)

Finding accuracy for all the models. 

In [20]:
from sklearn.metrics import accuracy_score

print(f"Accuracy value for GaussianNB: {accuracy_score(Y_test, y_pred1)}")
print(f"Accuracy value for BernoulliNB: {accuracy_score(Y_test, y_pred2)}")
print(f"Accuracy value for MultinomialNB: {accuracy_score(Y_test, y_pred3)}")

Accuracy value for GaussianNB: 0.628
Accuracy value for BernoulliNB: 0.8425
Accuracy value for MultinomialNB: 0.85475


It shows that features are not normally distributed. 

Now How can we increase the accuracy of the model. 

1. We can use CrossValidation and find the optimized value of hyperparameters in the models. GaussianNB --> var_smoothing=1e-09, BernoulliNB and MultinomialNB --> alpha=1.0
2. We can change the max_feature value in CountVectorizer and see how it is changing the output. 

For now, we are only going with the second option. 

In [None]:
def optimum_maxFeature():
    max_feature = [i for i in range(10000, 55000, 5000)]
    accuracy1 = []
    accuracy2 = []
    accuracy3 = []
    for i in max_feature:
        cv_ = CountVectorizer(max_features=i)
        X1 = cv_.fit_transform(df['review']).toarray()
        Y1 = df.iloc[:, -1].values
        X_train_, X_test_, Y_train_, Y_test_ = train_test_split(X1, Y1, test_size=0.2, random_state=0)
        clf1_ = GaussianNB()
        clf2_ = BernoulliNB()
        clf3_ = MultinomialNB()
        clf1_.fit(X_train_, Y_train_)
        clf2_.fit(X_train_, Y_train_)
        clf3_.fit(X_train_, Y_train_)
        y_pred1_ = clf1_.predict(X_test_)
        y_pred2_ = clf2_.predict(X_test_)
        y_pred3_ = clf3_.predict(X_test_)
        accuracy1.append(accuracy_score(Y_test_, y_pred1_))
        accuracy2.append(accuracy_score(Y_test_, y_pred2_))
        accuracy3.append(accuracy_score(Y_test_, y_pred3_))
    return accuracy1, accuracy2, accuracy3

accuracy1, accuracy2, accuracy3 = optimum_maxFeature()
max_feature = [i for i in range(10000, 55000, 5000)]

import matplotlib.pyplot as plt
plt.plot(accuracy1, max_feature)
plt.plot(accuracy2, max_feature)
plt.plot(accuracy3, max_feature)