Part 1:
1. Load the dataset and preprocess the reviews.
a. Convert all text to lowercase.
b. Remove non-alphabetic characters (punctuation).
c. Tokenize the reviews and remove common stopwords.
d. Apply stemming to reduce words to their root form.
2. Split the dataset into training and testing sets (80% training, 20% testing).
3. Use a Naive Bayes classifier to classify the reviews into positive and negative categories.
a. Implement a Bag-of-Words model using CountVectorizer.
b. Train the Naive Bayes classifier using the training set.

In [31]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_auc_score, roc_curve, auc
import matplotlib.pyplot as plt

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [32]:
#load dataset
df = pd.read_csv("/content/drive/MyDrive/Datasets/IMDB Dataset.csv")

In [33]:
#length of dataframe before any sampling
length_before_sampling = len(df)
length_before_sampling

50000

In [34]:
#randomly sample 100 rowa
df = df.sample( n = 1000)
length_after_sampling = len(df)
length_after_sampling

1000

In [35]:
df.head()

Unnamed: 0,review,sentiment
35447,Have I seen a worse movie? No I can't say that...,negative
45230,When Gundam0079 became the movie trilogy most ...,positive
47845,Moon Child is the story of two brothers and a ...,positive
40263,I first saw this movie about 4 years ago and i...,positive
26592,It is hard to know what category to put this f...,positive


In [36]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,515
negative,485


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 35447 to 9109
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     1000 non-null   object
 1   sentiment  1000 non-null   object
dtypes: object(2)
memory usage: 23.4+ KB


In [38]:
#Initialize an empty list to store processed reviews
corpus = []
#loop through each review
for i in range(0, length_after_sampling):
  #Remove non-alphabetic characters
  review = re.sub("[^a-zA-Z]", " ", df.iloc[i]["review"])
  #Convert text to lowercase
  review = review.lower()
  review = review.split()  #split review into individual words
  ps = PorterStemmer()  #create an instance of PorterStemmer for stemming
  #Remove stopwords and apply stemming to each word
  review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
  review = " ".join(review) #join processed words back into single string
  corpus.append(review) #append processed review to corpus list

In [39]:
#Create copy of dataset to avoid modifying original
df_copy = df.copy()
#Add processed reviews as new column
df_copy["CleanedText"] = corpus
df_copy.head()

Unnamed: 0,review,sentiment,CleanedText
35447,Have I seen a worse movie? No I can't say that...,negative,seen wors movi say pathet director still aliv ...
45230,When Gundam0079 became the movie trilogy most ...,positive,gundam becam movi trilog us familiar lot sheer...
47845,Moon Child is the story of two brothers and a ...,positive,moon child stori two brother friend tri make f...
40263,I first saw this movie about 4 years ago and i...,positive,first saw movi year ago expect someth funni si...
26592,It is hard to know what category to put this f...,positive,hard know categori put film film set time peri...


In [40]:
#Check number of unique values in 'sentiment' column
df_copy["sentiment"].nunique()

2

In [41]:
#Extract feature and target data
X = df_copy.loc[:, "CleanedText"].values
y = df_copy.loc[:, "sentiment"].values

In [42]:
#Initialize TfidVectorizer with a maximum of 5000 features
tfidf = TfidfVectorizer(max_features=5000)
#Transform text data into TF-IDF features and convert result into dense NumPy array
X = tfidf.fit_transform(X).toarray()

In [43]:
#Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.25, random_state= 42)

In [44]:
model = MultinomialNB() #Initialize multinomial Naive Bayes classifier
#Fit model into training data
model.fit(X_train, y_train)

Part 2:
1. Evaluate the performance of the model using the following metrics:
a. Accuracy
b. Precision, Recall, and F1-score
c. Confusion Matrix
d. ROC-AUC Score

In [45]:
#Predict and evaluate the model
y_pred = model.predict(X_test)
y_pred = [1 if label == "positive" else 0 for label in y_pred]
y_test = [1 if label == "positive" else 0 for label in y_test]
#Accuracy score
print("Accuracy: ", accuracy_score(y_test, y_pred))
#Classification report
print("classification Report: ", classification_report(y_test, y_pred))

Accuracy:  0.84
classification Report:                precision    recall  f1-score   support

           0       0.90      0.77      0.83       124
           1       0.80      0.91      0.85       126

    accuracy                           0.84       250
   macro avg       0.85      0.84      0.84       250
weighted avg       0.85      0.84      0.84       250



In [46]:
#ROC AUC score
roc_auc_value = roc_auc_score(y_test, y_pred)
print("ROC AUC Score: ", roc_auc_value)

#Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n", conf_matrix)

ROC AUC Score:  0.8394137224782385
Confusion Matrix: 
 [[ 95  29]
 [ 11 115]]
