#**Exercise 1 - Implementation of Naive Bayes Algorithm**

#*A.2. Naive Bayes Classifier for Movie Classification.*

*Building a Sentiment Analysis Model for IMDB Movie Review Dataset*

**Step 1: Import Required Libraries**

In [60]:
# Basic libraries
import pandas as pd
import numpy as np
import re

# NLP libraries
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Machine Learning libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Evaluation metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Step 2: Load the IMDB Dataset**

In [61]:
# Load dataset
df = pd.read_csv("/content/IMDB Dataset.csv")

# Convert sentiment to numbers
# positive → 1, negative → 0
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

# Check first 5 rows
df.head()


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


**Step 3: Text Preprocessing (Cleaning Reviews)**

In [62]:
# Initialize stemmer and stopwords
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))

# Create empty list to store cleaned reviews
corpus = []

for review in df['review']:
    # Remove punctuation and numbers
    review = re.sub('[^a-zA-Z]', ' ', review)

    # Convert to lowercase
    review = review.lower()

    # Split into words
    review = review.split()

    # Remove stopwords and apply stemming
    review = [ps.stem(word) for word in review if word not in stop_words]

    # Join words back to sentence
    review = ' '.join(review)

    corpus.append(review)


**Step 4: Create Feature Matrix (Bag of Words)**

In [63]:
# Convert text into numbers using CountVectorizer
cv = CountVectorizer(max_features=5000)

X = cv.fit_transform(corpus).toarray()
y = df['sentiment'].values


**Step 5: Train–Test Split (80% Training, 20% Testing)**

In [64]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


**Step 6: Train Naive Bayes Model**

In [65]:
# Create Naive Bayes model
model = MultinomialNB()

# Train the model
model.fit(X_train, y_train)


**Step 7: Make Predictions**

In [66]:
# Predict on test data
y_pred = model.predict(X_test)


**Step 8: Model Evaluation**

*Accuracy*

In [67]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.8489


*Classification Report (Precision, Recall, F1-Score*

In [68]:
print("Classification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.85      0.85      4961
           1       0.85      0.85      0.85      5039

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



*Confusion Matrix*

In [69]:
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[4230  731]
 [ 780 4259]]


*ROC-AUC Score*

In [70]:
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 0.8489290288421147
