<h2>***Problem Statement: Movie Review Sentiment Classification***

You are given a dataset of movie reviews labeled as either Positive or Negative sentiment.
Your task is to build a Naive Bayes classifier to predict the sentiment of a new review.

Steps to Solve:
<ol>
Load the dataset using Pandas.

<li>Preprocess the text (lowercase, remove punctuation, stopwords).

<li>Convert text into numerical features using CountVectorizer or TF-IDF.

<li>Split data into training and test sets.

<li>Train a Naive Bayes classifier (MultinomialNB from scikit-learn works well for text classification).

<li>Evaluate the model using accuracy, precision, recall, and confusion matrix.

<li>Predict sentiment for a new movie review (e.g., "The cinematography was stunning, but the plot was dull"). </li>

In [2]:
# Import Data
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords

df = pd.read_csv("/content/sample_data/movie_reviews.csv")
df.head()

Unnamed: 0,Sentiment,Review
0,positive,This movie was fantastic! The acting and plot ...
1,negative,The movie was too long and incredibly boring.
2,positive,Loved the visuals and the soundtrack. Highly r...
3,negative,The acting was terrible and the story made no ...
4,positive,An absolute masterpiece with breathtaking perf...


In [3]:
nltk.download("stopwords")

# Preprocessing
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    text = text.lower()  # lowercase
    text = text.translate(str.maketrans("", "", string.punctuation))  # remove punctuation
    words = text.split()
    words = [w for w in words if w not in stop_words]  # remove stopwords
    return " ".join(words)

df["Cleaned_Review"] = df["Review"].apply(preprocess_text)
df = df.drop(columns=["Review"])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
df.groupby('Sentiment').describe()

Unnamed: 0_level_0,Cleaned_Review,Cleaned_Review,Cleaned_Review,Cleaned_Review
Unnamed: 0_level_1,count,unique,top,freq
Sentiment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
negative,5,5,movie long incredibly boring,1
positive,5,5,movie fantastic acting plot brilliant,1


In [5]:
df['negative']=df['Sentiment'].apply(lambda x: 1 if x=='negative' else 0)
df.head()

Unnamed: 0,Sentiment,Cleaned_Review,negative
0,positive,movie fantastic acting plot brilliant,0
1,negative,movie long incredibly boring,1
2,positive,loved visuals soundtrack highly recommend,0
3,negative,acting terrible story made sense,1
4,positive,absolute masterpiece breathtaking performances,0


In [6]:
# Train Test Split
from sklearn.model_selection import train_test_split

X = df["Cleaned_Review"]
y = df["Sentiment"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
v = TfidfVectorizer()
X = v.fit_transform(df["Cleaned_Review"])
y = df["Sentiment"]

v = CountVectorizer()
X_train_v = v.fit_transform(X_train)  # training text
X_test_v= v.transform(X_test)
X_train_v.toarray()[:2]

array([[0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]])

In [8]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_v,y_train)

In [9]:
# Predictions & Evaluation
y_pred = model.predict(X_test_v)

In [10]:
model.score(X_test_v, y_test)

0.3333333333333333

In [11]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),   # Step 1: convert text → numbers
    ('nb', MultinomialNB())              # Step 2: train Naive Bayes
])

In [12]:
clf.fit(X_train, y_train)

In [13]:
clf.score(X_test,y_test)

0.3333333333333333

In [22]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, pos_label="positive"))
print("Recall:", recall_score(y_test, y_pred, pos_label="positive"))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.3333333333333333
Precision: 0.3333333333333333
Recall: 1.0
Confusion Matrix:
 [[0 2]
 [0 1]]


In [16]:
review = [' "The cinematography was stunning, but the plot was dull"']
review_count = v.transform(review)
print("Prediction : ",model.predict(review_count))

Prediction :  ['positive']


In [15]:
print("Prediction : ",clf.predict(review))

Prediction :  ['positive']
