# **Project Name**
IMDB Movie Review Sentiment Analysis

##### **Project Type** - Classification (Supervised Learning)
##### **Contribution** - Individual

# **Project Summary**
In this project, we build a Machine Learning model to classify IMDB movie reviews into **positive** and **negative** sentiments. The dataset contains two columns: `review` (text data) and `sentiment` (label).

The workflow includes:
- Preprocessing the text reviews
- Converting text into numerical features using vectorization
- Training classification models (Logistic Regression, Naive Bayes)
- Evaluating performance using accuracy, precision, recall, and F1-score.

In this project, we conducted a sentiment analysis of IMDB movie reviews, aiming to classify reviews as positive or negative. The workflow started with text preprocessing, where raw textual data was cleaned and prepared for modeling. Key preprocessing steps included tokenization, removal of stopwords, and lemmatization using the NLTK library. These steps were crucial to standardize the text and reduce noise, ensuring that the models could focus on meaningful features.

After preprocessing, the text data was converted into numerical features using TF-IDF vectorization, capturing both the importance and frequency of words in the corpus. This representation was then used to train and evaluate three different machine learning models: Logistic Regression, Multinomial Naive Bayes, and Random Forest Classifier.

The models were assessed using several performance metrics, including accuracy, precision, recall, and F1-score, to obtain a comprehensive understanding of their predictive power. The comparison revealed that while all three models performed reasonably well, Logistic Regression outperformed the others in this particular problem, achieving the highest overall metrics. Multinomial Naive Bayes performed competitively, as expected for text classification tasks, whereas Random Forest, although capable, showed slightly lower performance, likely due to the high dimensionality and sparsity of TF-IDF features.

Overall, the project demonstrates the importance of text preprocessing, feature extraction, and model selection in building an effective sentiment analysis pipeline. The results highlight that for sparse, high-dimensional text data like IMDB reviews, linear models such as Logistic Regression often provide strong performance with simpler implementation compared to more complex ensemble methods.

# **GitHub Link -**

https://github.com/SathwikThotapally

# **Problem Statement**
Movie reviews often express clear sentiment about films. Automatically classifying these reviews as positive or negative helps companies understand audience feedback at scale. The objective is to design a machine learning model that predicts the sentiment of a review accurately.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [18]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sathwik\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sathwik\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Dataset Loading and Dataset First View

In [3]:
df = pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Dataset Rows & Columns count

In [4]:
df.shape

(50000, 2)

### Dataset Information

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


#### Missing Values/Null Values

In [6]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

### What did you know about your dataset?
- Dataset: IMDB Movie Reviews
- Columns:
  - `review`: textual content of movie review
  - `sentiment`: target label (positive / negative)
- There are no missing values in the dataset.

## ***2. Understanding Your Variables***
- **review:** Feature (unstructured text → needs preprocessing and vectorization).
- **sentiment:** Target variable (categorical: positive / negative).

*(No need for correlation analysis here since features are text, not numerical.)*

In [7]:
# Counting positive vs neagtive sentiments
review_counts = df['sentiment'].value_counts()
print(review_counts)

sentiment
positive    25000
negative    25000
Name: count, dtype: int64


#### So, We can conclude from the above that the labels in the sentiment column are equally distributed

## ***3. Data Preprocessing***

#### ***This step icludes text preprocessing techniques like:***
- Convert text to lowercase
- Remove HTML tags
- Remove punctuation, special characters, and numbers
- Remove extra spaces
- Stopwords removal
- Tokenization
- Lemmatization

All these steps are wrapped into a single function: preprocess_text(text)

In [8]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'<.*?>','',text)
    text = re.sub(r'[^a-zA-Z]',' ',text)
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return " ".join(tokens)

In [9]:
df['clean_review'] = df['review'].apply(preprocess_text)
df.head()

Unnamed: 0,review,sentiment,clean_review
0,One of the other reviewers has mentioned that ...,positive,one reviewer mentioned watching oz episode hoo...
1,A wonderful little production. <br /><br />The...,positive,wonderful little production filming technique ...
2,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...
3,Basically there's a family where a little boy ...,negative,basically family little boy jake think zombie ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei love time money visually stunnin...


#### In the above code we applied the preprocess_text function to our 'review' column and stored the result into a new column - 'clean_review'

### ***3b. Data Splitting***

In [10]:
X = df['clean_review']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### ***3c. Vectorization*** 

In [13]:
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2))

X_train_vec = tfidf.fit_transform(X_train)

X_test_vec = tfidf.transform(X_test)

## ***4. Model Implementation***

### ML Model - 1 - Logistic Regression

In [14]:
# Initialize the model
clf = LogisticRegression(max_iter=1000)

# Fit the algorithm
clf.fit(X_train_vec, y_train)

# Predict on the model
y_pred_clf = clf.predict(X_test_vec)

# Evaluate the model
print("Accuracy :", accuracy_score(y_test, y_pred_clf))
print("Precision:", precision_score(y_test, y_pred_clf, pos_label="positive"))
print("Recall   :", recall_score(y_test, y_pred_clf, pos_label="positive"))
print("F1-score :", f1_score(y_test, y_pred_clf, pos_label="positive"))

Accuracy : 0.8912
Precision: 0.8835294117647059
Recall   : 0.9012
F1-score : 0.8922772277227723


#### Make predictions on new data

In [15]:
new_rev = "I love this movie, it was amazing!"

clean_text = preprocess_text(new_rev)

new_vec = tfidf.transform([clean_text])

prediction1 = clf.predict(new_vec)[0]

print("Predicted Sentiment: ",prediction1)

Predicted Sentiment:  positive


In [16]:
new_rev1 = "I hate this movie, it was boring!"

clean_text1 = preprocess_text(new_rev1)

new_vec1 = tfidf.transform([clean_text1])

prediction2 = clf.predict(new_vec1)[0]

print("Predicted Sentiment: ",prediction2)

Predicted Sentiment:  negative


### ML Model - 2 - Naive Bayes Classifier

In [17]:
# Initialize the model
mnb = MultinomialNB(alpha=1.0)

# Fit the algorithm
mnb.fit(X_train_vec, y_train)

# Predict on the model
y_pred_mnb = mnb.predict(X_test_vec)

print("Accuracy :", accuracy_score(y_test, y_pred_mnb))
print("Precision:", precision_score(y_test, y_pred_mnb, pos_label="positive"))
print("Recall   :", recall_score(y_test, y_pred_mnb, pos_label="positive"))
print("F1-score :", f1_score(y_test, y_pred_mnb, pos_label="positive"))

Accuracy : 0.8611
Precision: 0.8494290690923166
Recall   : 0.8778
F1-score : 0.8633815284744762


#### Make predictions on new data

In [23]:
new_rev1 = "I absolutely hate this movie, it was not at all good!"

clean_text1 = preprocess_text(new_rev1)

new_vec1 = tfidf.transform([clean_text1])

prediction3 = mnb.predict(new_vec1)[0]

print("Predicted Sentiment: ",prediction3)

Predicted Sentiment:  negative


In [24]:
new_rev2 = "I loved this movie, it was nice to watch!"

clean_text2 = preprocess_text(new_rev2)

new_vec2 = tfidf.transform([clean_text2])

prediction4 = clf.predict(new_vec2)[0]

print("Predicted Sentiment: ",prediction4)

Predicted Sentiment:  positive


In [25]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, max_depth=20, random_state=42)

rf.fit(X_train_vec, y_train)

y_pred_rf = rf.predict(X_test_vec)

print("Accuracy :", accuracy_score(y_test, y_pred_rf))
print("Precision:", precision_score(y_test, y_pred_rf, pos_label="positive"))
print("Recall   :", recall_score(y_test, y_pred_rf, pos_label="positive"))
print("F1-score :", f1_score(y_test, y_pred_rf, pos_label="positive"))

Accuracy : 0.8405
Precision: 0.8153361733654381
Recall   : 0.8804
F1-score : 0.8466198672949322


In [26]:
new_rev3 = "I absolutely hate this movie, it was not at all good!"

clean_text3 = preprocess_text(new_rev3)

new_vec3 = tfidf.transform([clean_text3])

prediction5 = mnb.predict(new_vec3)[0]

print("Predicted Sentiment: ",prediction5)

Predicted Sentiment:  negative


In [27]:
new_rev4 = "It was honestly a great movie, it was very nice to watch!"

clean_text4 = preprocess_text(new_rev4)

new_vec4 = tfidf.transform([clean_text4])

prediction6 = mnb.predict(new_vec4)[0]

print("Predicted Sentiment: ",prediction6)

Predicted Sentiment:  positive


# **Conclusion**
The IMDB Sentiment Analysis project successfully demonstrated the end-to-end process of building a text classification system. Through preprocessing, feature engineering, and model evaluation, we were able to classify movie reviews with reasonable accuracy. Logistic Regression emerged as the most effective model for this dataset, highlighting the suitability of linear models for high-dimensional, sparse text data.

The study also shows that while ensemble methods like Random Forest can be applied to text classification, their performance may not always surpass simpler, well-tuned models such as Logistic Regression or Naive Bayes. Future improvements could include hyperparameter tuning, experimenting with n-grams, or using word embeddings to capture semantic meaning, which could further enhance predictive performance. Overall, this project provides a solid foundation for sentiment analysis tasks and illustrates best practices in text preprocessing, feature extraction, and model evaluation.