# **IMDB Reviews Sentiment Analysis**

## Introduction

Sentiment Analysis on IMDb Movie Reviews: A Natural Language Processing Approach

In the digital era, user-generated content has become a cornerstone for business intelligence. For the entertainment industry, movie reviews are a goldmine of information that reflects audience reception and market trends. However, manually analyzing thousands of reviews is inefficient and prone to human bias.

This project leverages the IMDb Dataset of 50,000 Movie Reviews, a benchmark dataset for binary sentiment classification. By applying Natural Language Processing (NLP) techniques and Machine Learning algorithms, we aim to automate the process of categorizing reviews as either positive or negative based on their textual content.

## Objective

The primary goal of this notebook is to build a robust sentiment classifier. To achieve this, we will execute the following technical milestones:

- Data Preprocessing: Clean and normalize raw text by removing HTML tags, punctuation, and stopwords, followed by Lemmatization to reduce vocabulary complexity.

- Feature Engineering: Transform textual data into numerical representations using techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) or Word Embeddings.

- Model Development: Compare multiple classification algorithms (e.g., Logistic Regression, Naive Bayes, or Deep Learning architectures) to identify the best-performing model.

- Evaluation: Assess model performance using a comprehensive suite of metrics including Accuracy, F1-Score, and Confusion Matrices to ensure reliable sentiment detection.

## IMDB Dataset of 50K Movie Reviews

Source: [Kaggle - IMDb Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data)

Brief: A large-scale dataset of movie reviews used for binary sentiment classification. It is a standard benchmark in Natural Language Processing (NLP) for training models to distinguish between positive and negative opinions.

Dataset Structure: 50,000 rows, 2 columns.
Columns

- review (string) The raw text of the movie review. This is the primary feature for the model. It contains natural language, including punctuation, potential HTML tags (<br />), and varying lengths of text.

- sentiment (target: categorical "positive"/"negative") The label indicating the overall tone of the review. The dataset is perfectly balanced with 25,000 positive and 25,000 negative reviews, making it ideal for training without bias towards one class.

Notes and Typical Preprocessing Steps

- HTML Cleaning: The text frequently contains <br /> tags that must be removed using libraries like BeautifulSoup or Regular Expressions.

- Text Normalization: Standardize the text by converting it to lowercase and removing special characters/numbers to reduce the "noise" in the data.

- Stopword Removal: Words like "the", "a", "is" are often removed as they do not carry significant emotional weight for sentiment detection.

- Tokenization & Vectorization: Text must be converted into numerical format. Common techniques include:

    - Bag-of-Words / TF-IDF: For traditional machine learning (e.g., Logistic Regression).

    - Word Embeddings (Word2Vec, GloVe): For deep learning models (RNNs/LSTMs).

- Sequence Padding: Since reviews have different lengths, they need to be padded or truncated to a uniform length if using neural networks.

- Evaluation Metrics: Since the classes are balanced, Accuracy is a reliable metric, though F1-Score and Confusion Matrix are recommended for a deeper analysis of false positives vs. false negatives.

In [2]:
# Import necessary libraries for data analysis and machine learning

# Libraries to read and manipulate data
import numpy as np
import pandas as pd

# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries for machine learning
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

### **Show basic information about dataset**

In [3]:
# Import dataset
df = pd.read_csv('../data/IMDB-Dataset.csv')

In [4]:
# Show DataFrame summary (dtypes, non-null counts)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [5]:
# Display column data types
df.dtypes

review       object
sentiment    object
dtype: object

In [6]:
# Show dataset shape (rows, columns)
df.shape

(50000, 2)

In [7]:
# Display basic statistical summary of numerical columns
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [8]:
# Preview the first few rows of the dataset
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## **Exploratory Data Analysis (EDA)**

In [9]:
# Count missing values in each column
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [10]:
from bs4 import BeautifulSoup

# Function to clean HTML tags from reviews
def clean_html(review):
    soup = BeautifulSoup(review, "html.parser")
    return soup.get_text()

df['review_cleaned'] = df['review'].apply(clean_html)

In [11]:
df['review_cleaned'].head(5)

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. The filming tec...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review_cleaned, dtype: object

In [12]:
df.head(5)

Unnamed: 0,review,sentiment,review_cleaned
0,One of the other reviewers has mentioned that ...,positive,One of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,positive,A wonderful little production. The filming tec...
2,I thought this was a wonderful way to spend ti...,positive,I thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,negative,Basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"Petter Mattei's ""Love in the Time of Money"" is..."


In [14]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df['review_cleaned'], df['sentiment'], test_size=0.2, random_state=42
)

In [15]:
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

In [16]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

y_pred = model.predict(X_test_vectorized)
print(f'Presicion: {accuracy_score(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

Presicion: 0.8951
              precision    recall  f1-score   support

    negative       0.90      0.88      0.89      4961
    positive       0.89      0.91      0.90      5039

    accuracy                           0.90     10000
   macro avg       0.90      0.89      0.90     10000
weighted avg       0.90      0.90      0.90     10000



In [17]:
def predict_sentiment(review):
    review_cleaned = clean_html(review)
    review_vectorized = vectorizer.transform([review_cleaned])
    prediction = model.predict(review_vectorized)
    return 'positive' if prediction[0] == 1 else 'negative'

sample_review = "<br />I absolutely loved this movie! The plot was thrilling and the characters were well-developed. A must-watch!<br />"
predicted_sentiment = predict_sentiment(sample_review)
print(f'The predicted sentiment for the sample review is: {predicted_sentiment}')

The predicted sentiment for the sample review is: negative
