<span style="font-size:24px; font-weight:bold;">SENTIMENT ANALYSIS ON REVIEWS GIVEN BY VIEWERS ON IMDB</span>

The Dataset and Problem Statement:
In this notebook, we will work with a dataset of 50,000 movie reviews from IMDB. The dataset consists of two columns: "review" and "sentiment." These columns will help us determine whether a given movie review is positive or negative.

Problem Objective:
Our aim is to identify the most suitable machine learning model for predicting the sentiment (positive or negative) based on the content of a movie review.

<img src="sentiment.png" alt="Galaxy Image" width="500">

<span style="font-size:24px; font-weight:bold;">Importing the necessary libraries</span>

In [6]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [8]:
import nltk

# Download NLTK stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<span style="font-size:24px; font-weight:bold;">Loading the dataset</span>

In [11]:
df= pd.read_csv("IMDB Dataset.csv")

In [13]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


<span style="font-size:24px; font-weight:bold;">Exploratory Data Analysis</span>

In [16]:
#Summary of the dataset
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [18]:
#sentiment count
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [20]:
# Check for null values in the dataset
print(df.isnull().sum())

review       0
sentiment    0
dtype: int64


In [22]:
# Convert labels to binary (0 = negative, 1 = positive)
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

In [24]:
# Extract reviews and sentiments
X = df['review']  # Features (reviews)
y = df['sentiment']  # Labels (sentiments)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<span style="font-size:24px; font-weight:bold;">Text Preprocessing</span>

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk

# Define stop_words as a list
stop_words = list(stopwords.words('english'))  # Ensure it's a list

vectorizer = TfidfVectorizer(stop_words=stop_words)


In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer with stopwords removal
vectorizer = TfidfVectorizer(stop_words=stop_words)

# Fit the model on the training data and transform both train and test sets
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)


<span style="font-size:24px; font-weight:bold;">Model Training</span>

In [32]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
y_pred = model.predict(X_test_vectorized)

<span style="font-size:24px; font-weight:bold;">Results</span>

In [35]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(cm)


Confusion Matrix:
[[4365  596]
 [ 437 4602]]


This confusion matrix represents the performance of a classification model on a dataset. The matrix is structured as follows:

4365 (True Negatives - TN): The model correctly predicted negative class (0).

596 (False Positives - FP): The model incorrectly predicted positive class (1) when it was actually negative (0).

437 (False Negatives - FN): The model incorrectly predicted negative class (0) when it was actually positive (1).

4602 (True Positives - TP): The model correctly predicted positive class (1).

In [38]:
# Predict the sentiment on the test set
y_pred = model.predict(X_test_vectorized)

# Calculate accuracy and print classification report
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

Accuracy: 0.8967
              precision    recall  f1-score   support

           0       0.91      0.88      0.89      4961
           1       0.89      0.91      0.90      5039

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



This classification report provides key evaluation metrics for a binary classification model.

Accuracy (0.8967 or ~90%) indicates that the model correctly classifies about 90% of the total 10,000 samples.
    
Precision (0.91 for class 0, 0.89 for class 1) represents the proportion of correctly predicted positive (or negative) samples out of all predicted positives (or negatives), meaning class 0 is slightly more precise.
    
Recall (0.88 for class 0, 0.91 for class 1) shows how well the model identifies all actual positive (or negative) samples; here, class 1 is captured slightly better than class 0.
                                                                                                                         
F1-score (0.89 for class 0, 0.90 for class 1) balances precision and recall, with both classes performing similarly.
                                                                                                                         
Macro average (0.90 across precision, recall, and F1-score) treats both classes equally, while the weighted average (0.90) adjusts for class imbalances.

<span style="font-size:24px; font-weight:bold;">Conclusion</span>

The sentiment analysis model, trained with logistic regression, achieved an accuracy of 0.90, demonstrating a strong ability to classify sentiment accurately. 
Both precision and recall were well-balanced, with values close to 0.90 for both classes, leading to an F1-score of 0.90 for each class, indicating the model's robustness in handling positive and negative sentiment classifications effectively.