# IMDB Reviews Text Sentiment Analysis

This project performs **sentiment analysis** on a dataset of text data to classify it as either positive or negative. Sentiment analysis is a natural language processing (NLP) technique used to determine whether textual data expresses a positive or negative sentiment.


In [1]:
# importing important libraries
import pandas as pd
import numpy as np
import nltk
import sklearn
import warnings
warnings.filterwarnings('ignore')

In [2]:
# loading the dataset using pandas
ds = pd.read_csv('IMDB Dataset.csv')

In [3]:
ds.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
ds.shape

(50000, 2)

In [5]:
#finding out how many positive and negative sentiments
ds['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

## Preprocessing the Data
Text data is cleaned by:
- Converting to lowercase
- Removing punctuation, stopwords, and non-alphabetic characters
- Tokenizing and lemmatizing words  
This prepares the text for consistent and efficient machine learning analysis.


In [6]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PMLS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\PMLS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PMLS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
# Get English stopwords
stop_words = set(stopwords.words('english'))

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to preprocess text
def preprocess_nltk(text):
    # Lowercase
    text = text.lower()
    
    # Remove HTML tags
    text = re.sub(r"<.*?>", "", text)

    # Tokenize text
    words = word_tokenize(text)

    # Remove stopwords and lemmatize
    cleaned_words = [
        lemmatizer.lemmatize(word)
        for word in words
        if word.isalpha() and word not in stop_words
    ]

    return " ".join(cleaned_words)

ds['review'] = ds['review'].apply(preprocess_nltk)

In [8]:
ds.head()

Unnamed: 0,review,sentiment
0,one reviewer mentioned watching oz episode hoo...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically family little boy jake think zombie ...,negative
4,petter mattei love time money visually stunnin...,positive


## Splitting and Vectorizing the Dataset using TF-IDF Method
- The dataset is divided into training and testing sets using `train_test_split()` to evaluate the model’s performance on dataset.
- Text data is converted into numerical format using `TfidfVectorizer` so it can be processed by machine learning algorithms.


In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000,  # Only use the top 5000 words
    ngram_range=(1, 2)  # Check single words & word pairs
)

X_tfidf = tfidf.fit_transform(ds['review'])
print("Shape:", X_tfidf.shape)  # (50000 reviews, 5000 words)

Shape: (50000, 5000)


## Training the Model
- A machine learning model (Logistic Regression) is trained on the training data to learn how to classify sentiments.
- The trained model is used to predict sentiments on the test dataset. Along with predictions, the model's **accuracy score** is calculated to measure how well it classifies the sentiments. Accuracy shows the percentage of correct predictions made by the model on the test data.


In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Split data
y = ds['sentiment'].map({'positive': 1, 'negative': 0})
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2)

# Train and test
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
print("Accuracy:", model.score(X_test, y_test))

Accuracy: 0.8853


## Evaluating the Model

The model's performance is evaluated using the following metrics:

- **Precision**: Measures how many of the predicted positive sentiments are actually positive.  
  *High precision = low false positives.*

- **Recall**: Measures how many of the actual positive sentiments were correctly predicted.  
  *High recall = low false negatives.*

- **F1-Score**: The harmonic mean of precision and recall.  
  *Useful when you need a balance between precision and recall, especially with imbalanced datasets.*

These metrics provide a deeper insight into the model's performance beyond simple accuracy.


In [17]:
y_pred = model.predict(X_test)  # Predicted sentiments (0 or 1)

In [18]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.88      0.88      4962
           1       0.88      0.89      0.89      5038

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



## Summary and Insights

In this project, we developed a **sentiment analysis model** to classify text data as either positive (1) or negative (0). The workflow included cleaning and preprocessing the text, transforming it into numerical features using TF-IDF, training a machine learning model, and evaluating its performance.

### 📌 Final Results:
- **Accuracy Score:** The model achieved an accuracy of approximately **88.5%** on the test set.
- **Evaluation Metrics:**
  - **Negative (0):**
    - Precision: **0.89**
    - Recall: **0.88**
    - F1-Score: **0.88**
    - Support: **4962 samples**
  - **Positive (1):**
    - Precision: **0.88**
    - Recall: **0.89**
    - F1-Score: **0.89**
    - Support: **5038 samples**

### 🔍 Key Insights:
- The model demonstrated **balanced performance** across both positive and negative classes.
- The **F1-scores for both classes (~0.88–0.89)** indicate a strong and consistent performance in correctly identifying sentiments.
- **High precision and recall values** show that the model is effective in minimizing both false positives and false negatives.
- **Text preprocessing** (e.g., lowercasing, lemmatization, stopword removal) and **TF-IDF vectorization** played a critical role in enhancing model quality.

The model is well-suited for real-world applications in sentiment classification, providing reliable results across large-scale datasets.
