# Sentiment Analysis on IMDB Movie Reviews

This notebook presents a comprehensive approach to binary sentiment classification using the IMDB movie reviews dataset, which contains 50,000 labeled reviews (25,000 for training and 25,000 for testing). The goal is to classify each review as either positive or negative.

We explore traditional NLP techniques and deep learning models by:

* Preprocessing text data using NLTK or spaCy,

* Generating embeddings using TF-IDF and Word2Vec,

* Training and evaluating models including Logistic Regression, LSTM, and BERT.

Through this project, we aim to understand how different modeling techniques perform on real-world sentiment data.

## Load the Dataset

In [1]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("IMDB Dataset.csv")

In [2]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [3]:
# View first and last few records
print("Dataset Overview:")
print(df.head())
print("-"*80)
print(df.tail())

Dataset Overview:
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
--------------------------------------------------------------------------------
                                                  review sentiment
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative


In [4]:
# View structure and stats
print(df.info())
print("-"*80)
print(df.describe())
print("-"*80)
df.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB
None
--------------------------------------------------------------------------------
                                                   review sentiment
count                                               50000     50000
unique                                              49582         2
top     Loved today's show!!! It was a variety and not...  positive
freq                                                    5     25000
--------------------------------------------------------------------------------


Index(['review', 'sentiment'], dtype='object')

In [5]:
# Check class balance
print("Sentiment Distribution:\n")
print(df['sentiment'].value_counts())
print("-"*80)
print(df.isnull().sum())
print("-"*80)
print(df.shape)

Sentiment Distribution:

sentiment
positive    25000
negative    25000
Name: count, dtype: int64
--------------------------------------------------------------------------------
review       0
sentiment    0
dtype: int64
--------------------------------------------------------------------------------
(50000, 2)


In [6]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ayumi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ayumi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ayumi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ayumi\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [7]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [8]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Lowercase
    text = text.lower()

    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatize
    lemmatized = [lemmatizer.lemmatize(token) for token in tokens]

    return " ".join(lemmatized)

In [9]:
# Apply to a small sample first to test
sample_df = df.head(5).copy()
sample_df['clean_review'] = sample_df['review'].apply(preprocess_text)
print(sample_df[['review', 'clean_review']])

                                              review  \
0  One of the other reviewers has mentioned that ...   
1  A wonderful little production. <br /><br />The...   
2  I thought this was a wonderful way to spend ti...   
3  Basically there's a family where a little boy ...   
4  Petter Mattei's "Love in the Time of Money" is...   

                                        clean_review  
0  one reviewer mentioned watching oz episode you...  
1  wonderful little production filming technique ...  
2  thought wonderful way spend time hot summer we...  
3  basically there family little boy jake think t...  
4  petter matteis love time money visually stunni...  


In [10]:
from tqdm import tqdm
tqdm.pandas()

df['clean_review'] = df['review'].progress_apply(preprocess_text)

100%|██████████| 50000/50000 [00:54<00:00, 912.40it/s] 


## TF-IDF Vectorization (classic and fast)

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_tfidf = tfidf_vectorizer.fit_transform(df['clean_review'])

print(f"TF-IDF matrix shape: {X_tfidf.shape}")

TF-IDF matrix shape: (50000, 5000)


## Word2Vec Embeddings (contextual word vectors)

In [12]:
from gensim.models import Word2Vec

# Prepare tokenized sentences (list of tokens per review)
tokenized_reviews = [review.split() for review in df['clean_review']]

# Train Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_reviews, vector_size=100, window=5, min_count=2, workers=4)

print(w2v_model.wv['movie'])

[ 0.14054464  0.26114455  0.2861747  -0.01132216 -1.9468822  -0.01961933
 -0.12557174 -0.5686028  -0.57852674 -2.7938473   0.01985246 -1.2567481
  2.3373716  -0.18140516 -0.45398542 -1.0842849   3.025554    0.69989866
 -1.608329   -1.2478198   1.2674246  -1.2045821   1.9984994  -0.03028275
  1.314082   -1.9821984  -0.19048803 -0.26103324 -0.37805805  1.351212
  0.39358902 -0.7836576   1.6210145  -1.312569   -0.6867903   0.22486292
 -0.74402297 -0.42990008  0.31368807  2.4010155  -3.5769436   0.639443
  0.40140754  1.286355   -0.07115585  0.9736012  -0.24862228 -1.8691294
  1.4374658   0.22365648 -1.8970941  -0.24963419 -1.5368038   1.7550664
  2.9744265   0.6803544   0.4785899   0.03902685 -0.7566226   0.3237298
  0.22365047 -0.2927757   1.4900663  -0.49961895  0.53170764  0.6116781
  0.34949964  1.5971588  -1.4182926   0.40802798 -0.21277958 -0.35354707
 -0.13932127  0.11379347 -0.57939005  0.20816168 -0.1943101   1.4107716
 -0.11497167 -0.9540536   0.14611368 -0.31462672 -1.9710107  

# Create document vectors for each review by averaging word vectors

In [13]:
def get_w2v_vector(tokens, model, vector_size=100):
    vec = np.zeros(vector_size)
    count = 0
    for token in tokens:
        if token in model.wv:
            vec += model.wv[token]
            count += 1
    if count != 0:
        vec /= count
    return vec

# Apply to all reviews
X_w2v = np.array([get_w2v_vector(review.split(), w2v_model) for review in df['clean_review']])

print(f"Word2Vec embedding matrix shape: {X_w2v.shape}")

Word2Vec embedding matrix shape: (50000, 100)


## Logistic Regression Training & Evaluation (Using TF-IDF Features)

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

df['label'] = df['sentiment'].map({'positive':1, 'negative':0})

# Train-test split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_tfidf, df['label'], test_size=0.2, random_state=42)

# Initialize and train Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train1, y_train1)

# Predictions
y_pred1 = lr_model.predict(X_test1)

# Evaluation
print(f"Logistic Regression Accuracy: {accuracy_score(y_test1, y_pred1):.4f}")
print("-"*80)
print("classification report:\n", classification_report(y_test1, y_pred1, target_names=['negative', 'positive']))

Logistic Regression Accuracy: 0.8845
--------------------------------------------------------------------------------
classification report:
               precision    recall  f1-score   support

    negative       0.89      0.87      0.88      4961
    positive       0.88      0.90      0.89      5039

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000



## LSTM Model with PyTorch using Word2Vec Embeddings

In [15]:
# Import necessary libraries
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence

In [16]:
# Build vocabulary from Word2Vec model
vocab = {word: idx+1 for idx, word in enumerate(w2v_model.wv.index_to_key)}
vocab_size = len(vocab) + 1

# Convert reviews to list of word indices
def text_to_indices(text):
    return [vocab.get(word, 0) for word in text.split()]

In [17]:
# Dataset class for sentiment data
class SentimentDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = [torch.tensor(text_to_indices(t)) for t in texts]
        self.labels = torch.tensor(labels.values, dtype=torch.float32)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

# Function to pad sequences within a batch
def collate_batch(batch):
    texts, labels = zip(*batch)
    texts_padded = pad_sequence(texts, batch_first=True, padding_value=0)
    labels = torch.tensor(labels)
    return texts_padded, labels

In [18]:
# Split data into training and test sets
X_train_texts, X_test_texts, y_train_labels, y_test_labels = train_test_split(df['clean_review'], df['label'], test_size=0.2, random_state=42)

In [19]:
# Create datasets and data loaders
train_dataset = SentimentDataset(X_train_texts, y_train_labels)
test_dataset = SentimentDataset(X_test_texts, y_test_labels)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=collate_batch)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, collate_fn=collate_batch)

In [20]:
# Define the LSTM-based classification model
class LSTMClassifier(nn.Module):
    def __init__(self, w2v_model, embedding_dim, hidden_dim, output_dim):
        super(LSTMClassifier, self).__init__()

        # Prepare embedding matrix
        pretrained_vectors = w2v_model.wv.vectors
        vocab_size = pretrained_vectors.shape[0] + 1  # +1 for padding

        embedding_matrix = np.zeros((vocab_size, embedding_dim))
        embedding_matrix[1:] = pretrained_vectors  # shift by 1 to reserve idx 0 for padding

        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.embedding.weight.data.copy_(torch.from_numpy(embedding_matrix))
        self.embedding.weight.requires_grad = True  # Set False to freeze embeddings

        # LSTM layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

        # Fully connected output layer
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        out = self.fc(lstm_out[:, -1, :])
        return out

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [21]:
# Hyperparameters
embedding_dim = 100
hidden_dim = 128
output_dim = 1  # binary classification

# Instantiate the model and move it to the device
model = LSTMClassifier(w2v_model, embedding_dim, hidden_dim, output_dim)
model.to(device)

# Loss function and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [22]:
# Training function
def train(model, loader):
    model.train()
    total_loss = 0
    for texts, labels in loader:
        texts, labels = texts.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(texts).squeeze(1)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

In [23]:
# Evaluation function
def evaluate(model, loader):
    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for texts, labels in loader:
            texts, labels = texts.to(device), labels.to(device)
            outputs = model(texts).squeeze(1)
            preds = torch.round(torch.sigmoid(outputs))
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    return np.array(all_preds), np.array(all_labels)

In [24]:
# Training loop
epochs = 15
for epoch in range(epochs):
    loss = train(model, train_loader)
    preds, labels = evaluate(model, test_loader)
    acc = (preds == labels).mean()
    print(f"Epoch {epoch+1}: Train Loss = {loss:.4f}, Validation Accuracy = {acc:.4f}")

Epoch 1: Train Loss = 0.6925, Validation Accuracy = 0.5073
Epoch 2: Train Loss = 0.6714, Validation Accuracy = 0.5067
Epoch 3: Train Loss = 0.6423, Validation Accuracy = 0.7152
Epoch 4: Train Loss = 0.6470, Validation Accuracy = 0.5015
Epoch 5: Train Loss = 0.6572, Validation Accuracy = 0.7608
Epoch 6: Train Loss = 0.5330, Validation Accuracy = 0.7935
Epoch 7: Train Loss = 0.5282, Validation Accuracy = 0.7960
Epoch 8: Train Loss = 0.5061, Validation Accuracy = 0.4995
Epoch 9: Train Loss = 0.6898, Validation Accuracy = 0.5071
Epoch 10: Train Loss = 0.6418, Validation Accuracy = 0.5052
Epoch 11: Train Loss = 0.5676, Validation Accuracy = 0.7982
Epoch 12: Train Loss = 0.5311, Validation Accuracy = 0.8343
Epoch 13: Train Loss = 0.3532, Validation Accuracy = 0.8650
Epoch 14: Train Loss = 0.2733, Validation Accuracy = 0.8756
Epoch 15: Train Loss = 0.2416, Validation Accuracy = 0.8772


In [25]:
# Final evaluation
print("\nClassification Report:")
print(classification_report(labels, preds, digits=4))
print("Confusion Matrix:")
print(confusion_matrix(labels, preds))


Classification Report:
              precision    recall  f1-score   support

         0.0     0.8902    0.8583    0.8740      4961
         1.0     0.8652    0.8958    0.8803      5039

    accuracy                         0.8772     10000
   macro avg     0.8777    0.8771    0.8771     10000
weighted avg     0.8776    0.8772    0.8771     10000

Confusion Matrix:
[[4258  703]
 [ 525 4514]]


## Bert Model

In [26]:
# Import necessary libraries
from transformers import BertTokenizer, BertModel, BertForSequenceClassification
from transformers import get_scheduler
from torch.optim import AdamW
import torch.nn.functional as F

In [27]:
# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Encode the input texts using BERT tokenizer
def encode_texts(texts, max_length=128):
    return tokenizer(
        texts.tolist(),
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )

In [28]:
# Custom PyTorch Dataset class to hold BERT inputs and labels
class BERTDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels.values, dtype=torch.long)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

In [29]:
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['clean_review'], df['label'], test_size=0.2)

In [30]:
# Tokenize the training and test data
train_encodings = encode_texts(X_train)
test_encodings = encode_texts(X_test)

# Create PyTorch dataset objects
train_dataset = BERTDataset(train_encodings, y_train)
test_dataset = BERTDataset(test_encodings, y_test)

# Load pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2  # Binary classification
)

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [31]:
# Create DataLoaders for training and testing
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

# Define optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=2e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_loader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [32]:
# Training loop
def train(model, loader):
    model.train()
    total_loss = 0
    for batch in tqdm(loader, desc="Training"):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
    return total_loss / len(loader)

In [33]:
# Evaluation function to compute accuracy
def evaluate(model, loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch in tqdm(loader, desc="Evaluating"):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)
            correct += (predictions == batch["labels"]).sum().item()
            total += batch["labels"].size(0)
    return correct / total

In [34]:
# Train the model and evaluate after each epoch
for epoch in range(num_epochs):
    train_loss = train(model, train_loader)
    val_accuracy = evaluate(model, test_loader)
    print(f"Epoch {epoch+1}: Loss = {train_loss:.4f}, Val Accuracy = {val_accuracy:.4f}")

Training: 100%|██████████| 2500/2500 [19:43<00:00,  2.11it/s]
Evaluating: 100%|██████████| 625/625 [01:27<00:00,  7.18it/s]


Epoch 1: Loss = 0.3188, Val Accuracy = 0.8917


Training: 100%|██████████| 2500/2500 [19:02<00:00,  2.19it/s]
Evaluating: 100%|██████████| 625/625 [01:27<00:00,  7.15it/s]


Epoch 2: Loss = 0.1813, Val Accuracy = 0.8974


Training: 100%|██████████| 2500/2500 [19:01<00:00,  2.19it/s]
Evaluating: 100%|██████████| 625/625 [01:28<00:00,  7.10it/s]

Epoch 3: Loss = 0.0852, Val Accuracy = 0.8958





In [35]:
# Function to get predictions and true labels for evaluation metrics
def get_predictions(model, loader):
    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1).cpu().numpy()
            labels = batch["labels"].cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(labels)
    return np.array(all_preds), np.array(all_labels)

In [36]:
# Generate classification report and confusion matrix
preds, labels = get_predictions(model, test_loader)
print(confusion_matrix(labels, preds))
print(classification_report(labels, preds, digits=4))

[[4483  514]
 [ 528 4475]]
              precision    recall  f1-score   support

           0     0.8946    0.8971    0.8959      4997
           1     0.8970    0.8945    0.8957      5003

    accuracy                         0.8958     10000
   macro avg     0.8958    0.8958    0.8958     10000
weighted avg     0.8958    0.8958    0.8958     10000



In [37]:
# Function to predict sentiment for a single input text
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512).to(device)
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=1).item()
    return "Positive" if prediction == 1 else "Negative"

In [38]:
# Example predictions
print(predict_sentiment("This movie was an absolute masterpiece."))
print(predict_sentiment("This movie was an absolute disaster."))

Positive
Negative


## Conclusion

In this project, we explored sentiment analysis on the IMDB 50K movie reviews dataset using a combination of classical machine learning and deep learning approaches. Starting with robust text preprocessing using NLTK/spaCy, we generated meaningful text embeddings via TF-IDF and Word2Vec, enabling effective sentiment classification.

Our experiments revealed that:

* Logistic Regression with TF-IDF yielded a strong baseline with 88.45% accuracy, showcasing the enduring power of linear models when paired with good feature engineering.

* LSTM models demonstrated the potential of sequence-aware architectures by capturing word dependencies, achieving 87.72% accuracy. This underlined the strength of deep learning models in handling temporal text data.

* BERT, a state-of-the-art transformer-based model, outperformed both by achieving 89.58% accuracy, thanks to its contextualized word embeddings and deep language understanding capabilities.

The consistent performance across all three models reinforces the idea that while traditional models are fast and effective, modern deep learning and transformer-based approaches provide superior accuracy and generalization, especially for nuanced NLP tasks.

 -By Akshara S, Data Science Intern at inGrade