### Sentiment Analysis on IMDB Movie Reviews

In [None]:
!nvidia-smi

Sun Jan 18 12:27:51 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   52C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
import random
import numpy as np

# setting seed for reproducibility
random.seed(42)
np.random.seed(42)

#Task1:Load Dataset

In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset

# loading IMDB dataset from Hugging Face
imdb = load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
#Train / Validation / Test Split
split_data = imdb["train"].train_test_split(test_size=0.2, seed=42)

train_data = split_data["train"]
val_data   = split_data["test"]
test_data  = imdb["test"]

In [None]:
#Text Preprocessing
import re
import string

def clean_text(text):
    # lowercase
    text = text.lower()
    # remove HTML tags
    text = re.sub(r"<.*?>", "", text)
    # remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    return text

In [None]:
# applying preprocessing
train_data = train_data.map(lambda x: {"text": clean_text(x["text"])})
val_data   = val_data.map(lambda x: {"text": clean_text(x["text"])})
test_data  = test_data.map(lambda x: {"text": clean_text(x["text"])})

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

#Task 2(a): TF–IDF Model

In [None]:
#Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=20000)

X_train_tfidf = tfidf.fit_transform(train_data["text"])
X_test_tfidf  = tfidf.transform(test_data["text"])

y_train = train_data["label"]
y_test  = test_data["label"]

In [None]:
#Classification
from sklearn.linear_model import LogisticRegression

tfidf_model = LogisticRegression(max_iter=1000)
tfidf_model.fit(X_train_tfidf, y_train)

#Task 2(b): Word2Vec Model

In [None]:
!pip install nltk gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m92.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
#Tokenization
import nltk

nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
# tokenizing text
from nltk.tokenize import word_tokenize

train_tokens = [word_tokenize(text) for text in train_data["text"]]
test_tokens  = [word_tokenize(text) for text in test_data["text"]]

In [None]:
#Train Word2Vec
from gensim.models import Word2Vec

w2v_model = Word2Vec(
    sentences=train_tokens,
    vector_size=100,
    window=5,
    min_count=2
)

In [None]:
#Convert Reviews to Average Vectors
def get_avg_vector(words, model):
    vectors = []
    for w in words:
        if w in model.wv:
            vectors.append(model.wv[w])
    if len(vectors) == 0:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

In [None]:
X_train_w2v = np.array([get_avg_vector(t, w2v_model) for t in train_tokens])
X_test_w2v  = np.array([get_avg_vector(t, w2v_model) for t in test_tokens])

In [None]:
#Classification
w2v_clf = LogisticRegression(max_iter=1000)
w2v_clf.fit(X_train_w2v, y_train)

#Task 2(c): BERT Embeddings

In [None]:
!pip install transformers torch



In [None]:
from transformers import BertTokenizer, BertModel
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cuda


In [None]:
# loading pretrained BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")
# inference mode
bert_model.to(device)
bert_model.eval()

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [None]:
#Convert Dataset Columns to Lists
bert_train_texts = list(train_data["text"])
bert_test_texts  = list(test_data["text"])

In [None]:
#Extract CLS Embeddings
def get_bert_embeddings(texts, batch_size=16):
    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]

        inputs = tokenizer(
            batch,
            padding=True,
            truncation=True,
            max_length=256,
            return_tensors="pt"
        )

        # move inputs to GPU
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = bert_model(**inputs)

        # CLS token embedding, moved back to CPU
        cls_embeddings = outputs.last_hidden_state[:, 0, :].cpu()
        all_embeddings.append(cls_embeddings)

    return torch.cat(all_embeddings).numpy()

In [None]:
X_train_bert = get_bert_embeddings(bert_train_texts)
X_test_bert  = get_bert_embeddings(bert_test_texts)

In [None]:
#Classification Head
bert_clf = LogisticRegression(max_iter=1000)
bert_clf.fit(X_train_bert, y_train)

#Task 3: Evaluation

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [None]:
def evaluate_model(model, X):
    predictions = model.predict(X)
    acc = accuracy_score(y_test, predictions)
    p, r, f, _ = precision_recall_fscore_support(y_test, predictions, average="binary")
    return acc, p, r, f

#Results:

In [None]:
tfidf_acc, tfidf_p, tfidf_r, tfidf_f = evaluate_model(tfidf_model, X_test_tfidf)
print("TF-IDF Results")
print("Accuracy :", tfidf_acc)
print("Precision:", tfidf_p)
print("Recall   :", tfidf_r)
print("F1-score :", tfidf_f)
print("-"*40)

TF-IDF Results
Accuracy : 0.87996
Precision: 0.878597050617776
Recall   : 0.88176
F1-score : 0.8801756837692154
----------------------------------------


In [None]:
w2v_acc, w2v_p, w2v_r, w2v_f = evaluate_model(w2v_clf, X_test_w2v)
print("Word2Vec Results")
print("Accuracy :", w2v_acc)
print("Precision:", w2v_p)
print("Recall   :", w2v_r)
print("F1-score :", w2v_f)
print("-"*40)

Word2Vec Results
Accuracy : 0.82504
Precision: 0.8246244806647491
Recall   : 0.82568
F1-score : 0.8251519027822194
----------------------------------------


In [None]:
bert_acc, bert_p, bert_r, bert_f = evaluate_model(bert_clf, X_test_bert)
print("BERT Results")
print("Accuracy :", bert_acc)
print("Precision:", bert_p)
print("Recall   :", bert_r)
print("F1-score :", bert_f)
print("-"*40)

BERT Results
Accuracy : 0.84488
Precision: 0.8502599935001625
Recall   : 0.8372
F1-score : 0.8436794582392777
----------------------------------------


#Task 4: Comparison Table

In [None]:
import pandas as pd

results_table = pd.DataFrame({
    "Accuracy": [tfidf_acc, w2v_acc, bert_acc],
    "Precision": [tfidf_p, w2v_p, bert_p],
    "Recall": [tfidf_r, w2v_r, bert_r],
    "F1-score": [tfidf_f, w2v_f, bert_f]
}, index=["TF-IDF", "Word2Vec", "BERT"])

print("Final Comparison Table")
results_table

Final Comparison Table


Unnamed: 0,Accuracy,Precision,Recall,F1-score
TF-IDF,0.87996,0.878597,0.88176,0.880176
Word2Vec,0.82504,0.824624,0.82568,0.825152
BERT,0.84488,0.85026,0.8372,0.843679


#Again For finetue the BERT

In [None]:
#Loading the fine-tunable BERT again

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2
)

model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
#Prepare data

In [None]:
train_texts = list(train_data["text"])
train_labels = train_data["label"]

test_texts = list(test_data["text"])
test_labels = test_data["label"]

In [None]:
#Tokenization

In [None]:
def tokenize_data(texts):
    return tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=256,
        return_tensors="pt"
    )

In [None]:
train_enc = tokenize_data(train_texts)
test_enc  = tokenize_data(test_texts)

In [None]:
#Move to GPU for less computation

In [None]:
train_enc = {k: v.to(device) for k, v in train_enc.items()}
test_enc  = {k: v.to(device) for k, v in test_enc.items()}

train_labels = torch.tensor(train_labels).to(device)
test_labels  = torch.tensor(test_labels).to(device)

In [None]:
#Trainig setup for finetune bert

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)
model.train()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
#Fine-tuning loop

In [None]:
epochs = 2
batch_size = 16

for epoch in range(epochs):
    print(f"Epoch {epoch+1}/{epochs}")

    for i in range(0, len(train_labels), batch_size):
        optimizer.zero_grad()

        input_ids = train_enc["input_ids"][i:i+batch_size]
        attention_mask = train_enc["attention_mask"][i:i+batch_size]
        labels = train_labels[i:i+batch_size]

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        loss = outputs.loss
        loss.backward()
        optimizer.step()


Epoch 1/2
Epoch 2/2


In [None]:
# BERT (Fine-tuned) Evaluation

finetuned_model = model.eval()

all_finetune_preds = []
batch_size = 16

with torch.no_grad():
    for i in range(0, len(y_test), batch_size):
        input_ids = test_enc["input_ids"][i:i+batch_size].to(device)
        attention_mask = test_enc["attention_mask"][i:i+batch_size].to(device)

        outputs = finetuned_model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        batch_preds = torch.argmax(outputs.logits, dim=1)
        all_finetune_preds.extend(batch_preds.cpu().numpy())

finetune_pred = np.array(all_finetune_preds)

bert_ft_acc = accuracy_score(y_test, finetune_pred)
bert_ft_p, bert_ft_r, bert_ft_f, _ = precision_recall_fscore_support(
    y_test, finetune_pred, average="binary"
)

print("BERT (Fine-tuned) Evaluation")
print("Accuracy :", bert_ft_acc)
print("Precision:", bert_ft_p)
print("Recall   :", bert_ft_r)
print("F1-score :", bert_ft_f)
print("-" * 40)


BERT (Fine-tuned) Evaluation
Accuracy : 0.91544
Precision: 0.9215781782756941
Recall   : 0.90816
F1-score : 0.9148198887903941
----------------------------------------


#Task4:FINAL Comparison Table

In [None]:
comparison_table = pd.DataFrame({
    "Accuracy": [
        tfidf_acc,
        w2v_acc,
        bert_acc,
        bert_ft_acc
    ],
    "Precision": [
        tfidf_p,
        w2v_p,
        bert_p,
        bert_ft_p
    ],
    "Recall": [
        tfidf_r,
        w2v_r,
        bert_r,
        bert_ft_r
    ],
    "F1-score": [
        tfidf_f,
        w2v_f,
        bert_f,
        bert_ft_f
    ]
}, index=[
    "TF-IDF",
    "Word2Vec",
    "BERT (Embeddings + LR)",
    "BERT (Fine-tuned)"
])

print("Final Comparison Table")
comparison_table


Final Comparison Table


Unnamed: 0,Accuracy,Precision,Recall,F1-score
TF-IDF,0.87996,0.878597,0.88176,0.880176
Word2Vec,0.82504,0.824624,0.82568,0.825152
BERT (Embeddings + LR),0.84488,0.85026,0.8372,0.843679
BERT (Fine-tuned),0.91544,0.921578,0.90816,0.91482


### Comparison & Analysis

- **Fine-tune BERT had the best overall performance**, achieving the highest accuracy and F1-score due to end-to-end fine-tuning on the IMDB sentiment classification task.
- BERT significantly improved its performance through fine-tuning by optimizing its contextual representations for movie review sentiment.
- **TF–IDF performed well despite its simplicity**, indicating that IMDB sentiment classification relies heavily on keyword-based features.
- **Word2Vec showed the lowest performance** because averaging word embeddings removes word order and weakens important sentiment cues such as negation.
- In terms of **training time**, TF–IDF was the fastest model, followed by Word2Vec, while BERT required the longest training time.
- Regarding **computational cost**, TF–IDF was the most efficient, Word2Vec had moderate cost, and BERT was the most resource-intensive due to transformer-based computations.
- BERT required **GPU acceleration** to achieve reasonable runtime, whereas TF–IDF and Word2Vec ran efficiently on CPU.
- A common error pattern across all models was difficulty handling **sarcasm, irony, and mixed sentiment** within the same review.
- All models performed well on short reviews, while **BERT handled longer reviews better** due to its contextual understanding.


###Final Conclusion:
In this project , several sentiment analysis models were assessed on IMDB Movie Reviews dataset to determine the effect of various text representation methods on classification accuracy. The approaches that are based on traditional and deep learning were compared, with TFIDF features, Word2Vec embeddings, and BERT-based models.

The findings show that the simplicity of the TF -IDF led to good performance, which implies that the use of the sentiment classification feature on IMDB reviews largely relies on the presence of keywords. Word2Vec was inferior largely due to averaging word embeddings eliminating order of words and any significant negative sentiment term negation. Application of BERT as a feature extractor demonstrated better performance when compared to Word2Vec as it uses contextual information, but remained inferior to TF–IDF during performance when the model parameters were kept constant.

The most successful overall performance was achieved with the fine-tuned BERT model which had the highest accuracy and F1-score. Fine-tuning helped BERT to fine-tune its contextual representations to the sentiment of movie review, which slightly improved its performance. This gain, though, was at the cost of more time training and more computations, so to keep the running time reasonable, GPU acceleration was required.

In all models, error patterns were identified as patterns of common errors especially when dealing with sarcasm, irony and mixed sentiment reviews. Models that were simpler worked well when the reviews were shorter whereas BERT-based models worked better when the review was longer because of the capacity to model contextual relationships.

All in all, the results indicate that there is a distinct trade-off between predictive performance and the computational efficiency, with fine-tuned BERT achieving the highest accuracy, and TF–IDF offering a good and efficient baseline.