# SentimentScope: IMDB Sentiment Analysis with Transformers

A transformer-based sentiment analysis model trained from scratch on the IMDB movie review dataset. This project implements a custom GPT-style architecture for binary classification, achieving 76%+ accuracy on test data.

## Project Overview

- **Task**: Binary sentiment classification (positive/negative)
- **Dataset**: IMDB Movie Reviews (50,000 reviews)
- **Architecture**: Custom transformer with 4 layers, 4 attention heads
- **Performance**: 76.28% test accuracy


## Data Loading and Preparation

In [None]:
import pandas as pd
import os
import urllib.request
import tarfile

if not os.path.exists('aclImdb'):
    print("Downloading IMDB dataset...")
    url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    urllib.request.urlretrieve(url, 'aclImdb_v1.tar.gz')
    
    print("Extracting dataset...")
    with tarfile.open('aclImdb_v1.tar.gz', 'r:gz') as tar:
        tar.extractall()
    
    print("Dataset ready!")
else:
    print("Dataset already exists.")

In [4]:
train_pos_path = 'aclImdb/train/pos'
train_neg_path = 'aclImdb/train/neg'
test_pos_path = 'aclImdb/test/pos'
test_neg_path = 'aclImdb/test/neg'

In [5]:
def load_dataset(folder):
    reviews = []
    for filename in os.listdir(folder):
        file_path = os.path.join(folder, filename)
        if filename.endswith(".txt"):
            with open(file_path, 'r', encoding='utf-8') as f:
                reviews.append(f.read())
    return reviews

In [6]:
train_pos = load_dataset(train_pos_path)
train_neg = load_dataset(train_neg_path)
test_pos = load_dataset(test_pos_path)
test_neg = load_dataset(test_neg_path)

In [7]:
train_df = pd.DataFrame({
    'review': train_pos + train_neg,
    'label': [1] * len(train_pos) + [0] * len(train_neg)
})

test_df = pd.DataFrame({
    'review': test_pos + test_neg,
    'label': [1] * len(test_pos) + [0] * len(test_neg)
})

print(train_df.head())

                                              review  label
0  Silly, hilarious, tragic, sad, inevitable.<br ...      1
1  I actually like the original, and this film ha...      1
2  For my humanities quarter project for school, ...      1
3  To me this was more a wake up call, and realiz...      1
4  This movie is a lot of fun. What makes it grea...      1


In [13]:
train_size = int(0.9 * len(train_df))
shuffled_df = train_df.sample(frac=1, random_state=42).reset_index(drop=True)
train_data = shuffled_df.iloc[:train_size]
val_data = shuffled_df.iloc[train_size:]

## Tokenization Setup

In [14]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

## Custom Dataset and DataLoader

In [17]:
import torch
from torch.utils.data import Dataset

MAX_LENGTH = 128
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cuda


In [18]:
class IMDBDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=MAX_LENGTH):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx]['review']
        label = self.data.iloc[idx]['label']

        encoding = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt"
        )

        input_ids = encoding['input_ids'].squeeze()
        attention_mask = encoding['attention_mask'].squeeze()

        return input_ids, attention_mask, label

In [19]:
train_dataset = IMDBDataset(train_data, tokenizer)
val_dataset = IMDBDataset(val_data, tokenizer)
test_dataset = IMDBDataset(test_df, tokenizer)

In [20]:
from torch.utils.data import DataLoader

BATCH_SIZE = 32

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

## Model Architecture

In [22]:
config = {
    "vocabulary_size": tokenizer.vocab_size,
    "num_classes": 2,
    "d_embed": 128,
    "context_size": MAX_LENGTH,
    "layers_num": 4,
    "heads_num": 4,
    "head_size": 32,
    "dropout_rate": 0.1,
    "use_bias": True
}

In [23]:
import torch.nn as nn
import math

class AttentionHead(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.Q_weights = nn.Linear(config["d_embed"], config["head_size"], bias=config["use_bias"])
        self.K_weights = nn.Linear(config["d_embed"], config["head_size"], bias=config["use_bias"])
        self.V_weights = nn.Linear(config["d_embed"], config["head_size"], bias=config["use_bias"])
        self.dropout = nn.Dropout(config["dropout_rate"])

        casual_attention_mask = torch.tril(torch.ones(config["context_size"], config["context_size"]))
        self.register_buffer('casual_attention_mask', casual_attention_mask)

    def forward(self, input):
        batch_size, tokens_num, d_embed = input.shape
        Q = self.Q_weights(input)
        K = self.K_weights(input)
        V = self.V_weights(input)

        attention_scores = Q @ K.transpose(1, 2)
        attention_scores = attention_scores.masked_fill(
            self.casual_attention_mask[:tokens_num, :tokens_num] == 0,
            float('-inf')
        )
        attention_scores = attention_scores / math.sqrt(K.shape[-1])
        attention_scores = torch.softmax(attention_scores, dim=-1)
        attention_scores = self.dropout(attention_scores)

        return attention_scores @ V

In [25]:
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        heads_list = [AttentionHead(config) for _ in range(config["heads_num"])]
        self.heads = nn.ModuleList(heads_list)
        self.linear = nn.Linear(config["heads_num"] * config["head_size"], config["d_embed"])
        self.dropout = nn.Dropout(config["dropout_rate"])

    def forward(self, input):
        heads_outputs = [head(input) for head in self.heads]
        x = torch.cat(heads_outputs, dim=-1)
        x = self.linear(x)
        x = self.dropout(x)
        return x

In [27]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_layers = nn.Sequential(
            nn.Linear(config["d_embed"], 4 * config["d_embed"]),
            nn.GELU(),
            nn.Linear(4 * config["d_embed"], config["d_embed"]),
            nn.Dropout(config["dropout_rate"])
        )

    def forward(self, input):
        return self.linear_layers(input)

In [29]:
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.multi_head = MultiHeadAttention(config)
        self.layer_norm_1 = nn.LayerNorm(config["d_embed"])
        self.feed_forward = FeedForward(config)
        self.layer_norm_2 = nn.LayerNorm(config["d_embed"])

    def forward(self, input):
        x = input
        x = x + self.multi_head(self.layer_norm_1(x))
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [31]:
class DemoGPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embedding_layer = nn.Embedding(config["vocabulary_size"], config["d_embed"])
        self.positional_embedding_layer = nn.Embedding(config["context_size"], config["d_embed"])
        
        blocks = [Block(config) for _ in range(config["layers_num"])]
        self.layers = nn.Sequential(*blocks)
        
        self.layer_norm = nn.LayerNorm(config["d_embed"])
        self.classifier = nn.Linear(config["d_embed"], config["num_classes"], bias=config["use_bias"])
        
    def forward(self, token_ids):
        batch_size, tokens_num = token_ids.shape

        x = self.token_embedding_layer(token_ids)
        positions = torch.arange(tokens_num, device=token_ids.device)
        pos_embed = self.positional_embedding_layer(positions)
        x = x + pos_embed.unsqueeze(0)
        
        x = self.layers(x)
        x = self.layer_norm(x)
        
        x = torch.mean(x, dim=1)
        logits = self.classifier(x)
        return logits

## Training

In [34]:
def calculate_accuracy(model, data_loader, device):
    model.eval()
    total_correct = 0
    total_samples = 0
    with torch.no_grad():
        for batch in data_loader:
            token_ids, attention_mask, labels = batch
            token_ids = token_ids.to(device)
            labels = labels.to(device)

            logits = model(token_ids)
            predictions = torch.argmax(logits, dim=1)

            total_correct += (predictions == labels).sum().item()
            total_samples += labels.size(0)
    accuracy = (total_correct / total_samples) * 100
    return accuracy

In [37]:
import torch.optim as optim

EPOCHS = 3

model = DemoGPT(config).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=3e-4)

for epoch in range(EPOCHS):
    model.train()
    running_loss = 0.0

    for step, (input_ids, attention_mask, labels) in enumerate(train_loader):
        input_ids = input_ids.to(device)
        labels = labels.to(device)

        logits = model(input_ids)
        loss = criterion(logits, labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()

        if (step + 1) % 100 == 0:
            print(f"Epoch [{epoch+1}/{EPOCHS}], Step [{step+1}/{len(train_loader)}], "
                  f"Loss: {running_loss/100:.4f}")
            running_loss = 0.0

    val_accuracy = calculate_accuracy(model, val_loader, device)
    print(f"Epoch {epoch+1} - Validation Accuracy: {val_accuracy:.2f}%")

Epoch [1/3], Step [100/704], Loss: 0.6844
Epoch [1/3], Step [200/704], Loss: 0.6625
Epoch [1/3], Step [300/704], Loss: 0.6354
Epoch [1/3], Step [400/704], Loss: 0.6052
Epoch [1/3], Step [500/704], Loss: 0.6034
Epoch [1/3], Step [600/704], Loss: 0.5757
Epoch [1/3], Step [700/704], Loss: 0.5551
Epoch 1 - Validation Accuracy: 66.48%
Epoch [2/3], Step [100/704], Loss: 0.5232
Epoch [2/3], Step [200/704], Loss: 0.5198
Epoch [2/3], Step [300/704], Loss: 0.4939
Epoch [2/3], Step [400/704], Loss: 0.4928
Epoch [2/3], Step [500/704], Loss: 0.4971
Epoch [2/3], Step [600/704], Loss: 0.4993
Epoch [2/3], Step [700/704], Loss: 0.4706
Epoch 2 - Validation Accuracy: 76.24%
Epoch [3/3], Step [100/704], Loss: 0.4485
Epoch [3/3], Step [200/704], Loss: 0.4402
Epoch [3/3], Step [300/704], Loss: 0.4376
Epoch [3/3], Step [400/704], Loss: 0.4331
Epoch [3/3], Step [500/704], Loss: 0.4349
Epoch [3/3], Step [600/704], Loss: 0.4305
Epoch [3/3], Step [700/704], Loss: 0.4128
Epoch 3 - Validation Accuracy: 77.12%


## Evaluation

In [38]:
test_accuracy = calculate_accuracy(model, test_loader, device)
print(f"Test Accuracy: {test_accuracy:.2f}%")

Test Accuracy: 76.27%


In [39]:
MODEL_PATH = "sentiment_model.pth"
torch.save(model.state_dict(), MODEL_PATH)
print("Model saved to:", MODEL_PATH)

Model saved to: sentiment_model.pth


## Results Summary

### Performance Metrics
- **Validation Accuracy**: 77.12%
- **Test Accuracy**: 76.28%

### Key Achievements
- Successfully implemented a transformer architecture from scratch for sentiment classification
- Achieved strong generalization with minimal overfitting (validation and test accuracy within 1%)
- Efficient training with only 3 epochs

### Technical Highlights
- Custom GPT-style architecture with 4 transformer blocks
- Multi-head attention mechanism (4 heads, 32-dimensional each)
- Mean pooling for sequence-to-vector transformation
- BERT tokenizer for subword tokenization

### Future Improvements
- Increase model capacity (more layers, larger embeddings)
- Extended training with learning rate scheduling
- Fine-tuning pretrained models (BERT, RoBERTa)
- Ensemble methods for improved accuracy