- Full Name: **Seyyed Amirmahdi Sadrzadeh**
- Student ID: **401102015**

# 🧠 Homework 4: Transformer for Sentiment Analysis

## 📌 Objective

In this assignment, you will **implement a Transformer-based model for sentiment analysis** on the IMDb movie review dataset. You will:

- 🧹 Preprocess and clean real-world text data.
- 🏗️ Build a Transformer classifier from scratch (including positional encoding).
- 🧠 Train the model to classify IMDb reviews as **positive** or **negative**.
- 📈 Evaluate model performance on the test set.

---

## 📚 Learning Goals

By the end of this assignment, you should be able to:

- Understand how the Transformer encoder works in NLP.
- Implement tokenization, padding, and vocabulary creation.
- Train a Transformer-based model for text classification.
- Measure and interpret model performance on a real-world dataset.

---

## 📦 Dataset

We use the **IMDb movie reviews dataset**:

- Contains 50,000 highly polar movie reviews (25,000 for training and 25,000 for testing).
- Each review is labeled as either **positive (1)** or **negative (0)**.
- You will clean the raw text, tokenize it, and build a vocabulary before training.

---

## 🏗️ Model Architecture

You will build a **Transformer Encoder** model that includes:

- Word Embedding Layer
- Positional Encoding Layer
- Multi-head Self-Attention Blocks
- Feedforward Layers
- Final Classification Head

---

## ⚙️ Training Details

- Optimizer: `Adam`
- Loss Function: `CrossEntropyLoss`
- Batch Size: `32`
- Learning Rate: `1e-3`
- Epochs: `5`

---

## 🎯 Evaluation Criteria

Your final implementation will be evaluated on:

- ✅ Correct implementation of the Transformer classifier.
- ✅ Clean and modular code (e.g., `Dataset`, `Dataloader`, `Model`, `Train` functions).
- ✅ Accuracy on the IMDb test set.
- ✅ Proper text preprocessing and vocabulary handling.
- ✅ Well-commented and readable code.

---



In [1]:
import os
import re
import tarfile
import requests
import torch
import torch.nn as nn
import numpy as np
from collections import Counter
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import copy

In [2]:
## Do not edit part

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def download_imdb(data_path="./imdb"):
    url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    os.makedirs(data_path, exist_ok=True)
    filepath = os.path.join(data_path, "aclImdb_v1.tar.gz")

    if not os.path.exists(filepath):
        print("Downloading IMDb dataset...")
        r = requests.get(url, stream=True)
        with open(filepath, "wb") as f:
            for chunk in r.iter_content(chunk_size=1024): f.write(chunk)

        print("Extracting...")
        with tarfile.open(filepath, "r:gz") as tar:
            tar.extractall(path=data_path)
    print("Done.")

download_imdb()

def clean_text(text):
    text = re.sub(r"<.*?>", "", text)
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return text.lower()

def load_imdb_data(base_path, split='train'):
    data = []
    for label in ['pos', 'neg']:
        folder = os.path.join(base_path, f'aclImdb/{split}/{label}')
        for fname in os.listdir(folder):
            with open(os.path.join(folder, fname), 'r', encoding='utf8') as f:
                text = clean_text(f.read())
                data.append((text, 1 if label == 'pos' else 0))
    return data

train_raw = load_imdb_data("./imdb", split='train')
test_raw = load_imdb_data("./imdb", split='test')

Downloading IMDb dataset...
Extracting...
Done.


In [3]:
## Do not edit part
def tokenize(text):
    return text.split()

# Build vocab
counter = Counter()
for text, _ in train_raw:
    counter.update(tokenize(text))

vocab = {"<pad>": 0, "<unk>": 1}
for word, freq in counter.items():
    if freq >= 5:
        vocab[word] = len(vocab)

def encode(text):
    return [vocab.get(w, vocab["<unk>"]) for w in tokenize(text)]

class IMDBDataset(Dataset):
    def __init__(self, data):
        self.data = [(encode(text), label) for text, label in data]
    def __len__(self): return len(self.data)
    def __getitem__(self, idx): return self.data[idx]

def collate_fn(batch):
    texts, labels = zip(*batch)
    texts = [torch.tensor(x) for x in texts]
    texts = torch.nn.utils.rnn.pad_sequence(texts, batch_first=True, padding_value=vocab["<pad>"])
    return texts.to(device), torch.tensor(labels).to(device)

train_dataset = IMDBDataset(train_raw)
test_dataset = IMDBDataset(test_raw)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False, collate_fn=collate_fn)


In [4]:
## To do: complete the PositionalEncoding module
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
        ## Sin , Cos positional encoding
        pe[:, 0::2] = torch.sin(position * div_term)        # even dims
        pe[:, 1::2] = torch.cos(position * div_term)        # odd dims
        self.pe = pe.unsqueeze(0)
    def forward(self, x):
        return x + self.pe[:, :x.size(1)].to(x.device)


In [11]:
## To do: complete MultiheadAttention module
class MultiheadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.1):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        assert self.head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"

        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)

        self.out_proj = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, query, key, value, key_padding_mask=None):
        batch_size = query.size(1)

        # Project inputs to query, key, value
        q = self.q_proj(query).view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)
        k = self.k_proj(key).view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)
        v = self.v_proj(value).view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)

        # Scaled dot-product attention
        attn_output, attn_weights = self.scaled_dot_product_attention(q, k, v)

        # Concatenate heads and project back to original dimension
        attn_output = attn_output.transpose(0, 1).contiguous().view(-1, batch_size, self.embed_dim)
        attn_output = self.dropout(self.out_proj(attn_output))

        return attn_output, attn_weights

    ## To do
    def scaled_dot_product_attention(self, q, k, v):
        attn_scores = torch.bmm(q, k.transpose(1, 2)) / np.sqrt(self.head_dim)
        attn_weights = torch.softmax(attn_scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        output = torch.matmul(attn_weights, v)
        return output, attn_weights





In [6]:
## To do: complete the forward function

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, src):
        # Self attention
        src2 = self.self_attn(src, src, src)[0]
        src = src + self.dropout1(src2)
        src = self.norm1(src)

        # Feedforward
        src2 = self.linear2(self.dropout(torch.relu(self.linear1(src))))
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        return src


In [7]:
class TransformerEncoder(nn.Module):
    def __init__(self, encoder_layer, num_layers):
        super().__init__()
        self.layers = nn.ModuleList([copy.deepcopy(encoder_layer) for _ in range(num_layers)])

    ## To do
    def forward(self, src):
        output = src
        for layer in self.layers:
            output = layer(output)
        return output

In [13]:
class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, d_model=128, nhead=4, num_layers=2, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=0)
        self.pos_encoder = PositionalEncoding(d_model)

        encoder_layer = TransformerEncoderLayer(d_model, nhead, d_model*2, dropout=0.1)
        self.transformer = TransformerEncoder(encoder_layer, num_layers)

        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, src):
        ## get the embedding
        x = self.embedding(src)
        ## pos encode
        x = self.pos_encoder(x)
        x = x.permute(1, 0, 2)   # (seq_len, batch, dim)
        x = self.transformer(x)
        return self.fc(x[0])  # Use first token as representation
model = TransformerClassifier(len(vocab)).to(device)

In [14]:
## To do: complete the training loop
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

def train(model, loader):
    model.train()
    total_loss, correct = 0, 0
    for x, y in loader:
        optimizer.zero_grad()
        out = model(x)
        loss = criterion(out, y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        correct    += (out.argmax(1) == y).sum().item()
    return total_loss / len(loader), correct / len(loader.dataset)

def evaluate(model, loader):
    model.eval()
    correct = 0
    with torch.no_grad():
        for x, y in loader:
            out = model(x)
            correct += (out.argmax(1) == y).sum().item()
    return correct / len(loader.dataset)



In [16]:
## Run and enjoy

for epoch in range(20):
    train_loss, train_acc = train(model, train_loader)
    test_acc = evaluate(model, test_loader)
    print(f"Epoch {epoch+1}: Train Loss {train_loss:.4f}, Train Acc {train_acc:.4f}, Test Acc {test_acc:.4f}")

Epoch 1: Train Loss 0.5923, Train Acc 0.6885, Test Acc 0.6845
Epoch 2: Train Loss 0.5996, Train Acc 0.6794, Test Acc 0.6487
Epoch 3: Train Loss 0.6078, Train Acc 0.6720, Test Acc 0.6748
Epoch 4: Train Loss 0.5844, Train Acc 0.6966, Test Acc 0.6848
Epoch 5: Train Loss 0.5936, Train Acc 0.6892, Test Acc 0.6618
Epoch 6: Train Loss 0.5986, Train Acc 0.6721, Test Acc 0.6918
Epoch 7: Train Loss 0.5494, Train Acc 0.7220, Test Acc 0.7378
Epoch 8: Train Loss 0.5393, Train Acc 0.7330, Test Acc 0.7312
Epoch 9: Train Loss 0.5330, Train Acc 0.7370, Test Acc 0.7374
Epoch 10: Train Loss 0.5252, Train Acc 0.7412, Test Acc 0.7369
Epoch 11: Train Loss 0.5198, Train Acc 0.7480, Test Acc 0.7436
Epoch 12: Train Loss 0.5054, Train Acc 0.7586, Test Acc 0.7472
Epoch 13: Train Loss 0.5006, Train Acc 0.7628, Test Acc 0.7480
Epoch 14: Train Loss 0.4905, Train Acc 0.7668, Test Acc 0.7528
Epoch 15: Train Loss 0.4743, Train Acc 0.7781, Test Acc 0.7657
Epoch 16: Train Loss 0.4748, Train Acc 0.7771, Test Acc 0.7671
E

## 📝 Bonus (Optional)

- 🔍 Experiment with different model hyperparameters (e.g., `nhead`, `d_model`, `num_layers`).
- 📊 Plot training and validation accuracy over epochs.
- 📌 Use attention weights to interpret model focus.




In [17]:
import os
import re
import tarfile
import requests
import torch
import torch.nn as nn
import numpy as np
from collections import Counter
from torch.utils.data import Dataset, DataLoader
import copy
import matplotlib.pyplot as plt
from itertools import product

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Download IMDb dataset if not exists
def download_imdb(data_path="./imdb"):
    url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    os.makedirs(data_path, exist_ok=True)
    filepath = os.path.join(data_path, "aclImdb_v1.tar.gz")
    if not os.path.exists(filepath):
        print("Downloading IMDb dataset...")
        r = requests.get(url, stream=True)
        with open(filepath, "wb") as f:
            for chunk in r.iter_content(chunk_size=1024): f.write(chunk)
        print("Extracting...")
        with tarfile.open(filepath, "r:gz") as tar:
            tar.extractall(path=data_path)
    print("Done.")

download_imdb()

def clean_text(text):
    text = re.sub(r"<.*?>", "", text)
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return text.lower()

def load_imdb_data(base_path, split='train'):
    data = []
    for label in ['pos', 'neg']:
        folder = os.path.join(base_path, f'aclImdb/{split}/{label}')
        for fname in os.listdir(folder):
            with open(os.path.join(folder, fname), 'r', encoding='utf8') as f:
                text = clean_text(f.read())
                data.append((text, 1 if label == 'pos' else 0))
    return data

train_raw = load_imdb_data("./imdb", split='train')
test_raw = load_imdb_data("./imdb", split='test')

def tokenize(text):
    return text.split()

# Build vocab
counter = Counter()
for text, _ in train_raw:
    counter.update(tokenize(text))

vocab = {"<pad>": 0, "<unk>": 1}
for word, freq in counter.items():
    if freq >= 5:
        vocab[word] = len(vocab)
vocab_inv = {idx: word for word, idx in vocab.items()}

def encode(text):
    return [vocab.get(w, vocab["<unk>"]) for w in tokenize(text)]

class IMDBDataset(Dataset):
    def __init__(self, data):
        self.data = [(encode(text), label) for text, label in data]
    def __len__(self): return len(self.data)
    def __getitem__(self, idx): return self.data[idx]

def collate_fn(batch):
    texts, labels = zip(*batch)
    texts = [torch.tensor(x) for x in texts]
    texts = torch.nn.utils.rnn.pad_sequence(texts, batch_first=True, padding_value=vocab["<pad>"])
    return texts.to(device), torch.tensor(labels).to(device)

train_dataset = IMDBDataset(train_raw)
test_dataset = IMDBDataset(test_raw)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(np.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.pe = pe.unsqueeze(0)
    def forward(self, x):
        return x + self.pe[:, :x.size(1)].to(x.device)

class MultiheadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.1):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        assert self.head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"

        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, query, key, value, return_weights=False):
        batch_size = query.size(1)

        q = self.q_proj(query).view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)
        k = self.k_proj(key).view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)
        v = self.v_proj(value).view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)

        attn_output, attn_weights = self.scaled_dot_product_attention(q, k, v)
        attn_output = attn_output.transpose(0, 1).contiguous().view(-1, batch_size, self.embed_dim)
        attn_output = self.dropout(self.out_proj(attn_output))

        if return_weights:
            return attn_output, attn_weights
        return attn_output, None

    def scaled_dot_product_attention(self, q, k, v):
        scores = torch.bmm(q, k.transpose(1, 2)) / np.sqrt(self.head_dim)
        weights = torch.softmax(scores, dim=-1)
        weights = self.dropout(weights)
        output = torch.bmm(weights, v)
        return output, weights

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, src):
        src2 = self.self_attn(src, src, src)[0]
        src = src + self.dropout1(src2)
        src = self.norm1(src)

        ff = self.linear2(self.dropout(torch.relu(self.linear1(src))))
        src = src + self.dropout2(ff)
        src = self.norm2(src)
        return src

class TransformerEncoder(nn.Module):
    def __init__(self, encoder_layer, num_layers):
        super().__init__()
        self.layers = nn.ModuleList([copy.deepcopy(encoder_layer) for _ in range(num_layers)])
    def forward(self, src):
        output = src
        for layer in self.layers:
            output = layer(output)
        return output

class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, d_model=128, nhead=4, num_layers=2, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=0)
        self.pos_encoder = PositionalEncoding(d_model)
        encoder_layer = TransformerEncoderLayer(d_model, nhead, d_model*2, dropout=0.1)
        self.transformer = TransformerEncoder(encoder_layer, num_layers)
        self.fc = nn.Linear(d_model, num_classes)
    def forward(self, src):
        x = self.embedding(src)              # (batch, seq_len, d_model)
        x = self.pos_encoder(x)              # add positional encoding
        x = x.permute(1, 0, 2)               # (seq_len, batch, d_model)
        x = self.transformer(x)              # encode
        return self.fc(x[0])                 # use first token for classification

# Initialize model, criterion, optimizer
model = TransformerClassifier(len(vocab)).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

def train(model, loader):
    model.train()
    total_loss, correct = 0.0, 0
    for x, y in loader:
        optimizer.zero_grad()
        out = model(x)
        loss = criterion(out, y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        correct += (out.argmax(1) == y).sum().item()

    return total_loss / len(loader), correct / len(loader.dataset)

def evaluate(model, loader):
    model.eval()
    correct = 0
    with torch.no_grad():
        for x, y in loader:
            out = model(x)
            correct += (out.argmax(1) == y).sum().item()
    return correct / len(loader.dataset)

# Train for some epochs and track accuracy
num_epochs = 5
train_acc_history = []
test_acc_history = []

for epoch in range(1, num_epochs+1):
    train_loss, train_acc = train(model, train_loader)
    test_acc = evaluate(model, test_loader)

    train_acc_history.append(train_acc)
    test_acc_history.append(test_acc)

    print(f"Epoch {epoch}: Train Loss={train_loss:.4f}, Train Acc={train_acc:.4f}, Test Acc={test_acc:.4f}")

# Plot training & test accuracy
plt.figure()
plt.plot(range(1, num_epochs+1), train_acc_history, label="Train Acc")
plt.plot(range(1, num_epochs+1), test_acc_history, label="Test Acc")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.title("Train/Test Accuracy")
plt.legend()
plt.show()

# --------------------------
# BONUS: Visualize attention weights from first encoder layer on a sample input
sample_idx = 0
sample_text, sample_label = test_dataset[sample_idx]
sample_tensor = sample_text.unsqueeze(0).to(device)  # batch=1

model.eval()
with torch.no_grad():
    x = model.embedding(sample_tensor)
    x = model.pos_encoder(x)
    x = x.permute(1,0,2)  # (seq_len, batch, d_model)

    # extract first layer self-attention weights
    attn_module = model.transformer.layers[0].self_attn
    # get attn weights from forward
    _, attn_weights = attn_module(x, x, x, return_weights=True)

# attn_weights shape: (heads * batch, seq_len, seq_len)
num_heads = attn_module.num_heads
attn_weights = attn_weights.view(num_heads, 1, attn_weights.size(-2), attn_weights.size(-1)).squeeze(1)

tokens = [vocab_inv[idx.item()] for idx in sample_text]

fig, axs = plt.subplots(1, num_heads, figsize=(num_heads * 3, 3))
for i in range(num_heads):
    ax = axs[i] if num_heads > 1 else axs
    im = ax.imshow(attn_weights[i].cpu(), aspect='auto')
    ax.set_xticks(range(len(tokens)))
    ax.set_xticklabels(tokens, rotation=90, fontsize=6)
    ax.set_yticks(range(len(tokens)))
    ax.set_yticklabels(tokens, fontsize=6)
    ax.set_title(f"Head {i}")
plt.tight_layout()
plt.show()


Done.


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.86 GiB. GPU 0 has a total capacity of 14.74 GiB of which 1.56 GiB is free. Process 2557 has 13.18 GiB memory in use. Of the allocated memory 10.19 GiB is allocated by PyTorch, and 2.86 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)