<a href="https://colab.research.google.com/github/LeibGit/-DI_Bootcamp/blob/main/Copy_of_Custom_Attention_SMS_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Daily Challenge: Custom Attention Mechanism & SMS Spam Classification

Welcome to the guided notebook for the *Custom Attention Mechanism & SMS Spam* daily challenge. Cells tagged as **PREFILLED** are ready to run as-is. Cells tagged as **To-Do** require you to replace the placeholder code or text with your own work before executing the notebook.


## Why are we doing this?
Modern NLP systems rely on attention. By rolling your own attention block and contrasting it with a pre-trained GPT-2 classifier, you will demystify how query/key/value flows shape downstream predictions on a real SMS spam dataset.

![Image](https://github.com/user-attachments/assets/bc4d5315-983b-4fc1-9011-25fa743bb25f)


## Learning objectives
- Implement a custom scaled dot-product attention layer from scratch.
- Explain the respective roles of queries, keys, and values.
- Fine-tune GPT-2 for binary spam classification and compare it to a custom model.
- Evaluate both systems with accuracy, precision, recall, and F1.
- Reflect on trade-offs between transformer-based and lightweight attention models.


> **Learning point**
> Work through each part sequentially. Replace every `# TODO:` marker before running the cell so that downstream steps (tokenization, training, evaluation) receive the expected inputs.


# Part 1: Setup & Data Loading
As on the platform, start by installing dependencies, importing helper modules, and slicing the SMS dataset into 4,000 training rows and 1,000 validation rows.


**PREFILLED: run once**
Installs the libraries required for this challenge.


In [2]:
%pip install --quiet datasets evaluate transformers[sentencepiece]


**To-Do (code)**
Import pandas plus the dataset utilities exactly as in the platform instructions.


In [5]:
import pandas as pd
import datasets
from datasets import Dataset, DatasetDict

**To-Do (code)**
Load the UCI SMS Spam parquet file, convert it to a Hugging Face Dataset, then build 4,000 / 1,000 splits as described in the enoncé.


In [7]:
# TODO: load and inspect the SMS Spam dataset
DATA_PATH = 'hf://datasets/ucirvine/sms_spam/plain_text/train-00000-of-00001.parquet'
df = pd.read_parquet(DATA_PATH)
hf_dataset = Dataset.from_pandas(df)

TRAIN_START = 0
TRAIN_END = 4_000  # TODO: use 4,000 samples for training
VAL_START = 4_000 # TODO: begin validation split at 4,000
VAL_END = 5_000    # TODO: stop validation split at 5,000

if None in (TRAIN_END, VAL_START, VAL_END):
    raise ValueError('Set TRAIN_END, VAL_START, and VAL_END according to the instructions.')

train_ds = hf_dataset.select(range(TRAIN_START, TRAIN_END))
val_ds = hf_dataset.select(range(VAL_START, VAL_END))
display(df.head())


Unnamed: 0,sms,label
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...\n,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


# Part 2: Tokenization Setup
Initialize the GPT-2 tokenizer, set a padding token, and prepare batched tokenization for both splits.


> **Learning point**
> GPT-2 does not define a pad token. Reusing the EOS token keeps inputs aligned with how the model was pretrained.


In [9]:
# TODO: initialize the tokenizer and padding behavior
from transformers import GPT2Tokenizer

MODEL_NAME = "gpt2"  # TODO: set to 'gpt2', you can also try 'gpt2-medium' or 'gpt2-large'
if MODEL_NAME is None:
    raise ValueError("Set MODEL_NAME to the pretrained checkpoint (e.g., 'gpt2').")

tokenizer = GPT2Tokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token  # TODO: verify pad token is mapped to eos


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [10]:
# TODO: complete the tokenization function
TEXT_COLUMN = 'sms'       # TODO: set to 'sms' (the name of the text column in the dataset)
PADDING_STRATEGY = 'max_length'   # TODO: set to 'max_length' it will pad to MAX_SEQ_LEN
TRUNCATION_FLAG = True    # TODO: set to True this will truncate sequences longer than MAX_SEQ_LEN
MAX_SEQ_LEN = 64        # TODO: set to 64 because SMS messages are short

for setting in (TEXT_COLUMN, PADDING_STRATEGY, TRUNCATION_FLAG, MAX_SEQ_LEN):
    if setting is None:
        raise ValueError('Complete TEXT_COLUMN, PADDING_STRATEGY, TRUNCATION_FLAG, and MAX_SEQ_LEN.')


def tokenize_fn(examples):
    return tokenizer(
        examples[TEXT_COLUMN],
        padding=PADDING_STRATEGY,
        truncation=TRUNCATION_FLAG,
        max_length=MAX_SEQ_LEN,
    )


train_tok = train_ds.map(tokenize_fn, batched=True)
val_tok = val_ds.map(tokenize_fn, batched=True)


Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

# Part 3: Pre-trained GPT-2 Classifier
Load GPT-2 with a classification head suited for binary spam detection.


In [11]:
# TODO: instantiate GPT-2 for sequence classification
import torch
from transformers import GPT2ForSequenceClassification

NUM_LABELS = 2  # TODO: set to 2 for spam vs. ham because this is binary classification
if NUM_LABELS is None:
    raise ValueError('Set NUM_LABELS to 2 for binary classification.')

model = GPT2ForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS,
    pad_token_id=tokenizer.eos_token_id,  # TODO: confirm pad token id so training does not error out
)




model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Part 4: Custom Attention Implementation
Build the simple attention layer, classifier, and data pipeline for the scratch model.


> **Learning point**
> Scaling the dot products by $1/\sqrt{d_k}$ keeps gradients stable and prevents the softmax from collapsing when embeddings grow. This opeeration is crucial for training deep attention models.

In [14]:
# TODO: implement the Attention layer
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn


class Attention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.scale = embed_dim ** -0.5  # TODO: use embed_dim ** -0.5

    def forward(self, query, key, value, mask=None):
        scores = torch.matmul(
            query, # TODO: multiply query with the transposed key
            key.transpose(-2, -1), #transpose seq_len and embed_dim
        ) * self.scale
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)  # TODO: use the last dimension
        return torch.matmul(attn, value), attn  # TODO: apply attention weights to values


class SimpleAttentionClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)  # TODO: vocab_size, embed_dim
        self.attn = Attention(embed_dim)               # TODO: pass embed_dim
        self.fc = nn.Linear(embed_dim, num_classes)            # TODO: embed_dim to num_classes

    def forward(self, x):
        embed = self.embedding(x)              # TODO: pass input x
        attn_output, _ = self.attn(query=embed, key=embed, value=embed)  # TODO: self-attention (q=k=v=embed)
        pooled = attn_output.mean(dim=1)       # TODO: mean over sequence dimension
        return self.fc(pooled)                      # TODO: classify pooled representation


> **Learning point**
> Tokenize once and reuse the same 64-token cap so both models receive comparable context windows.


In [15]:
# TODO: preprocess datasets for the custom attention model
ATTN_TEXT_COLUMN = 'sms'  # TODO: set to 'sms'
ATTN_MAX_LEN = 64      # TODO: set to 64
if ATTN_TEXT_COLUMN is None or ATTN_MAX_LEN is None:
    raise ValueError('Complete ATTN_TEXT_COLUMN and ATTN_MAX_LEN.')


def preprocess_for_attention(example):
    tokens = tokenizer.encode(
        example[ATTN_TEXT_COLUMN],
        max_length=ATTN_MAX_LEN,
        truncation=True,
        padding='max_length',
    )
    return {'input_ids': tokens, 'label': example['label']}


train_ds_attn = train_ds.map(preprocess_for_attention)
val_ds_attn = val_ds.map(preprocess_for_attention)


Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [16]:
# TODO: create PyTorch DataLoaders
class SMSDataset(Dataset):
    def __init__(self, hf_dataset):
        self.data = hf_dataset

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        return {
            'input_ids': torch.tensor(item['input_ids'], dtype=torch.long),
            'label': torch.tensor(item['label'], dtype=torch.long),
        }


TRAIN_DATA_FOR_LOADER = train_ds_attn # TODO: set to train_ds_attn
VAL_DATA_FOR_LOADER = val_ds_attn    # TODO: set to val_ds_attn
if TRAIN_DATA_FOR_LOADER is None or VAL_DATA_FOR_LOADER is None:
    raise ValueError('Assign TRAIN_DATA_FOR_LOADER and VAL_DATA_FOR_LOADER before creating loaders.')


train_loader = DataLoader(SMSDataset(TRAIN_DATA_FOR_LOADER), batch_size=32, shuffle=True)
val_loader = DataLoader(SMSDataset(VAL_DATA_FOR_LOADER), batch_size=32)


In [18]:
# TODO: train the custom attention classifier
vocab_size = tokenizer.vocab_size  # TODO: derive from tokenizer (include added tokens)
embed_dim = 64
num_classes = 2   # TODO: set to 2
learning_rate = 1e-3 # TODO: set to 1e-3
if None in (vocab_size, num_classes, learning_rate):
    raise ValueError('Set vocab_size, num_classes, and learning_rate before training.')


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
attn_model = SimpleAttentionClassifier(vocab_size, embed_dim, num_classes).to(device)
optimizer = torch.optim.Adam(attn_model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

attn_model.train()
for batch in train_loader:
    inputs = batch['input_ids'].to(device)
    labels = batch['label'].to(device)
    optimizer.zero_grad()
    outputs = attn_model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

print('Custom Attention model trained on SMS dataset. Sample batch loss:', loss.item())

Custom Attention model trained on SMS dataset. Sample batch loss: 0.27820178866386414


# Part 5: Metrics & Evaluation
Load accuracy, precision, recall, and F1 from `evaluate`, then implement the shared `compute_metrics` helper.


In [19]:
# TODO: configure evaluation metrics
import evaluate
import numpy as np

accuracy = evaluate.load('accuracy')   # TODO: 'accuracy'
precision = evaluate.load('precision')  # TODO: 'precision'
recall = evaluate.load('recall')     # TODO: 'recall'
f1 = evaluate.load('f1')         # TODO: 'f1'


def compute_metrics(pred):
    logits, labels = pred
    preds = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy.compute(predictions=preds, references=labels)['accuracy'],
        'precision': precision.compute(predictions=..., references=...)['precision'],  # TODO
        'recall': recall.compute(predictions=..., references=...)['recall'],          # TODO
        'f1': f1.compute(predictions=..., references=...)['f1'],                      # TODO
    }


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

> **Learning point**
> Use the same helper dictionary pattern for both GPT-2 and the custom model so you can compare metrics side by side.


In [22]:
# TODO: evaluate GPT-2 on the validation split
gpt2_preds = []
gpt2_labels = []
model.eval()
for ex in val_tok:
    inputs = torch.tensor(ex['input_ids']).unsqueeze(0).to(model.device)
    with torch.no_grad():
        logits = model(inputs).logits
    pred = torch.argmax(logits, dim=-1).cpu().item()
    gpt2_preds.append(pred)
    gpt2_labels.append(ex['label'])


gpt2_metrics = {
    'accuracy': accuracy.compute(predictions=gpt2_preds, references=gpt2_labels)['accuracy'],
    'precision': precision.compute(predictions=gpt2_preds, references=gpt2_labels)['precision'],
    'recall': recall.compute(predictions=gpt2_preds, references=gpt2_labels)['recall'],
    'f1': f1.compute(predictions=gpt2_preds, references=gpt2_labels)['f1'],
}
print('GPT-2 Metrics:', gpt2_metrics)

GPT-2 Metrics: {'accuracy': 0.143, 'precision': 0.1395582329317269, 'recall': 1.0, 'f1': 0.24493392070484582}


In [24]:
# TODO: evaluate the custom attention model
attn_preds = []
attn_labels = []
attn_model.eval()
for batch in val_loader:
    inputs = batch['input_ids'].to(device)
    labels = batch['label'].to(device)
    with torch.no_grad():
        outputs = attn_model(inputs)
        preds = torch.argmax(outputs, dim=1)
    attn_preds.extend(preds.cpu().tolist())
    attn_labels.extend(labels.cpu().tolist())


attn_metrics = {
    'accuracy': accuracy.compute(predictions=attn_preds, references=attn_labels)['accuracy'],    # TODO
    'precision': precision.compute(predictions=attn_preds, references=attn_labels)['precision'],  # TODO
    'recall': recall.compute(predictions=attn_preds, references=attn_labels)['recall'],          # TODO
    'f1': f1.compute(predictions=attn_preds, references=attn_labels)['f1']                      # TODO
}
print('Attention Model Metrics:', attn_metrics)

Attention Model Metrics: {'accuracy': 0.86, 'precision': 0.4, 'recall': 0.014388489208633094, 'f1': 0.027777777777777776}


# Part 6: Reflection Questions
Answer directly in the markdown cells below once your experiments finish.


### 1. What are the roles of query, key, and value in the attention mechanism?
TODO: Write your explanation here.
Query is the information we are looking for
Key is the avaliable information we have
values are the returned values we get after looking through our keys(the best matches for the query)


### 2. Why do we use a scaling factor in the dot-product attention?
TODO: Summarize the numerical stability rationale.
it prevents the dot products from becoming to large.

### 3. How does self-attention differ from traditional sequence models like RNNs?
TODO: Compare processing style, dependency capture, and efficiency.
it processes each word and is able to effeciently connect words even in large sentences. RNN struggle to retain connections on longer prompts.

### 4. Performance analysis
TODO: Discuss which model performed better, describe trade-offs, and suggest one improvement for the custom attention classifier.
The gpt model performed better on recall but the attn model was more accuracate and precise.