# CSC 8614 - Language Models
## CI2 - Fine-tuning a language model for text classification

In this TP, you will work on fine-tuning a language model to move from text generation to text classification, specifically working on Spam Detection.

The exercise (and code) has been adapted from the book _Build a Large Language Model (From Scratch)_, by Sebastian Raschka, and its [official github repository](https://github.com/rasbt/LLMs-from-scratch).

This TP will be done in this notebook, and requires some additional files (available from the course website). You will have to fill the missing portions of code, and perform some additional experiments by testing different parameters.

Working on this TP:
- The easiest way is probably to work directly on the notebook, using jupyter notebook or visual studio code. An alternative is also to use Google colab.
- You should be able to run everything on your machine, but you can connect to the GPUs if needed.

Some files are required, and are available on the course website:
- `requirements.txt`
- `gpt_utils.py`


## About the report
You will have to return this notebook (completed), as well as a mini-report (`TP2/rapport.md`).

The notebook and report shall be submitted via a GitHub repository, similarly to what you did for the first session (remember to use a different folder: `TP2`).
For the notebook, it is sufficient to complete the code and submit the final version.

For the mini-report, you have to answer the questions asked in this notebook, and discuss some of your findings as requested.
As for the first session:
- "Vous devez y mettre : réponses courtes, résultats observés (copie de sorties), captures d’écran demandées, et une courte interprétation."
- "Ne collez pas des pages entières : soyez concis et sélectionnez les éléments pertinents."

Reproducibility: 
- fix a random seed and write it in the report
- indicate in the report the specific python version OS, and the library versions.

**Question 1**: Dans `TP1/rapport.md`, ajoutez immédiatement un court en-tête (quelques lignes) contenant : (i) votre nom/prénom, (ii) la commande d’installation/activation d’environnement utilisée, (iii) les versions (Python + bibliothèques principales).

Ajoutez ensuite au fil du TP des sections/titres à votre convenance, tant que l’on peut retrouver clairement vos réponses et vos preuves d’exécution.

In [1]:
# [Instructor code: install requirements]
!pip install -r requirements.txt



## Preparing the model

In [4]:
# --- [INSTRUCTOR CODE: load the model weights into memory] ---
import torch
import tiktoken
from gpt_utils import GPTModel, download_and_load_gpt2, load_weights_into_gpt

# Download the model weights (124M param version) / This function (which we put in gpt_utils) handles the downloading
settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2_weights")
print("Weights downloaded and loaded into memory.")

checkpoint: 100%|██████████| 77.0/77.0 [00:00<00:00, 19.0kiB/s]
encoder.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 1.17MiB/s]
hparams.json: 100%|██████████| 90.0/90.0 [00:00<00:00, 19.4kiB/s]
model.ckpt.data-00000-of-00001: 100%|██████████| 498M/498M [04:16<00:00, 1.94MiB/s]  
model.ckpt.index: 100%|██████████| 5.21k/5.21k [00:00<00:00, 3.02MiB/s]
model.ckpt.meta: 100%|██████████| 471k/471k [00:00<00:00, 753kiB/s] 
vocab.bpe: 100%|██████████| 456k/456k [00:00<00:00, 775kiB/s] 


Weights downloaded and loaded into memory.


The `settings` obtained with `download_and_load_gpt2` are the GPT-2 weights made publicly available by OpenAI.

**Question 2**: What type is the object `setting`, and what is its structure (e.g. if it is a list, its length; if a dictionary, its keys, etc.)?

**Question 3**: What type is the object `params`, and what is its structure?

In [5]:
# Analyse `settings`
print(f"Type de l'objet 'settings': {type(settings)}")
print(f"Structure (clés du dictionnaire): {settings.keys()}")
print(f"Valeurs contenues: {settings}")

print("-" * 30)

# Analyse `params`
print(f"Type de l'objet 'params': {type(params)}")
print(f"Nombre de clés principales: {len(params)}")
print(f"Clés de premier niveau: {params.keys()}")
# Exemple de forme pour une couche spécifique
print(f"Forme des poids d'embedding (wte): {params['wte'].shape}")

Type de l'objet 'settings': <class 'dict'>
Structure (clés du dictionnaire): dict_keys(['n_vocab', 'n_ctx', 'n_embd', 'n_head', 'n_layer'])
Valeurs contenues: {'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head': 12, 'n_layer': 12}
------------------------------
Type de l'objet 'params': <class 'dict'>
Nombre de clés principales: 5
Clés de premier niveau: dict_keys(['blocks', 'b', 'g', 'wpe', 'wte'])
Forme des poids d'embedding (wte): (50257, 768)


Look at the `GPTModel` in the file `gpt_utils.py`. In the `__init__` method, we have to pass a config (parameter `cfg`). 

**Question 4:** 
Analyse the `__init__` method, and check what is the required structure for the `cfg` parameter. Is the `settings` variable we have obtained in the right format? If not, perform the mapping to convert the variable `setting` into a variable `model_config` with the right structure.

In [6]:
# Configure the model, mapping OpenAI specific keys to our model's keys
model_config = {
    "vocab_size": settings["n_vocab"],     # Mappage de n_vocab
    "context_length": settings["n_ctx"],   # Mappage de n_ctx
    "emb_dim": settings["n_embd"],         # Mappage de n_embd
    "n_heads": settings["n_head"],         # Mappage de n_head
    "n_layers": settings["n_layer"],       # Mappage de n_layer
    "drop_rate": 0.1,                      # Requis par GPTModel
    "qkv_bias": True,                      # Requis par GPTModel
}

# Vérification du dictionnaire créé
print("Structure de model_config :", model_config)

Structure de model_config : {'vocab_size': 50257, 'context_length': 1024, 'emb_dim': 768, 'n_heads': 12, 'n_layers': 12, 'drop_rate': 0.1, 'qkv_bias': True}


In [7]:
model = GPTModel(model_config)

# Load the pre-trained weights
load_weights_into_gpt(model, params)
model.eval() 

print("GPT-2 Model Loaded and Configured successfully!")

GPT-2 Model Loaded and Configured successfully!


## Preparing the data

Context from the lecture: The raw data is just text messages. 

The model needs numbers (token IDs). We also need to pad the messages so they are all the same length in a batch.

We will use a `SpamDataset` class (provided below) to tokenize the text.

In [8]:
# --- [INSTRUCTOR CODE: Run this cell to define the Dataset Class] ---
from torch.utils.data import Dataset
import pandas as pd
import urllib.request
import zipfile
import os

class SpamDataset(Dataset):
    def __init__(self, csv_file, tokenizer, max_length=120, pad_token_id=50256):
        self.data = pd.read_csv(csv_file)
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.pad_token_id = pad_token_id
        # Encode labels: "spam" -> 1, "ham" -> 0
        self.data["label_encoded"] = self.data["Label"].map({"spam": 1, "ham": 0})

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx]["Text"]
        label = self.data.iloc[idx]["label_encoded"]
        # Tokenize
        encoded = self.tokenizer.encode(text, allowed_special={'<|endoftext|>'})       
        # Truncate if too long
        encoded = encoded[:self.max_length]
        # Pad if too short
        pad_len = self.max_length - len(encoded)
        encoded += [self.pad_token_id] * pad_len
        return torch.tensor(encoded, dtype=torch.long), torch.tensor(label, dtype=torch.long)

# Download the dataset zip file
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"
extract_path = "sms_spam_collection"
data_file_path = os.path.join(extract_path, "SMSSpamCollection")
if not os.path.exists(zip_path):
    print("Downloading dataset...")
    urllib.request.urlretrieve(url, zip_path)
    print("Download complete.")
# Unzip
if not os.path.exists(extract_path):
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(extract_path)
# Read the TSV file
df = pd.read_csv(
    data_file_path, 
    sep="\t", 
    header=None, 
    names=["Label", "Text"]
)
print(f"Total samples loaded: {len(df)}")

# 4. Create Train/Test Split (80 train / 20 test)
df = df.sample(frac=1, random_state=123).reset_index(drop=True)
# Split index
split_idx = int(0.8 * len(df))

# TODO: if needed (for performance resons), you can come back here and reduce the size of the training set.
train_df = df.iloc[:split_idx]  # [:2000]  # Readd this to only consider 2000 training samples
test_df = df.iloc[split_idx:]

# Save as CSVs, so the SpamDataset class can read them.
train_df.to_csv("train.csv", index=False)
test_df.to_csv("test.csv", index=False)
print("Created 'train.csv' and 'test.csv' successfully!")
print(f"Train size: {len(train_df)}")
print(f"Test size: {len(test_df)}")

Downloading dataset...
Download complete.
Total samples loaded: 5572
Created 'train.csv' and 'test.csv' successfully!
Train size: 4457
Test size: 1115


**Question 5.1**: In the cell above, why did we do `df = df.sample(frac=1, random_state=123)` when creating the train/test split?

**Question 5.2**: Analyse the datasets, what is the distribution of the two classes in the train set? Are they balanced or unbalanced? In case they are unbalanced, might this lead to issues for the fine-tuning of the model?

In [9]:
# TODO: Your code here.

distribution = train_df['Label'].value_counts()
proportions = train_df['Label'].value_counts(normalize=True)

print("Distribution des classes (Train) :")
print(distribution)
print("\nProportions :")
print(proportions)

Distribution des classes (Train) :
Label
ham     3860
spam     597
Name: count, dtype: int64

Proportions :
Label
ham     0.866053
spam    0.133947
Name: proportion, dtype: float64


**Question 6**: Create the dataloaders for training and test.

In [12]:
# TODO: add any imports which are needed
from torch.utils.data import DataLoader
import torch

# Create the Tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Instantiate the Dataset
train_dataset = SpamDataset("train.csv", tokenizer)
test_dataset = SpamDataset("test.csv", tokenizer)

# --- TODO: Create DataLoaders ---
# 1. Create a train_loader with batch_size=16 and shuffle=True
train_loader = DataLoader(
    dataset=train_dataset, 
    batch_size=16, 
    shuffle=True, 
    drop_last=True
)

# 2. Create a test_loader with batch_size=16 and shuffle=False
test_loader = DataLoader(
    dataset=test_dataset, 
    batch_size=16, 
    shuffle=False, 
    drop_last=False
)

In [13]:
# Check your work
for input_batch, target_batch in train_loader:
    print("Input batch shape:", input_batch.shape) # Should be [16, 120] (unless you use batch_size != 16)
    print("Target batch shape:", target_batch.shape) # Should be [16]
    break

Input batch shape: torch.Size([16, 120])
Target batch shape: torch.Size([16])


**Question 7**: Looking at the batch size and the training size, how many batches will you have in total? Please report the size of the subsampled training data, you reduce it due to performance constraints.

In [14]:
# TODO: add your code.

# 1. Vérification de la taille du dataset d'entraînement
train_size = len(train_dataset)

# 2. Vérification du nombre de batches dans le DataLoader
num_batches = len(train_loader)

# 3. Calcul manuel pour confirmer
batch_size = train_loader.batch_size
theoretical_batches = train_size // batch_size  # Division entière (simule drop_last=True)

print(f"Taille des données d'entraînement (subsampled) : {train_size}")
print(f"Batch size utilisé : {batch_size}")
print(f"Nombre total de batches calculés par le DataLoader : {num_batches}")
print(f"Vérification théorique ({train_size} // {batch_size}) : {theoretical_batches}")

Taille des données d'entraînement (subsampled) : 4457
Batch size utilisé : 16
Nombre total de batches calculés par le DataLoader : 278
Vérification théorique (4457 // 16) : 278


## Fine-tuning

**Context**: GPT-2 was trained to predict the next word (output size ~50,000). We want to predict binary classes (output size 2), so we must replace the final layer.

**Question 8**:

**8.1**: In the cell below, define the number of output classes (`num_classes`) for the new spam detection task.

**8.2**: Also, pring the original and updated output heads (hint: `out_head` from `GPTModel`)

**8.3**: Why do we freeze the internal layers with `param.requires_grad = False`?

In [15]:
# --- TODO: Compléter la Question 8 ---

# 1. Geler tous les paramètres (Freeze)
for param in model.parameters():
    param.requires_grad = False

# 8.2 : Afficher la tête de sortie originale (50257 neurones en sortie)
print(f"Original output head: {model.out_head}")

# 8.1 : Définir le nombre de classes (Spam vs Ham = 2 classes)
num_classes = 2

# 8.2 : Remplacer la tête par une nouvelle couche linéaire (768 -> 2)
# On utilise model_config["emb_dim"] qui est égal à 768
model.out_head = torch.nn.Linear(model_config["emb_dim"], num_classes)

# Rendre entraînable uniquement la nouvelle tête et la normalisation du dernier bloc
for param in model.out_head.parameters():
    param.requires_grad = True

for param in model.trf_blocks[-1].norm2.parameters():
    param.requires_grad = True

# 8.2 : Afficher la nouvelle tête de sortie
print(f"New output head: {model.out_head}")

Original output head: Linear(in_features=768, out_features=50257, bias=False)
New output head: Linear(in_features=768, out_features=2, bias=True)


You now have to **finalise the code for the training loop** (see individual steps below).

In the first cell below you can find the code to move the model to GPU (if available), define the optimizer, and calculate the accuracy. The following cell contains the code for the training (fine-tuning) loop.

You will have to complete the code of the training loop, by answering the following questions:

**Question 9.1**: Reset the gradients of the `optimizer` ([hint](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html)).

**Question 9.2**: Compute cross-entropy loss ([hint](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html)).

**Question 9.3**: Add code for the backward pass, to compute the gradient ([hint](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html))

**Question 9.4**: Add code for the optimizer step, to update the weights ([hint](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.step.html))

**Question 9.5**: Add code to calculate the accuracy on train and test (hint: you can use the `calc_accuracy` method).

**Note about the speed**: On my laptop's CPU 1 epoch with the full training dataset (~4400 samples, batch_size=16) took ~20 minutes; 1 epoch with a train set of 2000 samples (batch_size=16) took ~12 minutes. 

To iterate more quickly, you could:
- i) set `num_epochs = 1` (but only at the beginning), just to make sure that the code is working;
- ii) increase batch_size to 32 or 64 (but careful with possible memory issues).
- iii) reduce the size of the training dataset, by going back to the *Preparing the data* section, and changing the line `train_df = df.iloc[:split_idx]` to `train_df = df.iloc[:split_idx][:2000]` or similar. Be careful that if you reduce the training data too much, the model will not have enough data for fine-tuning.
- Use a GPU; it would be much quicker (few minutes on the whole training data).


In [16]:
# [--- INSTRUCTOR CODE ---]

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Measure imbalance
count_ham = len(train_df[train_df['Label']=='ham'])
count_spam = len(train_df[train_df['Label']=='spam'])

# Calculate weight: penalize missing the minority class (Spam) more
# Weight = Count(Majority) / Count(Minority)
pos_weight = count_ham / count_spam  # approx 6.46 (for full training dataset)
class_weights = torch.tensor([1.0, pos_weight]).to(device)
print(f"Using class weights: {class_weights}")

# Define Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)

# Calculate Accuracy Helper Function
def calc_accuracy(loader, model, device):
    correct, total = 0, 0
    # Track spam specifically
    spam_correct, spam_total = 0, 0
    model.eval()
    with torch.no_grad():
        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)
            logits = model(inputs)[:, -1, :]
            predicted = torch.argmax(logits, dim=-1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            # Filter for Spam (Label 1)
            spam_mask = (labels == 1)
            spam_total += spam_mask.sum().item()
            spam_correct += (predicted[spam_mask] == labels[spam_mask]).sum().item()
    # Avoid division by zero
    spam_acc = spam_correct / spam_total if spam_total > 0 else 0.0
    global_acc = correct / total
    return global_acc, spam_acc


Using class weights: tensor([1.0000, 6.4657])


In [17]:
import torch.nn.functional as F

# Configuration du nombre d'époques (1 pour tester la rapidité, 3 pour les résultats)
num_epochs = 3 

for epoch in range(num_epochs):
    model.train() # Mode entraînement
    for batch_idx, (inputs, targets) in enumerate(train_loader):
        inputs, targets = inputs.to(device), targets.to(device)

        # 9.1. Reset Gradients : On remet à zéro pour ne pas accumuler les anciens
        optimizer.zero_grad()

        # Forward Pass : Prédiction pour le DERNIER token de la séquence
        logits = model(inputs)[:, -1, :]

        # 9.2. Calculate Cross Entropy Loss : Utilisation des poids de classe
        loss = F.cross_entropy(logits, targets, weight=class_weights)

        # 9.3. Backward Pass : Calcul des gradients par rétropropagation
        loss.backward()

        # 9.4 Optimizer Step : Mise à jour des poids du modèle
        optimizer.step()

        if batch_idx % 10 == 0:
            print(f"Epoch {epoch+1}, Batch {batch_idx}, Loss: {loss.item():.4f}")

    # 9.5 Calculate Accuracy : Évaluation à la fin de chaque époque
    train_acc, train_spam_acc = calc_accuracy(train_loader, model, device)
    test_acc, test_spam_acc = calc_accuracy(test_loader, model, device)
    
    print(f"Epoch {epoch+1}: Train Acc: {train_acc*100:.2f}% (Spam: {train_spam_acc*100:.2f}%) | "
          f"Test Acc: {test_acc*100:.2f}% (Spam: {test_spam_acc*100:.2f}%)")

Epoch 1, Batch 0, Loss: 1.3201
Epoch 1, Batch 10, Loss: 1.3599
Epoch 1, Batch 20, Loss: 0.6295
Epoch 1, Batch 30, Loss: 1.3252
Epoch 1, Batch 40, Loss: 0.3653
Epoch 1, Batch 50, Loss: 0.4632
Epoch 1, Batch 60, Loss: 0.4864
Epoch 1, Batch 70, Loss: 1.0512
Epoch 1, Batch 80, Loss: 0.4719
Epoch 1, Batch 90, Loss: 0.7207
Epoch 1, Batch 100, Loss: 0.6826
Epoch 1, Batch 110, Loss: 0.4762
Epoch 1, Batch 120, Loss: 0.6782
Epoch 1, Batch 130, Loss: 0.7033
Epoch 1, Batch 140, Loss: 0.5628
Epoch 1, Batch 150, Loss: 0.6409
Epoch 1, Batch 160, Loss: 0.7631
Epoch 1, Batch 170, Loss: 0.6190
Epoch 1, Batch 180, Loss: 0.9017
Epoch 1, Batch 190, Loss: 0.6654
Epoch 1, Batch 200, Loss: 0.7522
Epoch 1, Batch 210, Loss: 0.7225
Epoch 1, Batch 220, Loss: 0.6773
Epoch 1, Batch 230, Loss: 0.8247
Epoch 1, Batch 240, Loss: 0.7210
Epoch 1, Batch 250, Loss: 0.5224
Epoch 1, Batch 260, Loss: 0.7781
Epoch 1, Batch 270, Loss: 0.5193
Epoch 1: Train Acc: 86.62% (Spam: 0.00%) | Test Acc: 86.55% (Spam: 0.00%)
Epoch 2, Batc

**Question 10**: 

Now run the cell above. You should see how the training loss changes after each batch (and epoch).
Describe thie trend: what do you see, is the model learning?

**Question 11 (optional)**: Change the number of epochs and/or the learning rate and/or the size of the training data, and investigate how the loss/accuracy of the model changes. You can do this editing and re-running the cells above, or creating new cells below.

In [None]:
# TODO if needed, your code for the additional analysis can go here.

**Question 12 (optional)**: Now test the model *on your own text*.

In [None]:
def classify_text(text, model, tokenizer, device, max_length=120, pad_token_id=50256):
    model.eval()
    
    # Encode the text
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    
    # Pad/Truncate 
    # (Matches the logic in SpamDataset so the model sees familiar input structures)
    encoded = encoded[:max_length]
    pad_len = max_length - len(encoded)
    encoded += [pad_token_id] * pad_len

    # Create tensor and add batch dimension
    encoded_tensor = torch.tensor(encoded).unsqueeze(0).to(device) # Shape: [1, max_length]

    # Get prediction
    with torch.no_grad():
        logits = model(encoded_tensor)[:, -1, :] # Logits for the last token
        predicted_label = torch.argmax(logits, dim=-1).item()

    return "SPAM" if predicted_label == 1 else "NOT SPAM"

# --- TODO: Test the model ---
# Create 2 strings: one clearly spam, one normal.
text_1 = "<YOUR TEXT>"  # YOUR TEXT HERE
text_2 = "<YOUR TEXT>"  # YOUR TEXT HERE

print(f"Text 1: {text_1} -> {classify_text(text_1, model, tokenizer, device)}")
print(f"Text 2: {text_2} -> {classify_text(text_2, model, tokenizer, device)}")

---