
Fine-Tuning GPT-2 for SMS Spam

Last Updated: May 10th, 2025

Daily Challenge: Fine-Tuning GPT-2 for SMS Spam Classification (Legacy transformers API)


In this daily challenge, you’ll fine-tune a pre-trained GPT-2 model to classify SMS messages as spam or ham (not spam). We’ll work through loading the dataset, inspecting its schema, tokenizing examples, adapting to an older transformers version, and running training and evaluation with the classic do_train/do_eval flags.


👩‍🏫 👩🏿‍🏫 What You’ll learn

    How to load and explore a custom text-classification dataset
    Inspecting and aligning column names for tokenization
    Tokenizing text for GPT-2 (with its peculiar padding setup)
    Initializing GPT2ForSequenceClassification
    Defining and computing multiple evaluation metrics
    Configuring TrainingArguments for transformers < 4.4 (using do_train, eval_steps, etc.)
    Running fine-tuning with Trainer and interpreting results
    Common pitfalls when using legacy APIs


🛠️ What you will create

By the end of this challenge, you will have built:

    A tokenized SMS dataset compatible with GPT-2’s requirements, including custom padding and truncation.
    A fine-tuned GPT2ForSequenceClassification model that can accurately label incoming SMS messages as spam or ham.
    A complete training pipeline using the legacy do_train/do_eval flags in TrainingArguments, with periodic checkpointing, logging, and evaluation.
    A set of evaluation metrics (accuracy, precision, recall, F1) computed at each validation step and summarized after training.
    A reusable Jupyter notebook that ties everything together—from dataset loading and inspection, through model initialization and tokenization, to training, evaluation, and results interpretation.


💼 Prerequisites

    Python 3.7+
    Installed packages: datasets, evaluate, transformers>=4.0.0,<4.4.0
    Basic familiarity with Hugging Face’s datasets and transformers libraries
    GitHub or Colab access for executing the notebook
    A Hugging Face API and a WeightAndBiases API, for instructions on how to get it, click here.


Task

We will guide you through making a fine-tuning a GPT-2 model to classify SMS messages as spam or ham using an older version of transformers (<4.4). Follow the steps below and complete the “TODO” in the code.

1. Setup : Install required packages datasets, evaluate and transformers[sentencepiece].

%pip install --quiet datasets evaluate transformers[sentencepiece]


2. Load & Inspect Dataset :

from datasets import TODO #import load_dataset
TODO # import pandas

# Load the UCI SMS Spam dataset (sms_spam) from Hugging Face hub
raw = TODO

# We'll use 4,000 for train, 1,000 for validation
train_ds = TODO
val_ds   = TODO

TODO  # print the features of the train dataset. It should show 'sms' and 'label'


3. Tokenization :

from transformers import TODO # import GPT2Tokenizer


model_name = TODO #load the tokenize, we will use GPT2
tokenizer  = TODO
# GPT-2 has no pad token by default—set it to eos
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(examples):
    # returns input_ids, attention_mask; keep max_length small for SMS
    return tokenizer(
        examples["sms"],
        padding="max_length",
        truncation=True,
        max_length=64
    )

train_tok = TODO #apply the tokenization by loading the subset using .map function
val_tok   = TODO #apply the tokenization by loading the subset using .map function


4. Model Initialization

import torch
TODO  #import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained( # Load GPT-2 with sequence classification head
    model_name,
    num_labels=TODO,           # spam vs. ham
    pad_token_id=tokenizer.eos_token_id
)


5. Metrics Definition

import evaluate
import numpy as np

accuracy  = evaluate.load("accuracy")
precision = # apply the function used for accurracy but for precision
recall    = # apply the function used for accurracy but for recall
f1        = # apply the function used for accurracy but for F1

def compute_metrics(pred):
    logits, labels = pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy":  accuracy.compute(predictions=preds, references=labels)["accuracy"], 
        "precision": TODO, # apply the function used for accurracy but for precision
        "recall":    TODO, # apply the function used for accurracy but for recall
        "f1":        TODO # apply the function used for accurracy but for F1
    }


    In an imbalanced dataset like SMS spam (often more “ham” than “spam”), why is it important to track precision and recall alongside accuracy?
    How would you interpret a model that achieves high accuracy but low recall on the spam class?


6. TrainingArguments Configuration

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=TODO
    do_train=True,                 # turn on training
    do_eval=True,                  # turn on evaluation
    eval_steps=TODO,                # run .evaluate() every 500 steps
    save_steps=TODO,                # save a checkpoint every 500 steps
    logging_dir="./logs",
    logging_steps=TODO,             # log metrics every 500 steps

    per_device_train_batch_size=TODO,
    per_device_eval_batch_size=TODO,
    num_train_epochs=TODO,
    learning_rate=TODO,
    weight_decay=TODO,

    report_to=None,                # disable integrations
    save_total_limit=1,            # only keep last checkpoint
)


    What effect does weight_decay have during fine-tuning? When might you choose a higher or lower value?


7. Train & Evaluate

# Train
from transformers import Trainer
# you need to have your wandb api key ready to paste in the command line
trainer = Trainer(
    model=TODO,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=val_tok,
    compute_metrics=compute_metrics,
)
trainer.train()

#Evaluate
metrics = TODO
print(metrics)
# Expect something like: {"eval_loss": ..., "eval_accuracy": 0.98, ...}


    Interpret your results.


In [2]:
from datasets import load_dataset
import transformers as tf_transformers
import evaluate as evaluate
import pandas as pd
from transformers import GPT2ForSequenceClassification
import torch
import evaluate
import numpy as np
from transformers import TrainingArguments, Trainer

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Load the UCI SMS Spam dataset (sms_spam) from Hugging Face hub
raw = load_dataset("sms_spam", split="train")

# We'll use 4,000 for train, 1,000 for validation
train_ds = raw.shuffle(seed=42).select(range(4000))
val_ds = raw.shuffle(seed=42).select(range(4000, 5000))

# print the features of the train dataset. It should show 'sms' and 'label'
print(train_ds.features)
# Print the first 5 rows of the train datasetprint(train_ds.features)
print(train_ds[:5])

{'sms': Value('string'), 'label': ClassLabel(names=['ham', 'spam'])}
{'sms': ['sports fans - get the latest sports news str* 2 ur mobile 1 wk FREE PLUS a FREE TONE Txt SPORT ON to 8007 www.getzed.co.uk 0870141701216+ norm 4txt/120p \n', "It's justbeen overa week since we broke up and already our brains are going to mush!\n", 'Not directly behind... Abt 4 rows behind ü...\n', 'Haha, my legs and neck are killing me and my amigos are hoping to end the night with a burn, think I could swing by in like an hour?\n', 'Me too baby! I promise to treat you well! I bet you will take good care of me...\n'], 'label': [1, 0, 0, 0, 0]}


In [4]:
 #load the tokenize, we will use GPT2
model_name = "gpt2"  # or "gpt2-medium", "gpt2-large", etc.
tokenizer  = tf_transformers.AutoTokenizer.from_pretrained(model_name)
# GPT-2 has no pad token by default—set it to eos
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(examples):
    # returns input_ids, attention_mask; keep max_length small for SMS
    return tokenizer(
        examples["sms"],
        padding="max_length",
        truncation=True,
        max_length=64
    )
    # La fonction tokenize_fn retourne bien l'attention_mask car le tokenizer Hugging Face ajoute automatiquement
    # les clés "input_ids" et "attention_mask" dans le dictionnaire de sortie lors de l'appel avec return_tensors ou par défaut.
    # Donc, après le .map, chaque exemple dans train_tok et val_tok aura les champs "input_ids", "attention_mask" et "labels".
    # Si tu veux vérifier, tu peux afficher un exemple :

#apply the tokenization by loading the subset using .map function
#apply the tokenization by loading the subset using .map function
train_tok = train_ds.map(tokenize_fn, batched=True)
val_tok   = val_ds.map(tokenize_fn, batched=True)

train_tok = train_tok.rename_column("label", "labels")
val_tok   = val_tok.rename_column("label", "labels")
print(train_tok[0].keys())  # Affichera: dict_keys(['sms', 'labels', 'input_ids', 'attention_mask'])

dict_keys(['sms', 'labels', 'input_ids', 'attention_mask'])


In [12]:
#import GPT2ForSequenceClassification
model = GPT2ForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,  # spam vs ham
    pad_token_id=tokenizer.eos_token_id
)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:


accuracy  = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall    = evaluate.load("recall")
f1        = evaluate.load("f1")


def compute_metrics(pred):
    logits, labels = pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy":  accuracy.compute(predictions=preds, references=labels)["accuracy"], 
        "precision": precision.compute(predictions=preds, references=labels, average="binary")["precision"],
        "recall":   recall.compute(predictions=preds, references=labels, average="binary")["recall"],
        "f1":        f1.compute(predictions=preds, references=labels, average="binary")["f1"]
    }


    #In an imbalanced dataset like SMS spam (often more “ham” than “spam”), why is it important to track precision and recall alongside accuracy?
    #How would you interpret a model that achieves high accuracy but low recall on the spam class?
    # Dans un jeu de données déséquilibré comme les SMS spam (où il y a souvent beaucoup plus de "ham" que de "spam"), 
    # il est important de suivre la précision (precision) et le rappel (recall) en plus de l'exactitude (accuracy). 
    # L'accuracy peut être trompeuse : un modèle qui prédit toujours "ham" aura une haute accuracy si le spam est rare, 
    # mais il ne détectera jamais les spams (rappel faible). La précision indique la proportion de messages prédits comme spam qui 
    # sont réellement du spam, tandis que le rappel mesure la capacité du modèle à détecter tous les spams. Un modèle avec 
    # une haute accuracy mais un faible rappel sur la classe spam signifie qu'il manque beaucoup de spams, ce qui est 
    # problématique pour une application anti-spam.

In [14]:
training_args = TrainingArguments(
    output_dir="./gpt2-sms-spam",
    do_train=True,                 # turn on training
    do_eval=True,                  # turn on evaluation
    eval_steps=500,                # run .evaluate() every 500 steps
    save_steps=500,                # save a checkpoint every 500 steps
    logging_dir="./logs",
    logging_steps=500,             # log metrics every 500 steps

    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    learning_rate=5e-5,
    weight_decay=0.01,

    report_to=None,                # disable integrations
    save_total_limit=1,            # only keep last checkpoint
)
    #What effect does weight_decay have during fine-tuning? When might you choose a higher or lower value?
    # Le weight_decay (décroissance du poids) est une technique de régularisation qui pénalise les grands poids dans le modèle afin de limiter le surapprentissage (overfitting). 
    # Une valeur plus élevée de weight_decay augmente la régularisation, ce qui peut être utile si le modèle s'adapte trop aux données d'entraînement (overfit). 
    # À l'inverse, une valeur plus faible réduit la régularisation, ce qui peut être utile si le modèle sous-apprend (underfit) ou si les données sont déjà bien régularisées.


In [15]:
# Train
# you need to have your wandb api key ready to paste in the command line
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=val_tok,
    compute_metrics=compute_metrics,
)
trainer.train()

#Evaluate
metrics = trainer.evaluate()
print(metrics)
# Expect something like: {"eval_loss": ..., "eval_accuracy": 0.98, ...}

Step,Training Loss
500,0.4362
1000,0.4166
1500,0.4036
2000,0.4298
2500,0.4024
3000,0.4265


{'eval_loss': nan, 'eval_accuracy': 0.874, 'eval_precision': 0.125, 'eval_recall': 0.008333333333333333, 'eval_f1': 0.015625, 'eval_runtime': 140.7296, 'eval_samples_per_second': 7.106, 'eval_steps_per_second': 0.888, 'epoch': 3.0}


Ces résultats montrent une **accuracy élevée (0.874)**, mais une **précision (0.125)**, un **rappel (0.008)** et un **F1-score (0.016)** très faibles. Cela signifie que le modèle prédit presque toujours la classe majoritaire ("ham", non-spam), ce qui donne une bonne accuracy si le dataset est déséquilibré. Cependant, il détecte très peu de spams (rappel très faible) et fait beaucoup d’erreurs lorsqu’il prédit "spam" (précision faible). 

**Conclusion :**  
Le modèle n’est pas utile pour détecter les spams. Il faut probablement :
- Rééquilibrer les classes (oversampling/undersampling, pondération des pertes)
- Vérifier le pipeline de tokenisation et d’entraînement
- Ajuster les hyperparamètres ou utiliser plus de données annotées pour la classe minoritaire.