We will guide you through making a fine-tuning a GPT-2 model to classify SMS messages as spam or ham using an older version of transformers (<4.4). Follow the steps below and complete the “TODO” in the code.

1. Setup : Install required packages datasets, evaluate and transformers[sentencepiece].

%pip install --quiet datasets evaluate transformers[sentencepiece]

In [1]:
%pip install -U datasets

Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.8/494.8 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency r

In [2]:
!pip install --quiet evaluate transformers[sentencepiece]

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h

2. Load & Inspect Dataset :

In [3]:
from datasets import load_dataset
import pandas as pd

# 1. Load correct dataset
dataset = load_dataset("ucirvine/sms_spam")

# 2. Convert to DataFrame
df = pd.DataFrame(dataset["train"])

# 3. Shuffle and split
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)
train_df = df_shuffled.iloc[:4000]
val_df   = df_shuffled.iloc[4000:5000]

# 4. Inspect columns
print(train_df.columns)
# Output: Index(['sms', 'label'], dtype='object')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/359k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

Index(['sms', 'label'], dtype='object')


3. Tokenization :



In [4]:
from transformers import GPT2Tokenizer
from datasets import Dataset

model_name = "openai-community/gpt2" #load the tokenize, we will use GPT2
tokenizer  = GPT2Tokenizer.from_pretrained(model_name)
# GPT-2 has no pad token by default—set it to eos
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(examples):
    # returns input_ids, attention_mask; keep max_length small for SMS
    return tokenizer(
        examples["sms"],
        padding="max_length",
        truncation=True,
        max_length=64
    )

# Convert pandas DataFrame to Hugging Face Dataset
train_raw = Dataset.from_pandas(train_df)
val_raw   = Dataset.from_pandas(val_df)

train_tok = train_raw.map(tokenize_fn, batched= True) #apply the tokenization by loading the subset using .map function
val_tok   = val_raw.map(tokenize_fn, batched= True) #apply the tokenization by loading the subset using .map function

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

4. Model Initialization


In [5]:
import torch
from transformers import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained( # Load GPT-2 with sequence classification head
    model_name,
    num_labels=2,           # spam vs. ham
    pad_token_id=tokenizer.eos_token_id
)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


5. Metrics Definition

In [6]:
import evaluate
import numpy as np

accuracy  = evaluate.load("accuracy")
precision = evaluate.load("precision")# apply the function used for accurracy but for precision
recall    = evaluate.load("recall")# apply the function used for accurracy but for recall
f1        = evaluate.load("f1")# apply the function used for accurracy but for F1

def compute_metrics(pred):
    logits, labels = pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy":  accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "precision": precision.compute(predictions=preds, references=labels)["precision"], # apply the function used for accurracy but for precision
        "recall":    recall.compute(predictions=preds, references=labels)["recall"], # apply the function used for accurracy but for recall
        "f1":        f1.compute(predictions=preds, references=labels)["f1"] # apply the function used for accurracy but for F1
    }

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In an imbalanced dataset like SMS spam (often more “ham” than “spam”), why is it important to track precision and recall alongside accuracy?
How would you interpret a model that achieves high accuracy but low recall on the spam class?

Precision tracks ham identified as spam. Recall tracks spam identified as ham. Therefore high accuracy along with low recall will let a fair share of spam as ham.

6. TrainingArguments Configuration

In [7]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",         # Where to save checkpoints and logs
    do_train=True,
    do_eval=True,
    eval_steps=500,
    save_steps=500,
    logging_dir="./logs",
    logging_steps=500,

    per_device_train_batch_size=4,  # Conservative batch size
    per_device_eval_batch_size=4,
    num_train_epochs=3,             # Good for small datasets
    learning_rate=5e-5,             # Standard starting LR
    weight_decay=0.01,              # Helps generalization

    report_to=None,                 # Disable WandB or other integrations
    save_total_limit=1,             # Keep only the latest checkpoint
)

What effect does weight_decay have during fine-tuning? When might you choose a higher or lower value?

What does weight_decay do?
In essence, it penalizes large weights by adding an extra term to the loss function.

When to use lower weight_decay (e.g., 0.0 to 0.01):
*   When you have a lot of data relative to model size.
*   When your model is already pretrained and the task is very close to the pretraining task.
*   If you're seeing signs of underfitting (low training and validation accuracy).
*   On small models or lightweight fine-tuning where regularization is less critical.

When to use higher weight_decay (e.g., 0.05 to 0.1):
*   When your dataset is very small (e.g., under 10k examples) — like SMS Spam with ~5.5k rows.
*   When you're fine-tuning a very large model like GPT-2 or BERT on a task that differs a lot from the original pretraining objective.
*   When you see overfitting (training loss dropping but validation stagnating or increasing).
*   If you want to add regularization without using dropout.

7. Train & Evaluate

In [8]:
# Train
from transformers import Trainer
# you need to have your wandb api key ready to paste in the command line
trainer = Trainer(
    model= model,
    args= training_args,
    train_dataset= train_tok,
    eval_dataset= val_tok,
    compute_metrics= compute_metrics,
)
trainer.train()

#Evaluate
metrics = trainer.evaluate()
print(metrics)
# Expect something like: {"eval_loss": ..., "eval_accuracy": 0.98, ...}



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjal-hellsing[0m ([33mjal-hellsing-pstb[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
500,0.1728
1000,0.0538
1500,0.0567
2000,0.02
2500,0.0157
3000,0.0136


{'eval_loss': 0.050086814910173416, 'eval_accuracy': 0.99, 'eval_precision': 0.991869918699187, 'eval_recall': 0.9312977099236641, 'eval_f1': 0.9606299212598425, 'eval_runtime': 4.2014, 'eval_samples_per_second': 238.014, 'eval_steps_per_second': 59.504, 'epoch': 3.0}
