# Exercise: Teach an LLM to Spell with Supervised Fine-Tuning (SFT)

Large language models (LLMs) are notoriously bad at spelling. This is partly because tokenizers break words into smaller pieces, so the model learns about sub-word units rather than whole words and their spellings.

In this exercise, you'll use supervised fine-tuning (SFT) and a technique called Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to teach a small LLM how to spell words. This is a classic example of teaching a model a new skill that isn't well-represented in its pre-training data.

## What you'll do in this notebook

1.  **Setup**: Import libraries and configure the environment.
2.  **Load the tokenizer and base model**: Use a small, instruction-tuned model as our starting point.
3.  **Create the dataset**: Generate a simple dataset of words and their correct spellings.
4.  **Evaluate the base model**: Test the model's spelling ability *before* fine-tuning to establish a baseline.
5.  **Configure LoRA and train**: Attach a LoRA adapter to the model and fine-tune it on the spelling dataset.
6.  **Evaluate the fine-tuned model**: Test the model again to see if its spelling has improved.

## Setup

In [2]:
# Setup imports
# No changes needed in this cell

import os
import torch
from datasets import Dataset

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

# Use GPU, MPS, or CPU, in that order of preference
if torch.cuda.is_available():
    device = torch.device("cuda")  # NVIDIA GPU
elif torch.backends.mps.is_available():
    device = torch.device("mps")  # Apple Silicon
else:
    device = torch.device("cpu")
torch.set_num_threads(max(1, os.cpu_count() // 2))
print("Using device:", device)

Using device: mps


## Step 1. Load the tokenizer and base model

The model `HuggingFaceTB/SmolLM2-135M-Instruct` is a small, instruction-tuned model that's suitable for this exercise. It has 135 million parameters, making it lightweight and efficient for fine-tuning. It's not the most powerful model, but it's a good choice for demonstrating the concepts of SFT and PEFT with LoRA, especially on a CPU or limited GPU resources.

In [3]:
# Student task: Load the model and tokenizer, and copy the model to the device.
# TODO: Complete the sections with **********

# See: https://huggingface.co/docs/transformers/en/models
# See: https://huggingface.co/docs/transformers/en/fast_tokenizers

# Model ID for SmolLM2-135M-Instruct
model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_id)

# Copy the model to the device (GPU, MPS, or CPU)
model = model.to(device)



print("Model parameters (total):", sum(p.numel() for p in model.parameters()))

Model parameters (total): 134515008


## Step 2. Create the dataset

In [4]:
# Create a list of words of different lengths
# No changes are needed in this cell.

# fmt: off
ALL_WORDS = [
    "idea", "glow", "rust", "maze", "echo", "wisp", "veto", "lush", "gaze", "knit", "fume", "plow",
    "void", "oath", "grim", "crisp", "lunar", "fable", "quest", "verge", "brawn", "elude", "aisle",
    "ember", "crave", "ivory", "mirth", "knack", "wryly", "onset", "mosaic", "velvet", "sphinx",
    "radius", "summit", "banner", "cipher", "glisten", "mantle", "scarab", "expose", "fathom",
    "tavern", "fusion", "relish", "lantern", "enchant", "torrent", "capture", "orchard", "eclipse",
    "frescos", "triumph", "absolve", "gossipy", "prelude", "whistle", "resolve", "zealous",
    "mirage", "aperture", "sapphire",
]
# fmt: on

In [6]:
# Student Task: Create a Hugging Face Dataset with the prompt that asks the model to spell the word
# with hyphens between the letters.
# TODO: Complete the sections with **********


def generate_records():
    for word in ALL_WORDS:
        yield {
            # We will use the SFTTrainer which expects a certain format for prompt and completions pair
            # in order for it to automatically construct the right tokenizations to train the model.
            # See the documentation for more details:
            # https://huggingface.co/docs/trl/en/sft_trainer#expected-dataset-type-and-format
            # "**********": f"**********",
            "prompt":(
                f"you spell with hyphens between the letters like this W-O-R-D.\nWord:\n{word}\n\n"
                + "Spelling:\n"
            ),
            "completion": "-".join(word).upper() + ".",  # Of the form W-O-R-D.
        }


ds = Dataset.from_generator(generate_records)

# Show the first item
ds[0]

{'prompt': 'you spell with hyphens between the letters like this W-O-R-D.\nWord:\nidea\n\nSpelling:\n',
 'completion': 'I-D-E-A.'}

In [7]:
# Student Task: Split the dataset into training and testing sets
# See: train_test_split
# TODO: Complete the sections with **********

ds = ds.train_test_split(test_size=0.25,seed=42)  # Set the test set to be 25% of the dataset, and the rest is training



In [8]:
# View the training set
# No changes needed in this cell

ds["train"]

Dataset({
    features: ['prompt', 'completion'],
    num_rows: 46
})

## Step 3. Evaluate the base model

Before we fine-tune the model, let's see how it performs on the spelling task. We'll create a helper function to generate a spelling for a given word and compare it to the correct answer.

In [14]:
# Student task: Create a function to check the model's spelling.
# This function will take a model, tokenizer, prompt, and the correct spelling.
# It should generate text from the model and compare the model's proposed spelling
# to the actual spelling, returning the proportion of characters that were correct.
# TODO: Complete the sections with **********


def check_spelling(
    model, tokenizer, prompt: str, actual_spelling: str, max_new_tokens: int = 20
) -> (str, str):
    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate text from the model
    gen = model.generate(**inputs, max_new_tokens=max_new_tokens, use_cache =False)

    # Decode the generated tokens to a string
    output = tokenizer.decode(gen[0], skip_special_tokens=True)

    # Extract the generated spelling from the full output string
    proposed_spelling = output.split("Spelling:")[-1].strip()


    # strip any whitepsace from the actual spelling
    actual_spelling = actual_spelling.strip()


    # Remove hyphens for a character-by-character comparison
    proposed_spelling = proposed_spelling.lower()
    actual_spelling = actual_spelling.replace("-", " ")

    # Calculate the number of correct characters
    num_correct = sum(1 for a,b in zip(actual_spelling,proposed_spelling) if a == b)


    print(
        f"Proposed: {proposed_spelling} | Actual: {actual_spelling} "
        f"| Matches: {'‚úÖ' if proposed_spelling == actual_spelling else '‚ùå'}"
    )

    return num_correct / len(actual_spelling)  # Return proportion correct


check_spelling(
    model=model,
    tokenizer=tokenizer,
    prompt=ds["test"][0]["prompt"],
    actual_spelling=ds["test"][0]["completion"],
)

Proposed: wry | Actual: W R Y L Y. | Matches: ‚ùå


0.0

In [15]:
# Student task: Evaluate the base model's spelling ability
# We expect it to perform poorly, as it hasn't been trained for this task.

proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

Proposed: sphinx | Actual: S P H I N X. | Matches: ‚ùå
Proposed: brawn | Actual: B R A W N. | Matches: ‚ùå
Proposed: goss | Actual: G O S S I P Y. | Matches: ‚ùå
Proposed: - en-chant

spelling | Actual: E N C H A N T. | Matches: ‚ùå
Proposed: tavern | Actual: T A V E R N. | Matches: ‚ùå
Proposed: whistle | Actual: W H I S T L E. | Matches: ‚ùå
Proposed: w-o-r-d

how would you like to see the word changed? | Actual: C A P T U R E. | Matches: ‚ùå
Proposed: echo

word:
echo | Actual: E C H O. | Matches: ‚ùå
Proposed: mirth | Actual: M I R T H. | Matches: ‚ùå
Proposed: cris | Actual: C R I S P. | Matches: ‚ùå
Proposed: zeal | Actual: Z E A L O U S. | Matches: ‚ùå
Proposed: ember

how many words does the word "ember" contain? | Actual: E M B E R. | Matches: ‚ùå
Proposed: scarab | Actual: S C A R A B. | Matches: ‚ùå
Proposed: knit

how many times does the word "knit" appear in the following sentence:
w | Actual: K N I T. | Matches: ‚ùå
Proposed: resolve

how many times does the word resolve 

As expected, the base model is terrible at spelling. It mostly just repeats the word back. Now, let's fine-tune it.

## Step 4. Configure LoRA and train the model

Let‚Äôs attach a LoRA adapter to the base model. We use a LoRA config so only a tiny fraction of parameters are trainable. Read more here: [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora).

In [16]:
from peft import TaskType

# Student task: Configure LoRA for a causal LM and wrap the model with get_peft_model
# Complete the sections with **********

# Print how many params are trainable at first
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params BEFORE: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

# See: https://huggingface.co/docs/peft/package_reference/lora
lora_config = LoraConfig(
    r=8      ,           # Rank of the update matrices. Lower value = fewer trainable parameters.
    lora_alpha=16   ,     # LoRA scaling factor.
    lora_dropout=0.05,      # Dropout probability for LoRA layers.
    bias="none",
    task_type=TaskType.CAUSAL_LM,         # Causal Language Modeling.
)
# # Wrap the base model with get_peft_model
model = get_peft_model(model, lora_config)


# Print the number of trainable parameters after applying LoRA
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params AFTER: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

Trainable params BEFORE: 134,515,008 / 134,515,008 (100.00%)
Trainable params AFTER: 460,800 / 134,975,808 (0.34%)


Now let‚Äôs set the training arguments. We'll use `SFTConfig` from the TRL library, which is a wrapper around the standard `TrainingArguments`. We keep epochs, batch size, and sequence length modest to finish training quickly.

In [24]:
# Student task: Fill in the SFTConfig for a quick training run
# Complete the sections with **********

output_dir = "data/model"

# See: https://huggingface.co/docs/trl/en/sft_trainer#trl.SFTConfig
training_args = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=20,
    learning_rate=5 * 1e-4,
    logging_steps=20,
    eval_strategy="steps",
    eval_steps=20,
    save_strategy="no",
    report_to=[],                            # disable wandb/tensorboard
    fp16=False,                              # stay in fp32 for CPU/MPS
    lr_scheduler_type="cosine",
)


Now we define the `SFTTrainer` and run the fine-tuning process.

In [25]:
# Student Task: Create and run the SFTTrainer
# TODO: Complete the sections with **********


# See: https://huggingface.co/docs/trl/en/sft_trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    args=training_args,
)
# Now train it:
trainer.train()


Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
20,0.4104,0.575742,1.789996,6473.0,0.831069
40,0.1928,0.613704,1.487747,12948.0,0.852582
60,0.0754,0.709228,1.313572,19340.0,0.810714
80,0.0275,0.775368,1.238472,25819.0,0.810135
100,0.0174,0.802979,1.232425,32272.0,0.80007
120,0.0157,0.793058,1.223765,38680.0,0.799265


TrainOutput(global_step=120, training_loss=0.12322785953680675, metrics={'train_runtime': 36.2759, 'train_samples_per_second': 25.361, 'train_steps_per_second': 3.308, 'total_flos': 26372523967488.0, 'train_loss': 0.12322785953680675, 'epoch': 20.0})

## Step 5. Evaluate the fine-tuned model

In [26]:
# Evaluate the fine-tuned model on the same training examples
# No changes needed in this cell


proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

Proposed: s-p-h-i-n-x. | Actual: S P H I N X. | Matches: ‚ùå
Proposed: b-r-a-w-n. | Actual: B R A W N. | Matches: ‚ùå
Proposed: g-o-s-s-i-p-y. | Actual: G O S S I P Y. | Matches: ‚ùå
Proposed: e-n-c-h-a-n-t. | Actual: E N C H A N T. | Matches: ‚ùå
Proposed: t-a-v-e-r-n. | Actual: T A V E R N. | Matches: ‚ùå
Proposed: w-h-i-s-t-l-e. | Actual: W H I S T L E. | Matches: ‚ùå
Proposed: c-a-p-t-u-r-e. | Actual: C A P T U R E. | Matches: ‚ùå
Proposed: e-c-h-o. | Actual: E C H O. | Matches: ‚ùå
Proposed: m-i-r-t-h. | Actual: M I R T H. | Matches: ‚ùå
Proposed: c-r-i-s-p. | Actual: C R I S P. | Matches: ‚ùå
Proposed: z-e-a-l-o-u-s. | Actual: Z E A L O U S. | Matches: ‚ùå
Proposed: e-m-b-e-r. | Actual: E M B E R. | Matches: ‚ùå
Proposed: s-c-a-r-a-b. | Actual: S C A R A B. | Matches: ‚ùå
Proposed: k-n-i-t. | Actual: K N I T. | Matches: ‚ùå
Proposed: r-e-s-o-l-v-e. | Actual: R E S O L V E. | Matches: ‚ùå
Proposed: v-e-l-v-e-t. | Actual: V E L V E T. | Matches: ‚ùå
Proposed: a-b-s-o-l-v-e. | Actua

The model now performs better on the training data it has seen. But has it generalized? Let's check its performance on the unseen test set.

In [27]:
# Evaluate the fine-tuned model on the unseen test set
# No changes needed in this cell


proportion_correct = 0.0
num_examples = len(ds["test"])

for example in ds["test"]:
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/{num_examples}.0 words correct")

Proposed: w-r-i-y-l-i-y. | Actual: W R Y L Y. | Matches: ‚ùå
Proposed: g-l-i-n-e-s. | Actual: G L I S T E N. | Matches: ‚ùå
Proposed: c-a-s-e. | Actual: Q U E S T. | Matches: ‚ùå
Proposed: c-r-e-e-v-e. | Actual: C R A V E. | Matches: ‚ùå
Proposed: l-u-s-i-o-n. | Actual: L U S H. | Matches: ‚ùå
Proposed: f-a-l-i-c-e. | Actual: F A B L E. | Matches: ‚ùå
Proposed: k-n-a-r-k-t. | Actual: K N A C K. | Matches: ‚ùå
Proposed: t-r-i-v-h-p-l-e. | Actual: T R I U M P H. | Matches: ‚ùå
Proposed: s-a-p-l-i-c-h. | Actual: S A P P H I R E. | Matches: ‚ùå
Proposed: e-x-p-s-t. | Actual: E X P O S E. | Matches: ‚ùå
Proposed: f-s-r-e-c-o-s-n. | Actual: F R E S C O S. | Matches: ‚ùå
Proposed: w-i-p-s. | Actual: W I S P. | Matches: ‚ùå
Proposed: m-i-r-g-e. | Actual: M I R A G E. | Matches: ‚ùå
Proposed: i-v-o-r-y. | Actual: I V O R Y. | Matches: ‚ùå
Proposed: o-n-s-i-d-r-w. | Actual: O N S E T. | Matches: ‚ùå
Proposed: e-l-u-d-e. | Actual: E L U D E. | Matches: ‚ùå
0.325/16.0 words correct


It looks like it has improved! Perhaps with a larger dataset and more training, it could get even better.

## Congratulations for completing the exercise! üéâ

‚úÖ You did it! You successfully fine-tuned a small language model using PEFT with LoRA to teach it a new skill: spelling! You saw how the base model failed completely at the task, and with a very small amount of data and a short training run, the model managed to get better at spelling.

<br /><br /><br /><br /><br /><br /><br /><br /><br />