# Exercise: Teach an LLM to Spell with Supervised Fine-Tuning (SFT)

Large language models (LLMs) are notoriously bad at spelling. This is partly because tokenizers break words into smaller pieces, so the model learns about sub-word units rather than whole words and their spellings.

In this exercise, you'll use supervised fine-tuning (SFT) and a technique called Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to teach a small LLM how to spell words. This is a classic example of teaching a model a new skill that isn't well-represented in its pre-training data.

## What you'll do in this notebook

1.  **Setup**: Import libraries and configure the environment.
2.  **Load the tokenizer and base model**: Use a small, instruction-tuned model as our starting point.
3.  **Create the dataset**: Generate a simple dataset of words and their correct spellings.
4.  **Evaluate the base model**: Test the model's spelling ability *before* fine-tuning to establish a baseline.
5.  **Configure LoRA and train**: Attach a LoRA adapter to the model and fine-tune it on the spelling dataset.
6.  **Evaluate the fine-tuned model**: Test the model again to see if its spelling has improved.

## Setup

In [34]:
# Setup imports
# No changes needed in this cell

import os
import torch
from datasets import Dataset

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

# Use GPU, MPS, or CPU, in that order of preference
if torch.cuda.is_available():
    device = torch.device("cuda")  # NVIDIA GPU
elif torch.backends.mps.is_available():
    device = torch.device("mps")  # Apple Silicon
else:
    device = torch.device("cpu")
torch.set_num_threads(max(1, os.cpu_count() // 2))
print("Using device:", device)

Using device: cuda


## Step 1. Load the tokenizer and base model

The model `HuggingFaceTB/SmolLM2-135M-Instruct` is a small, instruction-tuned model that's suitable for this exercise. It has 135 million parameters, making it lightweight and efficient for fine-tuning. It's not the most powerful model, but it's a good choice for demonstrating the concepts of SFT and PEFT with LoRA, especially on a CPU or limited GPU resources.

In [35]:
# Student task: Load the model and tokenizer, and copy the model to the device.
# TODO: Complete the sections with **********

# See: https://huggingface.co/docs/transformers/en/models
# See: https://huggingface.co/docs/transformers/en/fast_tokenizers

# Model ID for SmolLM2-135M-Instruct
model_id = "***********"
model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"

# Load the tokenizer
tokenizer = "***********"
tokenizer = AutoTokenizer.from_pretrained(model_id) 

# Load the model
model = "***********"
model = AutoModelForCausalLM.from_pretrained(model_id)

# Copy the model to the device (GPU, MPS, or CPU)
# model = "***********"
model = model.to(device)


print("Model parameters (total):", sum(p.numel() for p in model.parameters()))

Model parameters (total): 134515008


## Step 2. Create the dataset

In [36]:
# Create a list of words of different lengths
# No changes are needed in this cell.

# fmt: off
ALL_WORDS = [
    "idea", "glow", "rust", "maze", "echo", "wisp", "veto", "lush", "gaze", "knit", "fume", "plow",
    "void", "oath", "grim", "crisp", "lunar", "fable", "quest", "verge", "brawn", "elude", "aisle",
    "ember", "crave", "ivory", "mirth", "knack", "wryly", "onset", "mosaic", "velvet", "sphinx",
    "radius", "summit", "banner", "cipher", "glisten", "mantle", "scarab", "expose", "fathom",
    "tavern", "fusion", "relish", "lantern", "enchant", "torrent", "capture", "orchard", "eclipse",
    "frescos", "triumph", "absolve", "gossipy", "prelude", "whistle", "resolve", "zealous",
    "mirage", "aperture", "sapphire",
]
# fmt: on

In [37]:
# Student Task: Create a Hugging Face Dataset with the prompt that asks the model to spell the word
# with hyphens between the letters.
# TODO: Complete the sections with **********


def generate_records():
    for word in ALL_WORDS:
#         prompt = f"""Spell the word with hyphens between the letters like W-O-R-D.\n
# Word:
# {word} 
# Spelling:
# """
        # prompt = (
        #         f"You spell words with hyphens between the letters like this W-O-R-D.\nWord:\n{word}\n\n"
        #         + "Spelling:\n"
        #     )
        yield {
            # We will use the SFTTrainer which expects a certain format for prompt and completions pair
            # in order for it to automatically construct the right tokenizations to train the model.
            # See the documentation for more details:
            # https://huggingface.co/docs/trl/en/sft_trainer#expected-dataset-type-and-format
            # "**********": f"**********",
            
            "prompt": (
                f"You spell words with hyphens between the letters like this W-O-R-D.\nWord:\n{word}\n\n"
                + "Spelling:\n"
            ),
            "completion": '-'.join(word.upper())+'.',  # Of the form W-O-R-D.
        }


ds = Dataset.from_generator(generate_records)

# Show the first item
ds[0]

{'prompt': 'You spell words with hyphens between the letters like this W-O-R-D.\nWord:\nidea\n\nSpelling:\n',
 'completion': 'I-D-E-A.'}

In [38]:
# Student Task: Split the dataset into training and testing sets
# See: train_test_split
# TODO: Complete the sections with **********

# ds = **********  # Set the test set to be 25% of the dataset, and the rest is training
ds = ds.train_test_split(test_size=0.25,seed=42)

In [39]:
# View the training set
# No changes needed in this cell

ds["train"][0]

{'prompt': 'You spell words with hyphens between the letters like this W-O-R-D.\nWord:\nsphinx\n\nSpelling:\n',
 'completion': 'S-P-H-I-N-X.'}

## Step 3. Evaluate the base model

Before we fine-tune the model, let's see how it performs on the spelling task. We'll create a helper function to generate a spelling for a given word and compare it to the correct answer.

In [40]:
# Student task: Create a function to check the model's spelling.
# This function will take a model, tokenizer, prompt, and the correct spelling.
# It should generate text from the model and compare the model's proposed spelling
# to the actual spelling, returning the proportion of characters that were correct.
# TODO: Complete the sections with **********


def check_spelling(
    model, tokenizer, prompt: str, actual_spelling: str, max_new_tokens: int = 20
) -> (str, str):
    # Tokenize the prompt
    # inputs = **********
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate text from the model
    # gen = **********
    gen = model.generate(**inputs, max_new_tokens=max_new_tokens,use_cache=False)

    # Decode the generated tokens to a string
    # output = **********
    output:str = tokenizer.decode(gen[0], skip_special_tokens=True)

    # Extract the generated spelling from the full output string
    # proposed_spelling = "**********"
    proposed_spelling = output.split("Spelling:")[-1].strip().split("\n")[0].strip()
    # strip any whitepsace from the actual spelling
    # actual_spelling = "**********"
    # Remove hyphens for a character-by-character comparison
    # proposed_spelling = "**********"
    # actual_spelling = "**********"


    # Calculate the number of correct characters
    is_correct = proposed_spelling == actual_spelling


    print(
        f"Proposed: {proposed_spelling} | Actual: {actual_spelling} "
        f"| Matches: {'‚úÖ' if is_correct else '‚ùå'}"
    )
    return int(is_correct)
    # return num_correct / len(actual_spelling)  # Return proportion correct

num_correct = 0

check_spelling(
    model=model,
    tokenizer=tokenizer,
    prompt=ds["test"][0]["prompt"],
    actual_spelling=ds["test"][0]["completion"],
)

Proposed: wry | Actual: W-R-Y-L-Y. | Matches: ‚ùå


0

In [41]:
# Student task: Evaluate the base model's spelling ability
# We expect it to perform poorly, as it hasn't been trained for this task.

proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

Proposed: sphinx | Actual: S-P-H-I-N-X. | Matches: ‚ùå
Proposed: brawn | Actual: B-R-A-W-N. | Matches: ‚ùå
Proposed: goss | Actual: G-O-S-S-I-P-Y. | Matches: ‚ùå
Proposed: enchant | Actual: E-N-C-H-A-N-T. | Matches: ‚ùå
Proposed: tavern | Actual: T-A-V-E-R-N. | Matches: ‚ùå
Proposed: whistle | Actual: W-H-I-S-T-L-E. | Matches: ‚ùå
Proposed: W-O-R-D | Actual: C-A-P-T-U-R-E. | Matches: ‚ùå
Proposed: echo | Actual: E-C-H-O. | Matches: ‚ùå
Proposed: mirth | Actual: M-I-R-T-H. | Matches: ‚ùå
Proposed: cris | Actual: C-R-I-S-P. | Matches: ‚ùå
Proposed: zeal | Actual: Z-E-A-L-O-U-S. | Matches: ‚ùå
Proposed:  | Actual: E-M-B-E-R. | Matches: ‚ùå
Proposed: scarab | Actual: S-C-A-R-A-B. | Matches: ‚ùå
Proposed:  | Actual: K-N-I-T. | Matches: ‚ùå
Proposed: resolve | Actual: R-E-S-O-L-V-E. | Matches: ‚ùå
Proposed: velvet | Actual: V-E-L-V-E-T. | Matches: ‚ùå
Proposed:  | Actual: A-B-S-O-L-V-E. | Matches: ‚ùå
Proposed: lunar | Actual: L-U-N-A-R. | Matches: ‚ùå
Proposed: maze | Actual: M-A-Z-E. | Mat

As expected, the base model is terrible at spelling. It mostly just repeats the word back. Now, let's fine-tune it.

## Step 4. Configure LoRA and train the model

Let‚Äôs attach a LoRA adapter to the base model. We use a LoRA config so only a tiny fraction of parameters are trainable. Read more here: [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora).

In [42]:
# Student task: Configure LoRA for a causal LM and wrap the model with get_peft_model
# Complete the sections with **********

# Print how many params are trainable at first
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params BEFORE: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

# See: https://huggingface.co/docs/peft/package_reference/lora
# lora_config = LoraConfig(
#     r=**********,                 # Rank of the update matrices. Lower value = fewer trainable parameters.
#     lora_alpha=**********,        # LoRA scaling factor.
#     lora_dropout=**********,      # Dropout probability for LoRA layers.
#     bias="none",
#     task_type=**********,         # Causal Language Modeling.
# )
# # Wrap the base model with get_peft_model
# model = get_peft_model(**********, **********)

lora_config = LoraConfig(
    r=64,                 # Rank of the update matrices. Lower value = fewer trainable parameters.
    lora_alpha=16,        # LoRA scaling factor.
    lora_dropout=0.05,      # Dropout probability for LoRA layers.
    bias="none",        
    task_type="CAUSAL_LM",         # Causal Language Modeling.
)
# Wrap the base model with get_peft_model
model = get_peft_model(model, lora_config)

# Print the number of trainable parameters after applying LoRA
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params AFTER: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

Trainable params BEFORE: 134,515,008 / 134,515,008 (100.00%)
Trainable params AFTER: 3,686,400 / 138,201,408 (2.67%)


Now let‚Äôs set the training arguments. We'll use `SFTConfig` from the TRL library, which is a wrapper around the standard `TrainingArguments`. We keep epochs, batch size, and sequence length modest to finish training quickly.

In [43]:
# Student task: Fill in the SFTConfig for a quick training run
# Complete the sections with **********

output_dir = "data/model"

# See: https://huggingface.co/docs/trl/en/sft_trainer#trl.SFTConfig
# training_args = SFTConfig(
#     output_dir=output_dir,
#     per_device_train_batch_size=**********,
#     per_device_eval_batch_size=**********,
#     gradient_accumulation_steps=**********,
#     num_train_epochs=**********,
#     learning_rate=**********,
#     logging_steps=**********,
#     evaluation_strategy="steps",
#     eval_steps=**********,
#     save_strategy="no",
#     report_to=[],                            # disable wandb/tensorboard
#     fp16=False,                              # stay in fp32 for CPU/MPS
#     lr_scheduler_type="cosine",
# )
training_args = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=20,
    learning_rate=1e-3,
    logging_steps=20,
    eval_strategy="steps",
    eval_steps=20,
    save_strategy="no",
    report_to=[],                            # disable wandb/tensorboard
    fp16=False,
    lr_scheduler_type="cosine",
)

Now we define the `SFTTrainer` and run the fine-tuning process.

In [44]:
ds['test'][0]

{'prompt': 'You spell words with hyphens between the letters like this W-O-R-D.\nWord:\nwryly\n\nSpelling:\n',
 'completion': 'W-R-Y-L-Y.'}

In [45]:
# Student Task: Create and run the SFTTrainer
# TODO: Complete the sections with **********


# See: https://huggingface.co/docs/trl/en/sft_trainer
# trainer = SFTTrainer(
#     model=**********,
#     train_dataset=**********,
#     eval_dataset=**********,
#     args=**********,
# )
# Now train it:
# trainer.**********
trainer = SFTTrainer(
    model=model,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    args=training_args,
)

trainer.train()

Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
20,0.8615,0.590896,1.864297,6627.0,0.799492
40,0.3468,0.540267,1.483384,13256.0,0.861841
60,0.1488,0.617988,1.363995,19800.0,0.861841
80,0.0466,0.676387,1.286945,26433.0,0.856406
100,0.0216,0.697395,1.260177,33040.0,0.856406
120,0.0172,0.702987,1.248737,39600.0,0.856406


TrainOutput(global_step=120, training_loss=0.24040335938334464, metrics={'train_runtime': 16.7276, 'train_samples_per_second': 54.999, 'train_steps_per_second': 7.174, 'total_flos': 27776639121408.0, 'train_loss': 0.24040335938334464, 'epoch': 20.0})

## Step 5. Evaluate the fine-tuned model

In [46]:
# Evaluate the fine-tuned model on the same training examples
# No changes needed in this cell


proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

Proposed: S-P-H-I-N-X. | Actual: S-P-H-I-N-X. | Matches: ‚úÖ
Proposed: B-R-A-W-N. | Actual: B-R-A-W-N. | Matches: ‚úÖ
Proposed: G-O-S-S-I-P-Y. | Actual: G-O-S-S-I-P-Y. | Matches: ‚úÖ
Proposed: E-N-C-H-A-N-T. | Actual: E-N-C-H-A-N-T. | Matches: ‚úÖ
Proposed: T-A-V-E-R-N. | Actual: T-A-V-E-R-N. | Matches: ‚úÖ
Proposed: W-H-I-S-T-L-E. | Actual: W-H-I-S-T-L-E. | Matches: ‚úÖ
Proposed: C-A-P-T-U-R-E. | Actual: C-A-P-T-U-R-E. | Matches: ‚úÖ
Proposed: E-C-H-O. | Actual: E-C-H-O. | Matches: ‚úÖ
Proposed: M-I-R-T-H. | Actual: M-I-R-T-H. | Matches: ‚úÖ
Proposed: C-R-I-S-P. | Actual: C-R-I-S-P. | Matches: ‚úÖ
Proposed: Z-E-A-L-O-U-S. | Actual: Z-E-A-L-O-U-S. | Matches: ‚úÖ
Proposed: E-M-B-E-R. | Actual: E-M-B-E-R. | Matches: ‚úÖ
Proposed: S-C-A-R-A-B. | Actual: S-C-A-R-A-B. | Matches: ‚úÖ
Proposed: K-N-I-T. | Actual: K-N-I-T. | Matches: ‚úÖ
Proposed: R-E-S-O-L-V-E. | Actual: R-E-S-O-L-V-E. | Matches: ‚úÖ
Proposed: V-E-L-V-E-T. | Actual: V-E-L-V-E-T. | Matches: ‚úÖ
Proposed: A-B-S-O-L-V-E. | Actua

The model now performs better on the training data it has seen. But has it generalized? Let's check its performance on the unseen test set.

In [47]:
# Evaluate the fine-tuned model on the unseen test set
# No changes needed in this cell


proportion_correct = 0.0
num_examples = len(ds["test"])

for example in ds["test"]:
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/{num_examples}.0 words correct")

Proposed: W-R-I-Y-L-Y. | Actual: W-R-Y-L-Y. | Matches: ‚ùå
Proposed: G-L-I-S-N-E. | Actual: G-L-I-S-T-E-N. | Matches: ‚ùå
Proposed: S-C-A-W-E. | Actual: Q-U-E-S-T. | Matches: ‚ùå
Proposed: C-R-E-A-V-E. | Actual: C-R-A-V-E. | Matches: ‚ùå
Proposed: L-U-S-H. | Actual: L-U-S-H. | Matches: ‚úÖ
Proposed: F-A-C-I-L-E. | Actual: F-A-B-L-E. | Matches: ‚ùå
Proposed: K-N-A-R-K-E. | Actual: K-N-A-C-K. | Matches: ‚ùå
Proposed: T-R-I-U-M-P-H. | Actual: T-R-I-U-M-P-H. | Matches: ‚úÖ
Proposed: S-A-P-L-I-C-H. | Actual: S-A-P-P-H-I-R-E. | Matches: ‚ùå
Proposed: E-X-P-S-T-E. | Actual: E-X-P-O-S-E. | Matches: ‚ùå
Proposed: F-S-R-E-C-O-S. | Actual: F-R-E-S-C-O-S. | Matches: ‚ùå
Proposed: W-I-P-S. | Actual: W-I-S-P. | Matches: ‚ùå
Proposed: M-I-R-G-E. | Actual: M-I-R-A-G-E. | Matches: ‚ùå
Proposed: I-V-O-R-Y. | Actual: I-V-O-R-Y. | Matches: ‚úÖ
Proposed: O-S-L-D-E. | Actual: O-N-S-E-T. | Matches: ‚ùå
Proposed: E-L-U-D-E. | Actual: E-L-U-D-E. | Matches: ‚úÖ
4.0/16.0 words correct


It looks like it has improved! Perhaps with a larger dataset and more training, it could get even better.

## Congratulations for completing the exercise! üéâ

‚úÖ You did it! You successfully fine-tuned a small language model using PEFT with LoRA to teach it a new skill: spelling! You saw how the base model failed completely at the task, and with a very small amount of data and a short training run, the model managed to get better at spelling.

<br /><br /><br /><br /><br /><br /><br /><br /><br />