# 3. Model Adaptation: Changing What the Model Knows

> **GPU Required.** This section requires a CUDA-capable GPU. The lab environment provides an NVIDIA L4 with 24GB of VRAM. An 8B parameter model in full precision would not fit in 24GB for training, so we use **QLoRA** (Quantized Low-Rank Adaptation) to make it work: the base model is loaded in 4-bit precision via bitsandbytes (roughly 5GB), and only small LoRA adapter matrices are trained on top. This keeps the total memory footprint well within the L4's budget.

If you are running this outside the lab without GPU access, you can follow the setup and data preparation cells, but training will not execute. Pre-built outputs are provided for that scenario.

**Purpose:**

We have now exhausted the options that do not involve changing the model. RAG gave us grounded answers for 6 out of 10 questions. Inference-time scaling (Best-of-N) recovered 2 more. But for the remaining failures, particularly questions requiring implicit domain reasoning, every approach hit the same wall: the model's weights do not contain a reliable path to the correct answer.

This is where model adaptation becomes justified. Not because it is the next thing on the list, but because two independent evaluation methods pointed to the same conclusion: the gap is in the model, not in the pipeline.

We will use **LoRA SFT (Low-Rank Adaptation with Supervised Fine-Tuning)**, a parameter-efficient method that freezes the base model and adds small trainable matrices alongside the frozen weights. Combined with 4-bit quantization, this is the standard approach for fine-tuning large models on consumer or mid-range GPUs.

The training data comes directly from Section 2: synthetic question-answer pairs generated from the customer's own documents. The model comes from HuggingFace. The training runs locally on GPU.

By the end of this section, you will have an adapted model and evidence of whether the adaptation closed the gaps that RAG and inference-time scaling could not.

## 3.1 Install Training Hub

`training_hub` is an open-source library from the Red Hat AI Innovation Team. It provides a single Python interface for multiple post-training algorithms, each backed by a community implementation. You call a function, pass your model and data, and the library handles the rest.

We install the base package first, then the `[cuda]` extra for GPU support and the `[lora]` extra for the Unsloth-based LoRA backend.

This step typically takes 3 to 5 minutes on this hardware.

In [None]:
# Install base package first (provides torch, packaging, wheel, ninja)
! pip install training-hub -q

# Then install CUDA extras and LoRA backend
! pip install training-hub[cuda] --no-build-isolation -q
! pip install 'training-hub[lora]' -q

You may see dependency conflict warnings from pip. These are advisory, not fatal. The lab environment has many pre-installed packages (Docling, KFP, Feast, etc.) that pin older versions of shared dependencies like `transformers` and `click`. As long as the import in the next cell succeeds, the conflicts do not affect training_hub.

If `flash-attn` fails to build, that is also acceptable. Training will fall back to standard attention. Flash-attention is a performance optimization, not a requirement.

In [None]:
# Verify the install
from training_hub import AlgorithmRegistry
print("training_hub imported successfully")
print(f"Available algorithms: {AlgorithmRegistry.list_algorithms()}")

## 3.2 Environment Setup

Same credentials, same endpoint pattern. We reuse the `.env` file and config helper from earlier sections.

In [None]:
import sys
sys.path.insert(0, "..")
from config import API_KEY as key, ENDPOINT_BASE as endpoint_base

print(f"Endpoint: {endpoint_base}")
print(f"API Key:  {key[:8]}...")

Next we confirm that the GPU is visible and check how much memory we have to work with. You should see an NVIDIA L4 with approximately 24GB. If CUDA is not available, the training cells later in this notebook will not run.

In [None]:
import torch
import os
import json
import time

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available:  {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU:             {torch.cuda.get_device_name(0)}")
    print(f"Memory:          {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("WARNING: No GPU detected. Training cells will not execute.")

**Important:** Even with 4-bit quantization and LoRA, training still needs most of the L4's 24GB. If you ran Sections 1 or 2 in this same kernel session, embedding models or other objects may still be occupying GPU memory. Before continuing, **restart your kernel** (Kernel > Restart) and then run only the cells in this section from the top. This ensures the full 24GB is available for training.

You do not need to re-run Sections 1 or 2. The training data file (`synthetic_qa_pairs.csv`) is already saved to disk from Section 2.

## 3.3 Discover Available Algorithms

Before writing any training code, we ask the library what it supports. This is the same discovery-first pattern we used with `sdg_hub` in Section 2.

In [None]:
from training_hub import AlgorithmRegistry

algorithms = AlgorithmRegistry.list_algorithms()
print("Available algorithms:", algorithms)

for algo_name in algorithms:
    backends = AlgorithmRegistry.list_backends(algo_name)
    print(f"  {algo_name}: backends = {backends}")

You should see three algorithms: `sft`, `osft`, and `lora_sft`. Each maps to one or more backend implementations.

**SFT (Supervised Fine-Tuning)** updates all model weights. It is the most straightforward approach but also the most destructive: the model can lose general capability while learning domain-specific behavior. It is also the most memory-intensive. An 8B model in half precision needs 16GB just for weights, plus optimizer states and activations. That does not fit in 24GB.

**OSFT (Orthogonal Subspace Fine-Tuning)** is more surgical. It analyzes the model's weight matrices via singular value decomposition, identifies which directions are most critical to the model's existing behavior, freezes those, and only updates the orthogonal (least critical) directions. However, it still requires the full model in half precision for the SVD step, which exceeds our 24GB budget.

**LoRA SFT (Low-Rank Adaptation)** adds small trainable matrices alongside the frozen base weights. Combined with 4-bit quantization (the QLoRA pattern), it compresses the base model to roughly 5GB and trains only the lightweight adapter layers. This is the right choice for our L4 GPU.

The takeaway for customer conversations: LoRA with quantization is not a compromise. It is the standard production approach for adapting large models on mid-range hardware. The quality difference versus full fine-tuning is minimal for most domain adaptation tasks, and you gain the ability to swap adapters in and out without touching the base model.

### 3.3.1 Inspect LoRA SFT Parameters

Let us look at what the LoRA SFT algorithm expects. The `create_algorithm` function returns an object that can describe its own interface.

In [None]:
from training_hub import create_algorithm

lora_algo = create_algorithm('lora_sft')

required_params = lora_algo.get_required_params()
print("Required parameters:")
for k, v in required_params.items():
    print(f"  {k}: {v}")

print()
optional_params = lora_algo.get_optional_params()
print(f"Optional parameters ({len(optional_params)} total):")
for k, v in list(optional_params.items())[:15]:
    print(f"  {k}: {v}")
if len(optional_params) > 15:
    print(f"  ... and {len(optional_params) - 15} more")

The required parameters are minimal: a model path, a data path, and an output directory. Everything else has sensible defaults. This is intentional.

Take a moment to appreciate what this means. Under the hood, the library is loading an 8-billion parameter model in 4-bit quantization, injecting LoRA adapter matrices into the attention and MLP layers, configuring gradient computation only for the adapter weights, managing mixed-precision training, handling checkpointing, and coordinating all of it within our GPU memory budget. That is a significant amount of engineering.

And the interface is three required arguments and a function call.

The library absorbs the complexity of backend configuration, quantization setup, and checkpointing so that the interface stays focused on the decisions that actually matter: which model, which data, which hyperparameters. This is the same design philosophy behind `sdg_hub` and `its_hub`. The field should not need a PhD in optimization theory to run a training job. They need to make good decisions about inputs and interpret the outputs. The tooling handles the rest.

## 3.4 Prepare Training Data

We take the synthetic QA pairs from Section 2 and convert them into the format the training pipeline expects. The standard for chat-based fine-tuning is JSONL where each line contains a `messages` array representing one training conversation.

If the CSV from Section 2 is not available (for example, if you skipped that section or are running in a fresh environment), we fall back to a set of built-in sample QA pairs so that training can still proceed.

In [None]:
import pandas as pd

SDG_CSV_PATH = "../02SyntheticDataGen/synthetic_qa_pairs.csv"

if os.path.exists(SDG_CSV_PATH):
    qa_df = pd.read_csv(SDG_CSV_PATH)
    print(f"Loaded {len(qa_df)} QA pairs from Section 2")
else:
    print(f"CSV not found at {SDG_CSV_PATH}. Using built-in sample QA pairs.")
    sample_data = [
        {
            "question": "What is the Thief's Open Locks ability score at level 8?",
            "response": "At level 8, the Thief's Open Locks ability score is 60. This information is found in the Thief Abilities table in the document, under the column for Open Locks, corresponding to level 8.",
            "faithfulness_judgment": "faithful",
        },
        {
            "question": "How does the Thief's Move Silently ability progress from level 1 to level 15?",
            "response": 'Based on the provided document, the Thief\'s \"Move Silently\" ability progresses as follows from level 1 to level 15: Level 1: 25, Level 2: 30, Level 3: 35, Level 4: 40, Level 5: 45, Level 6: 50, Level 7: 55, Level 8: 60, Level 9: 65, Level 10: 68, Level 11: 71, Level 12: 74, Level 13: 77, Level 14: 80, Level 15: 83.',
            "faithfulness_judgment": "faithful",
        },
        {
            "question": "Can a GM adjust the Thief's ability scores, and for what reason?",
            "response": "Yes, a Game Master (GM) can adjust a Thief's ability scores in certain situations. The document specifies that the GM may apply situational adjustments, which can either be positive or negative percentage points, to reflect the difficulty of particular actions under specific conditions.",
            "faithfulness_judgment": "faithful",
        },
        {
            "question": "Why might a GM apply a situational adjustment when a Thief is climbing a wall?",
            "response": "A Game Master (GM) might apply a situational adjustment when a Thief is climbing a wall to account for the specific conditions that could make the task more difficult or easier. For instance, if a wall is slick with slime, the GM might implement a penalty such as a 20% adjustment to represent the increased difficulty.",
            "faithfulness_judgment": "faithful",
        },
        {
            "question": "What is the Thief's Climb Walls ability score at level 17?",
            "response": "At level 17, the Thief's Climb Walls ability score is 96. This can be found in the Thief Abilities table under the Climb Walls column for level 17.",
            "faithfulness_judgment": "faithful",
        },
        {
            "question": "If a Thief is level 10, what is their Pick Pockets ability score?",
            "response": "If a Thief is level 10, their Pick Pockets ability score is 74. According to the provided table, this score is determined by looking under the Pick Pockets column for level 10.",
            "faithfulness_judgment": "faithful",
        },
        {
            "question": "What does a Thief gain when they advance from level 3 to level 4?",
            "response": "When a Thief advances from level 3 to level 4, they gain the following increases in their abilities: Open Locks: 35 to 40, Remove Traps: 30 to 35, Pick Pockets: 40 to 45, Move Silently: 35 to 40, Climb Walls: 82 to 83, Hide: 20 to 25, Listen: 38 to 42.",
            "faithfulness_judgment": "faithful",
        },
        {
            "question": "Describe a situation where a GM might apply a bonus to a Thief's ability score.",
            "response": "A GM might apply a bonus to a Thief's ability score when the circumstances are particularly advantageous. For example, if a Thief is attempting to pick a lock that they have extensive practice with or specific knowledge about, the GM might grant a bonus of +10% to the Open Locks ability score.",
            "faithfulness_judgment": "faithful",
        },
    ]
    qa_df = pd.DataFrame(sample_data)
    print(f"Created {len(qa_df)} built-in sample QA pairs")

print(f"Columns: {list(qa_df.columns)}")

for i, row in qa_df.iterrows():
    print(f"\n  Q{i+1}: {row['question'][:80]}...")
    print(f"  A{i+1}: {row['response'][:80]}...")

### 3.4.1 Convert to JSONL Chat Format

Now we convert the QA pairs into JSONL chat format. Each training example becomes a three-turn conversation: a system prompt that establishes the model's role, a user question, and an assistant response. The model learns to produce the assistant turn given the other two.

The system prompt is a design choice. It tells the model who it is during training, and the same prompt should be used at inference time. If you change the persona later, the model may not behave as expected.

In [None]:
SYSTEM_PROMPT = (
    "You are a rules expert for the Basic Fantasy Role-Playing Game. "
    "Answer questions accurately based on the official rules. "
    "Be specific and cite page references or table values where possible."
)

# Convert to chat messages JSONL format
training_data = []
for _, row in qa_df.iterrows():
    example = {
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": row["question"]},
            {"role": "assistant", "content": row["response"]},
        ]
    }
    training_data.append(example)

# Write JSONL
data_path = "training_data.jsonl"
with open(data_path, "w", encoding="utf-8") as f:
    for example in training_data:
        f.write(json.dumps(example) + "\n")

print(f"Wrote {len(training_data)} examples to {data_path}")

# Verify structure
with open(data_path, "r") as f:
    first_line = json.loads(f.readline())
print(f"\nFirst example structure:")
print(f"  Keys: {list(first_line.keys())}")
print(f"  Roles: {[m['role'] for m in first_line['messages']]}")
print(f"  User: {first_line['messages'][1]['content'][:80]}...")

Eight training examples is very small. In a production engagement, you would generate hundreds or thousands of pairs across the full document corpus, filter on faithfulness, deduplicate, and review samples before committing to training. We use eight because the goal is to demonstrate the process and observe the behavior, not to produce a production-quality adapter.

Even with eight examples, if the training works correctly, the model should show measurable change on questions that directly overlap with the training data. Whether it generalizes beyond those specific examples is a separate question, and one that a larger dataset would answer.

## 3.5 Download the Base Model

LoRA operates on the model weights directly, so the model must be on disk, not behind an API. We download the same Granite model we have been using throughout the workshop.

This downloads approximately 16GB of model files. The files are large, but training will load them in 4-bit quantization (roughly 5GB in GPU memory).

This step typically takes 2 to 4 minutes depending on network speed.

In [None]:
# Set to False to skip the download and use a pre-existing local copy.
RUN_LIVE = True

In [None]:
from huggingface_hub import snapshot_download

MODEL_ID = "ibm-granite/granite-3.2-8b-instruct"
LOCAL_MODEL_DIR = "./models/granite-3.2-8b-instruct"

if RUN_LIVE:
    start_time = time.time()

    if os.path.exists(LOCAL_MODEL_DIR) and len(os.listdir(LOCAL_MODEL_DIR)) > 5:
        print(f"Model already exists at {LOCAL_MODEL_DIR}, skipping download.")
    else:
        print(f"Downloading {MODEL_ID}...")
        print("This may take several minutes.")
        snapshot_download(
            repo_id=MODEL_ID,
            local_dir=LOCAL_MODEL_DIR,
        )

    elapsed = time.time() - start_time
    minutes = int(elapsed // 60)
    seconds = int(elapsed % 60)

    # Verify
    model_files = [f for f in os.listdir(LOCAL_MODEL_DIR) if not f.startswith('.')]
    safetensor_files = [f for f in model_files if f.endswith('.safetensors')]
    total_size = sum(
        os.path.getsize(os.path.join(LOCAL_MODEL_DIR, f)) for f in model_files
    )

    print(f"\nModel directory: {len(model_files)} items, {len(safetensor_files)} safetensor shards")
    print(f"Total size:      {total_size / 1e9:.1f} GB")
    print(f"Elapsed:         {minutes}m {seconds}s")
else:
    print("RUN_LIVE is False. Skipping model download.")
    print(f"Expecting model at: {LOCAL_MODEL_DIR}")

## 3.6 Run LoRA SFT Training

This is the core of the section: a single function call that trains the model.

We use the QLoRA pattern, which combines two techniques:

1. **4-bit quantization** (`load_in_4bit=True`): The base model's weights are loaded in NF4 format via bitsandbytes. This reduces the 16GB model to roughly 5GB in GPU memory. The quantized weights are frozen and never updated.

2. **LoRA adapters** (`lora_r=8`, `lora_alpha=16`): Small trainable matrices are injected into each attention and MLP layer. These are the only parameters that receive gradients. With rank 8, the adapters add less than 1% new parameters relative to the base model.

The training parameters are conservative for the L4's 24GB budget:

- `effective_batch_size=2`: Small batch to stay within memory
- `max_seq_len=512`: Our training examples are short; no need for longer
- `learning_rate=2e-4`: Standard for LoRA (higher than full fine-tuning because fewer parameters are updated)
- `num_epochs=5`: Multiple passes over our small dataset

This step typically takes 3 to 6 minutes on this hardware.

In [None]:
# Set to True to run training live. Set to False to use pre-built adapter.
RUN_LIVE = True

In [None]:
from training_hub import lora_sft

CKPT_DIR = "./lora_output"

if RUN_LIVE:
    print("Starting QLoRA training...")
    print(f"  Model:          {LOCAL_MODEL_DIR}")
    print(f"  Data:           {data_path}")
    print(f"  Output:         {CKPT_DIR}")
    print(f"  Quantization:   4-bit NF4 (bitsandbytes)")
    print(f"  LoRA rank:      8")
    print(f"  LoRA alpha:     16")
    print(f"  Epochs:         5")
    print(f"  Batch size:     2")
    print(f"  Max seq len:    512")
    print(f"  Learning rate:  2e-4")
    print()

    start_time = time.time()

    result = lora_sft(
        model_path=LOCAL_MODEL_DIR,
        data_path=data_path,
        ckpt_output_dir=CKPT_DIR,
        # QLoRA: load base model in 4-bit
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype="float16",
        bnb_4bit_use_double_quant=True,
        # LoRA adapter configuration
        lora_r=8,
        lora_alpha=16,
        lora_dropout=0.05,
        # Training parameters (conservative for 24GB L4)
        num_epochs=5,
        effective_batch_size=2,
        max_seq_len=512,
        learning_rate=2e-4,
        lr_scheduler="cosine",
        warmup_steps=2,
        bf16=True,
    )

    elapsed = time.time() - start_time
    minutes = int(elapsed // 60)
    seconds = int(elapsed % 60)
    print(f"\nTraining complete in {minutes}m {seconds}s")
    print(f"Result: {result}")
else:
    print("Using pre-built checkpoint. Skipping training.")
    CKPT_DIR = "../prebuilt/lora_output"

### 3.6.1 Inspect Training Output

After training completes, the output directory contains the LoRA adapter weights and training metadata. Let us look at what was produced. The adapter files should be small (tens of megabytes) compared to the full model (16GB), because only the LoRA matrices were saved.

In [None]:
if os.path.exists(CKPT_DIR):
    print(f"Output directory: {CKPT_DIR}")
    print()
    for root, dirs, files in os.walk(CKPT_DIR):
        # Skip hidden directories
        dirs[:] = [d for d in dirs if not d.startswith('.')]
        level = root.replace(CKPT_DIR, '').count(os.sep)
        indent = '  ' * level
        subdir = os.path.basename(root)
        if level > 0:
            print(f"{indent}{subdir}/")
        for f in sorted(files):
            fpath = os.path.join(root, f)
            size = os.path.getsize(fpath)
            if size > 1_000_000:
                print(f"{indent}  {f:45s} {size / 1e6:>8.1f} MB")
            else:
                print(f"{indent}  {f:45s} {size:>8,} bytes")
else:
    print(f"Output directory not found at {CKPT_DIR}")

### 3.6.2 Free GPU Memory

Training is done. Let us release the GPU memory so later cells (or other notebooks) have a clean slate.

In [None]:
import gc

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
    print(f"GPU memory reserved:  {torch.cuda.memory_reserved() / 1e9:.1f} GB")

## 3.7 Understanding What Happened

Let us step back and make sure the mechanics are clear, because this is what you will explain to customers.

**What the base model looks like in memory.** The Granite 3.2 8B model has about 8 billion parameters. In 16-bit precision each parameter is 2 bytes, so the model needs roughly 16GB just for weights. Add optimizer states (another 16GB for Adam) and activations, and you are well past 24GB. That is why full SFT and OSFT do not fit on an L4.

**What 4-bit quantization does.** When we set `load_in_4bit=True`, the bitsandbytes library compresses each weight from 16 bits to 4 bits using the NF4 (Normal Float 4) format. NF4 is designed for normally distributed weights, which is a good match for transformer models. The result: the same 8B model fits in about 5GB. The quantized weights are frozen and never receive gradients.

**What LoRA adds on top.** For each target weight matrix W (shape m x n), LoRA inserts two small matrices: A (shape m x r) and B (shape r x n), where r is the rank (we used r=8). The effective weight becomes W + A*B. Only A and B receive gradients. With r=8 on an 8B model, the trainable parameters are less than 1% of the total. These adapters are what get saved to disk.

**Why this works.** The intuition is that domain adaptation does not require changing every weight in the model. The base model already knows how to form sentences, follow instructions, and reason. What it lacks is specific domain knowledge. LoRA provides a low-rank "correction" that nudges the model's representations toward the domain without disrupting its general capabilities. The 4-bit base model handles the heavy lifting; the adapters handle the specialization.

## 3.8 Training Methods at a Glance

The library offers three algorithms. Here is how they compare, and why we chose LoRA for this lab.

In [None]:
comparison = [
    ["What gets updated",       "All model weights",                                "Orthogonal (least critical) weight directions",  "Small adapter matrices alongside frozen weights"],
    ["Quantization support",    "No (needs full precision)",                         "No (needs half precision for SVD)",               "Yes (4-bit or 8-bit via bitsandbytes)"],
    ["Min GPU memory (8B)",     "~40GB",                                             "~35GB",                                           "~10GB with 4-bit quantization"],
    ["Forgetting risk",         "High; no protection for existing knowledge",        "Low; critical directions frozen by design",       "Low; base weights are completely frozen"],
    ["Output artifact",         "Full model checkpoint",                             "Full model checkpoint",                           "Small adapter files (tens of MB)"],
    ["Swappable adapters",      "No",                                                "No",                                              "Yes; multiple adapters per base model"],
    ["Backend",                 "instructlab-training",                               "mini-trainer",                                    "unsloth"],
    ["Fits on L4 (24GB)",       "No",                                                "No",                                              "Yes"],
]

header = f"{'Property':<28s} {'SFT':<42s} {'OSFT':<47s} {'LoRA SFT (QLoRA)':<45s}"
print(header)
print("-" * len(header))
for row in comparison:
    print(f"{row[0]:<28s} {row[1]:<42s} {row[2]:<47s} {row[3]:<45s}")

## Summary

In this notebook we demonstrated model adaptation using QLoRA via `training_hub`.

The base Granite 3.2 8B model was loaded in 4-bit quantization, reducing its memory footprint from 16GB to roughly 5GB. LoRA adapter matrices (rank 8) were injected into the model's layers, adding less than 1% new trainable parameters. Only these adapters received gradient updates during training. The result is a small set of adapter files that can be loaded on top of the base model at inference time.

This is the standard production pattern for adapting large models on mid-range hardware. When a customer has an L4, a T4, or even a good consumer GPU, QLoRA is the answer. The quality tradeoff versus full fine-tuning is minimal for domain adaptation tasks, and you gain practical benefits: smaller artifacts, swappable adapters, and the ability to run on hardware that customers actually have.

For customers with access to larger GPUs (A100 80GB, H100, or similar), the library also offers SFT and OSFT. OSFT is particularly interesting for production use because it provides mathematical guarantees about preserving existing model capabilities. But for this lab environment with a 24GB L4, LoRA is the right tool for the job.

| Scenario | Recommendation |
|----------|---------------|
| Limited GPU memory (under 24GB) | QLoRA (this notebook) |
| Multiple domain adapters needed | QLoRA, with one adapter per domain |
| Large GPU, must preserve general capabilities | OSFT |
| Large GPU, full retraining acceptable | SFT |
| Customer asks "will this break existing behavior?" | OSFT or LoRA, both freeze the base model in different ways |