# 3. Model Adaptation: Changing What the Model Knows

> **GPU Required.** This section requires a CUDA-capable GPU. The lab environment provides an NVIDIA L4 with 24GB of VRAM. If you are running this outside the lab without GPU access, you can follow the setup and data preparation cells, but training will not execute. Pre-built outputs are provided for that scenario.

**Purpose:**

We have now exhausted the options that do not involve changing the model. RAG gave us grounded answers for 6 out of 10 questions. Inference-time scaling (Best-of-N) recovered 2 more. But for the remaining failures, particularly questions requiring implicit domain reasoning, every approach hit the same wall: the model's weights do not contain a reliable path to the correct answer.

This is where model adaptation becomes justified. Not because it is the next thing on the list, but because two independent evaluation methods pointed to the same conclusion: the gap is in the model, not in the pipeline.

We will use **LoRA SFT (Low-Rank Adaptation with Supervised Fine-Tuning)**, a parameter-efficient method that adds small trainable matrices alongside the frozen base model weights. The base model itself is not modified. Instead, LoRA learns a compact set of adjustments that shift the model's behavior toward our domain without destroying what it already knows.

To fit an 8-billion parameter model into 24GB of GPU memory, we load the base weights in **4-bit quantization** using `bitsandbytes`. This reduces the model's memory footprint from roughly 16GB (in half precision) to around 5GB, leaving room for LoRA's trainable parameters, optimizer states, and activations. This combination of LoRA with 4-bit quantization is commonly known as **QLoRA**.

The training data comes directly from Section 2: synthetic question-answer pairs generated from the customer's own documents. The model comes from HuggingFace. The training runs locally on GPU.

By the end of this section, you will have an adapted model and evidence of whether the adaptation closed the gaps that RAG and inference-time scaling could not.

## 3.1 Install Training Hub

`training_hub` is an open-source library from the Red Hat AI Innovation Team. It provides a single Python interface for multiple post-training algorithms, each backed by a community implementation. You call a function, pass your model and data, and the library handles the rest.

We install three extras: `[cuda]` for GPU dependencies including PyTorch with CUDA support, `[lora]` for the LoRA backend, and `bitsandbytes` for 4-bit quantization support.

_This step typically takes 3 to 5 minutes. Package installation involves compiling several native extensions._

In [None]:
# Install training-hub with CUDA and LoRA support
! pip install training-hub -q
! pip install 'training-hub[cuda]' --no-build-isolation -q
! pip install 'training-hub[lora]' -q

# bitsandbytes provides 4-bit quantization, required to fit 8B models in 24GB VRAM
! pip install bitsandbytes -q

# Remove mamba-ssm and causal-conv1d: these ship pre-compiled CUDA extensions
# built against an older PyTorch ABI, causing "undefined symbol" errors on import.
# They are only needed for Mamba-architecture models. Granite is a transformer,
# so removing them is safe. Unsloth handles their absence gracefully.
! pip uninstall mamba-ssm causal-conv1d -y 2>/dev/null || true

# Patch Unsloth compiler: peft 0.18+ added VARIANT_KWARG_KEYS to Linear forward(),
# but Unsloth's code generator only imports symbols from the parent module, missing
# constants defined elsewhere. This adds a fallback import for the missing constant.
import unsloth_zoo.compiler, inspect, re
_src = inspect.getsource(unsloth_zoo.compiler.create_new_function)
if "VARIANT_KWARG_KEYS" not in _src:
    _old = '    new_source = imports + "\\n\\n" + new_source'
    _patch = (
        '    if "VARIANT_KWARG_KEYS" in new_source and "VARIANT_KWARG_KEYS" not in items:\n'
        '        imports += "\\ntry:\\n    from peft.tuners.lora.layer import VARIANT_KWARG_KEYS\\n'
        'except ImportError:\\n    VARIANT_KWARG_KEYS = []\\n"\n'
    )
    _src = _src.replace(_old, _patch + _old, 1)
    exec(compile(_src, unsloth_zoo.compiler.__file__, "exec"), unsloth_zoo.compiler.__dict__)
    print("Applied VARIANT_KWARG_KEYS patch to Unsloth compiler")
else:
    print("Unsloth compiler already handles VARIANT_KWARG_KEYS")

# Clear any stale compiled cache from previous failed runs
import shutil, os
_cache = os.path.join(os.getcwd(), "unsloth_compiled_cache")
if os.path.exists(_cache):
    shutil.rmtree(_cache)
    print(f"Cleared stale compiled cache at {_cache}")

**Note:** You may see dependency conflict warnings from pip. These are advisory, not fatal. The lab environment has many pre-installed packages (Docling, KFP, Feast, etc.) that pin older versions of shared dependencies like `transformers` and `click`. As long as the import in the next cell succeeds, the conflicts do not affect training_hub.

If `flash-attn` fails to build, that is also acceptable. Training will fall back to standard attention. Flash-attention is a performance optimization, not a requirement.

The install cell applies two compatibility fixes for this environment:

1. **Removes `mamba-ssm` and `causal-conv1d`**: These packages contain pre-compiled CUDA extensions that are binary-incompatible with PyTorch 2.10.0+cu128, causing an `undefined symbol` error when Unsloth tries to import them. Since they are only used for Mamba-architecture models and Granite is a transformer, removing them is safe.

2. **Patches the Unsloth compiler**: PEFT 0.18+ introduced a `VARIANT_KWARG_KEYS` constant used in LoRA Linear layer forward methods. Unsloth's code generator extracts these methods but only auto-imports symbols from the immediate parent module. When the constant is defined in a different submodule (`peft.tuners.lora.layer` vs `peft.tuners.lora.inc`), the generated code fails with a `NameError`. The patch adds a fallback import for this constant.

### 3.1.1 Verify the Installation

A quick import check confirms that `training_hub` is installed and can enumerate its available algorithms. If this cell fails, revisit the install step above.

In [2]:
# Verify the install
from training_hub import AlgorithmRegistry
print("training_hub imported successfully")
print(f"Available algorithms: {AlgorithmRegistry.list_algorithms()}")

training_hub imported successfully
Available algorithms: ['sft', 'osft', 'lora_sft']


## 3.2 Environment Setup

Same credentials, same endpoint pattern. We reuse the `.env` file and config helper from earlier sections. The API key and endpoint are needed later when we compare the adapted model's outputs against the baseline.

In [3]:
import sys
sys.path.insert(0, "..")
from config import API_KEY as key, ENDPOINT_BASE as endpoint_base

print(f"Endpoint: {endpoint_base}")
print(f"API Key:  {key[:8]}...")

Endpoint: https://litellm-prod.apps.maas.redhatworkshops.io/v1
API Key:  sk-UFHcL...


### 3.2.1 Verify GPU Access

Before we go any further, we confirm that PyTorch can see the GPU and report its memory. The L4 should show approximately 24GB. If no GPU is detected, training cells will not execute, but you can still follow the data preparation steps and use pre-built outputs.

In [4]:
import torch
import os
import json
import time

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available:  {torch.cuda.is_available()}")
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU:             {gpu_name}")
    print(f"Memory:          {gpu_mem_gb:.1f} GB")
    if gpu_mem_gb < 20:
        print("WARNING: Less than 20GB VRAM. Training may fail even with 4-bit quantization.")
    elif gpu_mem_gb < 40:
        print("NOTE: 4-bit quantization (QLoRA) will be used to fit the 8B model in available memory.")
    else:
        print("NOTE: Sufficient memory for full-precision LoRA. 4-bit quantization is optional.")
else:
    print("WARNING: No GPU detected. Training cells will not execute.")

PyTorch version: 2.10.0+cu128
CUDA available:  True
GPU:             NVIDIA L4
Memory:          23.6 GB
NOTE: 4-bit quantization (QLoRA) will be used to fit the 8B model in available memory.


  raw_cnt = _raw_device_count_nvml()


**Important:** LoRA training on an 8B model with 4-bit quantization requires approximately 12 to 18GB of GPU memory depending on batch size and sequence length. The L4 provides 24GB, which gives us reasonable headroom.

If you ran Sections 1 or 2 in this same kernel session, embedding models or other objects may still be occupying GPU memory. Before continuing, **restart your kernel** (Kernel > Restart) and then run only the cells in this section from the top. This clears GPU memory for training.

You do not need to re-run Sections 1 or 2. The training data file (`synthetic_qa_pairs.csv`) is already saved to disk from Section 2.

## 3.3 Discover Available Algorithms

Before writing any training code, we ask the library what it supports. This is the same discovery-first pattern we used with `sdg_hub` in Section 2. Rather than guessing at function names or reading source code, we let the registry tell us what is available and what each algorithm requires.

In [5]:
from training_hub import AlgorithmRegistry

algorithms = AlgorithmRegistry.list_algorithms()
print("Available algorithms:", algorithms)

for algo_name in algorithms:
    backends = AlgorithmRegistry.list_backends(algo_name)
    print(f"  {algo_name}: backends = {backends}")

Available algorithms: ['sft', 'osft', 'lora_sft']
  sft: backends = ['instructlab-training']
  osft: backends = ['mini-trainer']
  lora_sft: backends = ['unsloth']


You should see three algorithms: `sft`, `osft`, and `lora_sft`. Each maps to one or more backend implementations.

**SFT (Supervised Fine-Tuning)** updates all model weights. It is the most straightforward approach but also the most destructive: the model can lose general capability while learning domain-specific behavior. It is also the most memory-intensive, requiring well over 24GB for an 8B model. That puts it out of reach on our hardware.

**OSFT (Orthogonal Subspace Fine-Tuning)** is more surgical. It analyzes the model's weight matrices via singular value decomposition, identifies which directions are most critical to the model's existing behavior, freezes those, and only updates the orthogonal (least critical) directions. However, OSFT requires the full model in memory at half precision (roughly 16GB just for weights), plus optimizer states and activations. On a 24GB GPU, this does not leave sufficient margin.

**LoRA SFT (Low-Rank Adaptation)** adds small trainable matrices alongside the frozen base weights. Because the base weights are frozen, they can be loaded in a quantized format (4-bit), dramatically reducing memory requirements. This makes LoRA the practical choice for our hardware.

We use LoRA because it fits our hardware constraints and is the most widely adopted parameter-efficient method in the field. When a customer asks "will fine-tuning break what the model already does well?", LoRA gives you a concrete answer: the base weights are never modified. The adapter learns adjustments that sit on top of the original model.

### 3.3.1 Inspect LoRA Parameters

Let's see what the LoRA algorithm requires and what options it exposes. This tells us what decisions we actually need to make versus what the library handles for us.

In [6]:
from training_hub import create_algorithm

lora_algo = create_algorithm('lora_sft')

required_params = lora_algo.get_required_params()
print("Required parameters:")
for k, v in required_params.items():
    print(f"  {k}: {v}")

print()
optional_params = lora_algo.get_optional_params()
print(f"Optional parameters ({len(optional_params)} total):")
for k, v in list(optional_params.items())[:15]:
    print(f"  {k}: {v}")
if len(optional_params) > 15:
    print(f"  ... and {len(optional_params) - 15} more")

Required parameters:
  model_path: <class 'str'>
  data_path: <class 'str'>
  ckpt_output_dir: <class 'str'>

Optional parameters (56 total):
  num_epochs: <class 'int'>
  effective_batch_size: <class 'int'>
  learning_rate: <class 'float'>
  max_seq_len: <class 'int'>
  max_tokens_per_gpu: <class 'int'>
  data_output_dir: <class 'str'>
  save_samples: <class 'int'>
  warmup_steps: <class 'int'>
  accelerate_full_state_at_epoch: <class 'bool'>
  checkpoint_at_epoch: <class 'bool'>
  nproc_per_node: typing.Union[str, int]
  nnodes: <class 'int'>
  node_rank: <class 'int'>
  rdzv_id: typing.Union[str, int]
  rdzv_endpoint: <class 'str'>
  ... and 41 more


The required parameters are minimal: a model path, a data path, and an output directory.

Everything else has sensible defaults. This is intentional.

Take a moment to appreciate what this means.

Under the hood, LoRA is decomposing weight update matrices into low-rank factors, inserting trainable adapter layers at strategic points in an 8-billion parameter model, managing 4-bit quantization of the frozen base, configuring gradient computation only for the adapter weights, handling mixed-precision training, and coordinating all of it within a 24GB memory budget. That is a significant amount of engineering.

And the interface is three required arguments and a function call.

The library absorbs the complexity of backend configuration, quantization setup, and checkpointing so that the interface stays focused on the decisions that actually matter: which model, which data, which hyperparameters. This is the same design philosophy behind `sdg_hub` and `its_hub`. The field should not need a PhD in optimization theory to run a training job. They need to make good decisions about inputs and interpret the outputs. The tooling handles the rest.

## 3.4 Prepare Training Data

We take the synthetic QA pairs from Section 2 and convert them into the format the training pipeline expects. The standard for chat-based fine-tuning is JSONL where each line contains a `messages` array representing one training conversation.

If Section 2 was not completed (or the CSV file is missing), this cell provides fallback training data drawn from the Basic Fantasy RPG rules. In a full workshop run, the synthetic data generation step produces this file automatically.

In [7]:
import pandas as pd

# Load the SDG output from Section 2
sdg_path = "../02SyntheticDataGen/synthetic_qa_pairs.csv"

if not os.path.exists(sdg_path):
    print(f"NOTE: {sdg_path} not found.")
    print("Using fallback training data from the handbook's sample questions.")
    print("In a full run, Section 2 (Synthetic Data Generation) produces this file.")
    print()
    fallback_data = [
        {
            "question": "Why can't Elves roll higher than a d6 for hit points?",
            "response": (
                "In Basic Fantasy RPG, Elves use a d6 for hit points because they are a "
                "combination class (Fighter/Magic-User). Their hit die is determined by the "
                "lower of the two classes. Magic-Users use d4 and Fighters use d8, but the "
                "combination class compromise gives Elves d6. This is specified in the "
                "Character Races section."
            )
        },
        {
            "question": "What happens if a Thief fails an Open Locks attempt?",
            "response": (
                "Open Locks allows the Thief to unlock a lock without a proper key. It may "
                "only be tried once per lock. If the attempt fails, the Thief must wait until "
                "they have gained another level of experience before trying again."
            )
        },
        {
            "question": "What is the maximum number of retainers a character with Charisma 18 can hire?",
            "response": (
                "A character with Charisma 18 receives a +3 bonus. According to the Charisma "
                "table, this affects the number of retainers a character may hire and their "
                "loyalty. The Retainers section specifies that the base number is modified by "
                "the Charisma bonus."
            )
        },
        {
            "question": "How does the Turning Undead table work for Clerics?",
            "response": (
                "The Turning Undead table cross-references the Cleric's level against the "
                "undead type. The number shown is the target on 2d6 that must be rolled to "
                "successfully turn the undead. A 'T' means the undead are automatically turned, "
                "and a 'D' means they are automatically destroyed. If the roll fails, the Cleric "
                "cannot attempt to turn that group again during the same encounter."
            )
        },
        {
            "question": "What is the movement rate for a character in plate armor?",
            "response": (
                "Characters wearing plate mail or plate mail with shield have a base movement "
                "rate of 20 feet per round. This is the slowest standard movement rate. "
                "Unarmored characters move at 40 feet per round. Movement rates are detailed "
                "in the Equipment section."
            )
        },
        {
            "question": "How are saving throws determined for a first-level Fighter?",
            "response": (
                "Saving throws for a first-level Fighter are listed in the Fighter Saving "
                "Throws table. The five categories are Death Ray or Poison, Magic Wands, "
                "Paralysis or Turn to Stone, Dragon Breath, and Spells. Each has a target "
                "number that must be rolled on a d20 or higher to succeed. Fighters generally "
                "have the best saving throws against physical threats like Dragon Breath."
            )
        }
    ]
    qa_df = pd.DataFrame(fallback_data)
else:
    qa_df = pd.read_csv(sdg_path)

print(f"Loaded {len(qa_df)} QA pairs")
print(f"Columns: {list(qa_df.columns)}")

for i, row in qa_df.iterrows():
    print(f"\n  Q{i+1}: {row['question'][:100]}")
    print(f"  A{i+1}: {row['response'][:100]}...")

Loaded 8 QA pairs
Columns: ['question', 'response', 'faithfulness_judgment']

  Q1: What is the Thief's Open Locks ability score at level 8?
  A1: At level 8, the Thief's Open Locks ability score is 60. This information is found in the Thief Abili...

  Q2: How does the Thief's Move Silently ability progress from level 1 to level 15?
  A2: Based on the provided document, the Thief's "Move Silently" ability progresses as follows from level...

  Q3: Can a GM adjust the Thief's ability scores, and for what reason?
  A3: Yes, a Game Master (GM) can adjust a Thief's ability scores in certain situations. The document spec...

  Q4: Why might a GM apply a situational adjustment when a Thief is climbing a wall?
  A4: A Game Master (GM) might apply a situational adjustment when a Thief is climbing a wall to account f...

  Q5: What is the Thief's Climb Walls ability score at level 17?
  A5: At level 17, the Thief's Climb Walls ability score is 96. This can be found in the Thief Abilities t...


### 3.4.1 Convert to Chat JSONL Format

Now we convert the QA pairs into JSONL chat format. Each training example becomes a three-turn conversation: a system prompt that establishes the model's role, a user question, and an assistant response. The model learns to produce the assistant turn given the other two.

The system prompt is a design choice. It tells the model who it is during training, and the same prompt should be used at inference time. If you change the persona later, the model may not behave as expected.

In [8]:
SYSTEM_PROMPT = (
    "You are a rules expert for the Basic Fantasy Role-Playing Game. "
    "Answer questions accurately based on the official rules. "
    "Be specific and cite page references or table values where possible."
)

# Convert to chat messages JSONL format
training_data = []
for _, row in qa_df.iterrows():
    example = {
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": row["question"]},
            {"role": "assistant", "content": row["response"]},
        ]
    }
    training_data.append(example)

# Write JSONL
data_path = "training_data.jsonl"
with open(data_path, "w", encoding="utf-8") as f:
    for example in training_data:
        f.write(json.dumps(example) + "\n")

print(f"Wrote {len(training_data)} examples to {data_path}")

# Verify structure
with open(data_path, "r") as f:
    first_line = json.loads(f.readline())
print(f"\nFirst example structure:")
print(f"  Keys: {list(first_line.keys())}")
print(f"  Roles: {[m['role'] for m in first_line['messages']]}")
print(f"  User: {first_line['messages'][1]['content'][:80]}...")

Wrote 8 examples to training_data.jsonl

First example structure:
  Keys: ['messages']
  Roles: ['system', 'user', 'assistant']
  User: What is the Thief's Open Locks ability score at level 8?...


Six training examples is very small. In a production engagement, you would generate hundreds or thousands of pairs across the full document corpus, filter on faithfulness, deduplicate, and review samples before committing to training. We use six because the goal is to demonstrate the process and observe the behavior, not to produce a production-quality adapter.

Even with six examples, if the training works correctly, the model should show measurable change on questions that directly overlap with the training data. Whether it generalizes beyond those specific examples is a separate question, and one that a larger dataset would answer.

## 3.5 Download the Base Model

LoRA operates on the model weights directly, so the model must be on disk, not behind an API. We download the same Granite model we have been using throughout the workshop.

The download is approximately 16GB. The weights will be loaded in 4-bit quantization during training, so we need the full download on disk but only a fraction of that will occupy GPU memory.

The `RUN_LIVE` toggle controls whether the download executes. If set to `False`, the cell assumes a pre-built copy exists at the expected path.

_This step typically takes 2 to 5 minutes depending on network speed._

In [9]:
# Set to True to download the model live. Set to False to use a pre-downloaded copy.
RUN_LIVE_DOWNLOAD = True

In [10]:
from huggingface_hub import snapshot_download

MODEL_ID = "ibm-granite/granite-3.2-8b-instruct"
LOCAL_MODEL_DIR = "./models/granite-3.2-8b-instruct"

if RUN_LIVE_DOWNLOAD:
    start_time = time.time()

    if os.path.exists(LOCAL_MODEL_DIR) and len(os.listdir(LOCAL_MODEL_DIR)) > 5:
        print(f"Model already exists at {LOCAL_MODEL_DIR}, skipping download.")
    else:
        print(f"Downloading {MODEL_ID}...")
        print("This may take several minutes.")
        snapshot_download(
            repo_id=MODEL_ID,
            local_dir=LOCAL_MODEL_DIR,
        )

    elapsed = time.time() - start_time
    minutes = int(elapsed // 60)
    seconds = int(elapsed % 60)

    # Verify
    model_files = [f for f in os.listdir(LOCAL_MODEL_DIR) if not f.startswith('.')]
    safetensor_files = [f for f in model_files if f.endswith('.safetensors')]
    total_size = sum(
        os.path.getsize(os.path.join(LOCAL_MODEL_DIR, f)) for f in model_files
    )

    print(f"\nModel directory: {len(model_files)} items, {len(safetensor_files)} safetensor shards")
    print(f"Total size:      {total_size / 1e9:.1f} GB")
    print(f"Elapsed:         {minutes}m {seconds}s")
else:
    print(f"Skipping download. Using pre-downloaded model at {LOCAL_MODEL_DIR}")
    if not os.path.exists(LOCAL_MODEL_DIR):
        print("WARNING: Model directory does not exist. Training will fail.")
        print("Set RUN_LIVE_DOWNLOAD = True and re-run this cell.")

Model already exists at ./models/granite-3.2-8b-instruct, skipping download.

Model directory: 14 items, 4 safetensor shards
Total size:      16.3 GB
Elapsed:         0m 0s


## 3.6 Run LoRA SFT Training

This is the core of the section: a single function call that trains the model.

We set conservative parameters tuned for a lab environment on a single L4 (24GB):

| Parameter | Value | Why |
|-----------|-------|-----|
| `num_epochs` | 5 | Multiple passes over our small dataset to ensure the model sees each example enough times |
| `effective_batch_size` | 2 | Small batch to stay well within GPU memory limits |
| `max_seq_len` | 512 | Our training examples are short; longer sequences waste memory |
| `max_tokens_per_gpu` | 512 | Conservative memory cap per GPU to avoid OOM errors |
| `learning_rate` | 5e-6 | Conservative rate to avoid catastrophic forgetting of general knowledge |

These values are deliberately conservative. On a 24GB GPU, we do not have the margin to be aggressive with batch sizes or sequence lengths. The goal is a successful training run that demonstrates the process, not maximum throughput.

The `RUN_LIVE` toggle controls execution. When set to `False`, the cell skips training and points to a pre-built adapter checkpoint.

_Training with these parameters on 6 examples typically takes 3 to 8 minutes on an L4. You will see progress output as epochs complete._

In [11]:
# Set to True to run training live. Set to False to use pre-built adapter.
RUN_LIVE = True

In [None]:
from training_hub import lora_sft

CKPT_DIR = "./lora_output"

if RUN_LIVE:
    # Clear any leftover GPU memory from previous sections
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        import gc
        gc.collect()
        mem_used = torch.cuda.memory_allocated() / 1e9
        mem_total = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"GPU memory before training: {mem_used:.1f} GB used / {mem_total:.1f} GB total")
        print()

    print("Starting LoRA SFT training...")
    print(f"  Model:          {LOCAL_MODEL_DIR}")
    print(f"  Data:           {data_path}")
    print(f"  Output:         {CKPT_DIR}")
    print(f"  Epochs:         5")
    print(f"  Batch size:     2")
    print(f"  Max seq len:    512")
    print(f"  Max tokens/GPU: 512")
    print(f"  Learning rate:  5e-6")
    print(f"  Quantization:   4-bit (QLoRA)")
    print()
    print("This typically takes 3 to 8 minutes. Progress will appear below.")
    print()

    start_time = time.time()

    result = lora_sft(
        model_path=LOCAL_MODEL_DIR,
        data_path=data_path,
        ckpt_output_dir=CKPT_DIR,
        data_output_dir="./lora_data_output",
        num_epochs=5,
        effective_batch_size=2,
        max_tokens_per_gpu=512,
        max_seq_len=512,
        learning_rate=5e-6,
        load_in_4bit=True,
    )

    elapsed = time.time() - start_time
    minutes = int(elapsed // 60)
    seconds = int(elapsed % 60)
    print(f"\nTraining complete in {minutes}m {seconds}s")
    print(f"Result: {result}")
else:
    print("Skipping live training. Using pre-built adapter checkpoint.")
    CKPT_DIR = "../prebuilt/lora_output"
    if not os.path.exists(CKPT_DIR):
        print(f"WARNING: Pre-built checkpoint not found at {CKPT_DIR}")
        print("Set RUN_LIVE = True and re-run to train from scratch.")
    else:
        print(f"Using checkpoint at: {CKPT_DIR}")

### 3.6.1 Inspect Training Artifacts

After training completes, the output directory contains the LoRA adapter weights and a training log. The adapter is small (typically tens of megabytes) compared to the full model (16GB). This is one of LoRA's key advantages: the artifact you produce, store, and version is compact.

Let's verify that the output directory was populated and look at the training log for loss progression.

In [None]:
# Check what training produced
if os.path.exists(CKPT_DIR):
    contents = os.listdir(CKPT_DIR)
    print(f"Checkpoint directory ({CKPT_DIR}): {len(contents)} items")
    for item in sorted(contents):
        item_path = os.path.join(CKPT_DIR, item)
        if os.path.isfile(item_path):
            size_mb = os.path.getsize(item_path) / 1e6
            print(f"  {item:40s} {size_mb:8.1f} MB")
        else:
            print(f"  {item:40s} (directory)")
else:
    print(f"Checkpoint directory not found: {CKPT_DIR}")

In [None]:
# Check for training log
log_candidates = [
    os.path.join(CKPT_DIR, "training_log_node0.log"),
    os.path.join("./lora_output", "training_log_node0.log"),
    os.path.join("./lora_data_output", "training_log_node0.log"),
]

log_found = False
for log_path in log_candidates:
    if os.path.exists(log_path):
        print(f"Training log found at: {log_path}")
        print("Last 30 lines:")
        print()
        with open(log_path, "r") as f:
            lines = f.readlines()
            for line in lines[-30:]:
                print(line.rstrip())
        log_found = True
        break

if not log_found:
    print("No training log found. This is expected if using pre-built outputs.")

### 3.6.2 Verify GPU State After Training

Training loads the model, optimizer states, and gradient buffers into GPU memory. After training completes, these should be released. This cell confirms that GPU memory has been freed and reports current usage. If memory is still high, a kernel restart may be needed before running inference with the adapted model.

In [None]:
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    import gc
    gc.collect()
    print(f"GPU memory used:  {torch.cuda.memory_allocated()/1e9:.1f} GB")
    print(f"GPU memory cached: {torch.cuda.memory_reserved()/1e9:.1f} GB")
    print(f"GPU memory total:  {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
else:
    print("No GPU available.")

## 3.7 What Just Happened

If training completed successfully, you now have a LoRA adapter in the `lora_output` directory. This adapter contains only the learned adjustments, not a full copy of the model. To use it, you merge the adapter with the base model at inference time (or ahead of time as a one-time operation).

A few things to note about what we just did and what it means:

The base model was never modified. Every weight in Granite 3.2 8B is exactly as it was before training. The adapter sits alongside those weights and adds small corrections to the model's behavior. If the adapter makes things worse, you can discard it and return to the base model immediately. There is no rollback problem.

Six training examples is not enough for production. We used a minimal dataset to demonstrate the mechanics. In a real engagement, you would generate hundreds or thousands of pairs, filter for quality, and evaluate the adapter against a held-out test set before declaring success.

The real question is not "did it train" but "did the adapted model close the gaps we identified in evaluation?" That question gets answered in the next section, where we compare the adapted model's outputs against the baseline and RAG results from earlier.

This is where the escalation ladder from Section 5 comes full circle. We started with a baseline. We improved with RAG. We evaluated. We identified specific, consistent failures that pointed to the model itself. And only then did we adapt the model, with a clear hypothesis about what should improve and a way to measure whether it did.

## 3.8 Did Training Help?

We started this section because the evidence from Sections 0 and 1 pointed to a knowledge gap, not a retrieval gap and not a sampling gap. Now we close the loop. We load the adapter, run the same questions that failed earlier, and check whether the answers changed.

> **Note:** If training ran in this kernel session, the model and optimizer may still occupy GPU memory. If the next cell fails with an out-of-memory error, restart the kernel (Kernel > Restart) and run only the cells in this subsection. The adapter and base model are already saved to disk.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Free GPU memory from training
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    import gc
    gc.collect()

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    LOCAL_MODEL_DIR,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(LOCAL_MODEL_DIR)
model = PeftModel.from_pretrained(base_model, CKPT_DIR)
model.eval()
print("Adapter loaded.")

In [None]:
SYSTEM_PROMPT = (
    "You are a rules expert for the Basic Fantasy Role-Playing Game. "
    "Answer questions accurately and concisely based on the rules."
)

TARGET_QUESTIONS = [
    {
        "id": "q02",
        "question": "Why do Elves use a d6 for hit dice instead of a d8?",
        "expected": "Elves are a combination class. The d6 reflects the compromise between Fighter and Magic-User hit dice.",
    },
    {
        "id": "q06",
        "question": "What melee attack and damage bonus does a character with Strength 16 receive?",
        "expected": "+2 to attack and +2 to damage.",
    },
]

def run_inference(question):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]
    input_ids = tokenizer.apply_chat_template(
        messages, return_tensors="pt", add_generation_prompt=True
    ).to(model.device)
    with torch.no_grad():
        output = model.generate(input_ids, max_new_tokens=200, do_sample=False)
    return tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)

print("Running target questions through fine-tuned model...\n")
for q in TARGET_QUESTIONS:
    answer = run_inference(q["question"])
    print(f"[{q['id']}] {q['question']}")
    print(f"  Expected: {q['expected']}")
    print(f"  Got:      {answer[:200]}")
    print()

These are the two questions that resisted both RAG and inference-time scaling. If the adapted model produces better answers here, the training had the intended effect. If not, the training data or the number of examples may be insufficient.

Either result is informative. Section 4 runs the full evaluation with all 10 questions and a systematic comparison.

**Transition to Section 4:**

"We have an adapted model. The next step is to evaluate it properly: all 10 questions, side-by-side comparison, and an honest assessment of what improved and what did not."