# 3. Model Adaptation: Changing What the Model Knows

> **GPU Required.** This section requires a CUDA-capable GPU with at least 40GB of memory. The lab environment provides an NVIDIA L40S (46GB). If you are running this outside the lab without GPU access, you can follow the setup and data preparation cells, but training will not execute. Pre-built outputs are provided for that scenario.

**Purpose:**

We have now exhausted the options that do not involve changing the model. RAG gave us grounded answers for 6 out of 10 questions. Inference-time scaling (Best-of-N) recovered 2 more. But for the remaining failures, particularly questions requiring implicit domain reasoning, every approach hit the same wall: the model's weights do not contain a reliable path to the correct answer.

This is where model adaptation becomes justified. Not because it is the next thing on the list, but because two independent evaluation methods pointed to the same conclusion: the gap is in the model, not in the pipeline.

We will use **OSFT (Orthogonal Subspace Fine-Tuning)**, a parameter-efficient method that updates only the least critical directions in the model's weight matrices while preserving the most important ones. This is not full fine-tuning. It is a controlled modification designed to teach the model new domain knowledge without destroying what it already knows.

The training data comes directly from Section 2: synthetic question-answer pairs generated from the customer's own documents. The model comes from HuggingFace. The training runs locally on GPU.

By the end of this section, you will have an adapted model and evidence of whether the adaptation closed the gaps that RAG and inference-time scaling could not.

## 3.1 Install Training Hub

`training_hub` is an open-source library from the Red Hat AI Innovation Team. It provides a single Python interface for multiple post-training algorithms, each backed by a community implementation. You call a function, pass your model and data, and the library handles the rest.

The `[cuda]` extra installs GPU dependencies including PyTorch with CUDA support and flash-attention. 

_This step may take several minutes._


In [1]:
# Install base package first (provides torch, packaging, wheel, ninja)
! pip install training-hub -q

# Then install CUDA extras
! pip install training-hub[cuda] --no-build-isolation -q

**Note:** You may see dependency conflict warnings from pip. These are advisory, not fatal. The lab environment has many pre-installed packages (Docling, KFP, Feast, etc.) that pin older versions of shared dependencies like `transformers` and `click`. As long as the import in the next cell succeeds, the conflicts do not affect training_hub.

If `flash-attn` fails to build, that is also acceptable. Training will fall back to standard attention. Flash-attention is a performance optimization, not a requirement.

In [2]:
# Verify the install
from training_hub import AlgorithmRegistry
print("training_hub imported successfully")
print(f"Available algorithms: {AlgorithmRegistry.list_algorithms()}")

Skipping import of cpp extensions due to incompatible torch version 2.7.1+cu128 for torchao version 0.16.0             Please see https://github.com/pytorch/ao/issues/2919 for more info


training_hub imported successfully
Available algorithms: ['sft', 'osft', 'lora_sft']


## 3.2 Environment Setup

Same credentials, same endpoint pattern. We reuse the `.env` file and config helper from earlier sections.

In [3]:
import sys
sys.path.insert(0, "..")
from config import API_KEY as key, ENDPOINT_BASE as endpoint_base

print(f"Endpoint: {endpoint_base}")
print(f"API Key:  {key[:8]}...")

Endpoint: https://litellm-prod.apps.maas.redhatworkshops.io/v1
API Key:  sk-UFHcL...


In [4]:
import torch
import os
import json
import time

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available:  {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU:             {torch.cuda.get_device_name(0)}")
    print(f"Memory:          {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("WARNING: No GPU detected. Training cells will not execute.")

PyTorch version: 2.7.1+cu128
CUDA available:  True
GPU:             NVIDIA L40S
Memory:          47.7 GB


**Important:** OSFT training on an 8B model requires approximately 35-40GB of GPU memory. The L40S provides 46GB, which is sufficient but does not leave much margin.

If you ran Sections 1 or 2 in this same kernel session, embedding models or other objects may still be occupying GPU memory. Before continuing, **restart your kernel** (Kernel > Restart) and then run only the cells in this section from the top. This ensures the full 46GB is available for training.

You do not need to re-run Sections 1 or 2. The training data file (`synthetic_qa_pairs.csv`) is already saved to disk from Section 2.

## 3.3 Discover Available Algorithms

Before writing any training code, we ask the library what it supports. This is the same discovery-first pattern we used with `sdg_hub` in Section 2.

In [5]:
from training_hub import AlgorithmRegistry

algorithms = AlgorithmRegistry.list_algorithms()
print("Available algorithms:", algorithms)

for algo_name in algorithms:
    backends = AlgorithmRegistry.list_backends(algo_name)
    print(f"  {algo_name}: backends = {backends}")

Available algorithms: ['sft', 'osft', 'lora_sft']
  sft: backends = ['instructlab-training']
  osft: backends = ['mini-trainer']
  lora_sft: backends = ['unsloth']


You should see three algorithms: `sft`, `osft`, and `lora_sft`. Each maps to one or more backend implementations.

**SFT (Supervised Fine-Tuning)** updates all model weights. It is the most straightforward approach but also the most destructive: the model can lose general capability while learning domain-specific behavior. It is also the most memory-intensive.

**OSFT (Orthogonal Subspace Fine-Tuning)** is more surgical. It analyzes the model's weight matrices via singular value decomposition, identifies which directions are most critical to the model's existing behavior, freezes those, and only updates the orthogonal (least critical) directions. The `unfreeze_rank_ratio` parameter controls how much of the model is trainable. At `0.25`, 75% of the model's most critical weights are frozen and 25% are available for new learning.

**LoRA SFT (Low-Rank Adaptation)** adds small trainable matrices alongside the frozen base weights. It is widely used and well understood. LoRA and OSFT solve similar problems from different angles: LoRA adds capacity, OSFT reallocates existing capacity.

We use OSFT because it is developed by the same team that built the rest of this toolkit, and its continual learning properties are well suited to iterative domain adaptation. When a customer asks "will fine-tuning break what the model already does well?", OSFT gives you a concrete answer: we control exactly how much of the model changes, and we protect what matters most.

### 3.3.1 Inspect OSFT Parameters

In [6]:
from training_hub import create_algorithm

osft_algo = create_algorithm('osft')

required_params = osft_algo.get_required_params()
print("Required parameters:")
for k, v in required_params.items():
    print(f"  {k}: {v}")

print()
optional_params = osft_algo.get_optional_params()
print(f"Optional parameters ({len(optional_params)} total):")
for k, v in list(optional_params.items())[:15]:
    print(f"  {k}: {v}")
if len(optional_params) > 15:
    print(f"  ... and {len(optional_params) - 15} more")

Required parameters:
  model_path: <class 'str'>
  data_path: <class 'str'>
  unfreeze_rank_ratio: <class 'float'>
  effective_batch_size: <class 'int'>
  max_tokens_per_gpu: <class 'int'>
  max_seq_len: <class 'int'>
  learning_rate: <class 'float'>
  ckpt_output_dir: <class 'str'>

Optional parameters (26 total):
  target_patterns: list[str]
  seed: <class 'int'>
  use_liger: <class 'bool'>
  lr_scheduler: <class 'str'>
  warmup_steps: <class 'int'>
  lr_scheduler_kwargs: <class 'dict'>
  beta1: <class 'float'>
  beta2: <class 'float'>
  eps: <class 'float'>
  weight_decay: <class 'float'>
  checkpoint_at_epoch: <class 'bool'>
  save_final_checkpoint: <class 'bool'>
  num_epochs: <class 'int'>
  use_processed_dataset: <class 'bool'>
  unmask_messages: <class 'bool'>
  ... and 11 more


The required parameters are minimal: 
* a model path,
* a data path, and
* an output directory.

Everything else has sensible defaults. This is intentional.

Take a moment to appreciate what this means. 

Under the hood, OSFT is performing singular value decomposition on every weight matrix in an 8-billion parameter model, ranking directions by criticality, freezing the most important ones, configuring gradient computation only for the orthogonal subspace, managing mixed-precision training, handling checkpointing, and coordinating all of it across GPU memory boundaries. That is a significant amount of engineering.

And the interface is three required arguments and a function call.

The library absorbs the complexity of backend configuration, distributed setup, and checkpointing so that the interface stays focused on the decisions that actually matter: which model, which data, which hyperparameters. This is the same design philosophy behind `sdg_hub` and `its_hub`. The field should not need a PhD in optimization theory to run a training job. They need to make good decisions about inputs and interpret the outputs. The tooling handles the rest.

## 3.4 Prepare Training Data

We take the synthetic QA pairs from Section 2 and convert them into the format the training pipeline expects. The standard for chat-based fine-tuning is JSONL where each line contains a `messages` array representing one training conversation.

In [7]:
import pandas as pd

# Load the SDG output from Section 2
qa_df = pd.read_csv("../02SyntheticDataGen/synthetic_qa_pairs.csv")

print(f"Loaded {len(qa_df)} QA pairs from Section 2")
print(f"Columns: {list(qa_df.columns)}")

for i, row in qa_df.iterrows():
    print(f"\n  Q{i+1}: {row['question'][:80]}...")
    print(f"  A{i+1}: {row['response'][:80]}...")

Loaded 8 QA pairs from Section 2
Columns: ['question', 'response', 'faithfulness_judgment']

  Q1: What is the Thief's Open Locks ability score at level 8?...
  A1: At level 8, the Thief's Open Locks ability score is 60. This information is foun...

  Q2: How does the Thief's Move Silently ability progress from level 1 to level 15?...
  A2: Based on the provided document, the Thief's "Move Silently" ability progresses a...

  Q3: Can a GM adjust the Thief's ability scores, and for what reason?...
  A3: Yes, a Game Master (GM) can adjust a Thief's ability scores in certain situation...

  Q4: Why might a GM apply a situational adjustment when a Thief is climbing a wall?...
  A4: A Game Master (GM) might apply a situational adjustment when a Thief is climbing...

  Q5: What is the Thief's Climb Walls ability score at level 17?...
  A5: At level 17, the Thief's Climb Walls ability score is 96. This can be found in t...

  Q6: If a Thief is level 10, what is their Pick Pockets ability scor

Now we convert the CSV into JSONL chat format. Each training example becomes a three-turn conversation: a system prompt that establishes the model's role, a user question, and an assistant response. The model learns to produce the assistant turn given the other two.

The system prompt is a design choice. It tells the model who it is during training, and the same prompt should be used at inference time. If you change the persona later, the model may not behave as expected.

In [8]:
SYSTEM_PROMPT = (
    "You are a rules expert for the Basic Fantasy Role-Playing Game. "
    "Answer questions accurately based on the official rules. "
    "Be specific and cite page references or table values where possible."
)

# Convert to chat messages JSONL format
training_data = []
for _, row in qa_df.iterrows():
    example = {
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": row["question"]},
            {"role": "assistant", "content": row["response"]},
        ]
    }
    training_data.append(example)

# Write JSONL
data_path = "training_data.jsonl"
with open(data_path, "w", encoding="utf-8") as f:
    for example in training_data:
        f.write(json.dumps(example) + "\n")

print(f"Wrote {len(training_data)} examples to {data_path}")

# Verify structure
with open(data_path, "r") as f:
    first_line = json.loads(f.readline())
print(f"\nFirst example structure:")
print(f"  Keys: {list(first_line.keys())}")
print(f"  Roles: {[m['role'] for m in first_line['messages']]}")
print(f"  User: {first_line['messages'][1]['content'][:80]}...")

Wrote 8 examples to training_data.jsonl

First example structure:
  Keys: ['messages']
  Roles: ['system', 'user', 'assistant']
  User: What is the Thief's Open Locks ability score at level 8?...


Six training examples is very small. In a production engagement, you would generate hundreds or thousands of pairs across the full document corpus, filter on faithfulness, deduplicate, and review samples before committing to training. We use six because the goal is to demonstrate the process and observe the behavior, not to produce a production-quality adapter.

Even with six examples, if the training works correctly, the model should show measurable change on questions that directly overlap with the training data. Whether it generalizes beyond those specific examples is a separate question, and one that a larger dataset would answer.

## 3.5 Download the Base Model

OSFT operates on the model weights directly, so the model must be on disk, not behind an API. We download the same Granite model we have been using throughout the workshop.

This downloads approximately 16GB and may take several minutes depending on network speed.

In tests, this took under 2 minutes.

In [9]:
from huggingface_hub import snapshot_download

MODEL_ID = "ibm-granite/granite-3.2-8b-instruct"
LOCAL_MODEL_DIR = "./models/granite-3.2-8b-instruct"

start_time = time.time()

if os.path.exists(LOCAL_MODEL_DIR) and len(os.listdir(LOCAL_MODEL_DIR)) > 5:
    print(f"Model already exists at {LOCAL_MODEL_DIR}, skipping download.")
else:
    print(f"Downloading {MODEL_ID}...")
    print("This may take several minutes.")
    snapshot_download(
        repo_id=MODEL_ID,
        local_dir=LOCAL_MODEL_DIR,
    )

elapsed = time.time() - start_time
minutes = int(elapsed // 60)
seconds = int(elapsed % 60)

# Verify
model_files = [f for f in os.listdir(LOCAL_MODEL_DIR) if not f.startswith('.')]
safetensor_files = [f for f in model_files if f.endswith('.safetensors')]
total_size = sum(
    os.path.getsize(os.path.join(LOCAL_MODEL_DIR, f)) for f in model_files
)

print(f"\nModel directory: {len(model_files)} items, {len(safetensor_files)} safetensor shards")
print(f"Total size:      {total_size / 1e9:.1f} GB")
print(f"Elapsed:         {minutes}m {seconds}s")

Model already exists at ./models/granite-3.2-8b-instruct, skipping download.

Model directory: 14 items, 4 safetensor shards
Total size:      16.3 GB
Elapsed:         0m 0s


## 3.6 Run OSFT Training

This is the core of the section: a single function call that trains the model.

We set conservative parameters for a lab environment on a single L40S:

- `unfreeze_rank_ratio=0.25`: 75% of critical weights frozen, 25% trainable
- `num_epochs=5`: Multiple passes over our small dataset
- `effective_batch_size=4`: Small batch for memory safety
- `max_seq_len=1024`: Our training examples are short, no need for longer
- `max_tokens_per_gpu=2048`: Hard memory cap per GPU
- `learning_rate=5e-6`: Conservative to avoid catastrophic forgetting

The `RUN_LIVE` toggle works the same way as previous sections.

In [10]:
# Set to True to run training live. Set to False to use pre-built adapter.
RUN_LIVE = True

In [11]:
! pip install 'training-hub[lora]' -q

In [12]:
import time

from training_hub import lora_sft

CKPT_DIR = "./lora_output"

if RUN_LIVE:
    print("Starting LoRA SFT training...")
    print(f"  Model:          {LOCAL_MODEL_DIR}")
    print(f"  Data:           {data_path}")
    print(f"  Output:         {CKPT_DIR}")
    print(f"  Epochs:         5")
    print(f"  Batch size:     4")
    print(f"  Learning rate:  5e-6")
    print()

    start_time = time.time()

    result = lora_sft(
        model_path=LOCAL_MODEL_DIR,
        data_path=data_path,
        ckpt_output_dir=CKPT_DIR,
        data_output_dir="./lora_data_output",
        num_epochs=5,
        effective_batch_size=4,
        max_tokens_per_gpu=2048,
        max_seq_len=1024,
        learning_rate=5e-6,
    )

    elapsed = time.time() - start_time
    minutes = int(elapsed // 60)
    seconds = int(elapsed % 60)
    print(f"\nTraining complete in {minutes}m {seconds}s")
    print(f"Result: {result}")
else:
    print("Using pre-built checkpoint. Skipping training.")
    CKPT_DIR = "../prebuilt/lora_output"

Starting LoRA SFT training...
  Model:          ./models/granite-3.2-8b-instruct
  Data:           training_data.jsonl
  Output:         ./lora_output
  Epochs:         5
  Batch size:     4
  Learning rate:  5e-6

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed editing tqdm to replace Inductor Compilation:



Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


ImportError: Failed to import dependencies for Unsloth backend: cannot import name 'CompileCounterInt' from 'torch._dynamo.utils' (/opt/app-root/lib64/python3.12/site-packages/torch/_dynamo/utils.py)
Install LoRA dependencies with: pip install 'training-hub[lora]'

In [None]:
! cat osft_output/training_log_node0.log | tail -50

In [None]:
import torch
torch.cuda.empty_cache()
print(f"GPU memory used: {torch.cuda.memory_allocated()/1e9:.1f} GB")

In [13]:
import torch
print(torch.__version__)

2.7.1+cu128


In [14]:
! pip show unsloth 2>/dev/null | grep Version

Version: 2025.11.1
