# 4. Evaluation (continued): Cross-Architecture Comparison

> **GPU Required.** This notebook loads four models sequentially. Each model is loaded, evaluated, and unloaded before the next. An NVIDIA GPU with at least 16 GB of memory is required.

**Purpose:**

The lab fine-tuned Granite 8B with LoRA on 8 BFRPG Thief Q&A pairs using a conservative learning rate (5e-6). Three additional, smaller base models were fully trained ahead of time on the **exact same 8 training examples** using the same LoRA SFT technique but with a higher learning rate (2e-4). This gives us a natural experiment:

**When training data is identical, does architecture and size matter?**

| Model | Base Architecture | Parameters | Quantization | Learning Rate |
|-------|------------------|------------|-------------|---------------|
| Lab Granite 8B + LoRA | `granite-3.2-8b-instruct` | ~8B | 4-bit (BnB) | 5e-6 |
| Granite 2B (merged) | `granite-3.2-2b-instruct` | ~2.5B | bf16 | 2e-4 |
| Phi-3 Mini (merged) | `phi-3-mini-4k-instruct` | ~3.8B | bf16 | 2e-4 |
| Qwen2.5 3B (merged) | `Qwen2.5-3B-Instruct` | ~3B | bf16 | 2e-4 |

We evaluate all four on the same 10 questions from Day 2, in two modes:
- **Mode 1 — Without RAG context:** Pure knowledge test from weights alone
- **Mode 2 — With RAG context:** Same retrieved chunks as Day 2 baseline

This lets us compare both memorization (Mode 1) and reasoning-over-context (Mode 2) across architectures.

**Relationship to Section 4.1–4.8:** The previous notebook evaluates the model you trained live in Section 3. If that training run completed successfully, both notebooks contribute to the full evaluation story. If Section 3 encountered issues (out-of-memory, timeout, network failure), this notebook still works — the three fully trained models are hosted on HuggingFace, so they download and run independently of your local training. This guarantees every participant gets a complete evaluation experience with meaningful results to discuss in Section 5.

## 4.9 Install Dependencies

These packages were installed in earlier sections. The cell below ensures they are available if running in a fresh kernel.

In [1]:
! pip install peft bitsandbytes accelerate transformers -q

In [2]:
import json
import gc
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from IPython.display import display, HTML

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available:  {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU:             {torch.cuda.get_device_name(0)}")
    print(f"Memory:          {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

PyTorch version: 2.7.1+cu128
CUDA available:  True
GPU:             NVIDIA L4
Memory:          23.6 GB




## 4.10 Load Evaluation Data

We reuse the same 10 evaluation questions and retrieved context from Day 2. This ensures an apples-to-apples comparison across all models.

In [3]:
with open("../prebuilt/eval_results.json") as f:
    eval_baseline = json.load(f)

with open("../prebuilt/eval_with_context.json") as f:
    eval_with_context = json.load(f)

baseline_by_id = {r["id"]: r for r in eval_baseline["results"]}
context_by_id = {r["id"]: r for r in eval_with_context["results"]}

print(f"Loaded {len(baseline_by_id)} evaluation questions")
print(f"Loaded {len(context_by_id)} questions with retrieved context")

# Show the questions
for r in eval_baseline["results"]:
    print(f"  {r['id']} [{r['category']}]: {r['question'][:70]}")

Loaded 10 evaluation questions
Loaded 10 questions with retrieved context
  q01 [explicit_rule]: What happens if a Thief fails an Open Locks attempt?
  q02 [terminology]: Why can't Elves roll higher than a d6 for hit points?
  q03 [implicit_reasoning]: Can a character wear leather armor and cast spells?
  q04 [table_lookup]: What is the saving throw for a 3rd level Fighter against Dragon Breath
  q05 [multi_step_rule]: How does a Cleric turn undead?
  q06 [table_lookup]: If a character has a Strength of 16, what bonus do they get on melee a
  q07 [terminology]: What is the difference between a retainer and a hireling?
  q08 [implicit_reasoning]: When can a Magic-User learn new spells?
  q09 [explicit_rule]: What happens to a character at exactly 0 hit points?
  q10 [implicit_reasoning]: Can a Halfling use a longbow?


## 4.11 Model Registry

Each entry describes how to load a model. The lab's Granite 8B uses 4-bit quantization with a separate LoRA adapter, while the three fully trained models are pre-merged and loaded directly in bf16.

**Why the difference?** The Granite 8B model requires ~16 GB in bf16 — too large to fine-tune on an L4 (24 GB) without quantization. Section 3 uses QLoRA (4-bit quantization via BitsAndBytes) to compress the base model to ~5 GB, then trains a small LoRA adapter on top. At inference time we must reload the base in 4-bit and apply the adapter separately. The three smaller models (2–4B parameters) were fine-tuned on a DGX Spark with ample unified memory, so their LoRA adapters could be merged directly into the base weights and uploaded as complete bf16 models. Training these models took longer than this lab allows, so they are provided pre-trained on HuggingFace for you to download and evaluate here.

The `load_style` field controls the loading path:
- `"quantized_adapter"` — load base in 4-bit, then apply PEFT adapter
- `"merged"` — load directly from HuggingFace (weights already merged)

In [4]:
MODEL_REGISTRY = [
    {
        "name": "Granite 8B + LoRA (lab)",
        "base_model": "../03ModelAdaptation/models/granite-3.2-8b-instruct",
        "adapter": "../03ModelAdaptation/lora_output",
        "load_style": "quantized_adapter",
        "params": "~8B (4-bit)",
        "lr": "5e-6",
    },
    {
        "name": "Granite 2B (fully trained)",
        "base_model": "FrankDigsData/granite-2b-rhai-finetuned",
        "adapter": None,
        "load_style": "merged",
        "params": "~2.5B (bf16)",
        "lr": "2e-4",
    },
    {
        "name": "Phi-3 Mini (fully trained)",
        "base_model": "FrankDigsData/phi3-mini-rhai-finetuned",
        "adapter": None,
        "load_style": "merged",
        "params": "~3.8B (bf16)",
        "lr": "2e-4",
    },
    {
        "name": "Qwen2.5 3B (fully trained)",
        "base_model": "FrankDigsData/qwen25-3b-rhai-finetuned",
        "adapter": None,
        "load_style": "merged",
        "params": "~3B (bf16)",
        "lr": "2e-4",
    },
]

SYSTEM_PROMPT = (
    "You are a rules expert for the Basic Fantasy Role-Playing Game. "
    "Answer questions accurately based on the official rules. "
    "Be specific and cite page references or table values where possible."
)

print(f"Registered {len(MODEL_REGISTRY)} models for evaluation:")
for m in MODEL_REGISTRY:
    print(f"  - {m['name']} ({m['params']}, LR={m['lr']})")

Registered 4 models for evaluation:
  - Granite 8B + LoRA (lab) (~8B (4-bit), LR=5e-6)
  - Granite 2B (fully trained) (~2.5B (bf16), LR=2e-4)
  - Phi-3 Mini (fully trained) (~3.8B (bf16), LR=2e-4)
  - Qwen2.5 3B (fully trained) (~3B (bf16), LR=2e-4)


## 4.12 Helper Functions

Three helper functions handle the full model lifecycle:

1. **`load_model()`** — Loads model + tokenizer. Handles both quantized+adapter and merged loading styles.
2. **`generate_answer()`** — Generates an answer given a model, tokenizer, and question. Uses `pad_token_id` fallback for cross-architecture compatibility.
3. **`cleanup_model()`** — Deletes model and tokenizer, runs garbage collection, and clears GPU cache.

The key difference from Section 4's evaluation: these functions take `model` and `tokenizer` as parameters instead of using globals, so we can swap models in a loop.

In [5]:
def load_model(entry):
    """Load a model and tokenizer based on the registry entry.

    Returns (model, tokenizer, timing) where timing is a dict with
    'download_s' and 'gpu_load_s' keys.
    """
    timing = {}

    if entry["load_style"] == "quantized_adapter":
        t0 = time.time()
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
        )
        base = AutoModelForCausalLM.from_pretrained(
            entry["base_model"],
            quantization_config=bnb_config,
            device_map="auto",
            dtype=torch.float16,
        )
        tokenizer = AutoTokenizer.from_pretrained(entry["base_model"])
        timing["download_s"] = 0  # local model, no download
        timing["gpu_load_s"] = time.time() - t0

        t1 = time.time()
        model = PeftModel.from_pretrained(base, entry["adapter"])
        timing["adapter_s"] = time.time() - t1
    else:
        # For HF Hub models, the download happens inside from_pretrained.
        # We time the whole call — on repeat runs the cache makes download ~0.
        t0 = time.time()
        model = AutoModelForCausalLM.from_pretrained(
            entry["base_model"],
            device_map="auto",
            torch_dtype=torch.bfloat16,
        )
        timing["download_and_load_s"] = time.time() - t0

        t1 = time.time()
        tokenizer = AutoTokenizer.from_pretrained(entry["base_model"])
        timing["tokenizer_s"] = time.time() - t1

    model.eval()

    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id

    return model, tokenizer, timing


def generate_answer(model, tokenizer, question, context=None, max_new_tokens=512):
    """Generate an answer using the given model and tokenizer."""
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]

    if context:
        user_content = (
            f"Use the following context to answer the question.\n\n"
            f"Context:\n{context}\n\n"
            f"Question: {question}"
        )
    else:
        user_content = question

    messages.append({"role": "user", "content": user_content})

    input_text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=None,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
        )

    generated = outputs[0][inputs["input_ids"].shape[1]:]
    answer = tokenizer.decode(generated, skip_special_tokens=True).strip()
    return answer


def cleanup_model(model, tokenizer):
    """Delete model and tokenizer, free GPU memory."""
    del model
    del tokenizer
    gc.collect()
    torch.cuda.empty_cache()
    mem = torch.cuda.memory_allocated() / 1e9
    print(f"  GPU memory after cleanup: {mem:.1f} GB")


print("Helper functions defined.")

Helper functions defined.


## 4.13 Classification Function

We reuse the same keyword-based classification from Section 4. An answer passes if it contains at least half of the key terms from the expected answer. This is a rough heuristic — the detailed answer tables below allow manual review.

In [6]:
KEY_CHECKS = {
    "q01": ["another level", "wait"],
    "q02": ["hit die", "combination class", "d6"],
    "q03": ["magic-user", "cannot", "cleric"],
    "q04": ["15"],
    "q05": ["turn undead", "table", "roll"],
    "q06": ["+2", "bonus"],
    "q07": ["adventure", "non-adventure", "hired", "do not"],
    "q08": ["spell", "learn", "level"],
    "q09": ["0", "dead", "may"],
    "q10": ["large", "cannot", "halfling"],
}


def classify_answer(answer, qid):
    """Simple keyword-based classification.

    Returns 'pass' if the answer contains at least half of the
    key terms for that question, 'fail' otherwise.
    """
    answer_lower = answer.lower()
    checks = KEY_CHECKS.get(qid, [])
    if not checks:
        return "fail"
    matches = sum(1 for kw in checks if kw in answer_lower)
    return "pass" if matches >= len(checks) / 2 else "fail"


# Quick sanity check
assert classify_answer("The Thief must wait until another level.", "q01") == "pass"
assert classify_answer("I don't know.", "q01") == "fail"
print("Classification function ready.")

Classification function ready.


## 4.14 Run Evaluation

The main loop loads each model one at a time, runs all 10 questions in both modes (without and with RAG context), stores the results, and frees GPU memory before moving to the next model.

**Error handling:** If a model fails to download or load (e.g., network timeout), the loop logs the error and continues with the remaining models. This is important in a workshop setting where network conditions vary.

**Expected runtime:** ~3-5 minutes per model depending on GPU and model size.

In [7]:
all_results = {}  # model_name -> {"no_context": [...], "with_context": [...]}

for entry in MODEL_REGISTRY:
    name = entry["name"]
    print(f"\n{'=' * 70}")
    print(f"Loading: {name}")
    print(f"{'=' * 70}")

    try:
        t0 = time.time()
        model, tokenizer, timing = load_model(entry)
        total_load = time.time() - t0
        mem = torch.cuda.memory_allocated() / 1e9
        print(f"  Total load time: {total_load:.0f}s | GPU memory: {mem:.1f} GB")
        for k, v in timing.items():
            print(f"    {k}: {v:.1f}s")
    except Exception as e:
        print(f"  FAILED to load: {e}")
        print(f"  Skipping {name} and continuing...")
        all_results[name] = {"no_context": [], "with_context": [], "error": str(e)}
        continue

    # --- Mode 1: Without RAG context ---
    print(f"\n  Mode 1: Without RAG context")
    t_mode1 = time.time()
    no_ctx_results = []
    for r in eval_baseline["results"]:
        qid = r["id"]
        try:
            answer = generate_answer(model, tokenizer, r["question"], context=None)
        except Exception as e:
            answer = f"[Generation error: {e}]"
        classification = classify_answer(answer, qid)
        no_ctx_results.append({
            "id": qid,
            "question": r["question"],
            "expected": r["expected"],
            "category": r["category"],
            "answer": answer,
            "classification": classification,
        })
        status = "PASS" if classification == "pass" else "FAIL"
        print(f"    {qid}: {status}")

    no_ctx_pass = sum(1 for r in no_ctx_results if r["classification"] == "pass")
    print(f"  Mode 1 score: {no_ctx_pass}/10 ({time.time() - t_mode1:.0f}s)")

    # --- Mode 2: With RAG context ---
    print(f"\n  Mode 2: With RAG context")
    t_mode2 = time.time()
    ctx_results = []
    for r in eval_with_context["results"]:
        qid = r["id"]
        try:
            answer = generate_answer(
                model, tokenizer, r["question"], context=r["retrieved_context"]
            )
        except Exception as e:
            answer = f"[Generation error: {e}]"
        classification = classify_answer(answer, qid)
        ctx_results.append({
            "id": qid,
            "question": r["question"],
            "expected": r["expected"],
            "category": r["category"],
            "answer": answer,
            "classification": classification,
        })
        status = "PASS" if classification == "pass" else "FAIL"
        print(f"    {qid}: {status}")

    ctx_pass = sum(1 for r in ctx_results if r["classification"] == "pass")
    print(f"  Mode 2 score: {ctx_pass}/10 ({time.time() - t_mode2:.0f}s)")

    all_results[name] = {"no_context": no_ctx_results, "with_context": ctx_results}

    # --- Cleanup ---
    print(f"\n  Cleaning up {name}...")
    cleanup_model(model, tokenizer)

print(f"\n{'=' * 70}")
print(f"Evaluation complete. {len([v for v in all_results.values() if 'error' not in v])} models evaluated successfully.")


Loading: Granite 8B + LoRA (lab)




Loading weights:   0%|          | 0/362 [00:00<?, ?it/s]

  Total load time: 128s | GPU memory: 4.7 GB
    download_s: 0.0s
    gpu_load_s: 126.7s
    adapter_s: 1.3s

  Mode 1: Without RAG context
    q01: FAIL
    q02: FAIL
    q03: FAIL
    q04: FAIL
    q05: PASS
    q06: PASS
    q07: PASS
    q08: PASS
    q09: PASS
    q10: FAIL
  Mode 1 score: 5/10 (318s)

  Mode 2: With RAG context
    q01: PASS
    q02: FAIL
    q03: FAIL
    q04: PASS
    q05: PASS
    q06: PASS
    q07: PASS
    q08: PASS
    q09: PASS


`torch_dtype` is deprecated! Use `dtype` instead!


    q10: PASS
  Mode 2 score: 8/10 (159s)

  Cleaning up Granite 8B + LoRA (lab)...
  GPU memory after cleanup: 4.7 GB

Loading: Granite 2B (fully trained)


Loading weights:   0%|          | 0/362 [00:00<?, ?it/s]



  Total load time: 24s | GPU memory: 9.2 GB
    download_and_load_s: 23.1s
    tokenizer_s: 0.4s

  Mode 1: Without RAG context
    q01: FAIL
    q02: FAIL
    q03: FAIL
    q04: PASS
    q05: PASS
    q06: PASS
    q07: FAIL
    q08: PASS
    q09: FAIL
    q10: FAIL
  Mode 1 score: 4/10 (42s)

  Mode 2: With RAG context
    q01: PASS
    q02: FAIL
    q03: PASS
    q04: PASS
    q05: PASS
    q06: PASS
    q07: PASS
    q08: PASS
    q09: PASS
    q10: PASS
  Mode 2 score: 9/10 (20s)

  Cleaning up Granite 2B (fully trained)...
  GPU memory after cleanup: 5.1 GB

Loading: Phi-3 Mini (fully trained)


Loading weights:   0%|          | 0/195 [00:00<?, ?it/s]

  Total load time: 58s | GPU memory: 7.7 GB
    download_and_load_s: 57.7s
    tokenizer_s: 0.4s

  Mode 1: Without RAG context
    q01: FAIL
    q02: FAIL
    q03: FAIL
    q04: FAIL
    q05: FAIL
    q06: PASS
    q07: FAIL
    q08: PASS
    q09: FAIL
    q10: FAIL
  Mode 1 score: 2/10 (72s)

  Mode 2: With RAG context
    q01: PASS
    q02: FAIL
    q03: FAIL
    q04: PASS
    q05: PASS
    q06: PASS
    q07: PASS
    q08: PASS
    q09: PASS
    q10: PASS
  Mode 2 score: 8/10 (32s)

  Cleaning up Phi-3 Mini (fully trained)...
  GPU memory after cleanup: 7.7 GB

Loading: Qwen2.5 3B (fully trained)


Loading weights:   0%|          | 0/434 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


  Total load time: 48s | GPU memory: 6.2 GB
    download_and_load_s: 46.5s
    tokenizer_s: 1.2s

  Mode 1: Without RAG context
    q01: FAIL
    q02: FAIL
    q03: FAIL
    q04: FAIL
    q05: FAIL
    q06: PASS
    q07: FAIL
    q08: PASS
    q09: FAIL
    q10: PASS
  Mode 1 score: 3/10 (31s)

  Mode 2: With RAG context
    q01: PASS
    q02: FAIL
    q03: PASS
    q04: PASS
    q05: PASS
    q06: PASS
    q07: PASS
    q08: PASS
    q09: PASS
    q10: PASS
  Mode 2 score: 9/10 (24s)

  Cleaning up Qwen2.5 3B (fully trained)...
  GPU memory after cleanup: 6.2 GB

Evaluation complete. 4 models evaluated successfully.


## 4.15 Per-Model Detail Tables

These tables show the full generated answer for each model and question, enabling manual review. Scroll through to spot-check the classification results.

In [8]:
def truncate(text, max_len=200):
    """Truncate text for table display."""
    if len(text) <= max_len:
        return text
    return text[:max_len] + "..."


for model_name, results in all_results.items():
    if "error" in results:
        display(HTML(f"<h3>{model_name} — SKIPPED (load error)</h3><p>{results['error']}</p>"))
        continue

    for mode_key, mode_label in [("no_context", "Without RAG"), ("with_context", "With RAG")]:
        mode_results = results[mode_key]
        pass_count = sum(1 for r in mode_results if r["classification"] == "pass")

        html = f"<h3>{model_name} — {mode_label} ({pass_count}/10)</h3>"
        html += '<table style="border-collapse: collapse; width: 100%; font-size: 12px;">'
        html += '<tr style="background-color: #f0f0f0;">'
        html += '<th style="border: 1px solid #ddd; padding: 6px;">ID</th>'
        html += '<th style="border: 1px solid #ddd; padding: 6px;">Category</th>'
        html += '<th style="border: 1px solid #ddd; padding: 6px;">Expected</th>'
        html += '<th style="border: 1px solid #ddd; padding: 6px;">Generated Answer</th>'
        html += '<th style="border: 1px solid #ddd; padding: 6px;">Result</th>'
        html += '</tr>'

        for r in mode_results:
            color = "#4CAF50" if r["classification"] == "pass" else "#f44336"
            label = "PASS" if r["classification"] == "pass" else "FAIL"
            html += '<tr>'
            html += f'<td style="border: 1px solid #ddd; padding: 6px;">{r["id"]}</td>'
            html += f'<td style="border: 1px solid #ddd; padding: 6px;">{r["category"]}</td>'
            html += f'<td style="border: 1px solid #ddd; padding: 6px;">{truncate(r["expected"], 120)}</td>'
            html += f'<td style="border: 1px solid #ddd; padding: 6px;">{truncate(r["answer"], 200)}</td>'
            html += f'<td style="border: 1px solid #ddd; padding: 6px; color: {color}; font-weight: bold;">{label}</td>'
            html += '</tr>'

        html += '</table>'
        display(HTML(html))

ID,Category,Expected,Generated Answer,Result
q01,explicit_rule,The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.,"According to the Basic Fantasy Role-Playing Game rules (page 10), if a Thief fails an Open Locks attempt, the following happens: 1. The Thief takes damage equal to 1d6 points. 2. The lock remains clo...",FAIL
q02,terminology,Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.,"In the Basic Fantasy Role-Playing Game, Elves, like all demi-humans, have a different hit point maximum compared to humans. According to the rules on page 10, Elves have a maximum hit point die of d6,...",FAIL
q03,implicit_reasoning,Magic-Users and Elves cannot cast spells while wearing armor. Clerics can wear armor and cast spells.,"Yes, a character can wear leather armor and still cast spells in the Basic Fantasy Role-Playing Game. There are no restrictions in the rules that prevent a spellcaster from wearing leather armor. On...",FAIL
q04,table_lookup,"Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.","In the Basic Fantasy Role-Playing Game, there is no specific saving throw listed for a 3rd level Fighter against Dragon Breath. However, the game does provide a general rule for saving throws against ...",FAIL
q05,multi_step_rule,The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the typ...,"In the Basic Fantasy Role-Playing Game, a Cleric can turn undead using their special ability. Here's how it works: 1. The Cleric must have a certain level of experience. The ability to turn undead is...",PASS
q06,table_lookup,"A Strength score of 16 gives a +2 bonus, which applies to melee attack rolls and damage rolls.","In the Basic Fantasy Role-Playing Game, a character's Strength score does not directly provide a bonus to melee attack rolls. Instead, it affects the damage dealt by melee attacks. However, if the c...",PASS
q07,terminology,Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adven...,"In the Basic Fantasy Role-Playing Game, both retainers and hirelings are non-player characters (NPCs) who can assist the player characters (PCs), but they serve different roles and have different leve...",PASS
q08,implicit_reasoning,"A Magic-User can add spells to their spell book when gaining a level, and may also learn spells found during adventures ...","According to the Basic Fantasy Role-Playing Game rules, a Magic-User can learn new spells under the following conditions: 1. **Spellbooks**: Magic-Users start with a spellbook containing four 1st-lev...",PASS
q09,explicit_rule,"When a character's hit point total reaches 0, the character may be dead. The rules note this may not be the end for the ...","According to the Basic Fantasy Role-Playing Game rules (page 10), when a character's hit points reach exactly 0, they are considered to be at negative hit points. At this point, the character is not y...",PASS
q10,implicit_reasoning,"Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling c...","Yes, a Halfling can use a longbow in the Basic Fantasy Role-Playing Game. There are no restrictions in the rules that prevent Halflings from using longbows. They have the same proficiency with ranged ...",FAIL


ID,Category,Expected,Generated Answer,Result
q01,explicit_rule,The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.,"If a Thief fails an Open Locks attempt, they cannot try to open the same lock again until they have gained another level of experience. This rule is stated on page 19 of the Basic Fantasy Role-Playing...",PASS
q02,terminology,Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.,"According to the rules provided, Elves are restricted to never rolling larger than a six-sided die (d6) for hit points. This restriction is explicitly stated under the ""Restrictions"" section of the El...",FAIL
q03,implicit_reasoning,Magic-Users and Elves cannot cast spells while wearing armor. Clerics can wear armor and cast spells.,"Yes, an Elf character can wear leather armor and cast spells. This is because Elves are a combination class, advancing as both Fighters and Magic-Users simultaneously. Unlike a pure Magic-User, Elves ...",FAIL
q04,table_lookup,"Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.","According to the provided context, a 3rd level Fighter's saving throw against Dragon Breath is 15. This is listed under the ""Dragon Breath"" column for the 2-3 level range, which includes 3rd level.",PASS
q05,multi_step_rule,The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the typ...,"A Cleric turns undead by using the Turn Undead ability, which is detailed on the Clerics vs. Undead table. The Game Master (GM) looks up the Cleric's level on this table and cross-references it with t...",PASS
q06,table_lookup,"A Strength score of 16 gives a +2 bonus, which applies to melee attack rolls and damage rolls.","According to the Basic Fantasy Role-Playing Game rules, a Strength score of 16 corresponds to a bonus of +2. This bonus is applied to all attack and damage rolls in melee (hand to hand) combat. This i...",PASS
q07,terminology,Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adven...,"A retainer and a hireling differ primarily in their roles and the level of risk they are willing to undertake. Retainers, as per the rules, are Non-Player Characters who are close associates of the p...",PASS
q08,implicit_reasoning,"A Magic-User can add spells to their spell book when gaining a level, and may also learn spells found during adventures ...","A Magic-User can learn new spells at any point, but there are specific conditions that must be met. They can learn from a teacher or by acquiring a reference work, such as another Magic-User's spellbo...",PASS
q09,explicit_rule,"When a character's hit point total reaches 0, the character may be dead. The rules note this may not be the end for the ...","At exactly 0 hit points, a character is considered to be dead. However, this does not necessarily mean the end of the character. According to the rules, players should not tear up their character shee...",PASS
q10,implicit_reasoning,"Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling c...","No, a Halfling cannot use a longbow. According to the Weapons Table, a longbow is a Large weapon. The restrictions for Halflings state that they may not use Large weapons. They are, however, allowed t...",PASS


ID,Category,Expected,Generated Answer,Result
q01,explicit_rule,The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.,"If a Thief fails an Open Locks attempt, the lock remains closed. The Thief cannot attempt to open the lock again until they succeed on a new Open Locks attempt.",FAIL
q02,terminology,Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.,"According to the official rules, Elves cannot roll higher than a d6 for hit points because their maximum hit points are determined by their race. The table for hit points by race shows that Elves have...",FAIL
q03,implicit_reasoning,Magic-Users and Elves cannot cast spells while wearing armor. Clerics can wear armor and cast spells.,"Yes, a character can wear leather armor and cast spells. According to the rules, leather armor does not impose any armor class penalty. The table on page 100 of the official rules lists the armor clas...",FAIL
q04,table_lookup,"Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.",The saving throw for a 3rd level Fighter against Dragon Breath is a Constitution check. The DC for this check is 15.,PASS
q05,multi_step_rule,The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the typ...,"According to the official rules, a Cleric can turn undead by rolling a d20 and adding their Cleric level. The table below shows the results for each level of Cleric: | Cleric Level | Turn Undead Resu...",PASS
q06,table_lookup,"A Strength score of 16 gives a +2 bonus, which applies to melee attack rolls and damage rolls.",A character with a Strength of 16 gets a +3 bonus on melee attack rolls.,PASS
q07,terminology,Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adven...,"According to the official rules, the main differences between a retainer and a hireling are: 1. **Loyalty**: Retainers are more loyal to their master than hirelings. They are more likely to follow or...",FAIL
q08,implicit_reasoning,"A Magic-User can add spells to their spell book when gaining a level, and may also learn spells found during adventures ...",A Magic-User can learn new spells by spending experience points (XP) to increase their level. The number of spells a Magic-User can learn at each level is listed in the table below: | Level | Spells ...,PASS
q09,explicit_rule,"When a character's hit point total reaches 0, the character may be dead. The rules note this may not be the end for the ...","According to the rules, when a character's hit points reach exactly 0, they are considered unconscious. This is stated on page 100 of the official rules.",FAIL
q10,implicit_reasoning,"Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling c...","Yes, a Halfling can use a longbow. According to the rules, Halflings are proficient with all simple weapons, including longbows. This is stated on page 10 of the rules, under the ""Weapons"" section.",FAIL


ID,Category,Expected,Generated Answer,Result
q01,explicit_rule,The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.,"If a Thief fails an Open Locks attempt, they must wait until they have gained another level of experience before trying again.",PASS
q02,terminology,Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.,"Elves cannot roll higher than a d6 for hit points because it is a restriction stated in the rules for the Elf class. This restriction applies to all Elves, regardless of their level or abilities.",FAIL
q03,implicit_reasoning,Magic-Users and Elves cannot cast spells while wearing armor. Clerics can wear armor and cast spells.,"No, a character cannot wear leather armor and cast spells. According to the Magic-User restrictions, Magic-Users may not wear armor of any sort nor use shields, as armor interferes with spell casting....",PASS
q04,table_lookup,"Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.",The saving throw for a 3rd level Fighter against Dragon Breath is 15.,PASS
q05,multi_step_rule,The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the typ...,A Cleric turns undead by rolling a 1d20 and comparing the result to the minimum number indicated on the Clerics vs. Undead table for their level and the type of undead they are attempting to turn. If ...,PASS
q06,table_lookup,"A Strength score of 16 gives a +2 bonus, which applies to melee attack rolls and damage rolls.",Answer: The character with a Strength of 16 gets a +3 bonus on melee attack rolls.,PASS
q07,terminology,Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adven...,"A retainer is a Non-Player Character (NPC) who is hired by a player character for their loyalty and willingness to accompany them into dangerous places like dungeons, lairs, or ruins. Retainers are ty...",PASS
q08,implicit_reasoning,"A Magic-User can add spells to their spell book when gaining a level, and may also learn spells found during adventures ...","A Magic-User can learn new spells at any point, but there are specific conditions. They must find a teacher or acquire a reference work, such as another Magic-User's spellbook. The cost of such learni...",PASS
q09,explicit_rule,"When a character's hit point total reaches 0, the character may be dead. The rules note this may not be the end for the ...","Answer: At exactly 0 hit points, a character is considered dead. However, the game does not explicitly state that the character is permanently dead. The text suggests that the character's sheet should...",PASS
q10,implicit_reasoning,"Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling c...","Yes, a Halfling can use a longbow. According to the Weapons Table, a Longbow is listed as a Large weapon, and Halflings may use Large weapons.",PASS


ID,Category,Expected,Generated Answer,Result
q01,explicit_rule,The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.,"If a Thief fails an Open Locks attempt, the outcome depends on the difficulty of the lock and the Thief's skill level. According to the Basic Fantasy Role-Playing Game rules, the Thief has a chance to...",FAIL
q02,terminology,Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.,"In the Basic Fantasy Role-Playing Game, the hit point system is designed to reflect the different physical resilience of various races. Elves, being a race with a natural affinity for magic and agilit...",FAIL
q03,implicit_reasoning,Magic-Users and Elves cannot cast spells while wearing armor. Clerics can wear armor and cast spells.,"Yes, a character can wear leather armor and cast spells. According to the Basic Fantasy Role-Playing Game rules, leather armor is a type of light armor that does not impose any additional armor class ...",FAIL
q04,table_lookup,"Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.","According to the Basic Fantasy Role-Playing Game rules, a 3rd level Fighter has a saving throw of +3 against Dragon Breath. This information can be found in the ""Saving Throws"" table on page 10 of the...",FAIL
q05,multi_step_rule,The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the typ...,"A Cleric can turn undead using the spell ""Sacred Flame,"" which is found on page 202 of the Basic Fantasy Role-Playing Game rulebook. The spell description states that the Cleric can use their spellcas...",FAIL
q06,table_lookup,"A Strength score of 16 gives a +2 bonus, which applies to melee attack rolls and damage rolls.","According to the Basic Fantasy Role-Playing Game rules, a character with a Strength of 16 would receive a +3 bonus on melee attack rolls. This is outlined in the ""Combat"" section of the rulebook, spec...",PASS
q07,terminology,Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adven...,"In the Basic Fantasy Role-Playing Game, a retainer and a hireling are two different types of characters that can be part of a player's party. A retainer is a character that is part of the player's pa...",FAIL
q08,implicit_reasoning,"A Magic-User can add spells to their spell book when gaining a level, and may also learn spells found during adventures ...",A Magic-User can learn new spells during the following times: 1. At the start of a new campaign: Magic-Users can learn new spells from the spellbook provided by the game master (GM). The spellbook co...,PASS
q09,explicit_rule,"When a character's hit point total reaches 0, the character may be dead. The rules note this may not be the end for the ...","When a character in the Basic Fantasy Role-Playing Game reaches 0 hit points, they fall unconscious and are subject to the effects of the ""Falling Unconscious"" table on page 11. If they do not receive...",FAIL
q10,implicit_reasoning,"Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling c...","Yes, a Halfling can use a longbow. According to the Basic Fantasy Role-Playing Game rules, Halflings are skilled archers. This is detailed in the ""Halfling"" section of the Player's Handbook (p. 29). T...",FAIL


ID,Category,Expected,Generated Answer,Result
q01,explicit_rule,The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.,"If a Thief fails an Open Locks attempt, they must wait until they have gained another level of experience before trying again. This is based on the rules for the Thief class in the Basic Fantasy Role-...",PASS
q02,terminology,Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.,"According to the context provided, Elves are restricted to using only six-sided dice (d6) for determining their hit points. This rule is explicitly stated in the ""Restrictions"" section under the Elf e...",FAIL
q03,implicit_reasoning,Magic-Users and Elves cannot cast spells while wearing armor. Clerics can wear armor and cast spells.,"Yes, a character can wear leather armor and cast spells. According to the context provided, Magic-Users are restricted from wearing armor or using shields as it interferes with spell casting. However,...",FAIL
q04,table_lookup,"Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.","According to the Fighter class table, a 3rd level Fighter has a saving throw of 15 against Dragon Breath.",PASS
q05,multi_step_rule,The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the typ...,A Cleric turns undead by using their Turning Undead ability. The GM looks up the Cleric's level on the Clerics vs. Undead table and cross-references it with the undead type or Hit Dice. If the table i...,PASS
q06,table_lookup,"A Strength score of 16 gives a +2 bonus, which applies to melee attack rolls and damage rolls.","According to the context provided, a character with a Strength score of 16 would receive a +3 bonus on melee attack rolls. This information can be found in the table under the ""Strength"" column, where...",PASS
q07,terminology,Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adven...,"A retainer is a Non-Player Character (NPC) who follows a player character into danger and is typically very loyal. They are hired for a share of treasure and support costs, and their number is limited...",PASS
q08,implicit_reasoning,"A Magic-User can add spells to their spell book when gaining a level, and may also learn spells found during adventures ...","A Magic-User can learn new spells at any point during their adventure, as long as they meet the specific conditions mentioned in the context. These conditions include finding a teacher or acquiring a ...",PASS
q09,explicit_rule,"When a character's hit point total reaches 0, the character may be dead. The rules note this may not be the end for the ...","At exactly 0 hit points, a character may be considered dead. However, this does not necessarily mean the end of the character's story. The player should not tear up their character sheet. The characte...",PASS
q10,implicit_reasoning,"Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling c...","No, a Halfling cannot use a longbow. According to the context provided, Halflings may not use Large weapons. The longbow is classified as a Large weapon, which means it is not suitable for Halflings t...",PASS


ID,Category,Expected,Generated Answer,Result
q01,explicit_rule,The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.,"If a Thief fails an Open Locks check, they do not gain any additional time to continue the attempt. The DM (Game Master) should determine what happens next based on the situation. For example, the loc...",FAIL
q02,terminology,Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.,"Elves cannot roll higher than a d6 for hit points because their Constitution modifier is capped at +2, which corresponds to a d6 (1d6). This limitation is detailed in the ""Hit Points"" section of the B...",FAIL
q03,implicit_reasoning,Magic-Users and Elves cannot cast spells while wearing armor. Clerics can wear armor and cast spells.,"Yes, a character can wear leather armor and cast spells. According to Table 1-2: Armor Class, leather armor has an AC of 10, which is within the range that allows spellcasting. Additionally, the spell...",FAIL
q04,table_lookup,"Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.","For a 3rd level Fighter, the saving throw against Dragon Breath would be Fortitude. This can be found in the ""Saving Throws"" section of the Basic Fantasy Role-Playing Game core rulebook, specifically ...",FAIL
q05,multi_step_rule,The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the typ...,"A Cleric can turn undead by casting the Turn Undead spell, which is a 3rd-level spell. This spell requires the Cleric to be at least 4th level and have the Turn Undead ability. The spell allows the Cl...",FAIL
q06,table_lookup,"A Strength score of 16 gives a +2 bonus, which applies to melee attack rolls and damage rolls.","A character with a Strength score of 16 gains a +2 bonus to their melee attack rolls. This is directly stated in the Basic Fantasy Role-Playing Game core rulebook, specifically on page 14 under the ""A...",PASS
q07,terminology,Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adven...,"A retainer and a hireling are both types of NPCs (Non-Player Characters) that can be hired by the party, but they differ in their loyalty and cost. A retainer is a more loyal NPC who will remain with...",FAIL
q08,implicit_reasoning,"A Magic-User can add spells to their spell book when gaining a level, and may also learn spells found during adventures ...","A Magic-User can learn new spells when they have enough experience points (XP) to qualify for a higher level spell slot, which occurs at the end of each adventure. Specifically, this happens after the...",PASS
q09,explicit_rule,"When a character's hit point total reaches 0, the character may be dead. The rules note this may not be the end for the ...","At exactly 0 hit points, a character is considered to be unconscious and dying. They must make a Constitution saving throw at the start of each of their turns. If they fail, they fall unconscious unti...",FAIL
q10,implicit_reasoning,"Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling c...","No, a Halfling cannot use a longbow. According to Table 1-2: Weapon Proficiency in the Basic Fantasy Role-Playing Game (BFRPG) core rulebook, Halflings have proficiency only with shortbows and light c...",PASS


ID,Category,Expected,Generated Answer,Result
q01,explicit_rule,The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.,"If a Thief fails an Open Locks attempt, they must wait until they gain another level of experience before they can try again with that lock.",PASS
q02,terminology,Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.,"Elves cannot roll higher than a d6 for hit points because their restriction states explicitly, ""Elves never roll larger than six-sided dice (d6) for hit points.""",FAIL
q03,implicit_reasoning,Magic-Users and Elves cannot cast spells while wearing armor. Clerics can wear armor and cast spells.,"No, an Elf cannot wear leather armor and cast spells. According to the rules, Elves are a combination class that allows them to advance as both Fighters and Magic-Users. However, the restriction for E...",PASS
q04,table_lookup,"Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.","The saving throw for a 3rd level Fighter against Dragon Breath is 15. This can be found directly from the table provided, under the ""Dragon Breath"" column for a Fighter at levels 2-3.",PASS
q05,multi_step_rule,The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the typ...,A Cleric turns undead by rolling a d20 and comparing the result to the number given in the Clerics vs. Undead table for the specific type of undead they wish to affect. If the result is equal to or hi...,PASS
q06,table_lookup,"A Strength score of 16 gives a +2 bonus, which applies to melee attack rolls and damage rolls.","A Strength score of 16 grants a +4 bonus on melee attack rolls. This can be determined directly from the table provided in the Basic Fantasy Role-Playing Game, which lists bonuses for each Strength sc...",PASS
q07,terminology,Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adven...,"A retainer is a close associate of the player character who is willing to accompany them into dangerous situations such as dungeons, lairs, or ruins. They are typically very loyal and are hired for a ...",PASS
q08,implicit_reasoning,"A Magic-User can add spells to their spell book when gaining a level, and may also learn spells found during adventures ...",A Magic-User can learn new spells under the following conditions: 1. **From a Teacher or Reference Work**: They can learn new spells by finding a teacher or acquiring a reference work like another Ma...,PASS
q09,explicit_rule,"When a character's hit point total reaches 0, the character may be dead. The rules note this may not be the end for the ...","At exactly 0 hit points, a character may be considered dead. However, according to the rules, ""don't tear up your character sheet"" when a character reaches 0 hit points. This indicates that the charac...",PASS
q10,implicit_reasoning,"Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling c...","No, a Halfling cannot use a longbow. According to the restrictions for Halflings, they may only use Medium weapons two-handed. The longbow is listed as a Large weapon, which does not align with the Ha...",PASS


## 4.16 Cross-Model Comparison: Mode 1 (Without RAG)

This table compares all four models side by side on the pure knowledge test. Look for:
- Do any models pass questions others fail?
- Does the larger Granite 8B outperform the smaller models despite its lower learning rate?

In [9]:
model_names = [name for name, res in all_results.items() if "error" not in res]

html = '<h3>Mode 1: Without RAG — All Models</h3>'
html += '<table style="border-collapse: collapse; width: 100%; font-size: 13px;">'
html += '<tr style="background-color: #f0f0f0;">'
html += '<th style="border: 1px solid #ddd; padding: 8px;">ID</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">Question</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">Category</th>'
for name in model_names:
    short = name.split(" (")[0]  # trim the (lab)/(fully trained) suffix for column header
    html += f'<th style="border: 1px solid #ddd; padding: 8px;">{short}</th>'
html += '</tr>'

for i, r in enumerate(eval_baseline["results"]):
    qid = r["id"]
    html += '<tr>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{qid}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{truncate(r["question"], 55)}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{r["category"]}</td>'
    for name in model_names:
        result = all_results[name]["no_context"][i]["classification"]
        color = "#4CAF50" if result == "pass" else "#f44336"
        label = "PASS" if result == "pass" else "FAIL"
        html += f'<td style="border: 1px solid #ddd; padding: 8px; color: {color}; font-weight: bold; text-align: center;">{label}</td>'
    html += '</tr>'

# Totals row
html += '<tr style="background-color: #e8e8e8; font-weight: bold;">'
html += f'<td style="border: 1px solid #ddd; padding: 8px;" colspan="3">Total</td>'
for name in model_names:
    total = sum(1 for r in all_results[name]["no_context"] if r["classification"] == "pass")
    html += f'<td style="border: 1px solid #ddd; padding: 8px; text-align: center;">{total}/10</td>'
html += '</tr></table>'

display(HTML(html))

ID,Question,Category,Granite 8B + LoRA,Granite 2B,Phi-3 Mini,Qwen2.5 3B
q01,What happens if a Thief fails an Open Locks attempt?,explicit_rule,FAIL,FAIL,FAIL,FAIL
q02,Why can't Elves roll higher than a d6 for hit points?,terminology,FAIL,FAIL,FAIL,FAIL
q03,Can a character wear leather armor and cast spells?,implicit_reasoning,FAIL,FAIL,FAIL,FAIL
q04,What is the saving throw for a 3rd level Fighter agains...,table_lookup,FAIL,PASS,FAIL,FAIL
q05,How does a Cleric turn undead?,multi_step_rule,PASS,PASS,FAIL,FAIL
q06,"If a character has a Strength of 16, what bonus do they...",table_lookup,PASS,PASS,PASS,PASS
q07,What is the difference between a retainer and a hirelin...,terminology,PASS,FAIL,FAIL,FAIL
q08,When can a Magic-User learn new spells?,implicit_reasoning,PASS,PASS,PASS,PASS
q09,What happens to a character at exactly 0 hit points?,explicit_rule,PASS,FAIL,FAIL,FAIL
q10,Can a Halfling use a longbow?,implicit_reasoning,FAIL,FAIL,FAIL,PASS


## 4.17 Cross-Model Comparison: Mode 2 (With RAG)

Same comparison with retrieved context included. The Day 2 baseline column shows the pre-SFT Granite 8B result for reference. Look for:
- Do all fine-tuned models beat the Day 2 baseline?
- Which models close the implicit reasoning gaps (q02, q04, q06, q07)?

In [10]:
# Mode 2: With RAG — include Day 2 baseline for reference
html = '<h3>Mode 2: With RAG — All Models + Day 2 Baseline</h3>'
html += '<table style="border-collapse: collapse; width: 100%; font-size: 13px;">'
html += '<tr style="background-color: #f0f0f0;">'
html += '<th style="border: 1px solid #ddd; padding: 8px;">ID</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">Question</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">Day 2 Baseline</th>'
for name in model_names:
    short = name.split(" (")[0]
    html += f'<th style="border: 1px solid #ddd; padding: 8px;">{short}</th>'
html += '</tr>'

for i, r in enumerate(eval_baseline["results"]):
    qid = r["id"]
    html += '<tr>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{qid}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{truncate(r["question"], 55)}</td>'

    # Day 2 baseline
    day2_class = r["classification"]
    color = "#4CAF50" if day2_class == "pass" else "#f44336"
    label = "PASS" if day2_class == "pass" else "FAIL"
    html += f'<td style="border: 1px solid #ddd; padding: 8px; color: {color}; font-weight: bold; text-align: center;">{label}</td>'

    for name in model_names:
        result = all_results[name]["with_context"][i]["classification"]
        color = "#4CAF50" if result == "pass" else "#f44336"
        label = "PASS" if result == "pass" else "FAIL"
        html += f'<td style="border: 1px solid #ddd; padding: 8px; color: {color}; font-weight: bold; text-align: center;">{label}</td>'
    html += '</tr>'

# Totals row
html += '<tr style="background-color: #e8e8e8; font-weight: bold;">'
html += '<td style="border: 1px solid #ddd; padding: 8px;" colspan="2">Total</td>'
day2_total = sum(1 for r in eval_baseline["results"] if r["classification"] == "pass")
html += f'<td style="border: 1px solid #ddd; padding: 8px; text-align: center;">{day2_total}/10</td>'
for name in model_names:
    total = sum(1 for r in all_results[name]["with_context"] if r["classification"] == "pass")
    html += f'<td style="border: 1px solid #ddd; padding: 8px; text-align: center;">{total}/10</td>'
html += '</tr></table>'

display(HTML(html))

ID,Question,Day 2 Baseline,Granite 8B + LoRA,Granite 2B,Phi-3 Mini,Qwen2.5 3B
q01,What happens if a Thief fails an Open Locks attempt?,PASS,PASS,PASS,PASS,PASS
q02,Why can't Elves roll higher than a d6 for hit points?,FAIL,FAIL,FAIL,FAIL,FAIL
q03,Can a character wear leather armor and cast spells?,PASS,FAIL,PASS,FAIL,PASS
q04,What is the saving throw for a 3rd level Fighter agains...,FAIL,PASS,PASS,PASS,PASS
q05,How does a Cleric turn undead?,PASS,PASS,PASS,PASS,PASS
q06,"If a character has a Strength of 16, what bonus do they...",FAIL,PASS,PASS,PASS,PASS
q07,What is the difference between a retainer and a hirelin...,FAIL,PASS,PASS,PASS,PASS
q08,When can a Magic-User learn new spells?,PASS,PASS,PASS,PASS,PASS
q09,What happens to a character at exactly 0 hit points?,PASS,PASS,PASS,PASS,PASS
q10,Can a Halfling use a longbow?,PASS,PASS,PASS,PASS,PASS


## 4.18 Summary Scorecard

A consolidated view of all models showing parameter count, learning rate, scores in both modes, and the RAG lift (the difference between Mode 2 and Mode 1). RAG lift measures how much each model benefits from retrieved context.

In [11]:
html = '<h3>Summary Scorecard</h3>'
html += '<table style="border-collapse: collapse; width: 100%; font-size: 14px;">'
html += '<tr style="background-color: #f0f0f0;">'
html += '<th style="border: 1px solid #ddd; padding: 10px;">Model</th>'
html += '<th style="border: 1px solid #ddd; padding: 10px;">Params</th>'
html += '<th style="border: 1px solid #ddd; padding: 10px;">LR</th>'
html += '<th style="border: 1px solid #ddd; padding: 10px;">No RAG</th>'
html += '<th style="border: 1px solid #ddd; padding: 10px;">With RAG</th>'
html += '<th style="border: 1px solid #ddd; padding: 10px;">RAG Lift</th>'
html += '</tr>'

for entry in MODEL_REGISTRY:
    name = entry["name"]
    results = all_results.get(name, {})

    if "error" in results:
        html += '<tr>'
        html += f'<td style="border: 1px solid #ddd; padding: 10px;">{name}</td>'
        html += f'<td style="border: 1px solid #ddd; padding: 10px;">{entry["params"]}</td>'
        html += f'<td style="border: 1px solid #ddd; padding: 10px;">{entry["lr"]}</td>'
        html += '<td style="border: 1px solid #ddd; padding: 10px; color: #999;" colspan="3">Load failed</td>'
        html += '</tr>'
        continue

    no_rag = sum(1 for r in results["no_context"] if r["classification"] == "pass")
    with_rag = sum(1 for r in results["with_context"] if r["classification"] == "pass")
    lift = with_rag - no_rag
    lift_str = f"+{lift}" if lift >= 0 else str(lift)

    html += '<tr>'
    html += f'<td style="border: 1px solid #ddd; padding: 10px; font-weight: bold;">{name}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 10px;">{entry["params"]}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 10px;">{entry["lr"]}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 10px; text-align: center; font-weight: bold;">{no_rag}/10</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 10px; text-align: center; font-weight: bold;">{with_rag}/10</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 10px; text-align: center;">{lift_str}</td>'
    html += '</tr>'

# Day 2 baseline row for reference
day2_total = sum(1 for r in eval_baseline["results"] if r["classification"] == "pass")
html += '<tr style="background-color: #f9f9f9; font-style: italic;">'
html += '<td style="border: 1px solid #ddd; padding: 10px;">Day 2 Baseline (Granite 8B, no SFT)</td>'
html += '<td style="border: 1px solid #ddd; padding: 10px;">~8B (4-bit)</td>'
html += '<td style="border: 1px solid #ddd; padding: 10px;">N/A</td>'
html += '<td style="border: 1px solid #ddd; padding: 10px; text-align: center;">N/A</td>'
html += f'<td style="border: 1px solid #ddd; padding: 10px; text-align: center;">{day2_total}/10</td>'
html += '<td style="border: 1px solid #ddd; padding: 10px; text-align: center;">—</td>'
html += '</tr>'

html += '</table>'
display(HTML(html))

Model,Params,LR,No RAG,With RAG,RAG Lift
Granite 8B + LoRA (lab),~8B (4-bit),5e-06,5/10,8/10,+3
Granite 2B (fully trained),~2.5B (bf16),0.0002,4/10,9/10,+5
Phi-3 Mini (fully trained),~3.8B (bf16),0.0002,2/10,8/10,+6
Qwen2.5 3B (fully trained),~3B (bf16),0.0002,3/10,9/10,+6
"Day 2 Baseline (Granite 8B, no SFT)",~8B (4-bit),,,6/10,—


## 4.19 Category-Level Analysis

The 10 evaluation questions span five categories. This breakdown shows whether certain model architectures handle specific reasoning types better than others:

- **explicit_rule** (q01, q09): Direct rule lookups
- **terminology** (q02, q07): Understanding game-specific terms
- **table_lookup** (q04, q06): Reading values from tables in context
- **implicit_reasoning** (q03, q10): Multi-step inference
- **multi_step_rule** (q05): Complex procedure understanding
- **implicit_reasoning** also covers q08: applying rules not explicitly stated

In [12]:
# Build category mapping from eval data
categories = {}
for r in eval_baseline["results"]:
    cat = r["category"]
    if cat not in categories:
        categories[cat] = []
    categories[cat].append(r["id"])

html = '<h3>Pass Rate by Question Category (With RAG)</h3>'
html += '<table style="border-collapse: collapse; width: 100%; font-size: 13px;">'
html += '<tr style="background-color: #f0f0f0;">'
html += '<th style="border: 1px solid #ddd; padding: 8px;">Category</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">Questions</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">Day 2 Baseline</th>'
for name in model_names:
    short = name.split(" (")[0]
    html += f'<th style="border: 1px solid #ddd; padding: 8px;">{short}</th>'
html += '</tr>'

# Build lookup for quick access
baseline_class_by_id = {r["id"]: r["classification"] for r in eval_baseline["results"]}

for cat, qids in sorted(categories.items()):
    html += '<tr>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px; font-weight: bold;">{cat}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{", ".join(qids)}</td>'

    # Day 2 baseline
    day2_pass = sum(1 for qid in qids if baseline_class_by_id.get(qid) == "pass")
    html += f'<td style="border: 1px solid #ddd; padding: 8px; text-align: center;">{day2_pass}/{len(qids)}</td>'

    for name in model_names:
        results_by_id = {r["id"]: r for r in all_results[name]["with_context"]}
        model_pass = sum(1 for qid in qids if results_by_id.get(qid, {}).get("classification") == "pass")
        html += f'<td style="border: 1px solid #ddd; padding: 8px; text-align: center;">{model_pass}/{len(qids)}</td>'

    html += '</tr>'

html += '</table>'
display(HTML(html))

Category,Questions,Day 2 Baseline,Granite 8B + LoRA,Granite 2B,Phi-3 Mini,Qwen2.5 3B
explicit_rule,"q01, q09",2/2,2/2,2/2,2/2,2/2
implicit_reasoning,"q03, q08, q10",3/3,2/3,3/3,2/3,3/3
multi_step_rule,q05,1/1,1/1,1/1,1/1,1/1
table_lookup,"q04, q06",0/2,2/2,2/2,2/2,2/2
terminology,"q02, q07",0/2,1/2,1/2,1/2,1/2


## 4.20 Interpretation

### What the results show

**Model Size Matters for Standalone Knowledge:**
In Mode 1 (no RAG), the Granite 8B scored 5/10 despite using a 40x lower learning rate (5e-6 vs 2e-4) than the three smaller models (4/10, 2/10, 3/10). The larger model's pre-training knowledge gave it a baseline advantage that a higher learning rate on 8 training examples could not overcome. This suggests that for narrow-domain SFT with minimal data, model size provides a knowledge floor that hyperparameter tuning alone cannot replicate.

**RAG Is the Great Equalizer:**
In Mode 2 (with RAG), all four models converged to nearly identical scores: 8/10, 9/10, 8/10, 9/10. The size advantage that Granite 8B held in Mode 1 disappeared once retrieved context was provided. This is the most actionable finding: when RAG is available, what matters is the model's ability to extract answers from provided text, not how much it memorized during pre-training. Smaller models did this just as well as the larger one.

**RAG Lift:**
The smaller models showed larger RAG lifts (Granite 2B: +5, Phi-3 Mini: +6, Qwen2.5 3B: +6) compared to Granite 8B (+3). This is consistent: the smaller models had less built-in knowledge to start with, so they gained more from external context. The 8B model gained less because it was already answering some questions correctly from its weights alone.

**Inference Speed:**
The smaller models were 4-10x faster at inference. Granite 8B took over 5 minutes for 10 questions without context, while the 2-3B models completed the same task in 30-70 seconds. With RAG context, the gap narrowed but persisted. In a production deployment, this speed difference compounds: lower latency per request, higher throughput, and lower GPU cost at scale.

**Customer Implications:**
- When RAG is always available, smaller models deliver comparable accuracy at significantly lower cost and latency — this is the strongest signal in the data
- If the domain requires standalone knowledge without retrieval, larger models justify their cost through better baseline performance
- The learning rate difference (5e-6 vs 2e-4) is a confound — the 8B model's Mode 1 advantage could widen or narrow with a higher learning rate, and a fair head-to-head would use the same LR across all models
- With only 8 training examples, all conclusions should be treated as preliminary indicators, not production benchmarks

In [13]:
gc.collect()
torch.cuda.empty_cache()
print(f"Final GPU memory: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
print("\nCross-architecture evaluation complete.")

Final GPU memory: 6.2 GB

Cross-architecture evaluation complete.
