# 4. Evaluation: Did Model Adaptation Close the Gap?

> **GPU Required.** This section loads the base model (16GB) plus the LoRA adapter in 4-bit quantization. An NVIDIA GPU with at least 16GB of memory is required. The lab environment provides an NVIDIA L40S (46GB) or L4 (24GB).

**Purpose:**

We have now completed three approaches to improving the model's performance on our evaluation questions:

1. **RAG (Day 2):** Retrieval-Augmented Generation gave us 6/10 correct answers. Four questions failed due to implicit reasoning gaps.
2. **Best-of-N (Section 1):** Inference-time scaling recovered 3 of the 4 failures (q04, q06, q07). Question q02 remained stuck.
3. **Model Adaptation (Section 3):** LoRA SFT on 8 Thief-specific training examples. Loss decreased from 2.93 to 2.26 over 5 epochs.

This section evaluates whether the fine-tuned model closes the remaining gaps. We run two evaluation modes:

- **Mode 1 — Without RAG Context:** A pure knowledge test. Did the training change what the model knows? We ask the 10 evaluation questions with no retrieved context. This measures whether domain knowledge was absorbed into the weights.
- **Mode 2 — With RAG Context:** An apples-to-apples comparison with the Day 2 baseline. Same questions, same retrieved chunks, but with the adapted model. This measures whether the model reasons better over the same information.

The training data covered only Thief-related topics (8 examples). The evaluation questions span Thieves, Elves, Fighters, Clerics, Magic-Users, Halflings, retainers, and hirelings. We should expect improvement on Thief-adjacent questions and minimal change elsewhere. If the model generalizes beyond its training data, that is a bonus. If it does not, that is a valid and instructive finding.

## 4.1 Install Dependencies

These packages were installed in Section 3. If you are running this notebook in a fresh kernel, the cell below ensures they are available.

In [1]:
! pip install peft bitsandbytes accelerate transformers -q

In [2]:
import json
import torch
import os
from pathlib import Path

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available:  {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU:             {torch.cuda.get_device_name(0)}")
    print(f"Memory:          {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

PyTorch version: 2.10.0+cu128
CUDA available:  True
GPU:             NVIDIA L40S
Memory:          47.8 GB


  raw_cnt = _raw_device_count_nvml()


## 4.2 Load Baseline Results

We load the prebuilt evaluation results from Day 2 and the Best-of-N results from Section 1. These provide the comparison baselines.

In [3]:
# Load Day 2 baseline results (without retrieved context details)
with open("../prebuilt/eval_results.json") as f:
    eval_baseline = json.load(f)

# Load Day 2 results with retrieved context (for Mode 2)
with open("../prebuilt/eval_with_context.json") as f:
    eval_with_context = json.load(f)

# Load Best-of-N results from Section 1
with open("../prebuilt/bon_results.json") as f:
    bon_results = json.load(f)

# Build lookup dictionaries
baseline_by_id = {r["id"]: r for r in eval_baseline["results"]}
context_by_id = {r["id"]: r for r in eval_with_context["results"]}
bon_by_id = {r["id"]: r for r in bon_results}

print(f"Loaded {len(baseline_by_id)} baseline results")
print(f"Loaded {len(context_by_id)} results with context")
print(f"Loaded {len(bon_by_id)} Best-of-N results")

# Show baseline summary
baseline_pass = sum(1 for r in eval_baseline["results"] if r["classification"] == "pass")
print(f"\nDay 2 baseline: {baseline_pass}/10 pass")
print("Failures:")
for r in eval_baseline["results"]:
    if r["classification"] != "pass":
        print(f"  {r['id']}: {r['question'][:60]}... ({r['classification']})")

Loaded 10 baseline results
Loaded 10 results with context
Loaded 10 Best-of-N results

Day 2 baseline: 6/10 pass
Failures:
  q02: Why can't Elves roll higher than a d6 for hit points?... (implicit_reasoning_failure)
  q04: What is the saving throw for a 3rd level Fighter against Dra... (implicit_reasoning_failure)
  q06: If a character has a Strength of 16, what bonus do they get ... (implicit_reasoning_failure)
  q07: What is the difference between a retainer and a hireling?... (implicit_reasoning_failure)


## 4.3 Load Fine-Tuned Model

We load the base Granite model in 4-bit quantization and apply the LoRA adapter from Section 3. This is the standard pattern for inference with a PEFT adapter: load the base model, then load the adapter on top.

4-bit quantization via `BitsAndBytesConfig` reduces memory usage from ~16GB to ~5GB, making inference feasible alongside the adapter weights.

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# Paths
BASE_MODEL_PATH = "../03ModelAdaptation/models/granite-3.2-8b-instruct"
ADAPTER_PATH = "../03ModelAdaptation/lora_output"

# Verify paths exist
assert os.path.exists(BASE_MODEL_PATH), f"Base model not found at {BASE_MODEL_PATH}"
assert os.path.exists(ADAPTER_PATH), f"Adapter not found at {ADAPTER_PATH}"

print(f"Base model: {BASE_MODEL_PATH}")
print(f"Adapter:    {ADAPTER_PATH}")

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

print("\nLoading base model in 4-bit...")
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_PATH,
    quantization_config=bnb_config,
    device_map="auto",
    dtype=torch.float16,
)

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

print("Applying LoRA adapter...")
model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
model.eval()

print(f"\nModel loaded successfully")
print(f"GPU memory used: {torch.cuda.memory_allocated() / 1e9:.1f} GB")

Base model: ../03ModelAdaptation/models/granite-3.2-8b-instruct
Adapter:    ../03ModelAdaptation/lora_output

Loading base model in 4-bit...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading tokenizer...
Applying LoRA adapter...

Model loaded successfully
GPU memory used: 4.7 GB


## 4.4 Define Inference Function

We use the same system prompt from training for consistency. The model was trained with this prompt, so it should be used at inference time to match the expected input format.

Temperature is set to 0 for deterministic outputs, making results reproducible.

In [5]:
SYSTEM_PROMPT = (
    "You are a rules expert for the Basic Fantasy Role-Playing Game. "
    "Answer questions accurately based on the official rules. "
    "Be specific and cite page references or table values where possible."
)


def generate_answer(question, context=None, max_new_tokens=512):
    """Generate an answer using the fine-tuned model.
    
    Args:
        question: The question to answer.
        context: Optional retrieved context to include in the prompt.
        max_new_tokens: Maximum tokens to generate.
    
    Returns:
        The generated answer string.
    """
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    
    if context:
        user_content = (
            f"Use the following context to answer the question.\n\n"
            f"Context:\n{context}\n\n"
            f"Question: {question}"
        )
    else:
        user_content = question
    
    messages.append({"role": "user", "content": user_content})
    
    input_text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=None,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    # Decode only the generated tokens (skip the input)
    generated = outputs[0][inputs["input_ids"].shape[1]:]
    answer = tokenizer.decode(generated, skip_special_tokens=True).strip()
    return answer


# Quick test
test_answer = generate_answer("What is a Thief's Open Locks score at level 1?")
print(f"Test answer: {test_answer[:200]}..." if len(test_answer) > 200 else f"Test answer: {test_answer}")

Test answer: In the Basic Fantasy Role-Playing Game, a Thief's Open Locks score at level 1 is determined by rolling 3 six-sided dice (3d6) and adding the results together. This total is then modified by the Thief'...


## 4.5 Fine-Tuned Model: Knowledge Test

This is a pure knowledge test. We ask all 10 evaluation questions with no retrieved 
context. The model must answer from its weights alone, with no documents, no chunks, 
and no retrieval pipeline.

This isolates what the LoRA training actually changed inside the model. Since the 
training data covered Thief-related topics across 8 examples, we expect possible 
improvement on q01 but limited change elsewhere. Questions spanning topics outside 
the training data should behave similarly to the base model baseline.


In [6]:
# Run all 10 questions without context
print("Running evaluation Mode 1: Without RAG context")
print("=" * 60)

no_context_results = []

for r in eval_baseline["results"]:
    qid = r["id"]
    question = r["question"]
    expected = r["expected"]
    
    print(f"\n{qid}: {question}")
    answer = generate_answer(question, context=None)
    print(f"  Answer: {answer[:150]}..." if len(answer) > 150 else f"  Answer: {answer}")
    
    no_context_results.append({
        "id": qid,
        "question": question,
        "expected": expected,
        "category": r["category"],
        "ft_answer_no_context": answer,
    })

print("\n" + "=" * 60)
print(f"Mode 1 complete: {len(no_context_results)} questions answered")

Running evaluation Mode 1: Without RAG context

q01: What happens if a Thief fails an Open Locks attempt?
  Answer: According to the Basic Fantasy RPG rules (page 10), if a Thief fails an Open Locks attempt, the following can happen:

1. The lock remains stubbornly ...

q02: Why can't Elves roll higher than a d6 for hit points?
  Answer: In the Basic Fantasy Role-Playing Game, Elves, like all demi-humans, have a different hit point maximum compared to humans. According to the rules on ...

q03: Can a character wear leather armor and cast spells?
  Answer: Yes, a character can wear leather armor and still cast spells in the Basic Fantasy Role-Playing Game. There are no restrictions in the rules that prev...

q04: What is the saving throw for a 3rd level Fighter against Dragon Breath?
  Answer: In the Basic Fantasy Role-Playing Game, there is no specific saving throw listed for a 3rd level Fighter against Dragon Breath. However, the game does...

q05: How does a Cleric turn undead?
  Ans

### 4.5.1 Fine-Tuned Model Without RAG Context

Review each answer against the expected answer. Mark as **pass** if the answer 
contains the key facts from the expected answer, **fail** otherwise.

Note: Without RAG context, even a well-trained model may not know specific table 
values or exact rule wordings. The question is whether it demonstrates domain 
understanding.

In [7]:
from IPython.display import display, HTML

def truncate(text, max_len=200):
    """Truncate text for table display."""
    if len(text) <= max_len:
        return text
    return text[:max_len] + "..."

# Build results table for Mode 1
html = "<h3>4.5.1 Fine-Tuned Model Without RAG Context</h3>"
html += '<table style="border-collapse: collapse; width: 100%; font-size: 13px;">'
html += '<tr style="background-color: #f0f0f0;">'
html += '<th style="border: 1px solid #ddd; padding: 8px; text-align: left;">ID</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Question</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Category</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Expected</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Fine-Tuned Answer</th>'
html += '</tr>'

for r in no_context_results:
    html += '<tr>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{r["id"]}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{truncate(r["question"], 80)}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{r["category"]}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{truncate(r["expected"], 150)}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{truncate(r["ft_answer_no_context"], 200)}</td>'
    html += '</tr>'

html += '</table>'
display(HTML(html))

ID,Question,Category,Expected,Fine-Tuned Answer
q01,What happens if a Thief fails an Open Locks attempt?,explicit_rule,The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.,"According to the Basic Fantasy RPG rules (page 10), if a Thief fails an Open Locks attempt, the following can happen: 1. The lock remains stubbornly closed, and the Thief must start over. 2. The lock..."
q02,Why can't Elves roll higher than a d6 for hit points?,terminology,Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.,"In the Basic Fantasy Role-Playing Game, Elves, like all demi-humans, have a different hit point maximum compared to humans. According to the rules on page 10, Elves have a maximum hit point die of d6,..."
q03,Can a character wear leather armor and cast spells?,implicit_reasoning,Magic-Users and Elves cannot cast spells while wearing armor. Clerics can wear armor and cast spells.,"Yes, a character can wear leather armor and still cast spells in the Basic Fantasy Role-Playing Game. There are no restrictions in the rules that prevent a spellcaster from wearing leather armor. On..."
q04,What is the saving throw for a 3rd level Fighter against Dragon Breath?,table_lookup,"Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.","In the Basic Fantasy Role-Playing Game, there is no specific saving throw listed for a 3rd level Fighter against Dragon Breath. However, the game does provide a general rule for saving throws against ..."
q05,How does a Cleric turn undead?,multi_step_rule,The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the type of undead.,"In the Basic Fantasy Role-Playing Game, a Cleric can turn undead using their special ability. Here's how it works: 1. The Cleric must have a certain level of experience. The ability to turn undead is..."
q06,"If a character has a Strength of 16, what bonus do they get on melee attack roll...",table_lookup,"A Strength score of 16 gives a +2 bonus, which applies to melee attack rolls and damage rolls.","In the Basic Fantasy Role-Playing Game, a character's Strength score does not directly provide a bonus to melee attack rolls. Instead, it affects the damage dealt by melee attacks. However, if the c..."
q07,What is the difference between a retainer and a hireling?,terminology,Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adventuring tasks.,"In the Basic Fantasy Role-Playing Game, both retainers and hirelings are non-player characters (NPCs) who provide services to the player character (PC). However, there are some differences between the..."
q08,When can a Magic-User learn new spells?,implicit_reasoning,"A Magic-User can add spells to their spell book when gaining a level, and may also learn spells found during adventures at the GM's discretion.","According to the Basic Fantasy Role-Playing Game rules, a Magic-User can learn new spells under the following conditions: 1. **Spellbooks**: Magic-Users start with a spellbook containing four 1st-lev..."
q09,What happens to a character at exactly 0 hit points?,explicit_rule,"When a character's hit point total reaches 0, the character may be dead. The rules note this may not be the end for the character.","According to the Basic Fantasy Role-Playing Game rules (page 10), when a character's hit points reach exactly 0, they are considered to be at negative hit points. At this point, the character is not y..."
q10,Can a Halfling use a longbow?,implicit_reasoning,"Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling cannot use one.","Yes, a Halfling can use a longbow in the Basic Fantasy Role-Playing Game. There are no restrictions in the rules that prevent Halflings from using longbows. They have no penalty for using them, just a..."


## 4.6 Fine-Tuned Model: With RAG Context

This mode runs the same 10 questions but with retrieved context included in each 
prompt, matching the setup from Day 2. Importantly, ChromaDB is not required here. 
The retrieval step already ran on Day 2 and the results were saved to disk in 
`eval_with_context.json`. This notebook simply reads that file and passes the 
pre-retrieved chunks directly to the model. There is no live vector store query, 
no embedding model, and no ChromaDB process running. The comparison is valid 
because the retrieved context is identical to what the Day 2 baseline used, which 
is exactly what makes the evaluation apples-to-apples.


In [8]:
# Run all 10 questions with context
print("Running evaluation Mode 2: With RAG context")
print("=" * 60)

with_context_results = []

for r in eval_with_context["results"]:
    qid = r["id"]
    question = r["question"]
    expected = r["expected"]
    context = r["retrieved_context"]
    
    print(f"\n{qid}: {question}")
    answer = generate_answer(question, context=context)
    print(f"  Answer: {answer[:150]}..." if len(answer) > 150 else f"  Answer: {answer}")
    
    with_context_results.append({
        "id": qid,
        "question": question,
        "expected": expected,
        "category": r["category"],
        "baseline_answer": r["answer"],
        "baseline_classification": r["classification"],
        "ft_answer_with_context": answer,
    })

print("\n" + "=" * 60)
print(f"Mode 2 complete: {len(with_context_results)} questions answered")

Running evaluation Mode 2: With RAG context

q01: What happens if a Thief fails an Open Locks attempt?
  Answer: If a Thief fails an Open Locks attempt, they cannot try to open the same lock again until they have gained another level of experience. This rule is s...

q02: Why can't Elves roll higher than a d6 for hit points?
  Answer: According to the rules provided, Elves have a restriction that they "never roll larger than six-sided dice (d6) for hit points." This means that regar...

q03: Can a character wear leather armor and cast spells?
  Answer: Yes, an Elf character can wear leather armor and cast spells. This is because Elves are a combination class, advancing as both Fighters and Magic-User...

q04: What is the saving throw for a 3rd level Fighter against Dragon Breath?
  Answer: According to the provided context, a 3rd level Fighter's saving throw against Dragon Breath is 15. This is listed under the "Dragon Breath" column for...

q05: How does a Cleric turn undead?
  Answer

In [9]:
# Build comparison table for Mode 2
html = "<h3>4.6.1 Fine-Tuned Model vs. Baselines: RAG Context</h3>"
html += '<table style="border-collapse: collapse; width: 100%; font-size: 12px;">'
html += '<tr style="background-color: #f0f0f0;">'
html += '<th style="border: 1px solid #ddd; padding: 6px;">ID</th>'
html += '<th style="border: 1px solid #ddd; padding: 6px;">Question</th>'
html += '<th style="border: 1px solid #ddd; padding: 6px;">Expected</th>'
html += '<th style="border: 1px solid #ddd; padding: 6px;">Day 2 Baseline</th>'
html += '<th style="border: 1px solid #ddd; padding: 6px;">Day 2</th>'
html += '<th style="border: 1px solid #ddd; padding: 6px;">Best-of-N</th>'
html += '<th style="border: 1px solid #ddd; padding: 6px;">Fine-Tuned + RAG</th>'
html += '</tr>'

for r in with_context_results:
    qid = r["id"]
    bon = bon_by_id.get(qid, {})
    bon_answer = bon.get("bon_answer", "N/A")
    
    # Color the baseline classification
    baseline_class = r["baseline_classification"]
    if baseline_class == "pass":
        class_color = "#4CAF50"
        class_label = "PASS"
    else:
        class_color = "#f44336"
        class_label = "FAIL"
    
    html += '<tr>'
    html += f'<td style="border: 1px solid #ddd; padding: 6px;">{qid}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 6px;">{truncate(r["question"], 60)}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 6px;">{truncate(r["expected"], 120)}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 6px;">{truncate(r["baseline_answer"], 120)}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 6px; color: {class_color}; font-weight: bold;">{class_label}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 6px;">{truncate(bon_answer, 120)}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 6px;">{truncate(r["ft_answer_with_context"], 150)}</td>'
    html += '</tr>'

html += '</table>'
display(HTML(html))

ID,Question,Expected,Day 2 Baseline,Day 2,Best-of-N,Fine-Tuned + RAG
q01,What happens if a Thief fails an Open Locks attempt?,The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.,"If a Thief fails an Open Locks attempt, they must wait until they have gained another level of experience before trying ...",PASS,"If a Thief fails an Open Locks attempt, they must wait until they have gained another level of experience before trying ...","If a Thief fails an Open Locks attempt, they cannot try to open the same lock again until they have gained another level of experience. This rule is s..."
q02,Why can't Elves roll higher than a d6 for hit points?,Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.,"According to the provided context, Elves never roll larger than six-sided dice (d6) for hit points. The reason for this ...",FAIL,"According to the provided context, Elves never roll larger than six-sided dice (d6) for hit points. This is a restrictio...","According to the rules provided, Elves have a restriction that they ""never roll larger than six-sided dice (d6) for hit points."" This means that regar..."
q03,Can a character wear leather armor and cast spells?,Magic-Users and Elves cannot cast spells while wearing armor. Clerics can wear armor and cast spells.,"Yes, according to the provided context, characters who can cast spells may wear leather armor.",PASS,"Yes, an Elf can wear leather armor and cast spells. According to the context, Elves are a combination class that advance...","Yes, an Elf character can wear leather armor and cast spells. This is because Elves are a combination class, advancing as both Fighters and Magic-User..."
q04,What is the saving throw for a 3rd level Fighter against Dra...,"Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.",The context does not provide specific information on the saving throw for a 3rd level Fighter against Dragon Breath. How...,FAIL,"3rd level Fighters have a saving throw of 15 against Dragon Breath. This is listed in the ""Dragon Breath"" column of the ...","According to the provided context, a 3rd level Fighter's saving throw against Dragon Breath is 15. This is listed under the ""Dragon Breath"" column for..."
q05,How does a Cleric turn undead?,The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the typ...,"To turn undead, the GM looks up the Cleric's level on the Clerics vs. Undead table and cross-references it with the unde...",PASS,A Cleric turns undead by looking up their level on the Clerics vs. Undead table and cross-referencing it with the undead...,"A Cleric turns undead by using the ""Turn Undead"" ability, which is detailed on the Clerics vs. Undead table. The Game Master (GM) looks up the Cleric'..."
q06,"If a character has a Strength of 16, what bonus do they get ...","A Strength score of 16 gives a +2 bonus, which applies to melee attack rolls and damage rolls.","The provided context does not specify how Strength affects melee attack rolls. In Basic Fantasy RPG, it is recommended n...",FAIL,A character with a Strength of 16 receives a bonus of +2 on melee attack rolls. This is determined by the ability bonus ...,"According to the Basic Fantasy Role-Playing Game rules, a Strength score of 16 would grant a +2 bonus on melee attack rolls. This is because a Strengt..."
q07,What is the difference between a retainer and a hireling?,Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adven...,"In the Basic Fantasy RPG system, a retainer and a hireling are both types of NPCs that can be hired by player characters...",FAIL,The primary difference between a retainer and a hireling lies in their roles and willingness to engage in dangerous acti...,A retainer and a hireling in the Basic Fantasy Role-Playing Game differ primarily in their roles and the level of risk they are willing to undertake. ...
q08,When can a Magic-User learn new spells?,"A Magic-User can add spells to their spell book when gaining a level, and may also learn spells found during adventures ...","A Magic-User can learn new spells at any point, but there are specific conditions. They must find a teacher or acquire a...",PASS,"A Magic-User can learn new spells at any point, but there are specific conditions. They must find a teacher or acquire a...","A Magic-User can learn new spells at any point, but there are specific conditions that must be met. They can learn from a teacher or by acquiring a re..."
q09,What happens to a character at exactly 0 hit points?,"When a character's hit point total reaches 0, the character may be dead. The rules note this may not be the end for the ...","When a character's hit point total reaches exactly 0, the character may be dead. However, this is not necessarily the en...",PASS,"At exactly 0 hit points, a character may be dead. However, this is not necessarily the end for the character. The rules ...","At exactly 0 hit points, a character is considered to be dead. However, this does not necessarily mean the end of the character. According to the Basi..."
q10,Can a Halfling use a longbow?,"Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling c...","No, according to the provided context, Halflings may not use Large weapons, and longbows are typically around 6 feet uns...",PASS,"No, a Halfling cannot use a longbow. According to the provided context, Halflings must use Medium weapons two-handed and...","No, a Halfling cannot use a longbow. According to the Weapons Table, a longbow is a Large weapon. The restrictions for Halflings state that they may n..."


## 4.7 Full Comparison Summary

This cell brings together all evaluation results into a single comparison. For each question, we show the outcome across all four evaluation modes:

1. **Day 2 RAG Baseline** — the original pipeline from Day 2
2. **Best-of-N** — inference-time scaling from Section 1
3. **Fine-Tuned (no RAG)** — the adapted model without any context
4. **Fine-Tuned + RAG** — the adapted model with the same retrieved context

The pass/fail classifications for the fine-tuned model are determined by comparing the generated answer against the expected answer. An answer passes if it contains the key factual content from the expected answer.

In [10]:
def classify_answer(answer, expected, question):
    """Simple keyword-based classification.
    
    Checks if key terms from the expected answer appear in the generated answer.
    This is a rough heuristic — manual review of the tables above is more reliable.
    """
    answer_lower = answer.lower()
    expected_lower = expected.lower()
    
    # Extract key facts from expected answers by question pattern
    key_checks = {
        "q01": ["another level", "wait"],
        "q02": ["hit die", "combination class", "d6"],
        "q03": ["magic-user", "cannot", "cleric"],
        "q04": ["15"],
        "q05": ["turn undead", "table", "roll"],
        "q06": ["+2", "bonus"],
        "q07": ["adventure", "non-adventure", "hired", "do not"],
        "q08": ["spell", "learn", "level"],
        "q09": ["0", "dead", "may"],
        "q10": ["large", "cannot", "halfling"],
    }
    
    qid = None
    for r in eval_baseline["results"]:
        if r["question"] == question:
            qid = r["id"]
            break
    
    if qid and qid in key_checks:
        checks = key_checks[qid]
        matches = sum(1 for kw in checks if kw in answer_lower)
        # Require at least half of key terms to match
        if matches >= len(checks) / 2:
            return "pass"
    
    return "fail"


# Build the full comparison
print("Full Comparison Summary")
print("=" * 80)

summary_rows = []

for i, r in enumerate(eval_baseline["results"]):
    qid = r["id"]
    
    # Day 2 baseline
    day2_class = r["classification"]
    
    # Best-of-N
    bon = bon_by_id.get(qid, {})
    bon_class = bon.get("day2_classification", "unknown")
    # Check if BoN improved on Day 2 failure
    bon_answer = bon.get("bon_answer", "")
    bon_result = classify_answer(bon_answer, r["expected"], r["question"])
    
    # Fine-tuned without context
    ft_no_ctx = no_context_results[i]
    ft_no_ctx_class = classify_answer(
        ft_no_ctx["ft_answer_no_context"], r["expected"], r["question"]
    )
    
    # Fine-tuned with context
    ft_ctx = with_context_results[i]
    ft_ctx_class = classify_answer(
        ft_ctx["ft_answer_with_context"], r["expected"], r["question"]
    )
    
    summary_rows.append({
        "id": qid,
        "question": r["question"],
        "category": r["category"],
        "day2": day2_class,
        "bon": bon_result,
        "ft_no_context": ft_no_ctx_class,
        "ft_with_context": ft_ctx_class,
    })

# Display summary table
html = '<h3>Evaluation Results Across All Modes</h3>'
html += '<table style="border-collapse: collapse; width: 100%; font-size: 13px;">'
html += '<tr style="background-color: #f0f0f0;">'
html += '<th style="border: 1px solid #ddd; padding: 8px;">ID</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">Question</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">Category</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">Day 2 RAG</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">Best-of-N</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">FT (no RAG)</th>'
html += '<th style="border: 1px solid #ddd; padding: 8px;">FT + RAG</th>'
html += '</tr>'

for row in summary_rows:
    html += '<tr>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{row["id"]}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{truncate(row["question"], 60)}</td>'
    html += f'<td style="border: 1px solid #ddd; padding: 8px;">{row["category"]}</td>'
    
    for mode in ["day2", "bon", "ft_no_context", "ft_with_context"]:
        val = row[mode]
        if val == "pass":
            color = "#4CAF50"
            label = "PASS"
        else:
            color = "#f44336"
            label = "FAIL"
        html += f'<td style="border: 1px solid #ddd; padding: 8px; color: {color}; font-weight: bold; text-align: center;">{label}</td>'
    
    html += '</tr>'

# Totals row
day2_total = sum(1 for r in summary_rows if r["day2"] == "pass")
bon_total = sum(1 for r in summary_rows if r["bon"] == "pass")
ft_no_ctx_total = sum(1 for r in summary_rows if r["ft_no_context"] == "pass")
ft_ctx_total = sum(1 for r in summary_rows if r["ft_with_context"] == "pass")

html += '<tr style="background-color: #e8e8e8; font-weight: bold;">'
html += '<td style="border: 1px solid #ddd; padding: 8px;" colspan="3">Total</td>'
html += f'<td style="border: 1px solid #ddd; padding: 8px; text-align: center;">{day2_total}/10</td>'
html += f'<td style="border: 1px solid #ddd; padding: 8px; text-align: center;">{bon_total}/10</td>'
html += f'<td style="border: 1px solid #ddd; padding: 8px; text-align: center;">{ft_no_ctx_total}/10</td>'
html += f'<td style="border: 1px solid #ddd; padding: 8px; text-align: center;">{ft_ctx_total}/10</td>'
html += '</tr>'

html += '</table>'
display(HTML(html))

# Print text summary
print(f"\nScore Summary:")
print(f"  Day 2 RAG Baseline:    {day2_total}/10")
print(f"  Best-of-N:             {bon_total}/10")
print(f"  Fine-Tuned (no RAG):   {ft_no_ctx_total}/10")
print(f"  Fine-Tuned + RAG:      {ft_ctx_total}/10")

Full Comparison Summary


ID,Question,Category,Day 2 RAG,Best-of-N,FT (no RAG),FT + RAG
q01,What happens if a Thief fails an Open Locks attempt?,explicit_rule,PASS,PASS,FAIL,PASS
q02,Why can't Elves roll higher than a d6 for hit points?,terminology,FAIL,FAIL,FAIL,FAIL
q03,Can a character wear leather armor and cast spells?,implicit_reasoning,PASS,FAIL,FAIL,FAIL
q04,What is the saving throw for a 3rd level Fighter against Dra...,table_lookup,FAIL,PASS,FAIL,PASS
q05,How does a Cleric turn undead?,multi_step_rule,PASS,PASS,PASS,PASS
q06,"If a character has a Strength of 16, what bonus do they get ...",table_lookup,FAIL,PASS,PASS,PASS
q07,What is the difference between a retainer and a hireling?,terminology,FAIL,PASS,FAIL,PASS
q08,When can a Magic-User learn new spells?,implicit_reasoning,PASS,PASS,PASS,PASS
q09,What happens to a character at exactly 0 hit points?,explicit_rule,PASS,PASS,PASS,PASS
q10,Can a Halfling use a longbow?,implicit_reasoning,PASS,PASS,FAIL,PASS



Score Summary:
  Day 2 RAG Baseline:    6/10
  Best-of-N:             8/10
  Fine-Tuned (no RAG):   4/10
  Fine-Tuned + RAG:      8/10


## 4.8 Conclusions

### What the results tell us

This evaluation demonstrates the full pipeline: RAG, inference-time scaling, model adaptation, and structured evaluation. The fine-tuned model's performance should be interpreted in the context of the training setup:

**Training constraints:**
- **8 training examples** — all focused on Thief abilities. The 10 eval questions span 7 different topics.
- **Conservative learning rate** (5e-6 vs typical 1e-4 to 2e-4) — the model updated slowly.
- **Modest loss reduction** (2.93 to 2.26) — the model learned something, but didn't fully converge.

**What this means:**
- If the fine-tuned model improved on Thief questions (q01), the training worked as expected for directly covered topics.
- If it did not improve on non-Thief questions (q02-q10), that is the expected outcome. You cannot teach a model about Fighter saving throws by training it on Thief ability scores.
- If adding RAG context to the fine-tuned model improves results beyond the baseline, the model may have developed better reasoning patterns even from limited training.

**The bigger picture:**

In a production engagement, you would:
1. Generate hundreds or thousands of training examples covering the full domain
2. Use a typical learning rate (1e-4 to 2e-4) for LoRA
3. Train for more epochs until loss converges (target < 1.0)
4. Evaluate on a held-out test set, not the same questions used to guide training

The value of this exercise is not the final score — it is the process. You now have a complete, repeatable pipeline from document ingestion through synthetic data generation, model adaptation, and structured evaluation. Each stage has clear inputs, outputs, and decision points. That pipeline is what you bring to a customer engagement.

For a detailed discussion of the training caveats and what would change in production, see `caveats.md` in this directory.

In [11]:
# Clean up GPU memory
del model
del base_model
torch.cuda.empty_cache()
print(f"GPU memory after cleanup: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
print("\nEvaluation complete.")

GPU memory after cleanup: 4.1 GB

Evaluation complete.
