# 3. Model Adaptation: Giving the Model Domain Knowledge

**Purpose:**

A base language model is trained on broad internet text. When you ask it about a specific domain — like the rules of a particular tabletop RPG — it may give plausible-sounding but incorrect answers, or admit it does not know. The model's weights do not contain the domain knowledge you need.

There are several ways to close that gap. The simplest and cheapest is **Retrieval-Augmented Generation (RAG)**: instead of changing the model, you give it the right information at query time by retrieving relevant document chunks and including them in the prompt. The model reads the context and answers based on it.

This notebook demonstrates the RAG approach:

1. **Baseline**: Ask the model domain-specific questions with no context. Observe where it fails.
2. **RAG Pipeline**: Load the domain document, chunk it, embed the chunks into a vector store, and retrieve relevant context at query time.
3. **Comparison**: Ask the same questions with retrieved context and compare the results.

RAG is the first technique to try before escalating to more expensive approaches like inference-time scaling or fine-tuning. If RAG solves the problem, you avoid the cost and complexity of changing the model entirely.

**Source Document:** `Basic-Fantasy-RPG-Rules-r142.md` — the complete Basic Fantasy RPG rulebook, converted to markdown by Docling in the previous section.

## 3.1 Install Dependencies

`chromadb` is a lightweight vector database that stores document embeddings and supports similarity search. `sentence-transformers` is already installed in the lab environment and provides the embedding model.

`pysqlite3-binary` is needed because the system sqlite3 (3.34) is below ChromaDB's minimum requirement (3.35). The binary package ships a newer version.

If you see dependency conflict warnings from pip, they are advisory and do not affect functionality.

In [1]:
! pip install chromadb pysqlite3-binary -q

## 3.2 Environment Setup

Same credentials, same endpoint pattern. We reuse the `.env` file and config helper from earlier sections.

In [2]:
import sys
sys.path.insert(0, "..")
from config import API_KEY as key, ENDPOINT_BASE as endpoint_base

print(f"Endpoint: {endpoint_base}")
print(f"API Key:  {key[:8]}...")

Endpoint: https://litellm-prod.apps.maas.redhatworkshops.io/v1
API Key:  sk-UFHcL...


In [3]:
from openai import OpenAI

client = OpenAI(
    api_key=key,
    base_url=endpoint_base,
)

MODEL = "granite-3-2-8b-instruct"

# Quick connectivity check
test = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": "Say hello in one sentence."}],
    max_tokens=50,
)
print(f"Model: {MODEL}")
print(f"Test:  {test.choices[0].message.content.strip()}")
print("Connection OK.")

Model: granite-3-2-8b-instruct
Test:  Hello, it's a pleasure to assist you today!
Connection OK.


## 3.3 Baseline: Model Responses Without Context

Before building anything, we establish a baseline. We ask the model a set of domain-specific questions about the Basic Fantasy RPG rules **without providing any context**. The model must rely entirely on whatever it learned during pre-training.

These questions come from the Day 2 evaluation set. They cover different types of domain knowledge: explicit rules, table lookups, terminology, and questions that require reasoning across multiple rules.

In [4]:
# Questions from the Day 2 evaluation set
questions = [
    {
        "id": "q01",
        "question": "What happens if a Thief fails an Open Locks attempt?",
        "expected": "The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.",
        "category": "explicit_rule",
    },
    {
        "id": "q04",
        "question": "What is the saving throw for a 3rd level Fighter against Dragon Breath?",
        "expected": "Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.",
        "category": "table_lookup",
    },
    {
        "id": "q05",
        "question": "How does a Cleric turn undead?",
        "expected": "The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the type of undead.",
        "category": "multi_step_rule",
    },
    {
        "id": "q07",
        "question": "What is the difference between a retainer and a hireling?",
        "expected": "Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adventuring tasks.",
        "category": "terminology",
    },
    {
        "id": "q10",
        "question": "Can a Halfling use a longbow?",
        "expected": "Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling cannot use one.",
        "category": "implicit_reasoning",
    },
]

print(f"Loaded {len(questions)} evaluation questions")
for q in questions:
    print(f"  [{q['id']}] ({q['category']}) {q['question']}")

Loaded 5 evaluation questions
  [q01] (explicit_rule) What happens if a Thief fails an Open Locks attempt?
  [q04] (table_lookup) What is the saving throw for a 3rd level Fighter against Dragon Breath?
  [q05] (multi_step_rule) How does a Cleric turn undead?
  [q07] (terminology) What is the difference between a retainer and a hireling?
  [q10] (implicit_reasoning) Can a Halfling use a longbow?


In [5]:
def ask_model(question, context=None):
    """Ask the model a question, optionally with retrieved context."""
    if context:
        system_msg = (
            "Answer the question using only the provided context. "
            "Be specific and cite rules where possible."
        )
        user_msg = f"Context:\n{context}\n\nQuestion: {question}"
    else:
        system_msg = (
            "Answer the question about Basic Fantasy RPG rules. "
            "Be specific and concise."
        )
        user_msg = question

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_msg},
        ],
        temperature=0,
        max_tokens=300,
    )
    return response.choices[0].message.content.strip()

In [6]:
print("BASELINE: No context provided")
print("=" * 70)

baseline_answers = {}

for q in questions:
    answer = ask_model(q["question"])
    baseline_answers[q["id"]] = answer

    print(f"\n[{q['id']}] {q['question']}")
    print(f"  Expected: {q['expected'][:120]}")
    print(f"  Model:    {answer[:120]}")

BASELINE: No context provided



[q01] What happens if a Thief fails an Open Locks attempt?
  Expected: The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.
  Model:    If a Thief fails an Open Locks attempt in Basic Fantasy RPG, they can try again immediately, but there's a cumulative -1



[q04] What is the saving throw for a 3rd level Fighter against Dragon Breath?
  Expected: Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.
  Model:    In Basic Fantasy RPG, a 3rd level Fighter would make a "Save vs. Breath Weapon" for a Dragon's breath attack. This is no



[q05] How does a Cleric turn undead?
  Expected: The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the typ
  Model:    In Basic Fantasy RPG, a Cleric can turn undead using the "Turn Undead" ability. Here's how it works:

1. The Cleric must



[q07] What is the difference between a retainer and a hireling?
  Expected: Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adven
  Model:    In Basic Fantasy RPG, both retainers and hirelings are NPCs who can assist the player characters, but they differ in the



[q10] Can a Halfling use a longbow?
  Expected: Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling c
  Model:    Yes, in Basic Fantasy RPG, Halflings can use longbows. There are no racial restrictions in the rules that prevent Halfli


Look at the baseline answers. Some may be partially correct — the model has seen tabletop RPG content during pre-training. But for specific rules, table values, and game-specific terminology, the model is guessing or hedging. It does not have reliable access to the exact rules of Basic Fantasy RPG.

This is the gap that RAG addresses: not by changing the model, but by giving it the right information when it needs it.

## 3.4 Building the RAG Pipeline

A RAG pipeline has three stages:

1. **Chunk** the source document into passages small enough to fit in a prompt.
2. **Embed** each chunk into a vector representation using an embedding model.
3. **Store** the vectors in a database that supports similarity search.

At query time, the user's question is embedded with the same model, the most similar chunks are retrieved, and they are included in the prompt as context.

### 3.4.1 Load the Source Document

The document is the Basic Fantasy RPG rulebook, converted to markdown by Docling in Section 2. At ~900KB of text, it is too large to fit in a single prompt. Chunking breaks it into manageable pieces.

In [7]:
doc_path = "../02SyntheticDataGen/Basic-Fantasy-RPG-Rules-r142.md"

with open(doc_path, "r", encoding="utf-8") as f:
    document = f.read()

print(f"Document: {doc_path}")
print(f"Size:     {len(document):,} characters")
print(f"Lines:    {document.count(chr(10)):,}")
print(f"Preview:  {document[:200]}...")

Document: ../02SyntheticDataGen/Basic-Fantasy-RPG-Rules-r142.md
Size:     908,392 characters
Lines:    11,260
Preview:  <!-- image -->

Copyright © 2006-2025 Chris Gonnerman All Rights Reserved.  See next page for license information. www.basicfantasy.org

Dedicated to Gary Gygax, Dave Arneson, Tom Moldvay, David Cook,...


### 3.4.2 Chunk the Document

We split the document into chunks of approximately 1000 characters with 200 characters of overlap. The overlap ensures that information at chunk boundaries is not lost. Splitting happens on paragraph breaks where possible, falling back to sentence boundaries, so chunks stay semantically coherent.

These parameters match the chunking strategy used in the Day 2 evaluation pipeline.

In [8]:
import re

def chunk_document(text, chunk_size=1000, overlap=200):
    """Split text into overlapping chunks, preferring paragraph boundaries."""
    # Split on double newlines (paragraph breaks) first
    paragraphs = re.split(r"\n\n+", text)

    chunks = []
    current_chunk = ""

    for para in paragraphs:
        para = para.strip()
        if not para:
            continue

        # If adding this paragraph would exceed chunk_size, save current and start new
        if len(current_chunk) + len(para) + 2 > chunk_size and current_chunk:
            chunks.append(current_chunk.strip())
            # Overlap: keep the tail of the previous chunk
            if len(current_chunk) > overlap:
                current_chunk = current_chunk[-overlap:].strip() + "\n\n" + para
            else:
                current_chunk = para
        else:
            if current_chunk:
                current_chunk += "\n\n" + para
            else:
                current_chunk = para

    # Don't forget the last chunk
    if current_chunk.strip():
        chunks.append(current_chunk.strip())

    return chunks

chunks = chunk_document(document, chunk_size=1000, overlap=200)

print(f"Total chunks: {len(chunks)}")
print(f"Avg length:   {sum(len(c) for c in chunks) / len(chunks):.0f} chars")
print(f"Min length:   {min(len(c) for c in chunks)} chars")
print(f"Max length:   {max(len(c) for c in chunks)} chars")
print(f"\nFirst chunk preview:")
print(f"  {chunks[0][:200]}...")

Total chunks: 1246
Avg length:   927 chars
Min length:   164 chars
Max length:   20378 chars

First chunk preview:
  <!-- image -->

Copyright © 2006-2025 Chris Gonnerman All Rights Reserved.  See next page for license information. www.basicfantasy.org

Dedicated to Gary Gygax, Dave Arneson, Tom Moldvay, David Cook,...


### 3.4.3 Load the Embedding Model

We use IBM's `granite-embedding-30m-english`, a compact embedding model designed for retrieval tasks. At 30 million parameters, it is small enough to run on CPU and fast enough for interactive use.

The model is downloaded from HuggingFace and cached locally.

In [9]:
import os
from huggingface_hub import snapshot_download

EMBEDDING_MODEL_ID = "ibm-granite/granite-embedding-30m-english"
EMBEDDING_MODEL_DIR = "./models/granite-embedding-30m-english"

if os.path.exists(EMBEDDING_MODEL_DIR) and len(os.listdir(EMBEDDING_MODEL_DIR)) > 3:
    print(f"Embedding model already exists at {EMBEDDING_MODEL_DIR}, skipping download.")
else:
    print(f"Downloading {EMBEDDING_MODEL_ID}...")
    snapshot_download(
        repo_id=EMBEDDING_MODEL_ID,
        local_dir=EMBEDDING_MODEL_DIR,
    )
    print("Download complete.")

model_files = [f for f in os.listdir(EMBEDDING_MODEL_DIR) if not f.startswith(".")]
print(f"Model directory: {len(model_files)} files")

Embedding model already exists at ./models/granite-embedding-30m-english, skipping download.
Model directory: 13 files


In [10]:
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer(EMBEDDING_MODEL_DIR)

# Verify it works
test_embedding = embedder.encode(["test sentence"])
print(f"Embedding model loaded: {EMBEDDING_MODEL_DIR}")
print(f"Embedding dimension:    {test_embedding.shape[1]}")



Embedding model loaded: ./models/granite-embedding-30m-english
Embedding dimension:    384


### 3.4.4 Build the Vector Store

We embed all document chunks and store them in a ChromaDB collection. ChromaDB is an in-memory vector database that supports cosine similarity search. It is lightweight and requires no external services — the entire database lives in this process.

We provide a custom embedding function that wraps our local Granite model so ChromaDB uses it for both indexing and querying.

In [11]:
# Workaround: system sqlite3 (3.34) is below chromadb's minimum (3.35).
# pysqlite3-binary ships a newer sqlite3 that satisfies the requirement.
__import__("pysqlite3")
import sys

sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

import chromadb


class LocalEmbeddingFunction(chromadb.EmbeddingFunction):
    """Wraps sentence-transformers model for use with ChromaDB."""

    def __init__(self, model):
        self.model = model

    def __call__(self, input):
        embeddings = self.model.encode(input)
        return embeddings.tolist()


embedding_fn = LocalEmbeddingFunction(embedder)

# Create in-memory ChromaDB client and collection
chroma_client = chromadb.Client()

# Delete collection if it already exists (re-run safety)
try:
    chroma_client.delete_collection("bfrpg_rules")
except Exception:
    pass

collection = chroma_client.create_collection(
    name="bfrpg_rules",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"},
)

print(f"Created collection: bfrpg_rules")
print(f"Embedding {len(chunks)} chunks...")

# Add chunks in batches to avoid memory issues
BATCH_SIZE = 100
for i in range(0, len(chunks), BATCH_SIZE):
    batch = chunks[i : i + BATCH_SIZE]
    ids = [f"chunk_{j}" for j in range(i, i + len(batch))]
    collection.add(documents=batch, ids=ids)
    print(f"  Added chunks {i} to {i + len(batch) - 1}")

print(f"\nVector store ready: {collection.count()} chunks indexed")

Created collection: bfrpg_rules
Embedding 1246 chunks...
  Added chunks 0 to 99


  Added chunks 100 to 199
  Added chunks 200 to 299


  Added chunks 300 to 399


  Added chunks 400 to 499


  Added chunks 500 to 599


  Added chunks 600 to 699
  Added chunks 700 to 799


  Added chunks 800 to 899


  Added chunks 900 to 999
  Added chunks 1000 to 1099


  Added chunks 1100 to 1199
  Added chunks 1200 to 1245

Vector store ready: 1246 chunks indexed


### 3.4.5 Test Retrieval

Before using retrieval in the full pipeline, let's verify it works. We query the vector store with one of our test questions and inspect the returned chunks.

In [12]:
test_query = "What happens if a Thief fails an Open Locks attempt?"

results = collection.query(
    query_texts=[test_query],
    n_results=3,
)

print(f"Query: {test_query}")
print(f"Retrieved {len(results['documents'][0])} chunks:\n")

for i, (doc, distance) in enumerate(zip(results["documents"][0], results["distances"][0])):
    print(f"  Chunk {i + 1} (distance: {distance:.4f}):")
    print(f"  {doc[:200]}...")
    print()

Query: What happens if a Thief fails an Open Locks attempt?
Retrieved 3 chunks:

  Chunk 1 (distance: 0.1897):
  82 |             98 |              91 |            98 |     72 |       92 |
|            20 |           88 |             83 |             99 |              93 |            99 |     73 |       95 |

Th...

  Chunk 2 (distance: 0.2443):
  rolled by the GM.    The   Thief   will   usually   believe   they   are   moving silently regardless of the die roll, but opponents they are trying to avoid will hear the Thief if the roll is failed....

  Chunk 3 (distance: 0.2502):
  -->

## Doors

A stuck door can be opened on a roll of 1 on 1d6; add the character's Strength bonus to the range, so that a character with a bonus of +2 can open a stuck door on a roll of 1-3 on 1d6.
...



The retrieved chunks should contain relevant rules about Thief abilities. The distances indicate how semantically similar each chunk is to the query (lower is more similar with cosine distance).

## 3.5 RAG-Enhanced Responses

Now we ask the same questions again, but this time we retrieve the top 3 relevant chunks from the vector store and include them as context in the prompt. The model reads the context and answers based on it.

This is the core idea of RAG: the model's weights have not changed, but it now has access to the right information.

In [13]:
def ask_with_rag(question, collection, n_results=3):
    """Retrieve relevant chunks and ask the model with context."""
    results = collection.query(
        query_texts=[question],
        n_results=n_results,
    )

    # Combine retrieved chunks into a single context string
    context = "\n\n---\n\n".join(results["documents"][0])
    distances = results["distances"][0]

    answer = ask_model(question, context=context)
    return answer, context, distances

In [14]:
print("RAG-ENHANCED: With retrieved context")
print("=" * 70)

rag_answers = {}

for q in questions:
    answer, context, distances = ask_with_rag(q["question"], collection)
    rag_answers[q["id"]] = {
        "answer": answer,
        "distances": distances,
    }

    print(f"\n[{q['id']}] {q['question']}")
    print(f"  Distances: {[f'{d:.4f}' for d in distances]}")
    print(f"  Expected:  {q['expected'][:120]}")
    print(f"  Model:     {answer[:120]}")

RAG-ENHANCED: With retrieved context



[q01] What happens if a Thief fails an Open Locks attempt?
  Distances: ['0.1897', '0.2443', '0.2502']
  Expected:  The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.
  Model:     If a Thief fails an Open Locks attempt, they cannot try again until they have gained another level of experience.



[q04] What is the saving throw for a 3rd level Fighter against Dragon Breath?
  Distances: ['0.2249', '0.2582', '0.2635']
  Expected:  Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.
  Model:     The context does not provide specific saving throw figures for a 3rd level Fighter against Dragon Breath. However, it do



[q05] How does a Cleric turn undead?
  Distances: ['0.1782', '0.1891', '0.2030']
  Expected:  The Cleric rolls 2d6 and compares the result to the Turn Undead table. Success depends on the Cleric's level and the typ
  Model:     A Cleric turns undead by rolling a 20-sided die (1d20) and then referencing their level on the Cleric's vs. Undead table



[q07] What is the difference between a retainer and a hireling?
  Distances: ['0.1546', '0.2220', '0.2292']
  Expected:  Retainers are NPCs who accompany the party on adventures and gain experience. Hirelings are hired for specific non-adven
  Model:     The context does not explicitly define the difference between a retainer and a hireling. However, it does provide specif



[q10] Can a Halfling use a longbow?
  Distances: ['0.1621', '0.1883', '0.2424']
  Expected:  Halflings may not use Large weapons and must use Medium weapons two-handed. A longbow is a Large weapon, so a Halfling c
  Model:     Yes, a Halfling can use a longbow. Despite their small stature, Halflings are not restricted from using longbows, which 


## 3.6 Side-by-Side Comparison

Now we put the baseline and RAG results next to each other. For each question, compare what the model said without context versus what it said with retrieved document chunks.

In [15]:
print("=" * 70)
print("COMPARISON: BASELINE vs. RAG")
print("=" * 70)

for q in questions:
    print(f"\n{'─' * 70}")
    print(f"[{q['id']}] {q['question']}")
    print(f"  Category: {q['category']}")
    print(f"  Expected: {q['expected']}")
    print()
    print(f"  BASELINE (no context):")
    print(f"    {baseline_answers[q['id']][:200]}")
    print()
    print(f"  RAG (with retrieved context):")
    print(f"    {rag_answers[q['id']]['answer'][:200]}")
    print(f"    Retrieval distances: {[f'{d:.4f}' for d in rag_answers[q['id']]['distances']]}")

COMPARISON: BASELINE vs. RAG

──────────────────────────────────────────────────────────────────────
[q01] What happens if a Thief fails an Open Locks attempt?
  Category: explicit_rule
  Expected: The Thief must wait until gaining another level of experience before trying again. It may only be tried once per lock.

  BASELINE (no context):
    If a Thief fails an Open Locks attempt in Basic Fantasy RPG, they can try again immediately, but there's a cumulative -1 penalty for each failure. This penalty resets to 0 after a successful attempt. 

  RAG (with retrieved context):
    If a Thief fails an Open Locks attempt, they cannot try again until they have gained another level of experience.
    Retrieval distances: ['0.1897', '0.2443', '0.2502']

──────────────────────────────────────────────────────────────────────
[q04] What is the saving throw for a 3rd level Fighter against Dragon Breath?
  Category: table_lookup
  Expected: Based on the Fighter saving throw table, a 3rd level Fight

## 3.7 What This Tells Us

Compare the two columns for each question:

- **Baseline answers** are vague, hedge with phrases like "it depends" or "in many RPG systems," or confidently state incorrect specifics. The model is interpolating from its general training data, not citing actual rules.

- **RAG answers** are grounded in the actual document text. When the retrieval finds the right chunk, the model can read and cite specific rules, table values, and definitions.

### Where RAG Works Well

RAG excels at **explicit rule lookups** and **terminology questions** where the answer is stated directly in a single passage. If the retriever finds the right chunk, the model just needs to read it.

### Where RAG Has Limits

RAG struggles when:
- The answer requires reasoning across **multiple separate sections** of the document
- The answer requires reading a **table** and extracting a specific value
- The question uses different phrasing than the document, causing the retriever to miss the best chunk

These failure modes do not mean RAG is broken. They mean that RAG alone may not be sufficient for all question types. That is the signal that tells you whether to stop here or escalate to more expensive techniques.

### The Decision Framework

1. **RAG solves it** → Deploy RAG. No model changes needed. This is the cheapest option.
2. **RAG gets close but not reliable** → Try inference-time scaling (Best-of-N). Let the model try multiple times and pick the best answer. Still no model changes.
3. **Neither works** → The gap is in the model's weights. Model adaptation (fine-tuning) is justified.

This progression — RAG → inference-time scaling → fine-tuning — is the central framework of this workshop. Each step is more expensive and more invasive than the last. You escalate only when you have evidence that the cheaper approach is insufficient.