# 0 Recap and the Evidence That Brought Us Here (10 min)
Purpose: 

Reconnect participants to the Day 2 workflow, surface the specific failures that RAG alone could not resolve, and build the evidence-based case for why we are now considering model adaptation.
No one should leave this section wondering "why are we here today?" The answer should be concrete, traceable, and grounded in outputs they already generated.

## 0.1 Where Day 2 Left Us 

Day 2 followed a deliberate progression. We started with nothing: a model, no customer data, no retrieval, no context. We asked questions and watched the model improvise. It sounded fluent. It was often wrong.

Then we added structure. We ingested documents with Docling. We preserved tables, headings, and semantic layout. We chunked those documents with care, respecting word boundaries and choosing chunk sizes that balanced coherence against precision. We embedded those chunks, stored them in a vector database, and wired up a retrieval pipeline.

And it worked. For many questions, the system returned grounded, defensible answers drawn directly from the customer's documents. That was real progress.

But it was not complete progress.

Some questions still failed. Not because the system was broken, but because the nature of the failure had changed. The model was no longer hallucinating from ignorance. It was receiving relevant context and still producing answers that missed the mark.

That distinction matters enormously, because the fix for "model never saw the data" is completely different from the fix for "model saw the data and still got it wrong."

Day 2 gave us a working RAG pipeline. It also gave us something more important: a clear, observable boundary where RAG stops being sufficient.

Today starts at that boundary.

## 0.2 The Escalation Ladder (Revisited) 
On Day 2 we introduced the escalation ladder. It looked like this:

* Improve chunking and structure
* Improve retrieval strategy
* Improve prompting and grounding
* Improve evaluation coverage

Only then: consider changing the model itself

We spent Day 2 working through steps 1 through 4. We improved ingestion quality. We tuned chunk boundaries. We tested retrieval. We ran questions, inspected outputs, and identified where the system performed well and where it did not.

Today we are standing at step 5.

But we are not here because someone decided fine-tuning sounded interesting. We are here because the evidence from the previous steps told us that the remaining failures cannot be fixed by better retrieval, better chunking, or better prompts.

That is the only legitimate reason to be here.

>**Facilitator note**: Pause here and reinforce this framing. If participants skipped Day 2, give them the short version: "We built a RAG pipeline. It solved most problems. The ones it didn't solve are why we're in this room." If anyone asks whether they can skip to the model training section, point them back to this moment.
>The escalation ladder is not a suggestion. It is the governing logic of the entire workshop.
Say explicitly: "We earned the right to be at step 5 because we did steps 1 through 4 first."

## 0.3 The Questions That Still Fail 

Not all failures are equal. Before we can talk about what to do next, we need to look at what is actually going wrong and sort those failures into categories that tell us something useful.

At the end of Day 2, we ran a structured set of questions through the RAG pipeline and captured the results: the question, the model's answer, the retrieved sources, the similarity distances, and the expected answer. Those results were saved so we could pick up exactly where we left off.

That is what we are loading now.

### 0.3.1 Pre-Built Evaluation Set (From Day 2 Results)

The first cell loads the Day 2 evaluation results directly from the saved JSON artifact.


In [1]:
import json

with open("../prebuilt/eval_results.json", "r", encoding="utf-8") as f:
    eval_data = json.load(f)

metadata = eval_data["metadata"]
results = eval_data["results"]

print("Day 2 Evaluation Results Loaded")
print("=" * 60)
print(f"  Source document:   {metadata['source_document']}")
print(f"  Model:             {metadata['model']}")
print(f"  Chunking strategy: {metadata['chunking']['strategy']}")
print(f"  Chunk size:        {metadata['chunking']['chunk_size']}")
print(f"  Overlap:           {metadata['chunking']['overlap']}")
print(f"  Total chunks:      {metadata['chunking']['total_chunks']}")
print(f"  Embedding model:   {metadata['embedding_model']}")
print(f"  Retrieval:         top-{metadata['retrieval']['n_results']}, {metadata['retrieval']['similarity']} similarity")
print(f"  Questions:         {len(results)}")
print("=" * 60)

Day 2 Evaluation Results Loaded
  Source document:   multi-document corpus
  Model:             granite-3-2-8b-instruct
  Chunking strategy: markdown_aware
  Chunk size:        1000
  Overlap:           200
  Total chunks:      4732
  Embedding model:   ibm-granite/granite-embedding-30m-english
  Retrieval:         top-3, cosine similarity
  Questions:         10


These are the same results you generated on Day 2. Same model. Same chunks. Same retrieval parameters. Nothing has been modified. We are picking up exactly where the previous lab left off.

Now let's look at what came back.


In [2]:
print(f"\n{'='*70}")
print("DAY 2 RAG RESULTS")
print(f"{'='*70}")

for i, r in enumerate(results):
    status = "PASS" if r["classification"] == "pass" else "FAIL"
    marker = "  " if status == "PASS" else ">>"

    print(f"\n{marker} [{r['id']}] {r['question']}")
    print(f"   Category:       {r['category']}")
    print(f"   Classification: {r['classification']}")
    print(f"   Top distance:   {r['distances'][0]:.4f}")
    print(f"   Answer preview: {r['answer'][:120]}...")
    print(f"   {'---'}")



DAY 2 RAG RESULTS

   [q01] What happens if a Thief fails an Open Locks attempt?
   Category:       explicit_rule
   Classification: pass
   Top distance:   0.1726
   Answer preview: If a Thief fails an Open Locks attempt, they must wait until they have gained another level of experience before trying ...
   ---

>> [q02] Why can't Elves roll higher than a d6 for hit points?
   Category:       terminology
   Classification: implicit_reasoning_failure
   Top distance:   0.2227
   Answer preview: According to the provided context, Elves never roll larger than six-sided dice (d6) for hit points. The reason for this ...
   ---

   [q03] Can a character wear leather armor and cast spells?
   Category:       implicit_reasoning
   Classification: pass
   Top distance:   0.1730
   Answer preview: Yes, according to the provided context, characters who can cast spells may wear leather armor....
   ---

>> [q04] What is the saving throw for a 3rd level Fighter against Dragon Breath?
   Category:

Take a moment to scan the output. Some questions passed cleanly. The model found the right chunks, read the context, and produced a grounded answer. 
Those are the wins from Day 2.


Now look at the ones marked with >>. Those are the questions we are here to talk about.

>**Facilitator note**: Give the room 30 seconds to read the output. Don't narrate every line. Let the pattern emerge on its own. If someone notices that all four failures share the same classification, call it out. That is exactly the observation this section is designed to produce.

### 0.3.2 Sorting the Failures

Now we sort. Not every wrong answer has the same cause, and the fix depends entirely on correctly diagnosing the failure mode.

Before looking at the code output, here is the framework for thinking about failure categories:

**Retrieval failures** occur when the system pulls back the wrong chunks. The model never had a chance because the relevant information was not in the context window. These are infrastructure problems. The fix is upstream: better chunking, better embeddings, better retrieval strategy. Model adaptation will not help here.

**Terminology failures** occur when the retrieved context contains the right information but uses domain-specific language that the model maps to the wrong concept. The model sees "hit die" and interprets it through the lens of its pre-training rather than through the customer's specific system. The context is present but the model's prior training creates interference.

**Implicit reasoning failures** occur when the answer requires combining information from multiple rules, applying a rule that is stated indirectly, or making an inference that the document assumes the reader already understands. The context is retrieved. The terminology is not the issue. But the model fails to connect the pieces, hedges when it should commit, or says the information is not present when it actually is.


In [3]:
categories = {}
for r in results:
    c = r["classification"]
    if c not in categories:
        categories[c] = []
    categories[c].append(r)

print(f"\n{'='*70}")
print("FAILURE ANALYSIS BY CLASSIFICATION")
print(f"{'='*70}")

for cat in ["pass", "retrieval_failure", "terminology_failure", "implicit_reasoning_failure"]:
    items = categories.get(cat, [])
    label = cat.replace("_", " ").title()
    print(f"\n  {label}: {len(items)} question(s)")
    for item in items:
        print(f"      [{item['id']}] [{item['category']}] {item['question']}")


FAILURE ANALYSIS BY CLASSIFICATION

  Pass: 6 question(s)
      [q01] [explicit_rule] What happens if a Thief fails an Open Locks attempt?
      [q03] [implicit_reasoning] Can a character wear leather armor and cast spells?
      [q05] [multi_step_rule] How does a Cleric turn undead?
      [q08] [implicit_reasoning] When can a Magic-User learn new spells?
      [q09] [explicit_rule] What happens to a character at exactly 0 hit points?
      [q10] [implicit_reasoning] Can a Halfling use a longbow?

  Retrieval Failure: 0 question(s)

  Terminology Failure: 0 question(s)

  Implicit Reasoning Failure: 4 question(s)
      [q02] [terminology] Why can't Elves roll higher than a d6 for hit points?
      [q04] [table_lookup] What is the saving throw for a 3rd level Fighter against Dragon Breath?
      [q06] [table_lookup] If a character has a Strength of 16, what bonus do they get on melee attack rolls?
      [q07] [terminology] What is the difference between a retainer and a hireling?


Now let's look at the failures in detail. For each one, we inspect what the model actually said, what it should have said, and what sources it was working from.


In [4]:
failures = [r for r in results if r["classification"] != "pass"]

print(f"\n{'='*70}")
print(f"DETAILED FAILURE INSPECTION ({len(failures)} failures)")
print(f"{'='*70}")

for i, r in enumerate(failures):
    print(f"\n--- Failure {i+1}: {r['id']} ---")
    print(f"Classification: {r['classification'].replace('_', ' ').upper()}")
    print(f"Category:       {r['category']}")
    print(f"Question:       {r['question']}")
    print(f"Expected:       {r['expected']}")
    print(f"Got:            {r['answer'][:200]}")
    print(f"Sources:        {r['sources']}")
    print(f"Distances:      {[f'{d:.4f}' for d in r['distances']]}")
    print()



DETAILED FAILURE INSPECTION (4 failures)

--- Failure 1: q02 ---
Classification: IMPLICIT REASONING FAILURE
Category:       terminology
Question:       Why can't Elves roll higher than a d6 for hit points?
Expected:       Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.
Got:            According to the provided context, Elves never roll larger than six-sided dice (d6) for hit points. The reason for this restriction is not explicitly stated in the context.
Sources:        ['Basic-Fantasy-RPG-Rules-r142.pdf', 'Basic-Fantasy-RPG-Rules-r142.pdf', 'Basic-Fantasy-RPG-Rules-r107-bookmarked.pdf']
Distances:      ['0.2227', '0.2454', '0.2882']


--- Failure 2: q04 ---
Classification: IMPLICIT REASONING FAILURE
Category:       table_lookup
Question:       What is the saving throw for a 3rd level Fighter against Dragon Breath?
Expected:       Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath savin

>**Facilitator note**: Walk through the failures one at a time with the room. For each, ask two questions:
"Did the retriever find relevant content?" Check the distances. Anything under roughly 0.30 means retrieval was in the right neighborhood.
"If the model had relevant context, why did it get the answer wrong?"


Look at the specifics:

* **q02** (Elf hit dice): The model retrieved relevant chunks (distances around 0.22 to 0.29) but said "the reason is not explicitly stated." The answer is in the rules, but it requires understanding that Elves are a combination class and that the d6 is tied to that class structure. The model hedged instead of reasoning through it.

* **q04** (Fighter saving throw, Dragon Breath): Distances around 0.20, so retrieval was reasonable. But the model could not extract the specific value from a saving throw table. It acknowledged the concept exists but could not produce the number 15.

* **q06** (Strength 16 melee bonus): Same pattern. The ability score bonus table was likely in the chunks, but the model could not perform the lookup. It said the context "does not specify" when the table was right there.

* **q07** (Retainer vs. hireling): The model gave a detailed answer about retainers but admitted it could not find specific details about hirelings. Partial retrieval, partial reasoning, incomplete answer.

Two of these failures involve table lookups. Two involve reasoning across implicit rules. All four share one trait: the model received relevant context and still could not produce the right answer.

These are not retrieval problems. These are model problems.

## 0.4 What the Evidence Shows

Let's make the summary explicit.



In [5]:
total = len(results)
passed = len(categories.get("pass", []))
retrieval = len(categories.get("retrieval_failure", []))
terminology = len(categories.get("terminology_failure", []))
implicit = len(categories.get("implicit_reasoning_failure", []))

print("=" * 70)
print("EVIDENCE SUMMARY")
print("=" * 70)

print(f"\n  Total questions evaluated:     {total}")
print(f"  Passed:                        {passed} ({passed/total*100:.0f}%)")
print(f"  Retrieval failures:            {retrieval} ({retrieval/total*100:.0f}%)")
print(f"  Terminology failures:          {terminology} ({terminology/total*100:.0f}%)")
print(f"  Implicit reasoning failures:   {implicit} ({implicit/total*100:.0f}%)")

print(f"\n{'='*70}")
print("DIAGNOSIS")
print(f"{'='*70}")

if retrieval > 0:
    print(f"\n  Retrieval failures present ({retrieval}).")
    print(f"  These should be addressed with chunking/embedding improvements,")
    print(f"  NOT model adaptation. Fix the pipeline first.")

if terminology > 0:
    print(f"\n  Terminology failures present ({terminology}).")
    print(f"  The model confuses domain-specific language with general knowledge.")
    print(f"  This is a candidate for model adaptation.")

if implicit > 0:
    print(f"\n  Implicit reasoning failures present ({implicit}).")
    print(f"  The model receives relevant context but cannot connect the pieces.")
    print(f"  This is the strongest signal that model adaptation may be warranted.")

model_failures = terminology + implicit
if model_failures > 0:
    print(f"\n  CONCLUSION: {model_failures} failure(s) fall outside what")
    print(f"  retrieval improvements alone can fix. This is the evidence that")
    print(f"  justifies exploring model adaptation today.")
else:
    print(f"\n  All failures appear to be retrieval-related.")
    print(f"  Model adaptation is NOT yet justified. Fix retrieval first.")

print()




EVIDENCE SUMMARY

  Total questions evaluated:     10
  Passed:                        6 (60%)
  Retrieval failures:            0 (0%)
  Terminology failures:          0 (0%)
  Implicit reasoning failures:   4 (40%)

DIAGNOSIS

  Implicit reasoning failures present (4).
  The model receives relevant context but cannot connect the pieces.
  This is the strongest signal that model adaptation may be warranted.

  CONCLUSION: 4 failure(s) fall outside what
  retrieval improvements alone can fix. This is the evidence that
  justifies exploring model adaptation today.



This is the moment to be explicit about what we just did and why it matters.

We did not wake up this morning and decide to fine-tune a model. We loaded the evaluation results from Day 2. We categorized the failures. We identified which failures are retrieval problems (none, in this case) and which are model problems (all four). The retrieval was reasonable. The distances were low. The sources were relevant. The model simply could not do the work.

That is the evidence. That is why we are here.


## 0.5 What This Session Is

This session is:
* A continuation of the pipeline you built yesterday
* Focused on diagnosing failures and preparing data for model adaptation
* Designed to teach you when and how synthetic data generation supports that process

This session is not:

* Starting over
* Introducing a new architecture
* Treating fine-tuning as the default answer

## 0.6 What Success Looks Like Today
At the end of Day 3, participants should be able to:
* Trace a specific eval failure to its root cause
* Explain why retrieval alone cannot fix it
* Articulate the role of synthetic data generation as preparation, not shortcut
* Describe the conditions under which model adaptation is justified

What success is not:

* 10/10 on the eval
* A trained model
* A production deployment

The eval is a diagnostic tool, not a scoreboard.

## 0.7 How the Lab Will Run
Same rules as yesterday:
Guided Jupyter notebook, code and explanation interleaved
Pre-generated outputs available for every major step
Participants are not expected to write code from scratch
If a cell takes longer than 2 to 3 minutes, move forward using pre-built results

>Facilitator guidance:
>* Anchor every discussion on the eval results
>* When something improves, ask "why did it improve?"
>* When something does not improve, ask "what layer is responsible?"
>* Resist the urge to skip ahead to model training
## 0.8 Setting the Tone
"Yesterday you built a system that works 60% of the time. Today you figure out why the other 40% fails, and whether the fix lives in the data or in the model. That is how real engineering works. That is what customers trust."

**Transition to Section 1:**

"Now that we know what the system cannot do and why, we can talk about what synthetic data generation actually is, what it is not, and how to use it responsibly to prepare for model adaptation."
