# 5. Synthesis: From Lab Results to Customer Conversations

**Duration:** 20 minutes 

**Format:** Facilitated discussion. No notebook. No code. 

**Purpose:** Translate the two-day arc into language the field can use in front of a customer next week.

## 5.1 What Just Happened (The Full Arc, Told Simply)

Before we talk about customers, take a moment to name what we actually built.

Two days ago, we started with a model that had never seen the customer's documents. It answered questions fluently and confidently. It was often wrong. That was not a model failure. That was the expected behavior of a capable general-purpose model operating outside its domain.

We did not respond to that by swapping the model. We responded by asking what it would take to make the model useful for this specific problem.

We ingested the documents. We made a deliberate decision about how to chunk them. We built a retrieval pipeline. We evaluated the results honestly. We identified failures by category, not by gut feeling. We tried inference-time scaling before touching the model itself. We generated synthetic training data from the customer's own documents. We adapted the model. We evaluated again.

That sequence took two days in a lab. In a real engagement it takes weeks. But the shape of the work is identical, and that shape is what the field needs to carry into customer conversations.

> **Facilitator note**: Resist the urge to summarize the technical steps here. The participants just lived through them. The goal is to help them see the shape of the work, not replay the details.

## 5.2 The Three Questions Customers Actually Ask

In practice, enterprise AI conversations almost always reduce to three questions. The field should be able to answer all three without reaching for a slide deck.

**"Can we just use ChatGPT for this?"**

This is not a naive question. It deserves a direct answer. General-purpose models are genuinely capable for a wide range of tasks. The honest answer is: it depends on what "this" is.

If the task requires reasoning over proprietary documents that no public model has ever seen, a general-purpose model without context will hallucinate. If the task requires consistent, defensible answers that can be traced back to source material, context stuffing is fragile and expensive at scale. If the task involves domain-specific terminology that maps to concepts a general model has learned differently, you will get fluent wrong answers.

The question is not "is this model smart enough." It is "can this model be grounded in the customer's data in a way that is reliable, repeatable, and auditable." That is a different question, and it has a different answer.

**"Why isn't RAG enough?"**

RAG is enough for a lot of problems. The honest answer is that it depends on what kind of failures you are seeing.

If answers are wrong because the retriever pulled the wrong chunks, that is a retrieval problem. Better chunking, better embeddings, better retrieval strategy. RAG fixes that.

If answers are wrong because the retrieved context contains the right information and the model still cannot produce the right answer, that is a different failure. The model is receiving everything it needs and something in its prior training is interfering. That is where retrieval stops being sufficient.

The distinction matters because the fix is completely different. You cannot fine-tune your way out of a retrieval failure, and you cannot chunk your way out of a model reasoning failure. Evaluation is the only way to know which problem you have.

**"How long will this take?"**

The honest answer is that it depends on where you start and what success looks like.

A RAG pipeline on clean documents can show meaningful results in days. Evaluation and iteration adds time, but that time is not wasted, it is the work that makes the system defensible. Model adaptation is measured in weeks, not days, when you include data generation, training, evaluation, and validation against held-out questions.

The more useful question to ask the customer is: what does "good enough" look like, and who has to trust the answer. If the answer is legal or compliance, the bar is higher and the timeline reflects that. If the answer is an internal productivity tool, the bar may be lower and you can ship faster.

## 5.3 The Escalation Ladder as a Customer Conversation

The escalation ladder is not just an engineering framework. It is a sales and trust-building tool.

When a customer comes in asking for fine-tuning, the instinct is to either agree immediately or push back with a list of prerequisites. Neither approach builds trust. The escalation ladder gives you a third option: a structured conversation about what the evidence says.

The conversation sounds like this:

"Before we talk about changing the model, we want to understand what the model is doing with your data today. That tells us whether the problem is in the data, in the retrieval, or in the model itself. If it turns out to be the model, we have a clear path. If it turns out to be something else, we save you the cost and time of a training run that would not have fixed it."

That framing does three things. It positions the field as rigorous rather than reactive. It gives the customer a concrete deliverable at each stage rather than a promise at the end. And it reduces the risk of the engagement failing for the wrong reasons.

> **Facilitator note**: Ask the room: "What does a customer hear when you say this?" The answer should be "they hear that you have a process and that you are not guessing." If participants are uncertain about how to deliver this message, spend five minutes on the wording. This is the part of the lab that directly affects revenue.

## 5.4 What the Results Actually Showed

> **Note:** The analysis below reflects the reference run included with this lab. Your live results may vary depending on training randomness and endpoint load. The patterns described are representative, not guaranteed.

The evaluation results from Sections 4.1–4.8 (single-model evaluation) and 4.9–4.20 (cross-architecture comparison) are worth interpreting out loud before participants leave the room, because they contain a more nuanced story than the pass/fail counts suggest.

### Single-Model Evaluation (Sections 4.1–4.8)

In our pre-built results, the summary was 6/10, then 8/10 after Best-of-N, then 8/10 after fine-tuning with RAG. On the surface, fine-tuning matched inference-time scaling and that looks like a wash.

But look at what moved and what did not.

In the reference run, Best-of-N recovered two questions through sampling. The model had the capability; it just did not surface it on the first try. Fine-tuning recovered different questions, particularly table lookups, by changing how the model reads structured information in context. The two approaches did not fix the same problems. In a production system, you would use both.

In the reference run, the model without RAG context scored 5/10, lower than the Day 2 baseline of 6/10. That is not a failure of training. It is evidence that 8 examples of Thief-related content cannot teach a model about Fighters, Clerics, and Halflings. The model without context was operating on partial domain knowledge and general priors, a combination that is less reliable than either the base model or the RAG pipeline alone. The training loss itself tells the same story: it dropped from 2.93 to 2.26, well above the < 1.0 target that indicates full convergence — the model was still learning when training ended.

In the reference run, the regression on q03 is the most instructive result. The Day 2 baseline answered it correctly. The fine-tuned model failed it in both evaluation modes. Fine-tuning introduced a specific wrong belief about whether spellcasters can wear armor. That is catastrophic forgetting at small scale. It is also exactly why you evaluate on a held-out test set rather than the same questions that guided your training, and why you compare against the baseline before shipping anything.

### Cross-Architecture Comparison (Sections 4.9–4.20)

Sections 4.9–4.20 expanded the evaluation to four models: the lab's Granite 8B plus three community-contributed models (Granite 2B, Phi-3 Mini, Qwen2.5 3B) all fine-tuned on the same 8 training examples with a higher learning rate (2e-4 vs 5e-6).

The results tell a clear story about where model size matters and where it does not.

Without RAG context, the models diverged. Scores ranged from 2/10 to 5/10. The larger Granite 8B led, which is expected: a larger model stores more general knowledge in its weights. Without retrieved context to ground the answer, size buys you something.

With RAG context, the models converged. All four scored 8/10 or 9/10. The smallest models matched or exceeded the 8B model. When the retriever provides the right information, what matters is the model's ability to extract the answer from the context, not how much it memorized during pretraining. Two of the smaller community models (Granite 2B, Qwen2.5 3B) actually beat the lab's 8B model by one question.

The RAG lift tells the rest of the story. Smaller models showed larger lifts (+5 to +6) compared to the 8B model (+3). That is not a weakness of smaller models. It means they benefit more from retrieval because they have less built-in knowledge to fall back on. In a deployment where RAG is always available, that is a feature, not a bug: you get comparable accuracy at lower inference cost.

The speed difference reinforces this. The smaller models were 3-10x faster at inference — the 8B model needed over 5 minutes for 10 questions without context, while the 2-3B models finished in 30-70 seconds. That gap compounds at scale: lower latency per request, higher throughput per GPU, and a meaningfully smaller infrastructure bill. When accuracy is equivalent with RAG, the speed advantage makes the smaller models hard to argue against.

The learning rate difference (40x higher for the community models) is a significant confound. A fair comparison would hold LR constant. But the practical takeaway stands: for narrow-domain SFT with minimal training data and a strong retrieval pipeline, smaller models can be a cost-effective choice.

### The Takeaway

The takeaway for the field is not "fine-tuning is risky" or "bigger is always better." It is "evaluation is what tells you whether your choices worked, and you cannot skip it." The cross-architecture results add a practical dimension: when RAG is part of the deployment, the field can confidently recommend smaller, cheaper, faster models without sacrificing answer quality.

## 5.5 This Works in Production

The approach demonstrated in this lab is not theoretical.

ATOSS, a German workforce management software company, faced a version of this problem with their support and documentation systems. Their documents were complex, domain-specific, and built up over years. General-purpose models answered questions fluently but not reliably. They needed answers that were grounded, consistent, and traceable.

They worked through the same escalation the field just practiced. Document ingestion. Structured retrieval. Evaluation against real customer questions. Synthetic data generation from their own documentation. Model adaptation. A three-way evaluation comparing the base model, the RAG pipeline, and the adapted model.

The adapted model improved meaningfully on the failure categories that RAG could not address. The pipeline was repeatable and survived model updates without requiring a full retraining cycle. The answers were traceable to source material, which satisfied their internal trust and compliance requirements.

That is the story the field can tell. Not "we have a fine-tuning product." Rather: "we have a process that starts with your documents, produces defensible answers, and tells you when and whether model adaptation is justified."

> **Facilitator note**: If you have access to additional customer case studies from your own engagements, this is the right place to include them. The ATOSS example establishes that the approach works at production scale. Local examples establish that the field has already done this work.

## 5.6 What to Bring Into the Next Customer Meeting

This is not about what to demo. It is about what to ask.

Bring one failing question. Not a hypothetical. A real question the customer's system gets wrong today, with the wrong answer on record. That question is the starting point of the conversation about where the failure lives and what it would take to fix it.

Bring the escalation ladder, not as a slide but as a mental model. When the customer describes their problem, you are already mapping it: is this a data quality issue, a retrieval issue, or a model issue? That mapping shapes everything that follows.

Bring a timeline that separates stages. RAG pipeline with evaluation: weeks. Model adaptation with proper data generation and validation: months. Being clear about what each stage costs and what it delivers prevents the engagement from stalling when the customer realizes fine-tuning takes longer than they expected.

Do not bring a promise that you will solve it. Bring a process that will tell you whether it can be solved and what solving it requires. That is a more credible offer, and it is the one that ages well.

## 5.7 Closing

"You now have a complete pipeline from document ingestion to model adaptation, with evaluation at every stage. More importantly, you have the judgment to know when each stage is necessary and when it is not.

The field that leaves this room is not the field that says yes to every fine-tuning request, and it is not the field that says RAG is always enough. It is the field that asks what the evidence shows and knows what to do with the answer.

That is what enterprise customers trust. That is what keeps engagements from failing for the wrong reasons. And that is what this lab was actually teaching you."