# What Comes Next

You just spent four hours building a pipeline that took a document, chunked it, embedded it, retrieved from it, evaluated it, and adapted a model against it. That is real. The pipeline works. The eval scores are real scores.

And it was four hours.

What follows is an honest map of what a production engagement with this technology actually looks like, so you can set expectations clearly with your team, your leadership, and your customers.

---

## What You Built Is a Mental Model

The lab was not designed to produce a deliverable. It was designed to give you felt experience with a problem space that your customers are trying to navigate, so you can ask better questions, spot bad assumptions earlier, and know when to escalate to the people who do the build.
That is a different kind of value than a working system. It is also the right kind for your role.
Production is a different problem than the lab. Not a harder version of the same problem. A different one.
The difference is not the model. It is everything around the model.

---

## A Realistic Engagement Timeline

### Weeks 1-2: Discovery and Corpus Assessment

Before writing a line of pipeline code, you need to understand what you are actually working with.

- What documents exist, and where do they live?
- What formats are they in? PDFs, HTML, Word docs, wikis, SharePoint, scanned images?
- How are they versioned? Who owns them? When were they last updated?
- What questions does the customer actually need the system to answer?
- Who are the subject matter experts who can validate answers?

This phase surfaces the real shape of the problem. In the lab, we used a single clean markdown file. Real customers have twelve versions of the same policy document, a SharePoint site nobody has audited in three years, and a wiki that contradicts the official handbook.

Corpus curation is often the highest-value work in the entire engagement. It is also the work that gets skipped when timelines are compressed.

### Weeks 3-4: Pipeline Build and Evaluation Framework

This is the work you did in the lab, done properly against real documents.

- Ingestion tuned for the actual document types in the corpus
- Chunking strategy validated against domain-specific content (the right chunk size for a legal policy document is not the same as for a technical manual)
- Embedding model selected and benchmarked against the domain vocabulary
- A real evaluation set built with subject matter experts: not five questions, but fifty or more, covering different question types, edge cases, and known failure modes
- Baseline scores established before any optimization

The evaluation set is the most important deliverable of this phase. Everything that follows depends on having a reliable way to measure whether things are getting better or worse.

### Weeks 5-6: Iteration and Retrieval Optimization

The first pipeline run is expected to be wrong. Not catastrophically wrong. But wrong in ways that are now visible because you have an evaluation framework.

Common failure patterns at this stage:

- Retrieval returning near-duplicate chunks from different document versions, wasting retrieval slots
- Chunk boundaries cutting across the information needed to answer a question
- Domain-specific terminology not aligning with how the embedding model represents concepts
- Questions that fail because the answer simply is not in the corpus at all

Each of these requires a different fix. Evaluation tells you which problem you have. Iteration fixes it. A targeted change should produce an explainable improvement. If you cannot explain why it got better, you got lucky, and luck is not a deployment strategy.

### Weeks 7-10: Model Adaptation (If Warranted)

Most customers will not need fine-tuning. RAG, done well, solves the majority of use cases.

Some customers will absolutely need it. The signal is specific: retrieval is returning the right chunks, the model has everything it needs in context, and it still fails consistently on a particular class of question. That is evidence, not a hunch, that the model itself needs to change.

If that signal is present, model adaptation is a real commitment:

- Synthetic data generation from domain documents: hundreds to thousands of examples, not eight
- Training run validation: loss curves, convergence checks, held-out test sets
- Three-way evaluation: base model, RAG pipeline, adapted model with RAG
- Regression testing to catch catastrophic forgetting (in the lab, fine-tuning broke a question the baseline answered correctly)
- Iteration, because the first training run rarely produces the best result

Estimated time from "we need fine-tuning" to "we have a validated adapted model": six to ten weeks, depending on data availability and iteration cycles.

### Weeks 10+: Production Hardening

The pipeline worked in a notebook. Production requires:

- Automated ingestion and re-indexing as documents change
- Latency and throughput testing under real load
- Monitoring for retrieval quality drift over time
- Guardrails and safety evaluation appropriate to the use case
- Integration with the customer's existing systems and auth infrastructure
- Ongoing evaluation as the corpus grows and the question set evolves

This is not a one-time build. A RAG system deployed against a living corpus requires ongoing maintenance.

---

## Things We Skipped in the Lab (And Why)

The lab made deliberate choices about what to include. Here is what was left out and why it matters in production.

**Corpus deduplication.** In the multi-document section of Day 2, three retrieval slots were consumed by three versions of the same rulebook. The current edition was not retrieved at all. In a customer corpus with hundreds of versioned documents, this problem gets worse, not better. The fix is upstream corpus curation, not a model change.

**Metadata filtering.** Once you have a multi-document corpus, retrieval needs to be smarter than "find the most similar chunks." You often need to filter by document type, date range, department, or authority level before ranking by similarity. This requires metadata strategy at ingestion time.

**Chunking strategy comparison.** The lab used a single chunking approach. In practice, the right strategy depends on document structure. Fixed-size chunking, semantic chunking, hierarchical chunking, and document-structure-aware chunking produce meaningfully different retrieval behavior against different content types.

**LLM-as-judge evaluation.** Keyword matching catches clear failures. It misses nuanced correctness issues. Production evaluation typically requires a second model evaluating the first model's answers against reference answers, with human spot-checking on a sample.

**Embedding model selection.** We used a specific Granite embedding model. In production, embedding model choice depends on domain vocabulary, context length requirements, multilingual needs, and latency constraints. This is a benchmarking exercise, not a default.

**Reranking.** Retrieval returns candidates. Reranking re-scores them using a separate, more expensive model before passing them to the generator. It consistently improves answer quality in production systems and is consistently skipped in proof-of-concept work.

---

## The Conversation to Have with Your Customer

When a customer says they want to build a RAG system, the instinct is to talk about models. The right conversation starts somewhere else.

Ask them: how are you measuring correct?

If they cannot answer that, neither can you. And any change you make, to retrieval, to chunking, to the model, becomes unjustifiable, because there is no before and after to compare.

The evaluation set you build with their subject matter experts, even ten or twenty questions with agreed-upon reference answers, is often the most productive artifact of the first month. It forces the customer to articulate standards they have never written down. It gives you a shared definition of done. And it gives every subsequent recommendation a defensible basis.

When they ask whether they need fine-tuning, the answer that earns trust is not yes or no. It is: "Here is what the evaluation shows. Here is what is failing and why. Here is what we would try before we consider changing the model."

That positions you as someone who solves problems methodically, not someone who sells the most expensive option.

---

## A Quick Reference: Stages and Realistic Timelines

| Stage | What It Produces | Realistic Duration |
|---|---|---|
| Discovery and corpus assessment | Document inventory, question set draft, scope definition | 1-2 weeks |
| Pipeline build | Working ingestion, chunking, retrieval, and generation against real documents | 2-3 weeks |
| Evaluation framework | 50+ question eval set with reference answers, baseline scores | 2-3 weeks (overlaps pipeline build) |
| Retrieval optimization | Improved retrieval scores against evaluation set, documented changes | 2-4 weeks |
| Model adaptation (if warranted) | Adapted model validated against held-out test set | 6-10 weeks |
| Production hardening | Deployed, monitored, maintainable system | 4-8 weeks |

These ranges assume a real customer corpus, not a clean single-document lab environment. Discovery of corpus problems, stakeholder alignment on evaluation criteria, and access delays are the most common sources of schedule expansion.

---

## Further Reading

The lab gave you working code and a mental model. These resources go deeper on the concepts that will matter most in real engagements.

**Retrieval and RAG**
- "RAGAS: Automated Evaluation of Retrieval Augmented Generation" (paper) -- the academic basis for automated RAG evaluation
- LangChain and LlamaIndex documentation on chunking strategies -- practical comparisons of fixed-size, semantic, and hierarchical approaches
- "Lost in the Middle: How Language Models Use Long Contexts" (paper) -- explains why what you retrieve and where it appears in the prompt both matter

**Evaluation**
- "Evaluating RAG Systems" (Hugging Face blog) -- practical guidance on moving beyond keyword matching
- HELM (Holistic Evaluation of Language Models) -- framework for thinking about evaluation dimensions

**Fine-Tuning and Model Adaptation**
- "LoRA: Low-Rank Adaptation of Large Language Models" (paper) -- the foundational paper for the adaptation technique used in Day 3
- "LIMA: Less Is More for Alignment" (paper) -- evidence that data quality matters more than data quantity
- Hugging Face TRL documentation -- the library used in the lab, with production-grade examples

**Production Considerations**
- "Building LLM Applications for Production" (Chip Huyen blog post) -- one of the most referenced practical guides on the gap between demos and production
- MLflow documentation -- experiment tracking and model registry, relevant to managing iteration cycles

---

*This document is part of the extras folder. Nothing here is required. All of it is real.*