# What Comes Next

You just spent four hours building a pipeline that took a document, chunked it,
embedded it, retrieved from it, evaluated it, and adapted a model against it.
That is real. The pipeline works. The eval scores are real scores.

And it was four hours.

What follows is an honest map of what a production engagement with this
technology actually looks like.

---

## What You Built Is a Mental Model

The lab was not designed to produce a deliverable. It was designed to give you
felt experience with a problem space that your customers are trying to navigate,
so you can ask better questions, spot bad assumptions earlier, and know when to
escalate to the people who do the build.

That is a different kind of value than a working system. It is also the right
kind for your role.

Production is a different problem than the lab. Not a harder version of the same
problem. A different one.

The difference is not the model. It is everything around the model.

---

## This is a Process and Works as a Chain

Every stage of the pipeline depends on the stage before it. Bad ingestion
produces bad chunks. Bad chunks produce bad embeddings. Bad embeddings produce
bad retrieval. Bad retrieval means the model never sees the right information,
and no amount of fine-tuning will fix that.

This is not a collection of independent problems. It is a chain. And a chain is
only as strong as its weakest link.

In the lab, we controlled for this. The document was clean. The format was
consistent. The evaluation questions were chosen to be answerable. Real
engagements do not come pre-cleaned.

The work of a real engagement is largely the work of finding the weak link,
fixing it, and confirming the fix improved the system. Then finding the next
one. That process takes time not because the technology is immature, but because
the problem is genuinely complex and the complexity lives in the details of each
customer's specific corpus, questions, and definition of correct.

---

### Corpus deduplication
In the multi-document section of Day 2, three retrieval
slots were consumed by three versions of the same rulebook. The current edition
was not retrieved at all. In a customer corpus with hundreds of versioned
documents, this problem gets worse, not better. The fix is upstream corpus
curation, not a model change.

### Metadata filtering
Once you have a multi-document corpus, retrieval needs
to be smarter than "find the most similar chunks." You often need to filter by
document type, date range, department, or authority level before ranking by
similarity. This requires metadata strategy at ingestion time. Without it, a
question about current policy might return a chunk from a three-year-old version
that scored higher on similarity.

### Chunking strategy comparison
The lab used a single chunking approach. In
practice, the right strategy depends on document structure. Fixed-size chunking,
semantic chunking, hierarchical chunking, and document-structure-aware chunking
produce meaningfully different retrieval behavior against different content types.
A chunk boundary that splits a table in half means neither piece is retrievable
on its own, and no retrieval tuning will fix that.

### LLM-as-judge evaluation
Keyword matching catches clear failures. It misses
nuanced correctness issues. Production evaluation typically requires a second
model evaluating the first model's answers against reference answers, with human
spot-checking on a sample.

### Embedding model selection
We used a specific Granite embedding model. In
production, embedding model choice depends on domain vocabulary, context length
requirements, multilingual needs, and latency constraints. This is a
benchmarking exercise, not a default. A model trained on general web text can
consistently fail to retrieve the right chunks simply because the domain uses
technical terminology the embedding space has never seen.

### Reranking
Retrieval returns candidates. Reranking re-scores them using a
separate, more expensive model before passing them to the generator. It
consistently improves answer quality in production systems and is consistently
skipped in early-stage work.

Each of these is a place where the chain can break.

---

## The Conversation to Have with Your Customer

When a customer says they want to build a RAG system, the instinct is to talk
about models. The right conversation starts somewhere else.

Ask them: how are you measuring correct?

If they cannot answer that, neither can you. And any change you make, to
retrieval, to chunking, to the model, becomes unjustifiable, because there is no
before and after to compare.

The evaluation set you build with their subject matter experts, even ten or
twenty questions with agreed-upon reference answers, is often the most productive
artifact of the first month. It forces the customer to articulate standards they
have never written down. It gives you a shared definition of done. And it gives
every subsequent recommendation a defensible basis.

When they ask whether they need fine-tuning, the answer that earns trust is not
yes or no. It is: "Here is what the evaluation shows. Here is what is failing
and why. Here is what we would try before we consider changing the model."

That positions you as someone who solves problems methodically, not someone who
sells the most expensive option.

---

## Further Reading

### Retrieval and RAG

[RAGAS: Automated Evaluation of Retrieval Augmented Generation](https://arxiv.org/abs/2309.15217)
(paper) -- the academic basis for automated RAG evaluation, introducing metrics
that do not require human-annotated ground truth.

[LangChain Text Splitters documentation](https://python.langchain.com/docs/concepts/text_splitters/)
-- practical comparisons of fixed-size, semantic, and hierarchical chunking
approaches with code examples.

[Lost in the Middle: How Language Models Use Long Contexts](https://arxiv.org/abs/2307.03172)
(paper) -- explains why what you retrieve and where it appears in the prompt
both matter. Performance degrades significantly when relevant information appears
in the middle of a long context.

### Evaluation

[Hugging Face TRL documentation](https://huggingface.co/docs/trl/index) -- the
library used in Day 3, with production-grade examples for training and
evaluation workflows.

[MLflow documentation](https://mlflow.org/docs/latest/index.html) -- experiment
tracking and model registry, relevant to managing iteration cycles across a real
engagement.

### Fine-Tuning and Model Adaptation

[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
(paper) -- the foundational paper for the adaptation technique used in Day 3.
Explains why you can train a tiny fraction of parameters and still get
meaningful results.

[LIMA: Less Is More for Alignment](https://arxiv.org/abs/2305.11206) (paper) --
evidence that data quality matters more than data quantity. A model fine-tuned
on 1,000 carefully curated examples outperformed models trained on far larger
datasets.

### Production Considerations

[Building LLM Applications for Production](https://huyenchip.com/2023/04/11/llm-engineering.html)
(Chip Huyen) -- one of the most referenced practical guides on the gap between
demos and production. The core observation maps directly to what you just built:
it is easy to make something cool with LLMs, and very hard to make something
production-ready.

---

*This document is part of the extras folder. Nothing here is required.
All of it is real.*