# 03 — Retrieval-Augmented Generation (RAG)

RAG exists to overcome **context window limits**. You cannot ask an LLM to internally read 100+ pages if it doesn't fit into its short-term memory (the context window). RAG retrieves only the **relevant** pieces and feeds them to the model at generation time.

## Why RAG? Options to use your data with LLMs

- **Train an LLM from scratch with your data.**  
  *Extremely expensive.* Practically infeasible for most teams.

- **Fine-tune an existing LLM with your data.**  
  *Expensive and technically complex.* Often overkill and hard to maintain.

- **RAG (Retrieval-Augmented Generation).**  
  Split your data into small chunks, embed them into vectors ("embeddings"), and store them in a **vector database**. At query time, retrieve the most similar chunks and pass them to the LLM.

**In practice, most LLM applications today use RAG.**

## Division of responsibilities in RAG

- **Language generation** comes from the **foundation LLM**.
- **Knowledge representation** comes from the **vector database**.

In other words:
- The foundation LLM behaves like someone who can *speak and reason* but doesn't know your private data.
- The vector database behaves like your *domain knowledge*, providing grounded facts on demand.

## How a RAG app works

1. **Question** — The user asks a question.
2. **Retrieval** — A *retriever* searches an indexed document set (PDFs, databases, websites, etc.) and returns the most relevant chunks.
3. **Augmented generation** — The LLM generates an answer **using** those retrieved chunks as context.
4. **Result** — The app returns a new answer synthesized from the retrieved information (not just copy-paste).

```
User Question → Retriever (Vector DB) → Context Chunks → LLM → Answer
```

## Why it's useful

- **Accuracy & relevance** — Answers are grounded in retrieved sources.
- **Efficiency** — Combines search + generation in a single flow.
- **Versatility** — Works for study aids, research assistants, enterprise knowledge, etc.

## RAG helps reduce cost

- **Selective queries** — Retrieve relevant snippets first; the LLM sees **only** the filtered context.
- **Lower compute** — Less text for the model to process → faster and cheaper.

👉 In essence, RAG ensures that the most expensive resource (LLM compute) is used **only** where it adds value.

## Privacy in RAG

- Data is provided to the LLM **only at generation time** — not for training.
- Therefore, your information is **not stored inside the model**; it is shown temporarily as context.
- After answering, the LLM **does not remember** those inputs.

⚠️ In production-grade RAG, carefully classify stored data and apply strong security controls for sensitive content.

## Indexing pipeline (offline)

1. **Load**: Ingest documents (PDFs, HTML, databases, etc.).
2. **Split**: Chunk the text with robust splitters (e.g., `RecursiveCharacterTextSplitter`).
3. **Embed**: Convert chunks into vectors using an embedding model.
4. **Store**: Persist vectors into a **vector store** (FAISS, Chroma, Pinecone, etc.).

## Retrieval + generation flow (online)

1. Receive a **user question**.
2. **Retrieve** top-*k* most similar chunks from the vector store.
3. Build a **prompt** that includes the retrieved context + the user question.
4. Call the **LLM** to generate the final answer.
5. (Optional) **Cite sources**, validate JSON, apply guardrails, or cache results.

### Controlling hallucinations with temperature

- Lower temperature → more deterministic, fewer hallucinations.
- Higher temperature → more creative, but riskier.

```python
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
```
Combine **low temperature** with **good retrieval** and **structured prompts** for best grounding.