# Week 5 Backend: RAG at Scale (Multi-Model + Agentic RAG)

This notebook is the backend deep dive for a production-grade RAG system. It pairs with the
scaffold under `research/week5-backend/week5_backend/` and includes full real-case scenarios.

## Table of Contents
1. Architecture and goals
2. Real-world case studies
3. Multi-model routing
4. Ingestion and indexing
5. Retrieval, reranking, and citations
6. Agentic RAG workflow
7. Evaluation and monitoring
8. Ops and scaling checklist


## 1. Architecture and goals

Read the design doc: `research/week5-backend/week5_backend/architecture.md`.
Run the API: `python research/week5-backend/week5_backend/run_api.py`.

Key constraints:
- multi-tenant isolation and ACL filtering
- multiple vector stores and LLM providers
- explicit citations and traceability
- agentic planner for tool usage (RAG, SQL, external APIs)


## 2. Real-world case studies (all included in scaffold)

Full examples live in `research/week5-backend/week5_backend/case_studies.md`.

### A) Customer Support RAG (multi-tenant)
- Sources: ticket system, internal KB, product docs
- SLA: 2s P95, strict per-tenant ACL
- Fallback: cached answers and BM25

### B) Compliance/Legal RAG (strict citations)
- Must produce citations for every claim
- No cross-tenant data mixing
- Mandatory verifier step before response

### C) Engineering KB (code-aware)
- Source: repos, RFCs, ADRs
- Chunking: code + prose split
- Reranker required to prioritize exact API usage


## 3. Multi-model routing

Providers supported in the scaffold:
- OpenAI (`providers/openai_provider.py`)
- Anthropic (`providers/anthropic_provider.py`)
- Local vLLM (`providers/local_vllm_provider.py`)

Routing policies live in `agents/policies.py`.


In [ ]:
from agents.policies import RoutingPolicy

policy = RoutingPolicy(default_provider="openai", fallback_provider="anthropic")
policy.choose(task="qa")


## 4. Ingestion and indexing

See `rag/ingestion.py` and `pipelines/offline_index.py`.
- Parse -> chunk -> embed -> upsert into vector store.
- Store metadata (tenant_id, doc_type, ACL tags).


In [ ]:
from rag.ingestion import index_text
from rag.embeddings import EmbeddingService
from providers.openai_embeddings import OpenAIEmbeddings
from storage.pgvector_store import PgVectorStore

embedder = EmbeddingService(OpenAIEmbeddings(model="text-embedding-3-large"))
store = PgVectorStore(dsn="postgresql://localhost/rag", table="rag_chunks", embedding_dim=1536)

index_text(
    doc_id="doc-001",
    text="Example doc content for ingestion...",
    embedder=embedder,
    vector_store=store,
    metadata={"tenant_id": "acme"},
)


## 5. Retrieval, reranking, citations

See `rag/retriever.py`, `rag/reranker.py`, and `rag/citations.py`.
- Retrieve with metadata filters (tenant, ACL, recency).
- Rerank for higher precision.
- Format citations for client consumption.


In [ ]:
from rag.retriever import HybridRetriever
from rag.reranker import Reranker
from rag.citations import format_citations
from rag.embeddings import EmbeddingService
from providers.openai_embeddings import OpenAIEmbeddings
from storage.pgvector_store import PgVectorStore

store = PgVectorStore(dsn="postgresql://localhost/rag", table="rag_chunks", embedding_dim=1536)
embedder = EmbeddingService(OpenAIEmbeddings(model="text-embedding-3-large"))
retriever = HybridRetriever(store, embedder)
chunks = retriever.retrieve(query="What is our refund policy?", top_k=8, filters={"tenant_id": "acme"})
chunks = Reranker().rerank("What is our refund policy?", chunks)
format_citations(chunks)


## 6. Agentic RAG workflow

Agentic RAG combines a planner with tool execution. The minimal loop is:
1) Planner decides tools
2) Tools run (RAG, SQL, web)
3) Verifier checks for missing citations or conflicts


In [ ]:
from agents.executor import AgentExecutor
from agents.planner import Planner
from agents.tools import Tool, ToolRegistry

tools = ToolRegistry()
tools.register(Tool(name="rag", handler=lambda q: "rag-answer"))

executor = AgentExecutor(Planner(), tools)
executor.run("Summarize our legal obligations for data retention.")


## 7. Evaluation and monitoring

See `evaluation/` for the harness. In production, track:
- faithfulness, citation coverage
- latency and cost per query
- failure and fallback rates


## 8. Ops and scaling checklist

- Separate ingestion workers from query API
- Cache embeddings and retrieval results
- Rate limit per tenant and enforce budgets
- Multi-provider fallback and circuit breakers

See `research/week5-backend/week5_backend/runbook.md` for ops playbooks.
