# Lab 3.5.7: Production RAG System

**Module:** 3.5 - RAG Systems & Vector Databases  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê (Expert)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

- [ ] Build a production-ready RAG pipeline with all best practices
- [ ] Implement error handling, retries, and graceful degradation
- [ ] Add caching for improved performance
- [ ] Implement logging and monitoring
- [ ] Handle edge cases (empty results, long queries)
- [ ] Benchmark throughput and latency

---

## üìö Prerequisites

- Completed: All previous labs in Module 3.5
- Understanding of: RAG, chunking, hybrid search, reranking, evaluation

---

## üåç Real-World Context

**The Situation:** Your RAG prototype impressed stakeholders. Now they want it in production serving real users. Demo code won't cut it - you need reliability, monitoring, and performance.

**Production Requirements:**
- 99.9% uptime
- < 3s response time P95
- Graceful handling of failures
- Observable metrics and logging
- Scalable to 100+ concurrent users

---

## üßí ELI5: Production vs Demo

> **Demo Code**: Like cooking dinner for yourself - if something goes wrong, you just start over.
>
> **Production Code**: Like running a restaurant - you need:
> - Backup ingredients (error handling)
> - Health inspections (monitoring)
> - Multiple chefs (scalability)
> - Recipe records (logging)
> - Pre-prepped ingredients (caching)
> - "We're out of fish" plan (graceful degradation)

---

## Part 1: Setup

In [None]:
# Install dependencies
!pip install -q \
    langchain langchain-community langchain-huggingface \
    chromadb sentence-transformers \
    rank_bm25 \
    ollama \
    cachetools \
    tenacity \
    structlog

print("‚úÖ Dependencies installed!")

In [None]:
import os
import time
import json
import hashlib
import logging
from pathlib import Path
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass, field, asdict
from datetime import datetime
from functools import wraps
from concurrent.futures import ThreadPoolExecutor, TimeoutError
import threading

import numpy as np
from cachetools import TTLCache, LRUCache
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import structlog

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

from sentence_transformers import CrossEncoder
from rank_bm25 import BM25Okapi
import ollama

import torch
import gc

print(f"CUDA: {torch.cuda.is_available()}")

---

## Part 2: Production RAG Architecture

### System Design

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                        PRODUCTION RAG SYSTEM                            ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                         ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê        ‚îÇ
‚îÇ  ‚îÇ Request ‚îÇ ‚îÄ‚îÄ‚ñ∫ ‚îÇ Validate ‚îÇ ‚îÄ‚îÄ‚ñ∫ ‚îÇ  Cache  ‚îÇ ‚îÄ‚îÄ‚ñ∫ ‚îÇ Retrieve ‚îÇ        ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ  Check  ‚îÇ     ‚îÇ + Rerank ‚îÇ        ‚îÇ
‚îÇ                                   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò        ‚îÇ
‚îÇ                                        ‚îÇ               ‚îÇ               ‚îÇ
‚îÇ                                   [Cache Hit]    [Cache Miss]          ‚îÇ
‚îÇ                                        ‚ñº               ‚ñº               ‚îÇ
‚îÇ                                   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê        ‚îÇ
‚îÇ                                   ‚îÇ Return  ‚îÇ ‚óÑ‚îÄ‚îÄ ‚îÇ Generate ‚îÇ        ‚îÇ
‚îÇ                                   ‚îÇ Cached  ‚îÇ     ‚îÇ + Cache  ‚îÇ        ‚îÇ
‚îÇ                                   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò        ‚îÇ
‚îÇ                                                                         ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îÇ
‚îÇ  ‚îÇ                    MONITORING & LOGGING                         ‚îÇ  ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ Request metrics  ‚Ä¢ Latency tracking  ‚Ä¢ Error rates           ‚îÇ  ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

---

## Part 3: Core Components

In [None]:
# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

logging.basicConfig(level=logging.INFO)
logger = structlog.get_logger()

In [None]:
@dataclass
class RAGConfig:
    """Configuration for Production RAG System."""
    # Model settings
    embedding_model: str = "BAAI/bge-large-en-v1.5"
    reranker_model: str = "BAAI/bge-reranker-large"
    llm_model: str = "llama3.1:8b"
    
    # Retrieval settings
    chunk_size: int = 512
    chunk_overlap: int = 50
    first_stage_k: int = 50
    final_k: int = 5
    use_reranking: bool = True
    use_hybrid_search: bool = True
    hybrid_alpha: float = 0.5
    
    # Cache settings
    cache_ttl_seconds: int = 3600  # 1 hour
    cache_max_size: int = 1000
    
    # Timeouts
    retrieval_timeout_s: float = 5.0
    generation_timeout_s: float = 30.0
    
    # Quality thresholds
    min_similarity_score: float = 0.3
    max_query_length: int = 1000
    
    # Retry settings
    max_retries: int = 3
    retry_delay_s: float = 1.0


@dataclass
class RAGResponse:
    """Structured response from RAG system."""
    query: str
    answer: str
    sources: List[Dict[str, Any]]
    metadata: Dict[str, Any] = field(default_factory=dict)
    
    @property
    def success(self) -> bool:
        return bool(self.answer and not self.answer.startswith("Error"))


@dataclass
class RAGMetrics:
    """Metrics for monitoring."""
    total_requests: int = 0
    successful_requests: int = 0
    failed_requests: int = 0
    cache_hits: int = 0
    cache_misses: int = 0
    total_latency_ms: float = 0
    
    @property
    def success_rate(self) -> float:
        if self.total_requests == 0:
            return 0.0
        return self.successful_requests / self.total_requests
    
    @property
    def cache_hit_rate(self) -> float:
        total = self.cache_hits + self.cache_misses
        if total == 0:
            return 0.0
        return self.cache_hits / total
    
    @property
    def avg_latency_ms(self) -> float:
        if self.total_requests == 0:
            return 0.0
        return self.total_latency_ms / self.total_requests

---

## Part 4: Production RAG Implementation

In [None]:
class ProductionRAG:
    """
    Production-ready RAG system with:
    - Hybrid search (dense + sparse)
    - Cross-encoder reranking
    - Caching
    - Error handling with retries
    - Logging and monitoring
    """
    
    def __init__(self, config: RAGConfig = None):
        self.config = config or RAGConfig()
        self.logger = structlog.get_logger().bind(component="ProductionRAG")
        self.metrics = RAGMetrics()
        self._lock = threading.Lock()
        
        # Caches
        self.query_cache = TTLCache(
            maxsize=self.config.cache_max_size,
            ttl=self.config.cache_ttl_seconds
        )
        self.embedding_cache = LRUCache(maxsize=10000)
        
        # Models (loaded lazily)
        self._embedding_model = None
        self._reranker = None
        self._vectorstore = None
        self._bm25 = None
        self._documents = None
        
        self.logger.info("ProductionRAG initialized", config=asdict(self.config))
    
    def load_documents(self, documents: List[Document]):
        """
        Load and index documents.
        """
        self.logger.info("Loading documents", count=len(documents))
        start = time.time()
        
        # Chunk documents
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.config.chunk_size,
            chunk_overlap=self.config.chunk_overlap
        )
        self._documents = splitter.split_documents(documents)
        
        # Load embedding model
        self._embedding_model = HuggingFaceEmbeddings(
            model_name=self.config.embedding_model,
            model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
            encode_kwargs={"normalize_embeddings": True, "batch_size": 32}
        )
        
        # Build vector store
        import shutil
        db_path = "./production_chroma_db"
        if Path(db_path).exists():
            shutil.rmtree(db_path)
        
        self._vectorstore = Chroma.from_documents(
            documents=self._documents,
            embedding=self._embedding_model,
            persist_directory=db_path
        )
        
        # Build BM25 index for hybrid search
        if self.config.use_hybrid_search:
            tokenized = [doc.page_content.lower().split() for doc in self._documents]
            self._bm25 = BM25Okapi(tokenized)
        
        # Load reranker
        if self.config.use_reranking:
            self._reranker = CrossEncoder(
                self.config.reranker_model,
                device="cuda" if torch.cuda.is_available() else "cpu"
            )
        
        elapsed = time.time() - start
        self.logger.info(
            "Documents loaded",
            chunks=len(self._documents),
            elapsed_s=elapsed
        )
    
    def _get_cache_key(self, query: str) -> str:
        """Generate cache key for query."""
        return hashlib.md5(query.lower().strip().encode()).hexdigest()
    
    def _validate_query(self, query: str) -> Tuple[bool, str]:
        """
        Validate query before processing.
        Returns (is_valid, error_message).
        """
        if not query or not query.strip():
            return False, "Query cannot be empty"
        
        if len(query) > self.config.max_query_length:
            return False, f"Query too long (max {self.config.max_query_length} chars)"
        
        return True, ""
    
    def _dense_retrieve(self, query: str, k: int) -> List[Tuple[Document, float]]:
        """Dense retrieval using embeddings."""
        results = self._vectorstore.similarity_search_with_score(query, k=k)
        return [(doc, 1 - score) for doc, score in results]  # Convert distance to similarity
    
    def _sparse_retrieve(self, query: str, k: int) -> List[Tuple[Document, float]]:
        """Sparse retrieval using BM25."""
        tokenized_query = query.lower().split()
        scores = self._bm25.get_scores(tokenized_query)
        top_indices = np.argsort(scores)[-k:][::-1]
        return [(self._documents[i], scores[i]) for i in top_indices if scores[i] > 0]
    
    def _hybrid_retrieve(self, query: str, k: int) -> List[Tuple[Document, float]]:
        """Hybrid retrieval combining dense and sparse."""
        dense_results = self._dense_retrieve(query, k)
        sparse_results = self._sparse_retrieve(query, k)
        
        # RRF fusion
        rrf_k = 60
        doc_scores = {}
        
        for rank, (doc, _) in enumerate(dense_results):
            doc_id = id(doc)
            doc_scores[doc_id] = doc_scores.get(doc_id, {"doc": doc, "score": 0})
            doc_scores[doc_id]["score"] += self.config.hybrid_alpha / (rrf_k + rank + 1)
        
        for rank, (doc, _) in enumerate(sparse_results):
            doc_id = id(doc)
            doc_scores[doc_id] = doc_scores.get(doc_id, {"doc": doc, "score": 0})
            doc_scores[doc_id]["score"] += (1 - self.config.hybrid_alpha) / (rrf_k + rank + 1)
        
        sorted_results = sorted(doc_scores.values(), key=lambda x: -x["score"])
        return [(r["doc"], r["score"]) for r in sorted_results[:k]]
    
    def _rerank(self, query: str, candidates: List[Tuple[Document, float]]) -> List[Tuple[Document, float]]:
        """Rerank candidates using cross-encoder."""
        if not candidates:
            return []
        
        pairs = [[query, doc.page_content] for doc, _ in candidates]
        scores = self._reranker.predict(pairs, batch_size=32, show_progress_bar=False)
        
        reranked = sorted(
            zip([doc for doc, _ in candidates], scores),
            key=lambda x: -x[1]
        )
        
        return reranked
    
    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=1, max=10),
        retry=retry_if_exception_type(Exception)
    )
    def _generate_answer(self, query: str, contexts: List[str]) -> str:
        """
        Generate answer using LLM with retry logic.
        """
        if not contexts:
            return "I couldn't find relevant information to answer your question."
        
        context_str = "\n\n---\n\n".join(contexts[:self.config.final_k])
        
        prompt = f"""You are a helpful AI assistant. Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, acknowledge this.
Be concise and accurate.

CONTEXT:
{context_str}

QUESTION: {query}

ANSWER:"""
        
        response = ollama.chat(
            model=self.config.llm_model,
            messages=[{"role": "user", "content": prompt}],
            options={"temperature": 0.1}
        )
        
        return response["message"]["content"]
    
    def query(self, query: str) -> RAGResponse:
        """
        Main query method with full production handling.
        """
        request_id = hashlib.md5(f"{query}{time.time()}".encode()).hexdigest()[:8]
        start_time = time.time()
        
        self.logger.info("Query received", request_id=request_id, query=query[:100])
        
        with self._lock:
            self.metrics.total_requests += 1
        
        # Validate query
        is_valid, error_msg = self._validate_query(query)
        if not is_valid:
            self.logger.warning("Invalid query", request_id=request_id, error=error_msg)
            with self._lock:
                self.metrics.failed_requests += 1
            return RAGResponse(
                query=query,
                answer=f"Error: {error_msg}",
                sources=[],
                metadata={"request_id": request_id, "error": error_msg}
            )
        
        # Check cache
        cache_key = self._get_cache_key(query)
        if cache_key in self.query_cache:
            cached = self.query_cache[cache_key]
            with self._lock:
                self.metrics.cache_hits += 1
            self.logger.info("Cache hit", request_id=request_id)
            cached.metadata["cached"] = True
            return cached
        
        with self._lock:
            self.metrics.cache_misses += 1
        
        try:
            # Retrieve
            if self.config.use_hybrid_search:
                candidates = self._hybrid_retrieve(query, self.config.first_stage_k)
            else:
                candidates = self._dense_retrieve(query, self.config.first_stage_k)
            
            # Filter low-quality results
            candidates = [(doc, score) for doc, score in candidates 
                         if score >= self.config.min_similarity_score]
            
            # Rerank
            if self.config.use_reranking and candidates:
                candidates = self._rerank(query, candidates)
            
            # Extract contexts and sources
            contexts = [doc.page_content for doc, _ in candidates[:self.config.final_k]]
            sources = [
                {
                    "source": doc.metadata.get("source", "unknown"),
                    "score": float(score),
                    "preview": doc.page_content[:100] + "..."
                }
                for doc, score in candidates[:self.config.final_k]
            ]
            
            # Generate answer
            answer = self._generate_answer(query, contexts)
            
            # Build response
            elapsed_ms = (time.time() - start_time) * 1000
            response = RAGResponse(
                query=query,
                answer=answer,
                sources=sources,
                metadata={
                    "request_id": request_id,
                    "latency_ms": elapsed_ms,
                    "cached": False,
                    "candidates_count": len(candidates)
                }
            )
            
            # Cache response
            self.query_cache[cache_key] = response
            
            with self._lock:
                self.metrics.successful_requests += 1
                self.metrics.total_latency_ms += elapsed_ms
            
            self.logger.info(
                "Query completed",
                request_id=request_id,
                latency_ms=elapsed_ms,
                sources_count=len(sources)
            )
            
            return response
            
        except Exception as e:
            self.logger.error(
                "Query failed",
                request_id=request_id,
                error=str(e)
            )
            with self._lock:
                self.metrics.failed_requests += 1
            
            return RAGResponse(
                query=query,
                answer=f"Error: An unexpected error occurred. Please try again.",
                sources=[],
                metadata={"request_id": request_id, "error": str(e)}
            )
    
    def get_metrics(self) -> Dict[str, Any]:
        """Get current metrics."""
        return {
            "total_requests": self.metrics.total_requests,
            "successful_requests": self.metrics.successful_requests,
            "failed_requests": self.metrics.failed_requests,
            "success_rate": self.metrics.success_rate,
            "cache_hits": self.metrics.cache_hits,
            "cache_hit_rate": self.metrics.cache_hit_rate,
            "avg_latency_ms": self.metrics.avg_latency_ms,
            "cache_size": len(self.query_cache)
        }
    
    def health_check(self) -> Dict[str, Any]:
        """Health check endpoint."""
        checks = {
            "vectorstore": self._vectorstore is not None,
            "embedding_model": self._embedding_model is not None,
            "reranker": self._reranker is not None or not self.config.use_reranking,
            "documents_loaded": self._documents is not None and len(self._documents) > 0,
        }
        
        # Test LLM connection
        try:
            ollama.chat(
                model=self.config.llm_model,
                messages=[{"role": "user", "content": "test"}],
                options={"num_predict": 1}
            )
            checks["llm"] = True
        except:
            checks["llm"] = False
        
        return {
            "healthy": all(checks.values()),
            "checks": checks,
            "timestamp": datetime.now().isoformat()
        }

---

## Part 5: Building and Testing

In [None]:
# Load documents
DOCS_PATH = Path("../data/sample_documents")

documents = []
for file_path in sorted(DOCS_PATH.glob("*.md")):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    documents.append(Document(
        page_content=content,
        metadata={"source": file_path.name}
    ))

print(f"üìö Loaded {len(documents)} documents")

In [None]:
# Initialize Production RAG
print("üîÑ Initializing Production RAG System...")

config = RAGConfig(
    use_hybrid_search=True,
    use_reranking=True,
    first_stage_k=50,
    final_k=5,
    cache_ttl_seconds=3600
)

rag = ProductionRAG(config)
rag.load_documents(documents)

print("\n‚úÖ Production RAG System ready!")

In [None]:
# Health check
health = rag.health_check()

print("\nüè• Health Check:")
print(f"   Overall: {'‚úÖ Healthy' if health['healthy'] else '‚ùå Unhealthy'}")
for check, status in health['checks'].items():
    print(f"   {check}: {'‚úÖ' if status else '‚ùå'}")

In [None]:
# Test queries
test_queries = [
    "What is the memory capacity of DGX Spark?",
    "How does LoRA work?",
    "What are the benefits of hybrid search?",
]

print("\nüß™ Testing queries...")
print("=" * 70)

for query in test_queries:
    print(f"\n‚ùì Query: {query}")
    response = rag.query(query)
    
    print(f"\nüí¨ Answer: {response.answer[:300]}...")
    print(f"\nüìö Sources: {[s['source'] for s in response.sources]}")
    print(f"‚è±Ô∏è Latency: {response.metadata.get('latency_ms', 0):.1f}ms")
    print("-" * 70)

In [None]:
# Test cache
print("\nüîÑ Testing cache...")

# First query (cache miss)
response1 = rag.query("What is GPTQ?")
print(f"First query: cached={response1.metadata.get('cached', False)}, latency={response1.metadata.get('latency_ms', 0):.1f}ms")

# Second query (cache hit)
response2 = rag.query("What is GPTQ?")
print(f"Second query: cached={response2.metadata.get('cached', False)}")

metrics = rag.get_metrics()
print(f"\nCache hit rate: {metrics['cache_hit_rate']:.0%}")

---

## Part 6: Throughput Benchmark

In [None]:
# Benchmark throughput
benchmark_queries = [
    "What is DGX Spark?",
    "Explain attention mechanism",
    "How does LoRA reduce memory?",
    "What is GPTQ quantization?",
    "Benefits of RAG?",
    "Compare ChromaDB and FAISS",
    "What are Tensor Cores?",
    "Explain positional encoding",
    "What is QLoRA?",
    "How does hybrid search work?",
]

print("\n‚ö° Throughput Benchmark")
print("=" * 60)

# Sequential benchmark
start = time.time()
latencies = []

for query in benchmark_queries:
    q_start = time.time()
    response = rag.query(query)
    latencies.append((time.time() - q_start) * 1000)

total_time = time.time() - start

print(f"\nüìä Sequential Results ({len(benchmark_queries)} queries):")
print(f"   Total time: {total_time:.2f}s")
print(f"   Throughput: {len(benchmark_queries) / total_time:.2f} queries/sec")
print(f"   Avg latency: {np.mean(latencies):.0f}ms")
print(f"   P50 latency: {np.percentile(latencies, 50):.0f}ms")
print(f"   P95 latency: {np.percentile(latencies, 95):.0f}ms")
print(f"   P99 latency: {np.percentile(latencies, 99):.0f}ms")

In [None]:
# Get final metrics
final_metrics = rag.get_metrics()

print("\n" + "=" * 60)
print("üìä FINAL METRICS")
print("=" * 60)

print(f"\nüìà Request Statistics:")
print(f"   Total requests: {final_metrics['total_requests']}")
print(f"   Successful: {final_metrics['successful_requests']}")
print(f"   Failed: {final_metrics['failed_requests']}")
print(f"   Success rate: {final_metrics['success_rate']:.1%}")

print(f"\nüíæ Cache Statistics:")
print(f"   Cache hits: {final_metrics['cache_hits']}")
print(f"   Cache hit rate: {final_metrics['cache_hit_rate']:.1%}")
print(f"   Cache size: {final_metrics['cache_size']} entries")

print(f"\n‚è±Ô∏è Latency Statistics:")
print(f"   Average latency: {final_metrics['avg_latency_ms']:.0f}ms")

---

## Part 7: Edge Case Handling

In [None]:
# Test edge cases
print("\nüß™ Edge Case Testing")
print("=" * 60)

edge_cases = [
    ("", "Empty query"),
    ("   ", "Whitespace only"),
    ("a" * 2000, "Query too long"),
    ("What is the recipe for chocolate cake?", "Out of domain"),
    ("LPDDR5X", "Single word query"),
]

for query, description in edge_cases:
    print(f"\nüîπ Testing: {description}")
    response = rag.query(query)
    
    if response.success:
        print(f"   ‚úÖ Success: {response.answer[:100]}...")
    else:
        print(f"   ‚ö†Ô∏è Handled: {response.answer}")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: No Error Handling
```python
# ‚ùå Wrong: Unhandled exceptions crash the service
def query(self, query):
    results = self.vectorstore.search(query)  # Can throw!
    return self.llm.generate(results)

# ‚úÖ Right: Catch and handle gracefully
def query(self, query):
    try:
        results = self.vectorstore.search(query)
        return self.llm.generate(results)
    except Exception as e:
        logger.error("Query failed", error=str(e))
        return RAGResponse(error="An error occurred")
```

### Mistake 2: No Caching
```python
# ‚ùå Wrong: Compute everything every time
def query(self, query):
    return expensive_computation(query)

# ‚úÖ Right: Cache results
def query(self, query):
    cache_key = hash(query)
    if cache_key in self.cache:
        return self.cache[cache_key]
    result = expensive_computation(query)
    self.cache[cache_key] = result
    return result
```

### Mistake 3: No Monitoring
```python
# ‚ùå Wrong: No visibility into system health
def query(self, query):
    return process(query)

# ‚úÖ Right: Track metrics
def query(self, query):
    start = time.time()
    result = process(query)
    self.metrics.record(latency=time.time() - start)
    return result
```

---

## üéâ Checkpoint

You've built a production-ready RAG system with:
- ‚úÖ Hybrid search (dense + sparse)
- ‚úÖ Cross-encoder reranking
- ‚úÖ Query caching with TTL
- ‚úÖ Error handling with retries
- ‚úÖ Structured logging
- ‚úÖ Metrics and monitoring
- ‚úÖ Health check endpoint
- ‚úÖ Edge case handling

**Congratulations on completing Module 3.5!** üéä

---

## üßπ Cleanup

In [None]:
# Clean up
import shutil

del rag
gc.collect()
torch.cuda.empty_cache()

if Path("./production_chroma_db").exists():
    shutil.rmtree("./production_chroma_db")

print("‚úÖ Cleanup complete!")

---

## Next Steps

Congratulations on completing Module 3.5: RAG Systems!

You've mastered:
1. Building RAG pipelines from scratch
2. Chunking strategies and their trade-offs
3. Vector databases (ChromaDB, FAISS, Qdrant)
4. Hybrid search (dense + sparse)
5. Reranking with cross-encoders
6. Evaluation with RAGAS metrics
7. Production-ready implementation

‚û°Ô∏è Continue to [Module 3.6: AI Agents & Agentic Systems](../../module-3.6-ai-agents/)