# Multi-Index Routing RAG with PDF Documents - Interactive Learning Notebook

## 🎯 Learning Objectives

In this notebook, you will learn:

1. **Multi-Index RAG Architecture** - How to route queries to domain-specific indexes
2. **PDF Processing** - Load and process real PDF documents
3. **Document Chunking Strategies** - Split documents effectively for retrieval
4. **LLM-Based Routing** - Use AI to intelligently route queries
5. **Citation Tracking** - Track sources and page numbers for answers
6. **Evaluation Metrics** - Measure system performance

## 📚 Dataset

We'll process 4 PDF documents from `pdf_documents/` folder:
- **Legal/Compliance Domain**: 3 PDPA Advisory Guidelines
- **HR/Business Domain**: 1 Employee Handbook
- **General Domain**: Fallback for other queries

## 🔧 Prerequisites

Make sure you have:
- OpenAI API key set as environment variable: `OPENAI_API_KEY`
- Python 3.8+
- Required packages (we'll install in next cell)

---
## Part 1: Setup and Dependencies

First, let's install and import all required libraries.

In [None]:
# Install required packages
!pip install -q openai langchain langchain-openai langchain-community chromadb pypdf python-dotenv matplotlib plotly pandas

In [None]:
# Import libraries
import os
import time
from pathlib import Path
from typing import List, Dict, Literal, Tuple
from pydantic import BaseModel, Field
from dotenv import load_dotenv

# LangChain imports
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.utils.openai_functions import convert_pydantic_to_openai_function
from langchain_core.output_parsers.openai_functions import PydanticAttrOutputFunctionsParser

# Visualization
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd

# Load environment variables
load_dotenv()

# Verify API key is set
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("⚠️ OPENAI_API_KEY not found in environment variables!")

print("✅ All dependencies loaded successfully!")
print(f"📁 Working directory: {os.getcwd()}")

---
## Part 2: Explore PDF Documents

Let's examine what PDFs we have and understand their structure.

In [None]:
# Define paths
PDF_DIR = Path("pdf_documents")

# List all PDF files
pdf_files = list(PDF_DIR.glob("*.pdf"))

print(f"📂 Found {len(pdf_files)} PDF documents:\n")
for i, pdf_path in enumerate(pdf_files, 1):
    file_size = pdf_path.stat().st_size / 1024  # KB
    print(f"{i}. {pdf_path.name}")
    print(f"   Size: {file_size:.1f} KB\n")

# Domain mapping based on file names
domain_mapping = {
    "legal": ["PDPA", "Advisory", "Guidelines", "Enforcement"],
    "hr": ["Employee", "Handbook"],
}

def classify_pdf(filename: str) -> str:
    """Classify PDF into domain based on filename."""
    for domain, keywords in domain_mapping.items():
        if any(keyword.lower() in filename.lower() for keyword in keywords):
            return domain
    return "general"

# Classify documents
classified_docs = {}
for pdf_path in pdf_files:
    domain = classify_pdf(pdf_path.name)
    if domain not in classified_docs:
        classified_docs[domain] = []
    classified_docs[domain].append(pdf_path)

print("\n📊 Domain Classification:")
for domain, docs in classified_docs.items():
    print(f"\n{domain.upper()}: {len(docs)} document(s)")
    for doc in docs:
        print(f"  • {doc.name}")

---
## Part 3: Load PDF Documents

Now let's load the PDF documents with proper metadata tagging.

In [None]:
def load_pdf_with_metadata(pdf_path: Path, domain: str) -> List[Document]:
    """
    Load a PDF file and add domain metadata.
    
    Args:
        pdf_path: Path to PDF file
        domain: Domain classification (legal, hr, general)
    
    Returns:
        List of Document objects (one per page)
    """
    print(f"📄 Loading: {pdf_path.name}")
    
    loader = PyPDFLoader(str(pdf_path))
    documents = loader.load()
    
    # Add comprehensive metadata
    for doc in documents:
        doc.metadata["domain"] = domain
        doc.metadata["source_type"] = "pdf"
        doc.metadata["filename"] = pdf_path.name
        doc.metadata["file_path"] = str(pdf_path)
    
    print(f"   ✓ Loaded {len(documents)} pages")
    return documents

# Load all documents by domain
all_documents = {}
document_stats = {}

print("🔄 Loading all PDF documents...\n")

for domain, pdf_paths in classified_docs.items():
    domain_docs = []
    for pdf_path in pdf_paths:
        docs = load_pdf_with_metadata(pdf_path, domain)
        domain_docs.extend(docs)
    
    all_documents[domain] = domain_docs
    document_stats[domain] = {
        "num_files": len(pdf_paths),
        "num_pages": len(domain_docs),
        "total_chars": sum(len(doc.page_content) for doc in domain_docs)
    }

print("\n" + "="*60)
print("📈 DOCUMENT LOADING SUMMARY")
print("="*60)

for domain, stats in document_stats.items():
    print(f"\n{domain.upper()}:")
    print(f"  Files: {stats['num_files']}")
    print(f"  Pages: {stats['num_pages']}")
    print(f"  Characters: {stats['total_chars']:,}")
    print(f"  Avg chars/page: {stats['total_chars']//stats['num_pages']:,}")

---
## Part 4: Document Chunking Strategies

PDF pages can be very long. We'll split them into smaller chunks for better retrieval.

### 📝 Chunking Parameters:
- **chunk_size**: Maximum characters per chunk (1000)
- **chunk_overlap**: Characters shared between chunks (200)
- **separators**: Split at paragraphs, then sentences, then words

In [None]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Split documents into chunks
chunked_documents = {}
chunk_stats = {}

print("✂️ Splitting documents into chunks...\n")

for domain, docs in all_documents.items():
    chunks = text_splitter.split_documents(docs)
    chunked_documents[domain] = chunks
    
    chunk_stats[domain] = {
        "num_chunks": len(chunks),
        "avg_chunk_size": sum(len(c.page_content) for c in chunks) / len(chunks) if chunks else 0,
        "min_chunk_size": min(len(c.page_content) for c in chunks) if chunks else 0,
        "max_chunk_size": max(len(c.page_content) for c in chunks) if chunks else 0,
    }
    
    print(f"{domain.upper()}: {len(docs)} pages → {len(chunks)} chunks")

print("\n" + "="*60)
print("📊 CHUNKING STATISTICS")
print("="*60)

for domain, stats in chunk_stats.items():
    print(f"\n{domain.upper()}:")
    print(f"  Total chunks: {stats['num_chunks']}")
    print(f"  Avg size: {stats['avg_chunk_size']:.0f} chars")
    print(f"  Range: {stats['min_chunk_size']:.0f} - {stats['max_chunk_size']:.0f} chars")

### 📊 Visualize Chunk Size Distribution

In [None]:
# Create visualization of chunk sizes
fig = make_subplots(
    rows=1, cols=len(chunked_documents),
    subplot_titles=[f"{domain.upper()}" for domain in chunked_documents.keys()],
    specs=[[{"type": "histogram"}] * len(chunked_documents)]
)

for i, (domain, chunks) in enumerate(chunked_documents.items(), 1):
    chunk_sizes = [len(chunk.page_content) for chunk in chunks]
    
    fig.add_trace(
        go.Histogram(
            x=chunk_sizes,
            name=domain.upper(),
            nbinsx=20,
            marker_color=['#FF6B6B', '#4ECDC4', '#45B7D1'][i-1]
        ),
        row=1, col=i
    )

fig.update_layout(
    title_text="Chunk Size Distribution by Domain",
    showlegend=False,
    height=400
)
fig.update_xaxes(title_text="Chunk Size (characters)")
fig.update_yaxes(title_text="Frequency")

fig.show()

print("\n💡 Insight: Consistent chunk sizes (around 800-1000 chars) ensure balanced retrieval across domains.")

---
## Part 5: Create Vector Indexes

We'll create separate vector stores for each domain using ChromaDB.

In [None]:
# Initialize embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

print("🔄 Creating vector indexes...\n")

# Create vector stores
vectorstores = {}
retrievers = {}

for domain, chunks in chunked_documents.items():
    if not chunks:
        print(f"⚠️ Skipping {domain} - no chunks available")
        continue
    
    print(f"📦 Creating {domain.upper()} vector store...")
    
    # Create vector store with persistence
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        collection_name=f"{domain}_docs",
        persist_directory=f"./chroma_db/{domain}"
    )
    
    vectorstores[domain] = vectorstore
    retrievers[domain] = vectorstore.as_retriever(
        search_kwargs={"k": 3}  # Retrieve top 3 chunks
    )
    
    print(f"   ✓ Indexed {len(chunks)} chunks")

print(f"\n✅ Created {len(vectorstores)} domain-specific indexes!")
print(f"📁 Vector stores persisted to: ./chroma_db/")

---
## Part 6: LLM-Based Query Router

The router decides which domain index to query based on the question.

In [None]:
# Define routing schema
class RouteQuery(BaseModel):
    """Route a user query to the most relevant domain-specific index."""
    
    datasource: Literal["legal", "hr", "general"] = Field(
        ...,
        description="""
        Choose the most relevant datasource for the query:
        - legal: Data privacy, PDPA, enforcement guidelines, compliance, personal data protection
        - hr: Employee handbook, workplace policies, HR procedures, benefits, employee conduct
        - general: Questions that don't fit legal or HR categories
        """
    )
    
    confidence: float = Field(
        ...,
        description="Confidence score between 0 and 1 for this routing decision",
        ge=0.0,
        le=1.0
    )

# Initialize LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Create router chain
router_function = convert_pydantic_to_openai_function(RouteQuery)
router_chain = (
    llm.bind(
        functions=[router_function],
        function_call={"name": "RouteQuery"}
    )
    | PydanticAttrOutputFunctionsParser(
        pydantic_schema=RouteQuery,
        attr_name="datasource"
    )
)

print("✅ Router configured successfully!")
print("\n🔀 Available routes:")
for route in retrievers.keys():
    print(f"  • {route.upper()}")

### Test the Router

In [None]:
# Test routing with sample queries
test_routing_queries = [
    "What are the penalties for PDPA violations?",
    "What is the company's vacation policy?",
    "How should personal data be collected?",
    "What are the dress code requirements?"
]

print("🧪 Testing Router with Sample Queries\n")
print("="*70)

routing_results = []
for query in test_routing_queries:
    route = router_chain.invoke({"question": query})
    routing_results.append({"query": query, "route": route})
    print(f"❓ {query}")
    print(f"   ➜ Routed to: {route.upper()}\n")

print("✅ Router test complete!")

---
## Part 7: Complete Multi-Index RAG Pipeline

Now let's build the complete pipeline with citation tracking.

In [None]:
# RAG prompt template
rag_prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant answering questions based on PDF documents.

Context from PDF documents:
{context}

Question: {question}

Instructions:
1. Provide a detailed, accurate answer based ONLY on the information in the PDF documents above
2. If the answer isn't in the documents, clearly state that
3. Be specific and cite relevant details from the documents
4. Keep the answer concise but comprehensive

Answer:""")

def format_docs_with_citations(docs: List[Document]) -> str:
    """Format documents with source information for citations."""
    formatted = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("filename", "Unknown")
        page = doc.metadata.get("page", "?")
        formatted.append(
            f"[Source {i}: {source}, Page {page}]\n{doc.page_content}"
        )
    return "\n\n".join(formatted)

def multi_index_rag(question: str, verbose: bool = True) -> Dict:
    """
    Complete Multi-Index RAG pipeline with routing, retrieval, and generation.
    
    Args:
        question: User's question
        verbose: Print detailed information
    
    Returns:
        Dictionary with question, route, answer, sources, and metrics
    """
    start_time = time.time()
    
    if verbose:
        print(f"\n{'='*70}")
        print(f"❓ Question: {question}")
        print(f"{'='*70}")
    
    # Step 1: Route query
    route_start = time.time()
    selected_route = router_chain.invoke({"question": question})
    route_time = time.time() - route_start
    
    if verbose:
        print(f"\n🔀 Routing Decision: {selected_route.upper()}")
        print(f"   Time: {route_time:.3f}s")
    
    # Step 2: Retrieve from selected index
    retrieval_start = time.time()
    selected_retriever = retrievers[selected_route]
    retrieved_docs = selected_retriever.invoke(question)
    retrieval_time = time.time() - retrieval_start
    
    if verbose:
        print(f"\n📚 Retrieved {len(retrieved_docs)} relevant chunks")
        print(f"   Time: {retrieval_time:.3f}s")
        print(f"\n📄 Sources:")
        for doc in retrieved_docs:
            source = doc.metadata.get("filename", "Unknown")
            page = doc.metadata.get("page", "?")
            preview = doc.page_content[:100].replace('\n', ' ')
            print(f"   • {source} (Page {page})")
            print(f"     Preview: {preview}...")
    
    # Step 3: Generate answer
    generation_start = time.time()
    context = format_docs_with_citations(retrieved_docs)
    rag_chain = rag_prompt | llm | StrOutputParser()
    answer = rag_chain.invoke({
        "context": context,
        "question": question
    })
    generation_time = time.time() - generation_start
    
    total_time = time.time() - start_time
    
    if verbose:
        print(f"\n💡 Answer:\n{answer}")
        print(f"\n⏱️ Performance:")
        print(f"   Routing: {route_time:.3f}s")
        print(f"   Retrieval: {retrieval_time:.3f}s")
        print(f"   Generation: {generation_time:.3f}s")
        print(f"   Total: {total_time:.3f}s")
    
    return {
        "question": question,
        "route": selected_route,
        "answer": answer,
        "sources": [
            {
                "filename": doc.metadata.get("filename"),
                "page": doc.metadata.get("page"),
                "content_preview": doc.page_content[:200]
            }
            for doc in retrieved_docs
        ],
        "metrics": {
            "route_time": route_time,
            "retrieval_time": retrieval_time,
            "generation_time": generation_time,
            "total_time": total_time,
            "num_chunks_retrieved": len(retrieved_docs)
        }
    }

print("✅ RAG pipeline ready!")

---
## Part 8: Test with Real Queries

Let's test the system with questions relevant to our documents.

In [None]:
# Define test questions
test_questions = [
    # Legal/PDPA questions
    "What are the key obligations for organizations under PDPA?",
    "What are the penalties for data protection violations?",
    "How should consent be obtained for collecting personal data?",
    
    # HR/Employee Handbook questions
    "What are the employee benefits mentioned in the handbook?",
    "What is the policy on working hours and overtime?",
    "What are the grounds for employee termination?",
]

# Run queries and collect results
print("\n" + "="*70)
print("🧪 TESTING MULTI-INDEX RAG SYSTEM")
print("="*70)

all_results = []
for question in test_questions:
    result = multi_index_rag(question, verbose=True)
    all_results.append(result)
    print("\n" + "-"*70)

---
## Part 9: Visualize Routing Decisions

Let's visualize how queries were routed across domains.

In [None]:
# Analyze routing distribution
routing_distribution = {}
for result in all_results:
    route = result["route"]
    routing_distribution[route] = routing_distribution.get(route, 0) + 1

# Create visualization
fig = go.Figure()

fig.add_trace(go.Bar(
    x=list(routing_distribution.keys()),
    y=list(routing_distribution.values()),
    marker_color=['#FF6B6B', '#4ECDC4', '#45B7D1'],
    text=list(routing_distribution.values()),
    textposition='auto',
))

fig.update_layout(
    title="Query Routing Distribution",
    xaxis_title="Domain",
    yaxis_title="Number of Queries",
    height=400
)

fig.show()

# Create routing flow diagram
print("\n📊 Routing Summary:")
for route, count in routing_distribution.items():
    percentage = (count / len(all_results)) * 100
    print(f"  {route.upper()}: {count} queries ({percentage:.1f}%)")

---
## Part 10: Performance Metrics & Evaluation

Let's analyze the system's performance.

In [None]:
# Extract performance metrics
metrics_df = pd.DataFrame([
    {
        "Question": result["question"][:50] + "...",
        "Route": result["route"],
        "Route Time (s)": result["metrics"]["route_time"],
        "Retrieval Time (s)": result["metrics"]["retrieval_time"],
        "Generation Time (s)": result["metrics"]["generation_time"],
        "Total Time (s)": result["metrics"]["total_time"],
        "Chunks Retrieved": result["metrics"]["num_chunks_retrieved"]
    }
    for result in all_results
])

print("\n📈 PERFORMANCE METRICS")
print("="*70)
print(metrics_df.to_string(index=False))

# Calculate summary statistics
print("\n📊 SUMMARY STATISTICS")
print("="*70)
print(f"Average Total Time: {metrics_df['Total Time (s)'].mean():.3f}s")
print(f"Average Route Time: {metrics_df['Route Time (s)'].mean():.3f}s")
print(f"Average Retrieval Time: {metrics_df['Retrieval Time (s)'].mean():.3f}s")
print(f"Average Generation Time: {metrics_df['Generation Time (s)'].mean():.3f}s")

# Visualize time breakdown
time_components = [
    metrics_df['Route Time (s)'].mean(),
    metrics_df['Retrieval Time (s)'].mean(),
    metrics_df['Generation Time (s)'].mean()
]

fig = go.Figure(data=[go.Pie(
    labels=['Routing', 'Retrieval', 'Generation'],
    values=time_components,
    marker_colors=['#FF6B6B', '#4ECDC4', '#45B7D1']
)])

fig.update_layout(
    title="Average Time Distribution by Pipeline Stage",
    height=400
)

fig.show()

---
## Part 11: Interactive Testing Section

Now it's your turn! Try asking your own questions.

In [None]:
# Interactive query function
def ask_question(question: str):
    """Ask a question and get an answer with full tracking."""
    result = multi_index_rag(question, verbose=True)
    return result

# Example: Try your own question!
# Uncomment and modify the question below:

# my_question = "What are data breach notification requirements?"
# my_result = ask_question(my_question)

### 🎯 Exercise: Test Different Query Types

Try these different types of questions:

1. **Specific factual questions**: "What is the maximum financial penalty for PDPA violations?"
2. **Comparative questions**: "What's the difference between consent and deemed consent?"
3. **Policy questions**: "What should I do if an employee violates the code of conduct?"
4. **Edge cases**: "How do I handle customer complaints?" (tests routing)

Observe:
- Which domain gets selected
- How relevant the retrieved chunks are
- Answer quality and accuracy

---
## Part 12: Advanced Features & Improvements

Let's explore some advanced techniques to improve the system.

### 🔍 Feature 1: Metadata Filtering

Retrieve documents from specific sources or page ranges.

In [None]:
def rag_with_filter(question: str, domain: str, filename_filter: str = None):
    """
    RAG with metadata filtering.
    
    Args:
        question: Query
        domain: Domain to search
        filename_filter: Optional filename to filter by
    """
    retriever = vectorstores[domain].as_retriever(
        search_kwargs={
            "k": 3,
            "filter": {"filename": filename_filter} if filename_filter else None
        }
    )
    
    docs = retriever.invoke(question)
    print(f"📚 Retrieved {len(docs)} chunks from {domain}")
    for doc in docs:
        print(f"  • {doc.metadata['filename']} (Page {doc.metadata.get('page', '?')})")
    
    return docs

# Example: Search only in enforcement guidelines
# filtered_docs = rag_with_filter(
#     "What are the penalties?",
#     "legal",
#     "Advisory Guidelines on Enforcement of DP Provisions_1oct2022.pdf"
# )

### 📈 Feature 2: Confidence Scoring

Add confidence scores to answers based on retrieval scores.

In [None]:
def rag_with_confidence(question: str) -> Dict:
    """
    RAG with confidence scoring based on retrieval similarity.
    """
    # Route query
    selected_route = router_chain.invoke({"question": question})
    
    # Retrieve with scores
    vectorstore = vectorstores[selected_route]
    docs_with_scores = vectorstore.similarity_search_with_score(question, k=3)
    
    # Calculate average similarity score (lower is better for cosine distance)
    avg_score = sum(score for _, score in docs_with_scores) / len(docs_with_scores)
    
    # Convert to confidence (inverse of distance, normalized)
    confidence = max(0, min(1, 1 - (avg_score / 2)))  # Normalize to 0-1
    
    print(f"🎯 Confidence Score: {confidence:.2f}")
    print(f"📊 Retrieval Scores: {[f'{score:.3f}' for _, score in docs_with_scores]}")
    
    return {
        "confidence": confidence,
        "docs": [doc for doc, _ in docs_with_scores],
        "scores": [score for _, score in docs_with_scores]
    }

# Example usage:
# result = rag_with_confidence("What are the data protection principles?")

### 🔄 Feature 3: Fallback Strategy

If confidence is low, try searching multiple domains.

In [None]:
def rag_with_fallback(question: str, confidence_threshold: float = 0.5):
    """
    RAG with fallback to other domains if confidence is low.
    """
    # Try primary route
    primary_route = router_chain.invoke({"question": question})
    primary_result = rag_with_confidence(question)
    
    if primary_result["confidence"] < confidence_threshold:
        print(f"⚠️ Low confidence ({primary_result['confidence']:.2f}), trying other domains...")
        
        # Try all other domains
        all_docs = []
        for domain in vectorstores.keys():
            docs = vectorstores[domain].similarity_search(question, k=1)
            all_docs.extend(docs)
        
        print(f"📚 Expanded search: retrieved {len(all_docs)} chunks from all domains")
        return all_docs
    
    return primary_result["docs"]

# Example:
# docs = rag_with_fallback("Tell me about workplace safety", confidence_threshold=0.6)

---
## Part 13: Evaluation Framework

Let's create a simple evaluation framework to assess answer quality.

In [None]:
# Define evaluation criteria
eval_prompt = ChatPromptTemplate.from_template("""
You are an expert evaluator assessing the quality of RAG system answers.

Question: {question}
Answer: {answer}
Retrieved Context: {context}

Evaluate the answer on these criteria (score 1-5 for each):

1. ACCURACY: Is the answer factually correct based on the context?
2. COMPLETENESS: Does the answer fully address the question?
3. RELEVANCE: Is the answer relevant to the question?
4. CLARITY: Is the answer clear and well-structured?

Provide scores in this format:
ACCURACY: X/5
COMPLETENESS: X/5
RELEVANCE: X/5
CLARITY: X/5
OVERALL: X/5

Brief explanation (1-2 sentences):
""")

def evaluate_answer(question: str, answer: str, context: str) -> Dict:
    """
    Evaluate answer quality using LLM.
    """
    eval_chain = eval_prompt | llm | StrOutputParser()
    evaluation = eval_chain.invoke({
        "question": question,
        "answer": answer,
        "context": context
    })
    
    # Parse scores
    import re
    scores = {}
    for criterion in ["ACCURACY", "COMPLETENESS", "RELEVANCE", "CLARITY", "OVERALL"]:
        match = re.search(f"{criterion}:\s*(\d+)/5", evaluation)
        if match:
            scores[criterion.lower()] = int(match.group(1))
    
    return {
        "scores": scores,
        "evaluation_text": evaluation
    }

# Evaluate all previous results
print("\n📊 EVALUATING ANSWERS")
print("="*70)

evaluation_results = []
for result in all_results[:2]:  # Evaluate first 2 to save time
    context = format_docs_with_citations([{"page_content": s["content_preview"], "metadata": s} for s in result["sources"]])
    eval_result = evaluate_answer(
        result["question"],
        result["answer"],
        context
    )
    evaluation_results.append(eval_result)
    
    print(f"\n❓ {result['question'][:60]}...")
    print(f"Scores: {eval_result['scores']}")

# Calculate average scores
if evaluation_results:
    avg_scores = {}
    for criterion in ["accuracy", "completeness", "relevance", "clarity", "overall"]:
        scores = [r["scores"].get(criterion, 0) for r in evaluation_results]
        avg_scores[criterion] = sum(scores) / len(scores) if scores else 0
    
    print(f"\n📈 Average Scores:")
    for criterion, score in avg_scores.items():
        print(f"  {criterion.capitalize()}: {score:.2f}/5")

---
## Part 14: Key Learnings & Best Practices

### 🎓 What You've Learned:

1. **Multi-Index Architecture**: Separate indexes improve relevance by domain specialization
2. **Intelligent Routing**: LLM-based routing handles complex query classification
3. **Document Processing**: Chunking strategies balance context and retrieval precision
4. **Citation Tracking**: Metadata enables source attribution and verification
5. **Performance Monitoring**: Metrics help identify bottlenecks and optimize

### ✅ Best Practices:

1. **Domain Design**:
   - Create clear, distinct domains
   - Ensure routing descriptions are specific
   - Include fallback/general category

2. **Chunking Strategy**:
   - Balance chunk size (800-1200 chars optimal)
   - Use overlap to preserve context
   - Respect document structure (paragraphs, sections)

3. **Retrieval Optimization**:
   - Start with k=3 chunks, adjust based on domain
   - Use metadata filtering for specific sources
   - Implement confidence thresholds

4. **Answer Quality**:
   - Include source citations in prompts
   - Set clear answer guidelines
   - Handle "no answer found" gracefully

5. **Monitoring**:
   - Track routing accuracy
   - Monitor retrieval relevance
   - Measure end-to-end latency
   - Evaluate answer quality regularly

### 🚀 Next Steps:

1. Add more documents to existing domains
2. Create new domain indexes as needed
3. Implement user feedback collection
4. Fine-tune routing decisions based on metrics
5. Explore hybrid search (semantic + keyword)
6. Add query rewriting for better retrieval
7. Implement caching for common queries

---
## Part 15: Cleanup & Save Results

Let's save our results and clean up resources.

In [None]:
import json
from datetime import datetime

# Save results to JSON
results_file = f"rag_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"

output_data = {
    "metadata": {
        "timestamp": datetime.now().isoformat(),
        "num_domains": len(vectorstores),
        "domains": list(vectorstores.keys()),
        "total_queries": len(all_results)
    },
    "document_stats": document_stats,
    "chunk_stats": chunk_stats,
    "results": all_results,
    "performance_summary": {
        "avg_total_time": metrics_df['Total Time (s)'].mean(),
        "avg_route_time": metrics_df['Route Time (s)'].mean(),
        "avg_retrieval_time": metrics_df['Retrieval Time (s)'].mean(),
        "avg_generation_time": metrics_df['Generation Time (s)'].mean(),
    },
    "routing_distribution": routing_distribution
}

with open(results_file, 'w') as f:
    json.dump(output_data, f, indent=2)

print(f"\n✅ Results saved to: {results_file}")
print(f"\n📊 Session Summary:")
print(f"  • Processed {sum(stats['num_files'] for stats in document_stats.values())} PDF files")
print(f"  • Created {len(vectorstores)} domain indexes")
print(f"  • Answered {len(all_results)} queries")
print(f"  • Average response time: {metrics_df['Total Time (s)'].mean():.3f}s")

print("\n🎉 Tutorial Complete! You've mastered Multi-Index RAG with PDF documents!")

---
## 🎯 Practice Exercises

Try these exercises to reinforce your learning:

### Exercise 1: Add a New Domain
1. Add new PDFs to a `pdf_documents/` subfolder (e.g., `technical/`)
2. Update the domain mapping
3. Create a new vector index
4. Update the routing schema
5. Test with relevant queries

### Exercise 2: Improve Chunking
1. Experiment with different chunk sizes (500, 1500, 2000)
2. Try different overlap values (0, 100, 300)
3. Compare retrieval quality
4. Document your findings

### Exercise 3: Enhanced Routing
1. Modify RouteQuery to include confidence scores
2. Implement multi-domain retrieval for ambiguous queries
3. Add query intent classification (factual vs. procedural)
4. Log routing decisions for analysis

### Exercise 4: Build a Q&A Interface
1. Create a simple web UI with Streamlit/Gradio
2. Display routing decisions visually
3. Show source citations with clickable links
4. Add query history and favorites

### Exercise 5: Production Readiness
1. Add error handling and retries
2. Implement caching layer
3. Add rate limiting
4. Create health check endpoints
5. Add comprehensive logging