# Multi-Index Routing RAG with PDF Documents - FREE Open Source Version

## 🎯 Learning Objectives

In this notebook, you will learn:

1. **Multi-Index RAG Architecture** - How to route queries to domain-specific indexes
2. **PDF Processing** - Load and process real PDF documents
3. **Document Chunking Strategies** - Split documents effectively for retrieval
4. **LLM-Based Routing** - Use AI to intelligently route queries
5. **Citation Tracking** - Track sources and page numbers for answers
6. **Evaluation Metrics** - Measure system performance

## 📚 Dataset

We'll process 4 PDF documents from `pdf_documents/` folder:
- **Legal/Compliance Domain**: 3 PDPA Advisory Guidelines
- **HR/Business Domain**: 1 Employee Handbook
- **General Domain**: Fallback for other queries

## 🔧 Prerequisites

**100% FREE Tools Used:**
- **Ollama** - Local LLM (no API key needed!)
- **Sentence Transformers** - Free embedding models
- **ChromaDB** - Open-source vector database

### Installation:
1. Install Ollama: `curl -fsSL https://ollama.com/install.sh | sh`
2. Pull a model: `ollama pull llama3.2` or `ollama pull mistral`
3. Install Python packages (next cell)

---
## Part 1: Setup and Dependencies

First, let's install and import all required libraries.

In [6]:
# Install packages in the current kernel's Python environment
# Removed -q flag to show installation progress
import sys

print("📦 Installing packages (this may take 1-2 minutes)...\n")

!{sys.executable} -m pip install langchain langchain-community langchain-ollama langchain-core langchain-huggingface matplotlib plotly pandas chromadb pypdf sentence-transformers

print("\n✅ Installation complete!")

📦 Installing packages (this may take 1-2 minutes)...

Collecting langchain-huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Downloading langchain_huggingface-0.3.1-py3-none-any.whl (27 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.3.1
[0m
✅ Installation complete!


In [1]:
# Import libraries
import os
import time
import json
from pathlib import Path
from typing import List, Dict, Literal, Tuple
from pydantic import BaseModel, Field

# LangChain imports
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_huggingface import HuggingFaceEmbeddings  # ✅ Updated import
from langchain_ollama import ChatOllama
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Visualization
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd

print("✅ All dependencies loaded successfully!")
print(f"📁 Working directory: {os.getcwd()}")

  from .autonotebook import tqdm as notebook_tqdm


✅ All dependencies loaded successfully!
📁 Working directory: /root/Programming Projects/Personal/Pag-aaral/RAG/Multi-Index Routing


### Check Ollama Installation

In [2]:
# Check if Ollama is installed LOCALLY (Optional - skip if using Railway)
# Since you're using Railway Ollama, this will show "command not found" - that's OK!

import subprocess

try:
    result = subprocess.run(['ollama', 'list'], capture_output=True, text=True)
    print("✅ Local Ollama detected:")
    print(result.stdout)
except FileNotFoundError:
    print("ℹ️  Local Ollama not installed (this is fine if using Railway)")
    print("\n💡 You're using Railway Ollama, so local installation is optional.")
    print("\n📝 If you want to install Ollama locally for development:")
    print("   curl -fsSL https://ollama.com/install.sh | sh")
    print("   ollama pull llama3.2:3b")

ℹ️  Local Ollama not installed (this is fine if using Railway)

💡 You're using Railway Ollama, so local installation is optional.

📝 If you want to install Ollama locally for development:
   curl -fsSL https://ollama.com/install.sh | sh
   ollama pull llama3.2:3b


---
## Part 2: Initialize FREE Models

Set up our local LLM and embedding model - completely free!

In [3]:
# Initialize FREE embedding model
print("🔄 Loading embedding model (first time may take a minute)...")

# 🎮 GPU DETECTED! Using better quality model
# Since you have a GPU, we can use a higher quality model without speed penalty
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",  # Better quality with GPU! (420MB)
    model_kwargs={'device': 'cuda'}  # 🚀 GPU acceleration enabled!
)

# Alternative options:
# For even better quality (1.3GB): "BAAI/bge-large-en-v1.5"
# For faster/smaller (90MB): "sentence-transformers/all-MiniLM-L6-v2"

print("✅ Embedding model loaded on GPU!")

# Initialize LLM - Connect to Railway Ollama
MODEL_NAME = "llama3.2:3b"  # ✅ Updated to match your installed model

# 🚂 RAILWAY CONFIGURATION
# ⚠️ Fixed: Using HTTP instead of HTTPS (Railway doesn't support SSL for Ollama)
OLLAMA_BASE_URL = "http://ollama-production-4331.up.railway.app"

# Toggle: Use local or Railway Ollama
USE_LOCAL_OLLAMA = False  # Set to True to use local Ollama instead

if USE_LOCAL_OLLAMA:
    base_url = "http://localhost:11434"
    print("🔧 Using LOCAL Ollama")
else:
    base_url = OLLAMA_BASE_URL
    print(f"🚂 Using RAILWAY Ollama: {base_url}")

llm = ChatOllama(
    model=MODEL_NAME,
    temperature=0,
    base_url=base_url
)

# Test the LLM
print(f"\n🧪 Testing {MODEL_NAME}...")
try:
    test_response = llm.invoke("Say 'Hello! I am ready to help with RAG.'")
    print(f"✅ Response: {test_response.content}")
    print("\n✅ Connection successful! Ready to go!")
except Exception as e:
    print(f"❌ Connection failed: {e}")
    print("\n💡 Troubleshooting:")
    print("   1. Railway might require TCP proxy instead of HTTP")
    print("   2. Check Railway logs for connection errors")
    print("   3. Try: railway logs")
    print("   4. Verify model is running: railway ssh -> ollama list")

🔄 Loading embedding model (first time may take a minute)...
✅ Embedding model loaded on GPU!
🚂 Using RAILWAY Ollama: http://ollama-production-4331.up.railway.app

🧪 Testing llama3.2:3b...
❌ Connection failed: 405 method not allowed (status code: 405)

💡 Troubleshooting:
   1. Railway might require TCP proxy instead of HTTP
   2. Check Railway logs for connection errors
   3. Try: railway logs
   4. Verify model is running: railway ssh -> ollama list


---
## Part 3: Explore PDF Documents

Let's examine what PDFs we have and understand their structure.

In [None]:
# Define paths
PDF_DIR = Path("pdf_documents")

# List all PDF files
pdf_files = list(PDF_DIR.glob("*.pdf"))

print(f"📂 Found {len(pdf_files)} PDF documents:\n")
for i, pdf_path in enumerate(pdf_files, 1):
    file_size = pdf_path.stat().st_size / 1024  # KB
    print(f"{i}. {pdf_path.name}")
    print(f"   Size: {file_size:.1f} KB\n")

# Domain mapping based on file names
domain_mapping = {
    "legal": ["PDPA", "Advisory", "Guidelines", "Enforcement"],
    "hr": ["Employee", "Handbook"],
}

def classify_pdf(filename: str) -> str:
    """Classify PDF into domain based on filename."""
    for domain, keywords in domain_mapping.items():
        if any(keyword.lower() in filename.lower() for keyword in keywords):
            return domain
    return "general"

# Classify documents
classified_docs = {}
for pdf_path in pdf_files:
    domain = classify_pdf(pdf_path.name)
    if domain not in classified_docs:
        classified_docs[domain] = []
    classified_docs[domain].append(pdf_path)

print("\n📊 Domain Classification:")
for domain, docs in classified_docs.items():
    print(f"\n{domain.upper()}: {len(docs)} document(s)")
    for doc in docs:
        print(f"  • {doc.name}")

---
## Part 4: Load PDF Documents

Now let's load the PDF documents with proper metadata tagging.

In [None]:
def load_pdf_with_metadata(pdf_path: Path, domain: str) -> List[Document]:
    """
    Load a PDF file and add domain metadata.
    
    Args:
        pdf_path: Path to PDF file
        domain: Domain classification (legal, hr, general)
    
    Returns:
        List of Document objects (one per page)
    """
    print(f"📄 Loading: {pdf_path.name}")
    
    loader = PyPDFLoader(str(pdf_path))
    documents = loader.load()
    
    # Add comprehensive metadata
    for doc in documents:
        doc.metadata["domain"] = domain
        doc.metadata["source_type"] = "pdf"
        doc.metadata["filename"] = pdf_path.name
        doc.metadata["file_path"] = str(pdf_path)
    
    print(f"   ✓ Loaded {len(documents)} pages")
    return documents

# Load all documents by domain
all_documents = {}
document_stats = {}

print("🔄 Loading all PDF documents...\n")

for domain, pdf_paths in classified_docs.items():
    domain_docs = []
    for pdf_path in pdf_paths:
        docs = load_pdf_with_metadata(pdf_path, domain)
        domain_docs.extend(docs)
    
    all_documents[domain] = domain_docs
    document_stats[domain] = {
        "num_files": len(pdf_paths),
        "num_pages": len(domain_docs),
        "total_chars": sum(len(doc.page_content) for doc in domain_docs)
    }

print("\n" + "="*60)
print("📈 DOCUMENT LOADING SUMMARY")
print("="*60)

for domain, stats in document_stats.items():
    print(f"\n{domain.upper()}:")
    print(f"  Files: {stats['num_files']}")
    print(f"  Pages: {stats['num_pages']}")
    print(f"  Characters: {stats['total_chars']:,}")
    print(f"  Avg chars/page: {stats['total_chars']//stats['num_pages']:,}")

---
## Part 5: Document Chunking Strategies

PDF pages can be very long. We'll split them into smaller chunks for better retrieval.

### 📝 Chunking Parameters:
- **chunk_size**: Maximum characters per chunk (1000)
- **chunk_overlap**: Characters shared between chunks (200)
- **separators**: Split at paragraphs, then sentences, then words

In [None]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Split documents into chunks
chunked_documents = {}
chunk_stats = {}

print("✂️ Splitting documents into chunks...\n")

for domain, docs in all_documents.items():
    chunks = text_splitter.split_documents(docs)
    chunked_documents[domain] = chunks
    
    chunk_stats[domain] = {
        "num_chunks": len(chunks),
        "avg_chunk_size": sum(len(c.page_content) for c in chunks) / len(chunks) if chunks else 0,
        "min_chunk_size": min(len(c.page_content) for c in chunks) if chunks else 0,
        "max_chunk_size": max(len(c.page_content) for c in chunks) if chunks else 0,
    }
    
    print(f"{domain.upper()}: {len(docs)} pages → {len(chunks)} chunks")

print("\n" + "="*60)
print("📊 CHUNKING STATISTICS")
print("="*60)

for domain, stats in chunk_stats.items():
    print(f"\n{domain.upper()}:")
    print(f"  Total chunks: {stats['num_chunks']}")
    print(f"  Avg size: {stats['avg_chunk_size']:.0f} chars")
    print(f"  Range: {stats['min_chunk_size']:.0f} - {stats['max_chunk_size']:.0f} chars")

### 📊 Visualize Chunk Size Distribution

In [None]:
# Create visualization of chunk sizes
fig = make_subplots(
    rows=1, cols=len(chunked_documents),
    subplot_titles=[f"{domain.upper()}" for domain in chunked_documents.keys()],
    specs=[[{"type": "histogram"}] * len(chunked_documents)]
)

colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#95E1D3', '#F38181']
for i, (domain, chunks) in enumerate(chunked_documents.items(), 1):
    chunk_sizes = [len(chunk.page_content) for chunk in chunks]
    
    fig.add_trace(
        go.Histogram(
            x=chunk_sizes,
            name=domain.upper(),
            nbinsx=20,
            marker_color=colors[i-1] if i <= len(colors) else '#999999'
        ),
        row=1, col=i
    )

fig.update_layout(
    title_text="Chunk Size Distribution by Domain",
    showlegend=False,
    height=400
)
fig.update_xaxes(title_text="Chunk Size (characters)")
fig.update_yaxes(title_text="Frequency")

fig.show()

print("\n💡 Insight: Consistent chunk sizes (around 800-1000 chars) ensure balanced retrieval across domains.")

---
## Part 6: Create Vector Indexes

We'll create separate vector stores for each domain using ChromaDB and FREE embeddings.

In [None]:
print("🔄 Creating vector indexes with FREE embeddings...\n")

# Create vector stores
vectorstores = {}
retrievers = {}

for domain, chunks in chunked_documents.items():
    if not chunks:
        print(f"⚠️ Skipping {domain} - no chunks available")
        continue
    
    print(f"📦 Creating {domain.upper()} vector store...")
    
    # Create vector store with persistence
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        collection_name=f"{domain}_docs",
        persist_directory=f"./chroma_db_free/{domain}"
    )
    
    vectorstores[domain] = vectorstore
    retrievers[domain] = vectorstore.as_retriever(
        search_kwargs={"k": 3}  # Retrieve top 3 chunks
    )
    
    print(f"   ✓ Indexed {len(chunks)} chunks")

print(f"\n✅ Created {len(vectorstores)} domain-specific indexes!")
print(f"📁 Vector stores persisted to: ./chroma_db_free/")
print("\n💰 Cost: $0.00 - Everything runs locally!")

---
## Part 7: LLM-Based Query Router (FREE)

The router decides which domain index to query based on the question.
Using JSON output parsing since Ollama doesn't support function calling.

In [None]:
# Define routing schema
class RouteQuery(BaseModel):
    """Route a user query to the most relevant domain-specific index."""
    
    datasource: Literal["legal", "hr", "general"] = Field(
        ...,
        description="Choose the most relevant datasource for the query"
    )
    
    reasoning: str = Field(
        ...,
        description="Brief explanation of why this datasource was chosen"
    )

# Create routing prompt (JSON-based for Ollama)
routing_prompt = ChatPromptTemplate.from_template("""
You are a query router. Analyze the question and route it to the most relevant datasource.

Available datasources:
- legal: Data privacy, PDPA, enforcement guidelines, compliance, personal data protection, legal regulations
- hr: Employee handbook, workplace policies, HR procedures, benefits, employee conduct, workplace rules
- general: Questions that don't fit legal or HR categories

Question: {question}

Respond with ONLY a JSON object in this exact format:
{{
    "datasource": "legal" or "hr" or "general",
    "reasoning": "brief explanation"
}}

JSON Response:""")

# Create router chain
router_chain = routing_prompt | llm | JsonOutputParser()

def route_query(question: str) -> str:
    """Route a query to the appropriate domain."""
    try:
        result = router_chain.invoke({"question": question})
        datasource = result.get("datasource", "general")
        reasoning = result.get("reasoning", "No reasoning provided")
        return datasource, reasoning
    except Exception as e:
        print(f"⚠️ Routing error: {e}")
        return "general", "Fallback due to error"

print("✅ Router configured successfully!")
print("\n🔀 Available routes:")
for route in retrievers.keys():
    print(f"  • {route.upper()}")

### Test the Router

In [None]:
# Test routing with sample queries
test_routing_queries = [
    "What are the penalties for PDPA violations?",
    "What is the company's vacation policy?",
    "How should personal data be collected?",
    "What are the dress code requirements?"
]

print("🧪 Testing Router with Sample Queries\n")
print("="*70)

routing_results = []
for query in test_routing_queries:
    route, reasoning = route_query(query)
    routing_results.append({"query": query, "route": route, "reasoning": reasoning})
    print(f"❓ {query}")
    print(f"   ➜ Routed to: {route.upper()}")
    print(f"   💭 Reasoning: {reasoning}\n")

print("✅ Router test complete!")

---
## Part 8: Complete Multi-Index RAG Pipeline (FREE)

Now let's build the complete pipeline with citation tracking.

In [None]:
# RAG prompt template
rag_prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant answering questions based on PDF documents.

Context from PDF documents:
{context}

Question: {question}

Instructions:
1. Provide a detailed, accurate answer based ONLY on the information in the PDF documents above
2. If the answer isn't in the documents, clearly state that
3. Be specific and cite relevant details from the documents
4. Keep the answer concise but comprehensive

Answer:""")

def format_docs_with_citations(docs: List[Document]) -> str:
    """Format documents with source information for citations."""
    formatted = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("filename", "Unknown")
        page = doc.metadata.get("page", "?")
        formatted.append(
            f"[Source {i}: {source}, Page {page}]\n{doc.page_content}"
        )
    return "\n\n".join(formatted)

def multi_index_rag(question: str, verbose: bool = True) -> Dict:
    """
    Complete Multi-Index RAG pipeline with routing, retrieval, and generation.
    
    Args:
        question: User's question
        verbose: Print detailed information
    
    Returns:
        Dictionary with question, route, answer, sources, and metrics
    """
    start_time = time.time()
    
    if verbose:
        print(f"\n{'='*70}")
        print(f"❓ Question: {question}")
        print(f"{'='*70}")
    
    # Step 1: Route query
    route_start = time.time()
    selected_route, reasoning = route_query(question)
    route_time = time.time() - route_start
    
    if verbose:
        print(f"\n🔀 Routing Decision: {selected_route.upper()}")
        print(f"   💭 Reasoning: {reasoning}")
        print(f"   ⏱️ Time: {route_time:.3f}s")
    
    # Step 2: Retrieve from selected index
    retrieval_start = time.time()
    selected_retriever = retrievers.get(selected_route, list(retrievers.values())[0])
    retrieved_docs = selected_retriever.invoke(question)
    retrieval_time = time.time() - retrieval_start
    
    if verbose:
        print(f"\n📚 Retrieved {len(retrieved_docs)} relevant chunks")
        print(f"   ⏱️ Time: {retrieval_time:.3f}s")
        print(f"\n📄 Sources:")
        for doc in retrieved_docs:
            source = doc.metadata.get("filename", "Unknown")
            page = doc.metadata.get("page", "?")
            preview = doc.page_content[:100].replace('\n', ' ')
            print(f"   • {source} (Page {page})")
            print(f"     Preview: {preview}...")
    
    # Step 3: Generate answer
    generation_start = time.time()
    context = format_docs_with_citations(retrieved_docs)
    rag_chain = rag_prompt | llm | StrOutputParser()
    answer = rag_chain.invoke({
        "context": context,
        "question": question
    })
    generation_time = time.time() - generation_start
    
    total_time = time.time() - start_time
    
    if verbose:
        print(f"\n💡 Answer:\n{answer}")
        print(f"\n⏱️ Performance:")
        print(f"   Routing: {route_time:.3f}s")
        print(f"   Retrieval: {retrieval_time:.3f}s")
        print(f"   Generation: {generation_time:.3f}s")
        print(f"   Total: {total_time:.3f}s")
    
    return {
        "question": question,
        "route": selected_route,
        "reasoning": reasoning,
        "answer": answer,
        "sources": [
            {
                "filename": doc.metadata.get("filename"),
                "page": doc.metadata.get("page"),
                "content_preview": doc.page_content[:200]
            }
            for doc in retrieved_docs
        ],
        "metrics": {
            "route_time": route_time,
            "retrieval_time": retrieval_time,
            "generation_time": generation_time,
            "total_time": total_time,
            "num_chunks_retrieved": len(retrieved_docs)
        }
    }

print("✅ RAG pipeline ready!")
print("💰 Cost: $0.00 - Everything runs locally with Ollama!")

---
## Part 9: Test with Real Queries

Let's test the system with questions relevant to our documents.

In [None]:
# Define test questions
test_questions = [
    # Legal/PDPA questions
    "What are the key obligations for organizations under PDPA?",
    "What are the penalties for data protection violations?",
    
    # HR/Employee Handbook questions
    "What are the employee benefits mentioned in the handbook?",
    "What is the policy on working hours?",
]

# Run queries and collect results
print("\n" + "="*70)
print("🧪 TESTING MULTI-INDEX RAG SYSTEM (FREE VERSION)")
print("="*70)

all_results = []
for question in test_questions:
    result = multi_index_rag(question, verbose=True)
    all_results.append(result)
    print("\n" + "-"*70)

---
## Part 10: Visualize Routing Decisions

Let's visualize how queries were routed across domains.

In [None]:
# Analyze routing distribution
routing_distribution = {}
for result in all_results:
    route = result["route"]
    routing_distribution[route] = routing_distribution.get(route, 0) + 1

# Create visualization
fig = go.Figure()

fig.add_trace(go.Bar(
    x=list(routing_distribution.keys()),
    y=list(routing_distribution.values()),
    marker_color=['#FF6B6B', '#4ECDC4', '#45B7D1'],
    text=list(routing_distribution.values()),
    textposition='auto',
))

fig.update_layout(
    title="Query Routing Distribution (Free Local Models)",
    xaxis_title="Domain",
    yaxis_title="Number of Queries",
    height=400
)

fig.show()

# Create routing flow diagram
print("\n📊 Routing Summary:")
for route, count in routing_distribution.items():
    percentage = (count / len(all_results)) * 100
    print(f"  {route.upper()}: {count} queries ({percentage:.1f}%)")

---
## Part 11: Performance Metrics & Evaluation

Let's analyze the system's performance.

In [None]:
# Extract performance metrics
metrics_df = pd.DataFrame([
    {
        "Question": result["question"][:40] + "...",
        "Route": result["route"],
        "Route (s)": result["metrics"]["route_time"],
        "Retrieval (s)": result["metrics"]["retrieval_time"],
        "Generation (s)": result["metrics"]["generation_time"],
        "Total (s)": result["metrics"]["total_time"],
        "Chunks": result["metrics"]["num_chunks_retrieved"]
    }
    for result in all_results
])

print("\n📈 PERFORMANCE METRICS (FREE LOCAL MODELS)")
print("="*70)
print(metrics_df.to_string(index=False))

# Calculate summary statistics
print("\n📊 SUMMARY STATISTICS")
print("="*70)
print(f"Average Total Time: {metrics_df['Total (s)'].mean():.3f}s")
print(f"Average Route Time: {metrics_df['Route (s)'].mean():.3f}s")
print(f"Average Retrieval Time: {metrics_df['Retrieval (s)'].mean():.3f}s")
print(f"Average Generation Time: {metrics_df['Generation (s)'].mean():.3f}s")
print(f"\n💰 Total Cost: $0.00 (vs ~$0.10-0.50 with OpenAI)")

# Visualize time breakdown
time_components = [
    metrics_df['Route (s)'].mean(),
    metrics_df['Retrieval (s)'].mean(),
    metrics_df['Generation (s)'].mean()
]

fig = go.Figure(data=[go.Pie(
    labels=['Routing', 'Retrieval', 'Generation'],
    values=time_components,
    marker_colors=['#FF6B6B', '#4ECDC4', '#45B7D1']
)])

fig.update_layout(
    title="Average Time Distribution - Free Local Models",
    height=400
)

fig.show()

---
## Part 12: Interactive Testing Section

Now it's your turn! Try asking your own questions.

In [None]:
# Interactive query function
def ask_question(question: str):
    """Ask a question and get an answer with full tracking."""
    result = multi_index_rag(question, verbose=True)
    return result

# Example: Try your own question!
# Uncomment and modify the question below:

# my_question = "What are data breach notification requirements?"
# my_result = ask_question(my_question)

---
## Part 13: Save Results

Let's save our results for future reference.

In [None]:
from datetime import datetime

# Save results to JSON
results_file = f"rag_results_free_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"

output_data = {
    "metadata": {
        "timestamp": datetime.now().isoformat(),
        "model": MODEL_NAME,
        "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
        "num_domains": len(vectorstores),
        "domains": list(vectorstores.keys()),
        "total_queries": len(all_results),
        "cost": "$0.00 (Free Local Models)"
    },
    "document_stats": document_stats,
    "chunk_stats": chunk_stats,
    "results": all_results,
    "performance_summary": {
        "avg_total_time": metrics_df['Total (s)'].mean(),
        "avg_route_time": metrics_df['Route (s)'].mean(),
        "avg_retrieval_time": metrics_df['Retrieval (s)'].mean(),
        "avg_generation_time": metrics_df['Generation (s)'].mean(),
    },
    "routing_distribution": routing_distribution
}

with open(results_file, 'w') as f:
    json.dump(output_data, f, indent=2)

print(f"\n✅ Results saved to: {results_file}")
print(f"\n📊 Session Summary:")
print(f"  • Model: {MODEL_NAME} (Free Local)")
print(f"  • Embeddings: sentence-transformers/all-MiniLM-L6-v2 (Free)")
print(f"  • Processed {sum(stats['num_files'] for stats in document_stats.values())} PDF files")
print(f"  • Created {len(vectorstores)} domain indexes")
print(f"  • Answered {len(all_results)} queries")
print(f"  • Average response time: {metrics_df['Total (s)'].mean():.3f}s")
print(f"  • 💰 Total cost: $0.00")

print("\n🎉 Tutorial Complete! You've mastered Multi-Index RAG with FREE local models!")

---
## 💡 Key Advantages of Free Local Models

### ✅ Pros:
1. **Zero Cost** - No API fees, run unlimited queries
2. **Privacy** - Data never leaves your machine
3. **No Rate Limits** - Process as much as you want
4. **Offline Capable** - Works without internet
5. **Customizable** - Fine-tune models for your domain

### ⚠️ Considerations:
1. **Speed** - Local models may be slower (especially on CPU)
2. **Quality** - Smaller models may produce less accurate results
3. **Setup** - Requires Ollama installation
4. **Hardware** - Better with GPU, but works on CPU

### 🚀 Performance Tips:
1. **Use GPU** - If available, configure HuggingFaceEmbeddings with `device='cuda'`
2. **Choose Right Model**:
   - `phi3` - Fastest, smallest (2GB)
   - `llama3.2` - Good balance (2GB)
   - `mistral` - Best quality (4GB)
3. **Optimize Chunk Size** - Smaller chunks (500-800) work better with smaller models
4. **Batch Processing** - Process multiple queries together

### 🔄 Switching Models:
```python
# Try different models:
MODEL_NAME = "phi3"          # Fastest
MODEL_NAME = "llama3.2"      # Balanced  
MODEL_NAME = "mistral"       # Best quality
MODEL_NAME = "gemma2"        # Google's model
```

---
## 🎯 Practice Exercises

### Exercise 1: Try Different Models
1. Install multiple Ollama models: `ollama pull phi3`, `ollama pull mistral`
2. Run the same queries with different models
3. Compare speed and quality
4. Document which model works best for your use case

### Exercise 2: Optimize for Speed
1. Reduce chunk size to 500 characters
2. Retrieve fewer chunks (k=2)
3. Use a faster model (phi3)
4. Measure performance improvement

### Exercise 3: Add New Documents
1. Add more PDFs to `pdf_documents/`
2. Create a new domain category
3. Update the routing logic
4. Test with new queries

### Exercise 4: Build a Web Interface
1. Create a Streamlit app for the RAG system
2. Add file upload capability
3. Display routing decisions visually
4. Show source citations with links

### Exercise 5: Hybrid Approach
1. Keep embeddings local (HuggingFace)
2. Use OpenAI for generation only (fallback)
3. Compare costs and performance
4. Find the optimal balance