# Module 9: Retrieval Strategies & Optimization

## 🎯 Learning Objectives
By the end of this module, you will:
- Implement advanced retrieval strategies including multi-stage pipelines
- Master re-ranking techniques with cross-encoders and LLM judges
- Optimize retrieval parameters for different use cases and domains
- Handle edge cases, failures, and implement robust fallback strategies
- Measure and evaluate retrieval quality using comprehensive metrics
- Build production-ready retrieval pipelines with monitoring and optimization

## 📚 Key Concepts

### Why Retrieval Optimization Matters

**Retrieval quality directly determines RAG system performance:**

```
Poor Retrieval → Irrelevant Context → Bad LLM Output
Great Retrieval → Relevant Context → Excellent LLM Output
```

Even the best LLM cannot compensate for poor retrieval. Studies show that **retrieval quality accounts for 70-80% of RAG system performance**.

### 🏗️ Modern Retrieval Architecture (2025)

**Multi-Stage Retrieval Pipeline:**

1. **Stage 1 - Initial Retrieval**: Fast, broad search (top 100-500 candidates)
   - Vector similarity search
   - BM25 keyword matching
   - Hybrid fusion (RRF)

2. **Stage 2 - Re-ranking**: Precise, compute-intensive (top 10-50 candidates)
   - Cross-encoder models
   - LLM-based judges
   - Multi-criteria scoring

3. **Stage 3 - Post-processing**: Final optimization
   - Diversity injection
   - Deduplication
   - Context window fitting

### 🔄 Advanced Retrieval Techniques (2025)

#### Multi-Query Retrieval
- **Concept**: Generate multiple query variations, retrieve for each, fuse results
- **Benefit**: 25-40% improvement in recall
- **Cost**: 3-5x more retrieval operations

#### Parent-Child Retrieval
- **Concept**: Retrieve small chunks for precision, return large chunks for context
- **Benefit**: Better balance of specificity and completeness
- **Implementation**: Hierarchical document structure

#### Contextual Retrieval (Anthropic 2025)
- **Concept**: Add contextual information to chunks before embedding
- **Benefit**: 49% reduction in failed retrievals
- **Method**: LLM generates chunk context within document

#### RAPTOR (Recursive Tree Retrieval)
- **Concept**: Build hierarchical tree of document summaries
- **Benefit**: Better handling of multi-document queries
- **Use case**: Large document collections, complex questions

### 📊 Retrieval Quality Metrics

| Metric | Definition | Good Score | Use Case |
|--------|------------|------------|---------|
| **Recall@K** | Relevant items in top-K results | >0.8 | Coverage measurement |
| **Precision@K** | Proportion of relevant results | >0.6 | Relevance measurement |
| **MRR** | Mean Reciprocal Rank | >0.7 | Ranking quality |
| **NDCG@K** | Normalized Discounted Gain | >0.8 | Ranked relevance |
| **Hit Rate** | Queries with ≥1 relevant result | >0.9 | System reliability |

### ⚡ Performance vs Quality Trade-offs

```python
# Speed vs Quality Spectrum
retrieval_strategies = {
    "basic_vector": {"speed": "⚡⚡⚡⚡⚡", "quality": "⭐⭐⭐"},
    "hybrid_search": {"speed": "⚡⚡⚡⚡", "quality": "⭐⭐⭐⭐"},
    "cross_encoder": {"speed": "⚡⚡", "quality": "⭐⭐⭐⭐⭐"},
    "llm_rerank": {"speed": "⚡", "quality": "⭐⭐⭐⭐⭐"},
    "multi_query": {"speed": "⚡⚡", "quality": "⭐⭐⭐⭐⭐"}
}
```

## 🛠️ Setup
Let's install the required packages and set up our advanced retrieval lab.

In [None]:
# Install required packages
!pip install -q sentence-transformers transformers torch
!pip install -q langchain langchain-community langchain-openai
!pip install -q rank-bm25 faiss-cpu numpy pandas matplotlib seaborn
!pip install -q scikit-learn nltk textstat python-dotenv
!pip install -q plotly wordcloud
# Note: For production, add: cross-encoder, openai, anthropic

In [None]:
import os
import re
import time
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from typing import List, Dict, Any, Tuple, Optional, Union
import json
from collections import Counter, defaultdict
import warnings
warnings.filterwarnings('ignore')

# Core ML libraries
from sentence_transformers import SentenceTransformer, CrossEncoder
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import ndcg_score
from rank_bm25 import BM25Okapi
import faiss

# NLP libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import textstat

# LangChain components
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from dotenv import load_dotenv
load_dotenv()

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt', quiet=True)

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords', quiet=True)

# Set up visualization
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)

print("✅ Setup complete!")
print(f"📅 Today's date: {datetime.now().strftime('%Y-%m-%d')}")

## 🧪 Exercise 1: Multi-Stage Retrieval Pipeline

Let's build a production-ready multi-stage retrieval pipeline with initial search, re-ranking, and post-processing.

In [None]:
class MultiStageRetriever:
    """Advanced multi-stage retrieval pipeline with re-ranking and optimization"""
    
    def __init__(self, embedding_model_name: str = 'all-MiniLM-L6-v2'):
        # Initialize models
        self.embedding_model = SentenceTransformer(embedding_model_name)
        # Note: In production, use a more powerful cross-encoder like 'ms-marco-MiniLM-L-12-v2'
        self.cross_encoder = None  # Will simulate cross-encoder for demo
        
        # Initialize components
        self.documents = []
        self.embeddings = None
        self.bm25 = None
        self.tokenized_docs = []
        self.document_metadata = []
        
        # Preprocessing tools
        self.stemmer = PorterStemmer()
        self.stop_words = set(stopwords.words('english'))
        
        # Performance tracking
        self.retrieval_stats = defaultdict(list)
    
    def create_comprehensive_dataset(self) -> List[Dict[str, Any]]:
        """Create a comprehensive dataset with metadata for advanced retrieval testing"""
        
        dataset = [
            # Technical Documentation
            {
                "text": "TensorFlow 2.14.0 introduces optimized GPU kernels for transformer models, reducing training time by up to 35%. The new tf.keras.layers.MultiHeadAttention layer includes improved memory management and support for mixed precision training.",
                "category": "technical",
                "domain": "machine_learning",
                "difficulty": "advanced",
                "date": "2024-01-15",
                "source": "tensorflow_docs",
                "relevance_keywords": ["tensorflow", "gpu", "optimization", "training"]
            },
            {
                "text": "PyTorch 2.1 brings native support for distributed training with automatic mixed precision. The torch.compile() function can optimize models for inference with up to 60% speedup on modern GPUs.",
                "category": "technical",
                "domain": "machine_learning",
                "difficulty": "advanced",
                "date": "2024-02-20",
                "source": "pytorch_docs",
                "relevance_keywords": ["pytorch", "distributed", "inference", "speedup"]
            },
            # Business Analytics
            {
                "text": "Q4 2024 revenue reached $125M, representing a 22% year-over-year increase. The AI products division contributed $45M, driven by strong enterprise adoption of our machine learning platform.",
                "category": "business",
                "domain": "finance",
                "difficulty": "beginner",
                "date": "2024-12-31",
                "source": "earnings_report",
                "relevance_keywords": ["revenue", "growth", "ai products", "enterprise"]
            },
            {
                "text": "Customer acquisition costs decreased by 18% in 2024 due to improved targeting algorithms. The machine learning-powered recommendation system increased customer lifetime value by 31%.",
                "category": "business",
                "domain": "marketing",
                "difficulty": "intermediate",
                "date": "2024-11-15",
                "source": "marketing_analysis",
                "relevance_keywords": ["customer acquisition", "targeting", "recommendations", "lifetime value"]
            },
            # Research Papers
            {
                "text": "Attention mechanisms in transformer architectures enable models to focus on relevant parts of the input sequence. Multi-head attention allows the model to jointly attend to information from different representation subspaces.",
                "category": "research",
                "domain": "nlp",
                "difficulty": "advanced",
                "date": "2024-03-10",
                "source": "research_paper",
                "relevance_keywords": ["attention", "transformer", "multi-head", "sequence"]
            },
            {
                "text": "Large language models demonstrate emergent capabilities at scale, including few-shot learning, chain-of-thought reasoning, and cross-lingual transfer. Performance scaling follows predictable power laws with model size and compute.",
                "category": "research",
                "domain": "nlp",
                "difficulty": "advanced",
                "date": "2024-04-22",
                "source": "research_paper",
                "relevance_keywords": ["large language models", "emergent", "few-shot", "scaling"]
            },
            # API Documentation
            {
                "text": "The /api/v1/predict endpoint accepts POST requests with JSON payload containing feature vectors. Returns prediction scores and confidence intervals. Rate limit: 1000 requests per hour.",
                "category": "documentation",
                "domain": "api",
                "difficulty": "intermediate",
                "date": "2024-05-01",
                "source": "api_docs",
                "relevance_keywords": ["api", "predict", "endpoint", "rate limit"]
            },
            {
                "text": "Authentication uses Bearer tokens with JWT format. Include Authorization header in all requests. Tokens expire after 24 hours and must be refreshed using the /auth/refresh endpoint.",
                "category": "documentation",
                "domain": "api",
                "difficulty": "intermediate",
                "date": "2024-05-01",
                "source": "api_docs",
                "relevance_keywords": ["authentication", "bearer token", "jwt", "authorization"]
            },
            # Tutorials and Guides
            {
                "text": "To train a neural network for image classification, start by preprocessing your images to a consistent size and format. Use data augmentation techniques like rotation and scaling to improve model generalization.",
                "category": "tutorial",
                "domain": "computer_vision",
                "difficulty": "beginner",
                "date": "2024-06-15",
                "source": "tutorial",
                "relevance_keywords": ["neural network", "image classification", "preprocessing", "augmentation"]
            },
            {
                "text": "Natural language processing pipelines typically include tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis. Modern approaches use transformer-based models for end-to-end processing.",
                "category": "tutorial",
                "domain": "nlp",
                "difficulty": "intermediate",
                "date": "2024-07-20",
                "source": "tutorial",
                "relevance_keywords": ["nlp", "tokenization", "named entity", "sentiment analysis"]
            },
            # News and Updates
            {
                "text": "OpenAI releases GPT-4 Turbo with improved performance and reduced costs. The new model supports 128K context windows and demonstrates better reasoning capabilities across multiple domains.",
                "category": "news",
                "domain": "ai_updates",
                "difficulty": "beginner",
                "date": "2024-08-01",
                "source": "tech_news",
                "relevance_keywords": ["openai", "gpt-4 turbo", "context window", "reasoning"]
            },
            {
                "text": "Google announces Gemini Ultra, achieving state-of-the-art performance on 30 of 32 academic benchmarks. The model excels at multimodal reasoning, combining text, images, and code understanding.",
                "category": "news",
                "domain": "ai_updates",
                "difficulty": "beginner",
                "date": "2024-08-15",
                "source": "tech_news",
                "relevance_keywords": ["google", "gemini ultra", "benchmarks", "multimodal"]
            },
            # Code Examples
            {
                "text": "def train_model(X_train, y_train, epochs=100, batch_size=32): model = Sequential([Dense(128, activation='relu'), Dense(64, activation='relu'), Dense(1, activation='sigmoid')]) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) return model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size)",
                "category": "code",
                "domain": "machine_learning",
                "difficulty": "intermediate",
                "date": "2024-09-01",
                "source": "code_example",
                "relevance_keywords": ["train_model", "sequential", "dense", "binary_crossentropy"]
            },
            {
                "text": "class DataProcessor: def __init__(self, config): self.config = config def preprocess(self, raw_data): normalized = self.normalize(raw_data) cleaned = self.remove_outliers(normalized) return self.feature_engineering(cleaned)",
                "category": "code",
                "domain": "data_science",
                "difficulty": "intermediate",
                "date": "2024-09-05",
                "source": "code_example",
                "relevance_keywords": ["DataProcessor", "preprocess", "normalize", "feature_engineering"]
            }
        ]
        
        return dataset
    
    def build_indexes(self, dataset: List[Dict[str, Any]]) -> None:
        """Build all necessary indexes for multi-stage retrieval"""
        print("🏗️ Building multi-stage retrieval indexes...")
        
        # Extract documents and metadata
        self.documents = [item["text"] for item in dataset]
        self.document_metadata = [{k: v for k, v in item.items() if k != "text"} for item in dataset]
        
        # Build semantic index (embeddings)
        print("   Building semantic embeddings...")
        self.embeddings = self.embedding_model.encode(self.documents, show_progress_bar=True)
        self.embeddings = self.embeddings / np.linalg.norm(self.embeddings, axis=1, keepdims=True)
        
        # Build BM25 index
        print("   Building BM25 index...")
        self.tokenized_docs = [self._preprocess_text(doc) for doc in self.documents]
        self.bm25 = BM25Okapi(self.tokenized_docs)
        
        print(f"✅ Indexes built for {len(self.documents)} documents")
    
    def _preprocess_text(self, text: str) -> List[str]:
        """Preprocess text for BM25"""
        tokens = word_tokenize(text.lower())
        tokens = [token for token in tokens if token.isalpha() and token not in self.stop_words]
        return [self.stemmer.stem(token) for token in tokens]
    
    def stage1_initial_retrieval(self, query: str, top_k: int = 100, alpha: float = 0.7) -> List[Tuple[int, float, str]]:
        """Stage 1: Fast initial retrieval with hybrid search"""
        start_time = time.time()
        
        # Semantic search
        query_embedding = self.embedding_model.encode([query])
        query_embedding = query_embedding / np.linalg.norm(query_embedding)
        semantic_similarities = np.dot(self.embeddings, query_embedding.T).flatten()
        
        # BM25 search
        query_tokens = self._preprocess_text(query)
        bm25_scores = self.bm25.get_scores(query_tokens)
        
        # Reciprocal Rank Fusion (RRF)
        semantic_ranks = np.argsort(semantic_similarities)[::-1]
        bm25_ranks = np.argsort(bm25_scores)[::-1]
        
        # Calculate RRF scores
        rrf_scores = {}
        k = 60  # RRF parameter
        
        for rank, doc_id in enumerate(semantic_ranks):
            rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + alpha / (k + rank + 1)
        
        for rank, doc_id in enumerate(bm25_ranks):
            rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + (1 - alpha) / (k + rank + 1)
        
        # Sort by RRF score
        sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
        
        results = []
        for doc_id, score in sorted_results:
            results.append((doc_id, score, self.documents[doc_id]))
        
        stage1_time = time.time() - start_time
        self.retrieval_stats["stage1_time"].append(stage1_time)
        
        return results
    
    def stage2_reranking(self, query: str, initial_results: List[Tuple[int, float, str]], 
                        top_k: int = 20) -> List[Tuple[int, float, str]]:
        """Stage 2: Re-ranking with simulated cross-encoder"""
        start_time = time.time()
        
        # Simulate cross-encoder scoring
        # In production, use: self.cross_encoder.predict([(query, doc) for _, _, doc in initial_results])
        reranked_results = []
        
        for doc_id, initial_score, doc_text in initial_results:
            # Simulate cross-encoder with sophisticated heuristics
            cross_encoder_score = self._simulate_cross_encoder_score(query, doc_text, doc_id)
            reranked_results.append((doc_id, cross_encoder_score, doc_text))
        
        # Sort by cross-encoder score
        reranked_results.sort(key=lambda x: x[1], reverse=True)
        
        stage2_time = time.time() - start_time
        self.retrieval_stats["stage2_time"].append(stage2_time)
        
        return reranked_results[:top_k]
    
    def _simulate_cross_encoder_score(self, query: str, document: str, doc_id: int) -> float:
        """Simulate cross-encoder scoring with advanced heuristics"""
        base_score = 0.5
        
        # Query-document similarity (semantic)
        query_emb = self.embedding_model.encode([query])
        doc_emb = self.embeddings[doc_id:doc_id+1]
        semantic_sim = np.dot(query_emb, doc_emb.T)[0, 0]
        base_score += semantic_sim * 0.3
        
        # Keyword overlap bonus
        query_words = set(query.lower().split())
        doc_words = set(document.lower().split())
        overlap = len(query_words.intersection(doc_words)) / len(query_words.union(doc_words))
        base_score += overlap * 0.2
        
        # Metadata relevance
        metadata = self.document_metadata[doc_id]
        
        # Boost for relevant keywords in metadata
        if 'relevance_keywords' in metadata:
            keyword_matches = sum(1 for keyword in metadata['relevance_keywords'] 
                                if keyword.lower() in query.lower())
            base_score += keyword_matches * 0.1
        
        # Boost for recent documents
        if 'date' in metadata:
            doc_date = datetime.strptime(metadata['date'], '%Y-%m-%d')
            days_old = (datetime.now() - doc_date).days
            recency_score = max(0, 1 - days_old / 365)  # Decay over 1 year
            base_score += recency_score * 0.1
        
        # Category-specific adjustments
        if 'category' in metadata:
            if metadata['category'] == 'research' and any(word in query.lower() 
                                                        for word in ['study', 'research', 'paper', 'analysis']):
                base_score += 0.1
            elif metadata['category'] == 'tutorial' and any(word in query.lower() 
                                                          for word in ['how', 'tutorial', 'guide', 'learn']):
                base_score += 0.1
        
        # Add some randomness to simulate real cross-encoder variance
        base_score += np.random.normal(0, 0.05)
        
        return max(0.0, min(1.0, base_score))
    
    def stage3_postprocessing(self, query: str, reranked_results: List[Tuple[int, float, str]], 
                            final_k: int = 10, diversity_threshold: float = 0.8) -> List[Tuple[int, float, str, Dict]]:
        """Stage 3: Post-processing with diversity and deduplication"""
        start_time = time.time()
        
        final_results = []
        selected_embeddings = []
        
        for doc_id, score, doc_text in reranked_results:
            if len(final_results) >= final_k:
                break
            
            doc_embedding = self.embeddings[doc_id]
            
            # Check diversity (avoid too similar documents)
            is_diverse = True
            for selected_emb in selected_embeddings:
                similarity = np.dot(doc_embedding, selected_emb)
                if similarity > diversity_threshold:
                    is_diverse = False
                    break
            
            if is_diverse:
                # Add metadata for context
                metadata = self.document_metadata[doc_id].copy()
                metadata['retrieval_score'] = score
                metadata['rank'] = len(final_results) + 1
                
                final_results.append((doc_id, score, doc_text, metadata))
                selected_embeddings.append(doc_embedding)
        
        stage3_time = time.time() - start_time
        self.retrieval_stats["stage3_time"].append(stage3_time)
        
        return final_results
    
    def retrieve(self, query: str, stage1_k: int = 100, stage2_k: int = 20, 
                final_k: int = 10, alpha: float = 0.7) -> Dict[str, Any]:
        """Complete multi-stage retrieval pipeline"""
        total_start = time.time()
        
        # Stage 1: Initial retrieval
        stage1_results = self.stage1_initial_retrieval(query, top_k=stage1_k, alpha=alpha)
        
        # Stage 2: Re-ranking
        stage2_results = self.stage2_reranking(query, stage1_results, top_k=stage2_k)
        
        # Stage 3: Post-processing
        final_results = self.stage3_postprocessing(query, stage2_results, final_k=final_k)
        
        total_time = time.time() - total_start
        self.retrieval_stats["total_time"].append(total_time)
        
        return {
            'query': query,
            'results': final_results,
            'stage1_candidates': len(stage1_results),
            'stage2_candidates': len(stage2_results),
            'final_results': len(final_results),
            'total_time': total_time,
            'stage_times': {
                'stage1': self.retrieval_stats["stage1_time"][-1],
                'stage2': self.retrieval_stats["stage2_time"][-1],
                'stage3': self.retrieval_stats["stage3_time"][-1]
            }
        }

# Initialize the multi-stage retriever
retriever = MultiStageRetriever()
print("🔍 Multi-Stage Retriever initialized!")

In [None]:
# Create dataset and build indexes
dataset = retriever.create_comprehensive_dataset()
retriever.build_indexes(dataset)

print(f"📊 Dataset Overview:")
print(f"   Total documents: {len(dataset)}")

# Analyze dataset composition
categories = Counter(item['category'] for item in dataset)
domains = Counter(item['domain'] for item in dataset)
difficulties = Counter(item['difficulty'] for item in dataset)

print(f"   Categories: {dict(categories)}")
print(f"   Domains: {dict(domains)}")
print(f"   Difficulty levels: {dict(difficulties)}")

# Display sample documents
print("\n📄 Sample Documents:")
for i, item in enumerate(dataset[:3]):
    print(f"   {i+1}. [{item['category']}] {item['text'][:70]}...")

In [None]:
# Test multi-stage retrieval with different query types
test_queries = [
    "How to optimize TensorFlow training performance?",  # Technical query
    "What are the latest developments in large language models?",  # Research query
    "API authentication methods and security",  # Documentation query
    "Business revenue growth from AI products",  # Business query
    "Machine learning model training best practices"  # Tutorial query
]

print("🧪 MULTI-STAGE RETRIEVAL TESTING")
print("=" * 80)

all_retrieval_results = []

for i, query in enumerate(test_queries):
    print(f"\n🔍 Query {i+1}: '{query}'")
    print("-" * 60)
    
    # Perform multi-stage retrieval
    result = retriever.retrieve(query, stage1_k=50, stage2_k=10, final_k=5)
    all_retrieval_results.append(result)
    
    # Display timing information
    print(f"⏱️ Performance:")
    print(f"   Total time: {result['total_time']:.3f}s")
    print(f"   Stage 1 (Initial): {result['stage_times']['stage1']:.3f}s")
    print(f"   Stage 2 (Re-rank): {result['stage_times']['stage2']:.3f}s")
    print(f"   Stage 3 (Post-proc): {result['stage_times']['stage3']:.3f}s")
    
    # Display pipeline funnel
    print(f"\n🔄 Retrieval Funnel:")
    print(f"   Stage 1 → {result['stage1_candidates']} candidates")
    print(f"   Stage 2 → {result['stage2_candidates']} re-ranked")
    print(f"   Stage 3 → {result['final_results']} final results")
    
    # Display top results with metadata
    print(f"\n📋 Top Results:")
    for j, (doc_id, score, doc_text, metadata) in enumerate(result['results']):
        print(f"   {j+1}. [{score:.4f}] [{metadata['category']}] {doc_text[:60]}...")
        print(f"       Domain: {metadata['domain']}, Difficulty: {metadata['difficulty']}")

## 📈 Exercise 2: Advanced Retrieval Techniques

Let's implement cutting-edge retrieval techniques including multi-query, parent-child, and contextual retrieval.

In [None]:
class AdvancedRetrievalTechniques:
    """Implementation of advanced retrieval techniques from 2025 research"""
    
    def __init__(self, base_retriever: MultiStageRetriever):
        self.base_retriever = base_retriever
        self.embedding_model = base_retriever.embedding_model
        
        # For parent-child retrieval
        self.parent_documents = []
        self.child_chunks = []
        self.parent_child_mapping = {}
        
        # Query expansion templates
        self.query_expansion_templates = [
            "What is {query}?",
            "How to {query}?",
            "Explain {query}",
            "{query} examples",
            "{query} best practices",
            "Latest developments in {query}"
        ]
    
    def multi_query_retrieval(self, query: str, num_variations: int = 3, 
                            final_k: int = 10) -> Dict[str, Any]:
        """Multi-query retrieval with query expansion and result fusion"""
        print(f"🔍 Multi-Query Retrieval for: '{query}'")
        
        # Generate query variations
        query_variations = self._generate_query_variations(query, num_variations)
        print(f"   Generated {len(query_variations)} query variations")
        
        all_results = []
        query_results = {}
        
        # Retrieve for each query variation
        for i, var_query in enumerate(query_variations):
            print(f"   Querying: '{var_query}'", end=" ")
            result = self.base_retriever.retrieve(var_query, final_k=20)
            query_results[var_query] = result
            all_results.extend(result['results'])
            print(f"({len(result['results'])} results)")
        
        # Fuse results using Reciprocal Rank Fusion
        fused_results = self._fuse_multi_query_results(all_results, final_k)
        
        return {
            'original_query': query,
            'query_variations': query_variations,
            'individual_results': query_results,
            'fused_results': fused_results,
            'total_candidates': len(all_results),
            'final_count': len(fused_results)
        }
    
    def _generate_query_variations(self, query: str, num_variations: int) -> List[str]:
        """Generate query variations using templates and paraphrasing"""
        variations = [query]  # Include original
        
        # Template-based variations
        for template in self.query_expansion_templates[:num_variations]:
            if len(variations) >= num_variations + 1:
                break
            
            # Simple template filling
            if '{query}' in template:
                variation = template.format(query=query)
            else:
                variation = f"{template} {query}"
            
            if variation not in variations:
                variations.append(variation)
        
        # Synonym-based variations (simplified)
        synonyms = {
            'machine learning': ['ML', 'artificial intelligence', 'AI algorithms'],
            'optimization': ['improvement', 'enhancement', 'tuning'],
            'training': ['learning', 'education', 'instruction'],
            'performance': ['efficiency', 'speed', 'effectiveness'],
            'api': ['application programming interface', 'service interface', 'endpoint'],
            'authentication': ['auth', 'login', 'access control']
        }
        
        query_lower = query.lower()
        for term, syns in synonyms.items():
            if term in query_lower and len(variations) < num_variations + 1:
                for syn in syns:
                    if len(variations) >= num_variations + 1:
                        break
                    syn_query = query_lower.replace(term, syn)
                    if syn_query not in variations:
                        variations.append(syn_query)
        
        return variations[:num_variations + 1]
    
    def _fuse_multi_query_results(self, all_results: List[Tuple], final_k: int) -> List[Tuple]:
        """Fuse results from multiple queries using RRF"""
        doc_scores = defaultdict(list)
        doc_info = {}
        
        # Collect all scores for each document
        for doc_id, score, doc_text, metadata in all_results:
            doc_scores[doc_id].append(score)
            doc_info[doc_id] = (doc_text, metadata)
        
        # Calculate fused score (max score aggregation)
        fused_scores = []
        for doc_id, scores in doc_scores.items():
            # Use maximum score as fusion method
            max_score = max(scores)
            # Also consider frequency (how many queries returned this doc)
            frequency_bonus = len(scores) * 0.1
            final_score = max_score + frequency_bonus
            
            doc_text, metadata = doc_info[doc_id]
            metadata['fusion_score'] = final_score
            metadata['appearance_count'] = len(scores)
            
            fused_scores.append((doc_id, final_score, doc_text, metadata))
        
        # Sort by fused score
        fused_scores.sort(key=lambda x: x[1], reverse=True)
        
        return fused_scores[:final_k]
    
    def create_parent_child_structure(self, documents: List[str], 
                                     chunk_size: int = 200, overlap: int = 50) -> None:
        """Create parent-child document structure"""
        print("🏗️ Building parent-child document structure...")
        
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=overlap,
            length_function=len
        )
        
        self.parent_documents = documents
        self.child_chunks = []
        self.parent_child_mapping = {}
        
        for parent_id, parent_doc in enumerate(documents):
            # Split parent document into child chunks
            chunks = text_splitter.split_text(parent_doc)
            
            for chunk in chunks:
                child_id = len(self.child_chunks)
                self.child_chunks.append(chunk)
                self.parent_child_mapping[child_id] = parent_id
        
        print(f"   Created {len(self.child_chunks)} child chunks from {len(self.parent_documents)} parent documents")
    
    def parent_child_retrieval(self, query: str, top_k: int = 10, 
                              child_retrieval_k: int = 20) -> Dict[str, Any]:
        """Parent-child retrieval: search children, return parents"""
        print(f"👨‍👧‍👦 Parent-Child Retrieval for: '{query}'")
        
        if not self.child_chunks:
            self.create_parent_child_structure(self.base_retriever.documents)
        
        # Search in child chunks for precision
        child_embeddings = self.embedding_model.encode(self.child_chunks)
        child_embeddings = child_embeddings / np.linalg.norm(child_embeddings, axis=1, keepdims=True)
        
        query_embedding = self.embedding_model.encode([query])
        query_embedding = query_embedding / np.linalg.norm(query_embedding)
        
        # Find most relevant child chunks
        similarities = np.dot(child_embeddings, query_embedding.T).flatten()
        top_child_indices = np.argsort(similarities)[::-1][:child_retrieval_k]
        
        # Map to parent documents and aggregate scores
        parent_scores = defaultdict(list)
        child_results = []
        
        for child_id in top_child_indices:
            parent_id = self.parent_child_mapping[child_id]
            score = similarities[child_id]
            
            parent_scores[parent_id].append(score)
            child_results.append({
                'child_id': child_id,
                'parent_id': parent_id,
                'child_text': self.child_chunks[child_id],
                'score': score
            })
        
        # Rank parents by best child score
        parent_results = []
        for parent_id, scores in parent_scores.items():
            best_score = max(scores)
            avg_score = np.mean(scores)
            
            # If we have original metadata, include it
            if parent_id < len(self.base_retriever.document_metadata):
                metadata = self.base_retriever.document_metadata[parent_id].copy()
            else:
                metadata = {}
            
            metadata.update({
                'best_child_score': best_score,
                'avg_child_score': avg_score,
                'matching_children': len(scores)
            })
            
            parent_results.append((
                parent_id, 
                best_score, 
                self.parent_documents[parent_id], 
                metadata
            ))
        
        # Sort by best child score
        parent_results.sort(key=lambda x: x[1], reverse=True)
        
        return {
            'query': query,
            'child_results': child_results,
            'parent_results': parent_results[:top_k],
            'total_children_searched': len(self.child_chunks),
            'relevant_children': len(child_results),
            'unique_parents': len(parent_scores)
        }
    
    def contextual_retrieval(self, query: str, top_k: int = 10) -> Dict[str, Any]:
        """Contextual retrieval with chunk augmentation (simplified version)"""
        print(f"🌍 Contextual Retrieval for: '{query}'")
        
        # Create contextual versions of documents
        contextual_documents = []
        
        for i, doc in enumerate(self.base_retriever.documents):
            metadata = self.base_retriever.document_metadata[i]
            
            # Generate context based on metadata
            context_parts = []
            
            if 'category' in metadata:
                context_parts.append(f"This is a {metadata['category']} document")
            
            if 'domain' in metadata:
                context_parts.append(f"in the domain of {metadata['domain']}")
            
            if 'source' in metadata:
                context_parts.append(f"from {metadata['source']}")
            
            if 'date' in metadata:
                context_parts.append(f"published on {metadata['date']}")
            
            # Add context to document
            context_prefix = ". ".join(context_parts) + ". Content: "
            contextual_doc = context_prefix + doc
            contextual_documents.append(contextual_doc)
        
        # Generate embeddings for contextual documents
        contextual_embeddings = self.embedding_model.encode(contextual_documents)
        contextual_embeddings = contextual_embeddings / np.linalg.norm(contextual_embeddings, axis=1, keepdims=True)
        
        # Perform semantic search with contextual embeddings
        query_embedding = self.embedding_model.encode([query])
        query_embedding = query_embedding / np.linalg.norm(query_embedding)
        
        similarities = np.dot(contextual_embeddings, query_embedding.T).flatten()
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            metadata = self.base_retriever.document_metadata[idx].copy()
            metadata['contextual_score'] = similarities[idx]
            
            results.append((
                idx,
                similarities[idx],
                self.base_retriever.documents[idx],  # Return original document
                metadata
            ))
        
        return {
            'query': query,
            'results': results,
            'contextual_documents_sample': contextual_documents[:2]  # Show examples
        }

# Initialize advanced retrieval techniques
advanced_retrieval = AdvancedRetrievalTechniques(retriever)
print("🚀 Advanced Retrieval Techniques initialized!")

In [None]:
# Test advanced retrieval techniques
test_query = "machine learning model optimization techniques"

print("🧪 ADVANCED RETRIEVAL TECHNIQUES COMPARISON")
print(f"Query: '{test_query}'")
print("=" * 80)

# 1. Multi-Query Retrieval
print("\n1️⃣ Multi-Query Retrieval:")
multi_query_result = advanced_retrieval.multi_query_retrieval(test_query, num_variations=3, final_k=5)

print(f"   Query variations: {multi_query_result['query_variations']}")
print(f"   Total candidates: {multi_query_result['total_candidates']}")
print(f"   Final results: {multi_query_result['final_count']}")
print("   Top results:")
for i, (doc_id, score, doc_text, metadata) in enumerate(multi_query_result['fused_results'][:3]):
    print(f"      {i+1}. [{score:.4f}] (appeared {metadata['appearance_count']}x) {doc_text[:60]}...")

# 2. Parent-Child Retrieval
print("\n2️⃣ Parent-Child Retrieval:")
parent_child_result = advanced_retrieval.parent_child_retrieval(test_query, top_k=5, child_retrieval_k=15)

print(f"   Children searched: {parent_child_result['total_children_searched']}")
print(f"   Relevant children: {parent_child_result['relevant_children']}")
print(f"   Unique parents: {parent_child_result['unique_parents']}")
print("   Top parent results:")
for i, (parent_id, score, doc_text, metadata) in enumerate(parent_child_result['parent_results'][:3]):
    print(f"      {i+1}. [{score:.4f}] ({metadata['matching_children']} children) {doc_text[:60]}...")

# 3. Contextual Retrieval
print("\n3️⃣ Contextual Retrieval:")
contextual_result = advanced_retrieval.contextual_retrieval(test_query, top_k=5)

print("   Example contextual document:")
print(f"      {contextual_result['contextual_documents_sample'][0][:100]}...")
print("   Top results:")
for i, (doc_id, score, doc_text, metadata) in enumerate(contextual_result['results'][:3]):
    print(f"      {i+1}. [{score:.4f}] [{metadata['category']}] {doc_text[:60]}...")

# 4. Standard Multi-Stage (for comparison)
print("\n4️⃣ Standard Multi-Stage (Baseline):")
baseline_result = retriever.retrieve(test_query, final_k=5)
print("   Top results:")
for i, (doc_id, score, doc_text, metadata) in enumerate(baseline_result['results'][:3]):
    print(f"      {i+1}. [{score:.4f}] [{metadata['category']}] {doc_text[:60]}...")

## 📊 Exercise 3: Retrieval Quality Evaluation

Let's implement comprehensive evaluation metrics for retrieval quality assessment.

In [None]:
class RetrievalEvaluator:
    """Comprehensive evaluation framework for retrieval systems"""
    
    def __init__(self):
        self.evaluation_results = []
        self.ground_truth = {}
    
    def create_evaluation_dataset(self) -> Dict[str, List[int]]:
        """Create ground truth dataset for evaluation"""
        # Simplified ground truth based on document categories and keywords
        ground_truth = {
            "TensorFlow optimization and performance": [0, 1],  # TensorFlow and PyTorch docs
            "API authentication and security methods": [6, 7],  # API docs
            "business revenue and AI products growth": [2, 3],  # Business docs
            "transformer attention mechanisms research": [4, 5],  # Research papers
            "machine learning model training tutorial": [8, 9, 10, 11],  # Tutorial and code
            "latest developments in large language models": [5, 10],  # Research and news
            "neural network image classification guide": [8, 9],  # Tutorial docs
            "natural language processing pipeline steps": [9, 4],  # Tutorial and research
        }
        
        self.ground_truth = ground_truth
        return ground_truth
    
    def calculate_precision_recall(self, retrieved_docs: List[int], 
                                  relevant_docs: List[int], k: int = 10) -> Dict[str, float]:
        """Calculate precision and recall at k"""
        retrieved_set = set(retrieved_docs[:k])
        relevant_set = set(relevant_docs)
        
        true_positives = len(retrieved_set.intersection(relevant_set))
        
        precision_at_k = true_positives / len(retrieved_set) if retrieved_set else 0.0
        recall_at_k = true_positives / len(relevant_set) if relevant_set else 0.0
        
        return {
            'precision_at_k': precision_at_k,
            'recall_at_k': recall_at_k,
            'true_positives': true_positives,
            'retrieved_count': len(retrieved_set),
            'relevant_count': len(relevant_set)
        }
    
    def calculate_mrr(self, retrieved_docs: List[int], relevant_docs: List[int]) -> float:
        """Calculate Mean Reciprocal Rank"""
        relevant_set = set(relevant_docs)
        
        for rank, doc_id in enumerate(retrieved_docs, 1):
            if doc_id in relevant_set:
                return 1.0 / rank
        
        return 0.0
    
    def calculate_ndcg(self, retrieved_docs: List[int], relevant_docs: List[int], k: int = 10) -> float:
        """Calculate Normalized Discounted Cumulative Gain"""
        relevant_set = set(relevant_docs)
        
        # Create relevance scores (1 for relevant, 0 for not relevant)
        relevance_scores = [1 if doc_id in relevant_set else 0 for doc_id in retrieved_docs[:k]]
        
        # Ideal relevance scores (all relevant docs first)
        ideal_scores = [1] * min(len(relevant_docs), k) + [0] * max(0, k - len(relevant_docs))
        
        if not any(relevance_scores):
            return 0.0
        
        try:
            # Use sklearn's ndcg_score
            return ndcg_score([ideal_scores], [relevance_scores], k=k)
        except:
            # Manual calculation if sklearn fails
            dcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(relevance_scores) if rel > 0)
            idcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(ideal_scores) if rel > 0)
            return dcg / idcg if idcg > 0 else 0.0
    
    def calculate_hit_rate(self, retrieved_docs: List[int], relevant_docs: List[int], k: int = 10) -> float:
        """Calculate hit rate (whether any relevant doc is in top-k)"""
        retrieved_set = set(retrieved_docs[:k])
        relevant_set = set(relevant_docs)
        
        return 1.0 if retrieved_set.intersection(relevant_set) else 0.0
    
    def evaluate_retrieval_method(self, retrieval_function, method_name: str, 
                                queries: List[str], k_values: List[int] = [5, 10]) -> pd.DataFrame:
        """Evaluate a retrieval method across multiple queries and k values"""
        print(f"📊 Evaluating {method_name}...")
        
        results = []
        
        for query in queries:
            if query not in self.ground_truth:
                continue
                
            relevant_docs = self.ground_truth[query]
            
            # Get retrieval results
            retrieval_result = retrieval_function(query)
            
            # Extract document IDs (handle different result formats)
            if isinstance(retrieval_result, dict):
                if 'results' in retrieval_result:
                    retrieved_docs = [item[0] for item in retrieval_result['results']]
                elif 'fused_results' in retrieval_result:
                    retrieved_docs = [item[0] for item in retrieval_result['fused_results']]
                elif 'parent_results' in retrieval_result:
                    retrieved_docs = [item[0] for item in retrieval_result['parent_results']]
                else:
                    continue
            else:
                retrieved_docs = [item[0] for item in retrieval_result]
            
            # Calculate metrics for different k values
            for k in k_values:
                precision_recall = self.calculate_precision_recall(retrieved_docs, relevant_docs, k)
                mrr = self.calculate_mrr(retrieved_docs, relevant_docs)
                ndcg = self.calculate_ndcg(retrieved_docs, relevant_docs, k)
                hit_rate = self.calculate_hit_rate(retrieved_docs, relevant_docs, k)
                
                results.append({
                    'method': method_name,
                    'query': query[:40] + '...' if len(query) > 40 else query,
                    'k': k,
                    'precision_at_k': precision_recall['precision_at_k'],
                    'recall_at_k': precision_recall['recall_at_k'],
                    'mrr': mrr,
                    'ndcg_at_k': ndcg,
                    'hit_rate_at_k': hit_rate,
                    'true_positives': precision_recall['true_positives'],
                    'relevant_count': precision_recall['relevant_count']
                })
        
        return pd.DataFrame(results)
    
    def compare_retrieval_methods(self, methods: Dict[str, callable], 
                                queries: List[str]) -> pd.DataFrame:
        """Compare multiple retrieval methods"""
        print("🔍 COMPREHENSIVE RETRIEVAL EVALUATION")
        print("=" * 60)
        
        all_results = []
        
        for method_name, method_func in methods.items():
            method_results = self.evaluate_retrieval_method(method_func, method_name, queries)
            all_results.append(method_results)
        
        # Combine all results
        combined_results = pd.concat(all_results, ignore_index=True)
        
        return combined_results
    
    def generate_evaluation_report(self, results_df: pd.DataFrame) -> Dict[str, Any]:
        """Generate comprehensive evaluation report"""
        print("\n📋 EVALUATION REPORT")
        print("=" * 40)
        
        # Overall performance by method
        method_performance = results_df.groupby(['method', 'k']).agg({
            'precision_at_k': 'mean',
            'recall_at_k': 'mean',
            'mrr': 'mean',
            'ndcg_at_k': 'mean',
            'hit_rate_at_k': 'mean'
        }).round(4)
        
        print("\n📊 Average Performance by Method:")
        print(method_performance.to_string())
        
        # Best performing method for each metric
        best_methods = {}
        for k in [5, 10]:
            k_results = results_df[results_df['k'] == k]
            method_avg = k_results.groupby('method').mean()
            
            best_methods[f'k={k}'] = {
                'precision': method_avg['precision_at_k'].idxmax(),
                'recall': method_avg['recall_at_k'].idxmax(),
                'mrr': method_avg['mrr'].idxmax(),
                'ndcg': method_avg['ndcg_at_k'].idxmax(),
                'hit_rate': method_avg['hit_rate_at_k'].idxmax()
            }
        
        print("\n🏆 Best Methods by Metric:")
        for k, metrics in best_methods.items():
            print(f"   {k}:")
            for metric, method in metrics.items():
                score = results_df[(results_df['method'] == method) & 
                                 (results_df['k'] == int(k.split('=')[1]))][f'{metric}_at_k' if 'rate' not in metric and metric != 'mrr' else metric].mean()
                print(f"      {metric.capitalize()}: {method} ({score:.3f})")
        
        return {
            'method_performance': method_performance,
            'best_methods': best_methods,
            'raw_results': results_df
        }

# Initialize evaluator
evaluator = RetrievalEvaluator()
ground_truth = evaluator.create_evaluation_dataset()

print("📊 Retrieval Evaluator initialized!")
print(f"   Created ground truth for {len(ground_truth)} queries")

In [None]:
# Run comprehensive evaluation
evaluation_queries = list(ground_truth.keys())

# Define retrieval methods to compare
retrieval_methods = {
    'Multi-Stage': lambda q: retriever.retrieve(q, final_k=10),
    'Multi-Query': lambda q: advanced_retrieval.multi_query_retrieval(q, final_k=10),
    'Parent-Child': lambda q: advanced_retrieval.parent_child_retrieval(q, top_k=10),
    'Contextual': lambda q: advanced_retrieval.contextual_retrieval(q, top_k=10)
}

# Run evaluation
evaluation_results = evaluator.compare_retrieval_methods(retrieval_methods, evaluation_queries)

# Generate report
report = evaluator.generate_evaluation_report(evaluation_results)

# Visualize results
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))

# 1. Precision@K comparison
precision_data = evaluation_results.groupby(['method', 'k'])['precision_at_k'].mean().unstack()
precision_data.plot(kind='bar', ax=ax1, width=0.8)
ax1.set_title('Precision@K by Method', fontweight='bold', fontsize=14)
ax1.set_ylabel('Precision@K')
ax1.set_xlabel('Retrieval Method')
ax1.legend(title='K Value')
ax1.tick_params(axis='x', rotation=45)
ax1.grid(True, alpha=0.3)

# 2. Recall@K comparison
recall_data = evaluation_results.groupby(['method', 'k'])['recall_at_k'].mean().unstack()
recall_data.plot(kind='bar', ax=ax2, width=0.8)
ax2.set_title('Recall@K by Method', fontweight='bold', fontsize=14)
ax2.set_ylabel('Recall@K')
ax2.set_xlabel('Retrieval Method')
ax2.legend(title='K Value')
ax2.tick_params(axis='x', rotation=45)
ax2.grid(True, alpha=0.3)

# 3. NDCG@K comparison
ndcg_data = evaluation_results.groupby(['method', 'k'])['ndcg_at_k'].mean().unstack()
ndcg_data.plot(kind='bar', ax=ax3, width=0.8)
ax3.set_title('NDCG@K by Method', fontweight='bold', fontsize=14)
ax3.set_ylabel('NDCG@K')
ax3.set_xlabel('Retrieval Method')
ax3.legend(title='K Value')
ax3.tick_params(axis='x', rotation=45)
ax3.grid(True, alpha=0.3)

# 4. Hit Rate@K comparison
hit_rate_data = evaluation_results.groupby(['method', 'k'])['hit_rate_at_k'].mean().unstack()
hit_rate_data.plot(kind='bar', ax=ax4, width=0.8)
ax4.set_title('Hit Rate@K by Method', fontweight='bold', fontsize=14)
ax4.set_ylabel('Hit Rate@K')
ax4.set_xlabel('Retrieval Method')
ax4.legend(title='K Value')
ax4.tick_params(axis='x', rotation=45)
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Key Evaluation Insights:")
print("- Multi-Query retrieval often improves recall by finding diverse relevant documents")
print("- Parent-Child retrieval balances precision and completeness")
print("- Contextual retrieval helps with ambiguous queries")
print("- Multi-Stage baseline provides good overall performance")
print("- Choice of method should depend on specific use case requirements")

## ⚠️ Exercise 4: Failure Handling and Robust Retrieval

Let's implement robust failure handling, edge case management, and confidence scoring.

In [None]:
class RobustRetrievalSystem:
    """Production-ready retrieval system with failure handling and monitoring"""
    
    def __init__(self, base_retriever: MultiStageRetriever):
        self.base_retriever = base_retriever
        self.embedding_model = base_retriever.embedding_model
        
        # Failure tracking
        self.failure_stats = defaultdict(int)
        self.query_history = []
        
        # Confidence thresholds
        self.min_confidence_threshold = 0.3
        self.high_confidence_threshold = 0.7
        
        # Query preprocessing patterns
        self.query_cleaning_patterns = [
            (r'[^\w\s]', ' '),  # Remove special characters
            (r'\s+', ' '),      # Normalize whitespace
            (r'^\s+|\s+$', ''), # Strip leading/trailing whitespace
        ]
        
        # Fallback queries
        self.fallback_strategies = [
            'expand_query',
            'simplify_query', 
            'broaden_search',
            'use_keywords_only'
        ]
    
    def preprocess_query(self, query: str) -> Dict[str, Any]:
        """Clean and analyze query before retrieval"""
        original_query = query
        
        # Clean query
        cleaned_query = query.lower().strip()
        
        for pattern, replacement in self.query_cleaning_patterns:
            cleaned_query = re.sub(pattern, replacement, cleaned_query)
        
        # Analyze query characteristics
        analysis = {
            'original': original_query,
            'cleaned': cleaned_query,
            'length': len(cleaned_query.split()),
            'is_empty': len(cleaned_query.strip()) == 0,
            'is_too_short': len(cleaned_query.split()) < 2,
            'is_too_long': len(cleaned_query.split()) > 50,
            'has_question_words': any(word in cleaned_query.lower() 
                                    for word in ['what', 'how', 'why', 'when', 'where', 'who']),
            'complexity_score': textstat.flesch_reading_ease(cleaned_query) if cleaned_query else 0
        }
        
        return analysis
    
    def calculate_confidence_score(self, results: List[Tuple], query_analysis: Dict) -> float:
        """Calculate confidence score for retrieval results"""
        if not results:
            return 0.0
        
        confidence_factors = []
        
        # Factor 1: Top result score
        top_score = results[0][1] if results else 0
        confidence_factors.append(min(1.0, top_score * 2))  # Scale up score
        
        # Factor 2: Score distribution (less variation = higher confidence)
        scores = [r[1] for r in results]
        if len(scores) > 1:
            score_std = np.std(scores)
            score_uniformity = 1.0 - min(1.0, score_std * 2)
            confidence_factors.append(score_uniformity)
        
        # Factor 3: Number of results found
        result_count_factor = min(1.0, len(results) / 5.0)  # Normalize to 5 results
        confidence_factors.append(result_count_factor)
        
        # Factor 4: Query complexity
        if not query_analysis['is_too_short'] and not query_analysis['is_too_long']:
            confidence_factors.append(0.8)
        else:
            confidence_factors.append(0.4)
        
        # Factor 5: Keyword matching in top results
        query_words = set(query_analysis['cleaned'].split())
        if results and query_words:
            top_doc_words = set(results[0][2].lower().split())
            word_overlap = len(query_words.intersection(top_doc_words)) / len(query_words)
            confidence_factors.append(word_overlap)
        
        # Calculate weighted average
        weights = [0.3, 0.2, 0.2, 0.15, 0.15]  # Prioritize top score and distribution
        confidence = sum(f * w for f, w in zip(confidence_factors, weights[:len(confidence_factors)]))
        
        return min(1.0, max(0.0, confidence))
    
    def expand_query(self, query: str) -> str:
        """Expand query with related terms"""
        expansion_terms = {
            'ml': 'machine learning',
            'ai': 'artificial intelligence',
            'llm': 'large language model',
            'api': 'application programming interface',
            'gpu': 'graphics processing unit',
            'nlp': 'natural language processing'
        }
        
        expanded = query
        for abbrev, full_term in expansion_terms.items():
            if abbrev in query.lower():
                expanded = expanded.replace(abbrev, f"{abbrev} {full_term}")
        
        # Add context words
        if 'optimization' in query.lower():
            expanded += " performance improvement efficiency"
        elif 'training' in query.lower():
            expanded += " learning education tutorial"
        elif 'authentication' in query.lower():
            expanded += " security login access control"
        
        return expanded
    
    def simplify_query(self, query: str) -> str:
        """Simplify complex query to key terms"""
        # Remove question words and common words
        stop_words = {'what', 'how', 'why', 'when', 'where', 'who', 'is', 'are', 'the', 'a', 'an'}
        words = query.lower().split()
        key_words = [w for w in words if w not in stop_words and len(w) > 2]
        return ' '.join(key_words[:5])  # Take top 5 key words
    
    def extract_keywords(self, query: str) -> str:
        """Extract only the most important keywords"""
        # Simple keyword extraction (in production, use more sophisticated NLP)
        important_terms = []
        words = query.lower().split()
        
        # Priority terms
        priority_terms = ['tensorflow', 'pytorch', 'api', 'authentication', 'training', 
                         'optimization', 'machine learning', 'neural network']
        
        for term in priority_terms:
            if term in query.lower():
                important_terms.append(term)
        
        # Add remaining significant words
        for word in words:
            if len(word) > 4 and word not in important_terms:
                important_terms.append(word)
        
        return ' '.join(important_terms[:3])
    
    def robust_retrieve(self, query: str, max_attempts: int = 3, 
                       min_results: int = 1) -> Dict[str, Any]:
        """Robust retrieval with fallback strategies"""
        start_time = time.time()
        
        # Preprocess query
        query_analysis = self.preprocess_query(query)
        
        # Handle obvious failures early
        if query_analysis['is_empty']:
            self.failure_stats['empty_query'] += 1
            return {
                'success': False,
                'error': 'Empty query provided',
                'results': [],
                'confidence': 0.0,
                'attempts': 0,
                'total_time': time.time() - start_time
            }
        
        # Try different strategies
        attempts = []
        current_query = query_analysis['cleaned']
        
        for attempt in range(max_attempts):
            try:
                print(f"   Attempt {attempt + 1}: '{current_query[:50]}{'...' if len(current_query) > 50 else ''}'")
                
                # Perform retrieval
                result = self.base_retriever.retrieve(current_query, final_k=10)
                
                # Calculate confidence
                confidence = self.calculate_confidence_score(result['results'], query_analysis)
                
                attempts.append({
                    'attempt': attempt + 1,
                    'query': current_query,
                    'results': result['results'],
                    'confidence': confidence,
                    'result_count': len(result['results'])
                })
                
                # Check if results are satisfactory
                if (len(result['results']) >= min_results and 
                    confidence >= self.min_confidence_threshold):
                    
                    # Success!
                    self.query_history.append({
                        'query': query,
                        'success': True,
                        'attempts': attempt + 1,
                        'confidence': confidence,
                        'timestamp': datetime.now()
                    })
                    
                    return {
                        'success': True,
                        'results': result['results'],
                        'confidence': confidence,
                        'attempts': attempt + 1,
                        'final_query': current_query,
                        'all_attempts': attempts,
                        'total_time': time.time() - start_time
                    }
                
                # Try different strategy for next attempt
                if attempt < max_attempts - 1:
                    strategy = self.fallback_strategies[attempt % len(self.fallback_strategies)]
                    
                    if strategy == 'expand_query':
                        current_query = self.expand_query(current_query)
                    elif strategy == 'simplify_query':
                        current_query = self.simplify_query(current_query)
                    elif strategy == 'use_keywords_only':
                        current_query = self.extract_keywords(current_query)
                    elif strategy == 'broaden_search':
                        # Use broader terms
                        current_query = current_query + " guide tutorial introduction"
                
            except Exception as e:
                self.failure_stats['retrieval_error'] += 1
                attempts.append({
                    'attempt': attempt + 1,
                    'query': current_query,
                    'error': str(e),
                    'results': [],
                    'confidence': 0.0
                })
        
        # All attempts failed
        self.failure_stats['max_attempts_reached'] += 1
        self.query_history.append({
            'query': query,
            'success': False,
            'attempts': max_attempts,
            'timestamp': datetime.now()
        })
        
        # Return best attempt (highest confidence)
        best_attempt = max(attempts, key=lambda x: x.get('confidence', 0))
        
        return {
            'success': False,
            'results': best_attempt.get('results', []),
            'confidence': best_attempt.get('confidence', 0.0),
            'attempts': max_attempts,
            'final_query': best_attempt.get('query', current_query),
            'all_attempts': attempts,
            'warning': 'Low confidence results - manual review recommended',
            'total_time': time.time() - start_time
        }
    
    def get_system_health(self) -> Dict[str, Any]:
        """Get system health and performance statistics"""
        total_queries = len(self.query_history)
        successful_queries = sum(1 for q in self.query_history if q['success'])
        
        if total_queries > 0:
            success_rate = successful_queries / total_queries
            avg_attempts = np.mean([q['attempts'] for q in self.query_history])
            avg_confidence = np.mean([q.get('confidence', 0) for q in self.query_history if q['success']])
        else:
            success_rate = avg_attempts = avg_confidence = 0
        
        return {
            'total_queries': total_queries,
            'successful_queries': successful_queries,
            'success_rate': success_rate,
            'average_attempts': avg_attempts,
            'average_confidence': avg_confidence,
            'failure_breakdown': dict(self.failure_stats),
            'health_status': 'Healthy' if success_rate > 0.8 else 'Warning' if success_rate > 0.6 else 'Critical'
        }

# Initialize robust retrieval system
robust_system = RobustRetrievalSystem(retriever)
print("🛡️ Robust Retrieval System initialized!")

In [None]:
# Test robust retrieval with various edge cases
test_cases = [
    "tensorflow optimization performance",  # Normal query
    "",  # Empty query
    "   ",  # Whitespace only
    "how",  # Too short
    "xyz123 nonexistent made-up-term",  # No matches expected
    "What are the best practices for optimizing machine learning model training performance with GPU acceleration?",  # Long complex query
    "api auth",  # Abbreviations
    "ML performance tuning",  # Mixed abbreviations and terms
]

print("🧪 ROBUST RETRIEVAL TESTING")
print("=" * 60)

for i, test_query in enumerate(test_cases):
    print(f"\n🔍 Test Case {i+1}: '{test_query}'")
    print("-" * 40)
    
    result = robust_system.robust_retrieve(test_query, max_attempts=3)
    
    print(f"✅ Success: {result['success']}")
    print(f"🎯 Confidence: {result['confidence']:.3f}")
    print(f"🔄 Attempts: {result['attempts']}")
    print(f"⏱️ Time: {result['total_time']:.3f}s")
    
    if 'final_query' in result:
        print(f"🔧 Final Query: '{result['final_query']}'")
    
    if result['results']:
        print(f"📄 Top Result: {result['results'][0][2][:60]}...")
    
    if 'warning' in result:
        print(f"⚠️ Warning: {result['warning']}")
    
    if 'error' in result:
        print(f"❌ Error: {result['error']}")

# Display system health
health = robust_system.get_system_health()
print(f"\n🏥 SYSTEM HEALTH REPORT")
print("=" * 30)
print(f"Status: {health['health_status']}")
print(f"Success Rate: {health['success_rate']:.1%}")
print(f"Average Attempts: {health['average_attempts']:.1f}")
print(f"Average Confidence: {health['average_confidence']:.3f}")
print(f"Total Queries: {health['total_queries']}")
print(f"Failures: {health['failure_breakdown']}")

## 🎯 Key Takeaways

From this module, you should now understand:

### 🏗️ Multi-Stage Retrieval Architecture:
1. **Stage 1 - Initial Retrieval**: Fast, broad search (vector + BM25 hybrid)
2. **Stage 2 - Re-ranking**: Precise scoring with cross-encoders or LLM judges
3. **Stage 3 - Post-processing**: Diversity, deduplication, context optimization
4. **Benefits**: Balances speed and quality, handles large-scale retrieval efficiently

### 🚀 Advanced Retrieval Techniques (2025):

#### Multi-Query Retrieval:
- **Improvement**: 25-40% better recall through query diversification
- **Cost**: 3-5x more retrieval operations
- **Best for**: High recall requirements, diverse terminology domains

#### Parent-Child Retrieval:
- **Strategy**: Search small chunks (precision), return large chunks (context)
- **Benefit**: Optimal balance of specificity and completeness
- **Implementation**: Hierarchical document chunking with mapping

#### Contextual Retrieval (Anthropic 2025):
- **Innovation**: Add context to chunks before embedding
- **Improvement**: 49% reduction in failed retrievals
- **Method**: LLM generates contextual information for each chunk

### 📊 Evaluation Metrics Hierarchy:
1. **Precision@K**: Relevance of retrieved results
2. **Recall@K**: Coverage of relevant documents
3. **NDCG@K**: Ranked relevance quality
4. **MRR**: Mean Reciprocal Rank for ranking quality
5. **Hit Rate**: System reliability (any relevant result found)

### 🛡️ Production Robustness:

#### Failure Handling:
- **Query preprocessing**: Clean, validate, and analyze queries
- **Fallback strategies**: Expand, simplify, broaden, extract keywords
- **Confidence scoring**: Multi-factor confidence assessment
- **Graceful degradation**: Return best effort results with warnings

#### Monitoring & Health:
- **Success rate tracking**: Monitor query success/failure rates
- **Confidence distribution**: Track confidence score patterns
- **Performance metrics**: Query latency and throughput monitoring
- **Error categorization**: Systematic failure analysis

### 🎯 Performance Optimization Guidelines:

| Stage | Optimization Focus | Typical Latency |
|-------|-------------------|----------------|
| **Stage 1** | Vector index efficiency, hybrid fusion | 50-200ms |
| **Stage 2** | Cross-encoder batching, model size | 100-500ms |
| **Stage 3** | Diversity algorithms, metadata processing | 10-50ms |
| **Total** | End-to-end pipeline optimization | 200-800ms |

### 🔄 Retrieval Strategy Selection:

1. **High Precision Needs**: Multi-stage with cross-encoder re-ranking
2. **High Recall Needs**: Multi-query retrieval with fusion
3. **Context Rich Results**: Parent-child retrieval
4. **Ambiguous Queries**: Contextual retrieval with metadata
5. **Production Deployment**: Robust system with fallbacks

### 📈 Implementation Roadmap:
1. **Start Simple** → 2. **Add Multi-Stage** → 3. **Implement Advanced Techniques** → 4. **Add Robustness** → 5. **Monitor & Optimize**

## 🎯 Next Steps

In the next modules, we'll explore:
- **Module 10**: Prompt engineering for optimal RAG performance
- **Module 11**: LLM integration and model selection strategies  
- **Module 12**: Complete RAG system integration and deployment

Mastering retrieval strategies is crucial for building high-quality RAG systems that can handle diverse queries and scale to production workloads!

## 🤔 Discussion Questions

1. When would you choose multi-query retrieval over parent-child retrieval?
2. How would you balance retrieval latency vs quality in a real-time system?
3. What additional confidence factors would you implement for your domain?
4. How would you handle retrieval for multi-lingual documents?
5. What metrics would you use to trigger re-indexing in production?

## 📝 Optional Exercises

1. **Real Cross-Encoder**: Implement actual cross-encoder re-ranking with Hugging Face models
2. **LLM Judge**: Use GPT-4 or Claude for LLM-based re-ranking
3. **Custom Confidence**: Design domain-specific confidence scoring
4. **A/B Testing**: Compare retrieval strategies with statistical significance
5. **Production Monitoring**: Build dashboards for retrieval system health