# Module 8: Search Methods Comparison

## 🎯 Learning Objectives
By the end of this module, you will:
- Understand the fundamental differences between semantic and lexical search
- Implement both BM25 (lexical) and vector similarity (semantic) search
- Compare results on different query types and identify when to use each approach
- Design and implement hybrid search strategies using Reciprocal Rank Fusion (RRF)
- Implement advanced techniques like HyDE and contextual retrieval

## 📚 Key Concepts

### The Search Method Spectrum

Modern information retrieval spans a spectrum from **exact matching** to **semantic understanding**:

```
Lexical Search ←→ Semantic Search ←→ Hybrid Search
(Exact Terms)      (Meaning)        (Best of Both)
```

### 🔤 Lexical Search (Traditional)

**How it works**: Matches words and phrases directly
**Algorithm**: BM25 (Best Matching 25) - the gold standard
**Strengths**:
- Excellent for exact terms, names, IDs
- Fast and resource-efficient
- Explainable results
- No training required

**Weaknesses**:
- Misses synonyms and paraphrases
- No understanding of context or meaning
- Sensitive to spelling variations

**Example**:
```
Query: "machine learning algorithms"
✅ Matches: "machine learning algorithms are powerful"
❌ Misses: "AI techniques for pattern recognition" (same concept, different words)
```

### 🧠 Semantic Search (Modern)

**How it works**: Understands meaning through vector embeddings
**Algorithm**: Vector similarity (cosine, dot product)
**Strengths**:
- Captures meaning and intent
- Handles synonyms and paraphrases
- Works across languages
- Good for conceptual queries

**Weaknesses**:
- May miss exact terms if not semantically emphasized
- Computationally expensive
- Less explainable
- Requires quality embeddings

**Example**:
```
Query: "machine learning algorithms"
✅ Matches: "AI techniques for pattern recognition" (understands the concept)
❌ Might miss: "ML-2024-Report-v3.pdf" (exact filename)
```

### 🔄 Hybrid Search (Best Practice 2025)

**How it works**: Combines both approaches intelligently
**Standard**: BM25 + Vector Similarity with RRF fusion
**Result**: Gets benefits of both methods

### 2025 Production Standards 🏆

| Method | Use Case | Performance | Adoption |
|--------|----------|-------------|----------|
| **BM25 Only** | Legacy systems, exact matching | Fast | Declining |
| **Semantic Only** | Research, conceptual queries | Moderate | Common |
| **Hybrid (BM25+Semantic)** | Production systems | Best | Standard |

### Advanced Techniques (2025)

1. **Reciprocal Rank Fusion (RRF)**: Standard algorithm for combining search results
2. **HyDE**: Generate hypothetical documents to improve zero-shot performance
3. **Contextual Retrieval**: Add contextual information to chunks (49% improvement)
4. **Multi-Query Expansion**: Generate multiple query variations for better coverage

## 🛠️ Setup
Let's install the required packages and set up our search comparison lab.

In [None]:
# Install required packages
!pip install -q rank-bm25 sentence-transformers scikit-learn numpy pandas matplotlib seaborn
!pip install -q langchain langchain-community openai python-dotenv
!pip install -q nltk textstat wordcloud plotly
# Note: For production, also add: elasticsearch, weaviate-client

In [None]:
import os
import re
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from typing import List, Dict, Any, Tuple, Optional, Union
import json
from collections import Counter, defaultdict
import warnings
warnings.filterwarnings('ignore')

# Search libraries
from rank_bm25 import BM25Okapi, BM25L, BM25Plus
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# NLP libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import textstat

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from wordcloud import WordCloud

# LangChain for advanced techniques
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

from dotenv import load_dotenv
load_dotenv()

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

# Set up visualization
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ Setup complete!")
print(f"📅 Today's date: {datetime.now().strftime('%Y-%m-%d')}")

## 🧪 Exercise 1: Implementing and Comparing Search Methods

Let's implement both lexical and semantic search to understand their characteristics.

In [None]:
class SearchMethodComparison:
    """Compare different search methods: lexical, semantic, and hybrid"""
    
    def __init__(self):
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.stemmer = PorterStemmer()
        self.stop_words = set(stopwords.words('english'))
        
        # Initialize search indexes
        self.bm25_index = None
        self.embeddings = None
        self.documents = []
        self.tokenized_docs = []
        
    def create_sample_dataset(self) -> List[str]:
        """Create a diverse dataset for search comparison"""
        documents = [
            # AI/ML Documents (conceptual)
            "Machine learning algorithms enable computers to learn patterns from data without explicit programming.",
            "Deep learning uses neural networks with multiple layers to model complex patterns.",
            "Artificial intelligence techniques include supervised learning, unsupervised learning, and reinforcement learning.",
            "Natural language processing helps computers understand and generate human language.",
            "Computer vision algorithms analyze and interpret visual information from images and videos.",
            
            # Technical Documents (specific terms)
            "The TensorFlow library version 2.14.0 includes new optimizations for GPU computation.",
            "PyTorch framework provides dynamic computational graphs for research applications.",
            "CUDA programming enables parallel computing on NVIDIA graphics processing units.",
            "Docker containers provide consistent deployment environments across different systems.",
            "Kubernetes orchestrates containerized applications at scale in production environments.",
            
            # Business Documents (mixed)
            "Q4 revenue increased by 15% due to strong performance in the AI products division.",
            "The company's machine learning initiatives generated $50M in additional revenue this quarter.",
            "Customer satisfaction scores improved significantly after implementing AI-powered support tools.",
            "Data science team delivered 12 predictive models for various business units.",
            "Investment in artificial intelligence research and development reached $100M this year.",
            
            # Research Papers (academic)
            "Attention mechanisms in transformer architectures enable better sequence modeling.",
            "Pre-trained language models achieve state-of-the-art results on downstream NLP tasks.",
            "Contrastive learning improves representation learning in self-supervised settings.",
            "Graph neural networks effectively model relationships between entities in complex datasets.",
            "Federated learning enables distributed training while preserving data privacy.",
            
            # Code Documentation (technical)
            "The function calculate_accuracy() computes classification performance metrics.",
            "Use train_model(X_train, y_train) to fit the machine learning model on training data.",
            "The API endpoint /api/v1/predict accepts JSON payload with feature vectors.",
            "Error handling in ML pipelines prevents crashes during data preprocessing steps.",
            "Model versioning tracks changes to algorithm parameters and training configurations.",
            
            # News Articles (current events style)
            "Tech giants invest billions in generative AI research to compete with OpenAI ChatGPT.",
            "New breakthrough in quantum computing could revolutionize machine learning algorithms.",
            "Regulatory concerns grow around AI safety and potential risks of advanced systems.",
            "Healthcare industry adopts AI diagnostic tools to improve patient outcomes.",
            "Educational institutions integrate AI tutoring systems into online learning platforms."
        ]
        
        return documents
    
    def preprocess_text(self, text: str) -> List[str]:
        """Preprocess text for BM25 indexing"""
        # Convert to lowercase and tokenize
        tokens = word_tokenize(text.lower())
        
        # Remove non-alphabetic tokens and stop words
        tokens = [token for token in tokens if token.isalpha() and token not in self.stop_words]
        
        # Apply stemming
        tokens = [self.stemmer.stem(token) for token in tokens]
        
        return tokens
    
    def build_indexes(self, documents: List[str]) -> None:
        """Build both BM25 and semantic indexes"""
        self.documents = documents
        
        print("Building search indexes...")
        
        # 1. Build BM25 index
        print("   Building BM25 index...")
        self.tokenized_docs = [self.preprocess_text(doc) for doc in documents]
        self.bm25_index = BM25Okapi(self.tokenized_docs)
        
        # 2. Build semantic index
        print("   Building semantic index...")
        self.embeddings = self.embedding_model.encode(documents)
        # Normalize for cosine similarity
        self.embeddings = self.embeddings / np.linalg.norm(self.embeddings, axis=1, keepdims=True)
        
        print(f"✅ Indexes built for {len(documents)} documents")
    
    def lexical_search(self, query: str, top_k: int = 10) -> List[Tuple[int, float, str]]:
        """Perform BM25 lexical search"""
        query_tokens = self.preprocess_text(query)
        scores = self.bm25_index.get_scores(query_tokens)
        
        # Get top-k results
        top_indices = np.argsort(scores)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            if scores[idx] > 0:  # Only include documents with non-zero scores
                results.append((idx, scores[idx], self.documents[idx]))
        
        return results
    
    def semantic_search(self, query: str, top_k: int = 10) -> List[Tuple[int, float, str]]:
        """Perform semantic vector search"""
        query_embedding = self.embedding_model.encode([query])
        query_embedding = query_embedding / np.linalg.norm(query_embedding)
        
        # Calculate cosine similarities
        similarities = np.dot(self.embeddings, query_embedding.T).flatten()
        
        # Get top-k results
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            results.append((idx, similarities[idx], self.documents[idx]))
        
        return results
    
    def compare_search_methods(self, queries: List[str]) -> Dict[str, Any]:
        """Compare lexical and semantic search across multiple queries"""
        comparison_results = []
        
        for query in queries:
            print(f"\n🔍 Query: '{query}'")
            
            # Get results from both methods
            lexical_results = self.lexical_search(query, top_k=5)
            semantic_results = self.semantic_search(query, top_k=5)
            
            # Analyze overlap
            lexical_indices = set([r[0] for r in lexical_results])
            semantic_indices = set([r[0] for r in semantic_results])
            overlap = len(lexical_indices.intersection(semantic_indices))
            
            print(f"   📊 Overlap: {overlap}/5 documents")
            
            # Display top results
            print("   🔤 BM25 (Lexical) Top 3:")
            for i, (idx, score, doc) in enumerate(lexical_results[:3]):
                print(f"      {i+1}. [{score:.3f}] {doc[:80]}...")
            
            print("   🧠 Semantic Top 3:")
            for i, (idx, score, doc) in enumerate(semantic_results[:3]):
                print(f"      {i+1}. [{score:.3f}] {doc[:80]}...")
            
            comparison_results.append({
                'query': query,
                'lexical_results': lexical_results,
                'semantic_results': semantic_results,
                'overlap': overlap,
                'overlap_ratio': overlap / min(len(lexical_results), len(semantic_results)) if min(len(lexical_results), len(semantic_results)) > 0 else 0
            })
        
        return comparison_results
    
    def analyze_query_characteristics(self, query: str) -> Dict[str, Any]:
        """Analyze query characteristics to predict which search method might work better"""
        characteristics = {}
        
        # Basic statistics
        characteristics['length'] = len(query.split())
        characteristics['has_quotes'] = '"' in query
        characteristics['has_technical_terms'] = bool(re.search(r'\b(API|GPU|CPU|v\d+\.\d+|[A-Z]{2,})\b', query))
        characteristics['has_numbers'] = bool(re.search(r'\d+', query))
        
        # Readability (higher = more complex)
        characteristics['readability'] = textstat.flesch_reading_ease(query)
        
        # Predict best method
        if characteristics['has_technical_terms'] or characteristics['has_quotes'] or characteristics['has_numbers']:
            characteristics['predicted_best'] = 'lexical'
            characteristics['reason'] = 'Contains specific terms, numbers, or exact phrases'
        elif characteristics['length'] > 5 and characteristics['readability'] > 60:
            characteristics['predicted_best'] = 'semantic'
            characteristics['reason'] = 'Long conceptual query'
        else:
            characteristics['predicted_best'] = 'hybrid'
            characteristics['reason'] = 'Mixed characteristics, hybrid approach recommended'
        
        return characteristics

# Initialize the search comparison
search_lab = SearchMethodComparison()
print("🔍 Search Method Comparison Lab initialized!")

In [None]:
# Create dataset and build indexes
documents = search_lab.create_sample_dataset()
search_lab.build_indexes(documents)

print(f"📚 Dataset Statistics:")
print(f"   Total documents: {len(documents)}")
print(f"   Average document length: {np.mean([len(doc.split()) for doc in documents]):.1f} words")
print(f"   Document categories: AI/ML, Technical, Business, Research, Code, News")

# Display sample documents
print("\n📄 Sample Documents:")
for i, doc in enumerate(documents[:3]):
    print(f"   {i+1}. {doc}")

In [None]:
# Test different types of queries
test_queries = [
    # Conceptual queries (should favor semantic)
    "AI techniques for pattern recognition",
    "Methods to improve business performance",
    "Understanding human language with computers",
    
    # Specific term queries (should favor lexical)
    "TensorFlow version 2.14.0",
    "calculate_accuracy() function",
    "Q4 revenue 15%",
    
    # Mixed queries (could benefit from hybrid)
    "machine learning model training",
    "AI research investment",
    "neural network optimization"
]

print("🧪 SEARCH METHOD COMPARISON")
print("=" * 60)

comparison_results = search_lab.compare_search_methods(test_queries)

# Analyze query characteristics
print("\n🔍 QUERY ANALYSIS")
print("=" * 60)

for query in test_queries[:6]:  # Analyze first 6 queries
    characteristics = search_lab.analyze_query_characteristics(query)
    print(f"\n📝 Query: '{query}'")
    print(f"   Length: {characteristics['length']} words")
    print(f"   Technical terms: {characteristics['has_technical_terms']}")
    print(f"   Contains numbers: {characteristics['has_numbers']}")
    print(f"   Predicted best method: {characteristics['predicted_best']}")
    print(f"   Reason: {characteristics['reason']}")

In [None]:
# Visualize comparison results
overlap_ratios = [result['overlap_ratio'] for result in comparison_results]
query_lengths = [len(result['query'].split()) for result in comparison_results]
query_labels = [result['query'][:30] + "..." if len(result['query']) > 30 else result['query'] 
               for result in comparison_results]

# Create visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))

# 1. Overlap between search methods
ax1.bar(range(len(comparison_results)), overlap_ratios, color='skyblue')
ax1.set_title('Overlap Between Lexical and Semantic Search', fontweight='bold', fontsize=14)
ax1.set_xlabel('Query Index')
ax1.set_ylabel('Overlap Ratio (0-1)')
ax1.set_xticks(range(len(comparison_results)))
ax1.tick_params(axis='x', rotation=45)
ax1.grid(True, alpha=0.3)

# Add value labels
for i, ratio in enumerate(overlap_ratios):
    ax1.text(i, ratio + 0.02, f'{ratio:.2f}', ha='center', va='bottom', fontweight='bold')

# 2. Query length vs overlap
ax2.scatter(query_lengths, overlap_ratios, s=100, alpha=0.7, color='orange')
ax2.set_title('Query Length vs Search Method Overlap', fontweight='bold', fontsize=14)
ax2.set_xlabel('Query Length (words)')
ax2.set_ylabel('Overlap Ratio')
ax2.grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(query_lengths, overlap_ratios, 1)
p = np.poly1d(z)
ax2.plot(sorted(query_lengths), p(sorted(query_lengths)), "r--", alpha=0.8)

# 3. Score distribution comparison for one query
sample_query_idx = 0
sample_result = comparison_results[sample_query_idx]
lexical_scores = [r[1] for r in sample_result['lexical_results']]
semantic_scores = [r[1] for r in sample_result['semantic_results']]

x_pos = np.arange(len(lexical_scores))
width = 0.35

ax3.bar(x_pos - width/2, lexical_scores, width, label='BM25 (Lexical)', alpha=0.8)
ax3.bar(x_pos + width/2, semantic_scores, width, label='Semantic', alpha=0.8)
ax3.set_title(f'Score Distribution: "{sample_result["query"]}"', fontweight='bold', fontsize=14)
ax3.set_xlabel('Rank Position')
ax3.set_ylabel('Score')
ax3.set_xticks(x_pos)
ax3.set_xticklabels([f'#{i+1}' for i in range(len(lexical_scores))])
ax3.legend()
ax3.grid(True, alpha=0.3)

# 4. Method performance heatmap
performance_matrix = np.zeros((len(comparison_results), 2))
for i, result in enumerate(comparison_results):
    # Use average score as performance metric
    lexical_avg = np.mean([r[1] for r in result['lexical_results']]) if result['lexical_results'] else 0
    semantic_avg = np.mean([r[1] for r in result['semantic_results']]) if result['semantic_results'] else 0
    performance_matrix[i] = [lexical_avg, semantic_avg]

im = ax4.imshow(performance_matrix.T, cmap='YlOrRd', aspect='auto')
ax4.set_title('Performance Heatmap by Query Type', fontweight='bold', fontsize=14)
ax4.set_xlabel('Query Index')
ax4.set_ylabel('Search Method')
ax4.set_yticks([0, 1])
ax4.set_yticklabels(['BM25 (Lexical)', 'Semantic'])
ax4.set_xticks(range(len(comparison_results)))

# Add colorbar
plt.colorbar(im, ax=ax4, label='Average Score')

plt.tight_layout()
plt.show()

print("\n💡 Key Observations:")
print("- Lower overlap indicates methods find different relevant documents")
print("- Conceptual queries often show lower overlap (semantic finds different results)")
print("- Specific term queries show higher overlap (both methods find same exact matches)")
print("- This validates the need for hybrid approaches in production systems")

## 🔄 Exercise 2: Hybrid Search with Reciprocal Rank Fusion (RRF)

Let's implement hybrid search using the 2025 production standard: RRF fusion.

In [None]:
class HybridSearchEngine:
    """Advanced hybrid search with multiple fusion strategies"""
    
    def __init__(self, search_lab: SearchMethodComparison):
        self.search_lab = search_lab
        self.embedding_model = search_lab.embedding_model
        
    def reciprocal_rank_fusion(self, lexical_results: List[Tuple], semantic_results: List[Tuple], 
                              k: int = 60, alpha: float = 0.5) -> List[Tuple[int, float, str]]:
        """Combine search results using Reciprocal Rank Fusion (RRF)"""
        
        # Create document score dictionaries
        lexical_scores = {}
        semantic_scores = {}
        
        # Calculate RRF scores for lexical results
        for rank, (doc_id, score, doc_text) in enumerate(lexical_results):
            lexical_scores[doc_id] = 1 / (k + rank + 1)
        
        # Calculate RRF scores for semantic results
        for rank, (doc_id, score, doc_text) in enumerate(semantic_results):
            semantic_scores[doc_id] = 1 / (k + rank + 1)
        
        # Combine scores
        all_doc_ids = set(lexical_scores.keys()) | set(semantic_scores.keys())
        combined_scores = {}
        
        for doc_id in all_doc_ids:
            lexical_rrf = lexical_scores.get(doc_id, 0)
            semantic_rrf = semantic_scores.get(doc_id, 0)
            
            # Weighted combination
            combined_scores[doc_id] = (1 - alpha) * lexical_rrf + alpha * semantic_rrf
        
        # Sort by combined score and return top results
        sorted_docs = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
        
        results = []
        for doc_id, score in sorted_docs:
            doc_text = self.search_lab.documents[doc_id]
            results.append((doc_id, score, doc_text))
        
        return results
    
    def weighted_sum_fusion(self, lexical_results: List[Tuple], semantic_results: List[Tuple], 
                           alpha: float = 0.7) -> List[Tuple[int, float, str]]:
        """Combine search results using weighted sum of normalized scores"""
        
        # Normalize scores to [0, 1] range
        if lexical_results:
            max_lexical = max(r[1] for r in lexical_results)
            min_lexical = min(r[1] for r in lexical_results)
            lexical_range = max_lexical - min_lexical if max_lexical != min_lexical else 1
        else:
            max_lexical = min_lexical = lexical_range = 1
        
        if semantic_results:
            max_semantic = max(r[1] for r in semantic_results)
            min_semantic = min(r[1] for r in semantic_results)
            semantic_range = max_semantic - min_semantic if max_semantic != min_semantic else 1
        else:
            max_semantic = min_semantic = semantic_range = 1
        
        # Create normalized score dictionaries
        lexical_scores = {}
        semantic_scores = {}
        
        for doc_id, score, doc_text in lexical_results:
            normalized_score = (score - min_lexical) / lexical_range
            lexical_scores[doc_id] = normalized_score
        
        for doc_id, score, doc_text in semantic_results:
            normalized_score = (score - min_semantic) / semantic_range
            semantic_scores[doc_id] = normalized_score
        
        # Combine scores
        all_doc_ids = set(lexical_scores.keys()) | set(semantic_scores.keys())
        combined_scores = {}
        
        for doc_id in all_doc_ids:
            lexical_norm = lexical_scores.get(doc_id, 0)
            semantic_norm = semantic_scores.get(doc_id, 0)
            
            # Weighted combination
            combined_scores[doc_id] = (1 - alpha) * lexical_norm + alpha * semantic_norm
        
        # Sort and return results
        sorted_docs = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
        
        results = []
        for doc_id, score in sorted_docs:
            doc_text = self.search_lab.documents[doc_id]
            results.append((doc_id, score, doc_text))
        
        return results
    
    def hybrid_search(self, query: str, top_k: int = 10, fusion_method: str = "rrf", 
                     alpha: float = 0.7) -> List[Tuple[int, float, str]]:
        """Perform hybrid search with specified fusion method"""
        
        # Get results from both search methods
        lexical_results = self.search_lab.lexical_search(query, top_k=top_k)
        semantic_results = self.search_lab.semantic_search(query, top_k=top_k)
        
        # Apply fusion method
        if fusion_method == "rrf":
            combined_results = self.reciprocal_rank_fusion(lexical_results, semantic_results, alpha=alpha)
        elif fusion_method == "weighted_sum":
            combined_results = self.weighted_sum_fusion(lexical_results, semantic_results, alpha=alpha)
        else:
            raise ValueError(f"Unknown fusion method: {fusion_method}")
        
        return combined_results[:top_k]
    
    def compare_fusion_methods(self, queries: List[str], alpha_values: List[float] = [0.3, 0.5, 0.7]) -> pd.DataFrame:
        """Compare different fusion methods and alpha values"""
        
        results = []
        
        for query in queries:
            print(f"\n🔍 Testing: '{query}'")
            
            # Get baseline results
            lexical_results = self.search_lab.lexical_search(query, top_k=5)
            semantic_results = self.search_lab.semantic_search(query, top_k=5)
            
            for alpha in alpha_values:
                # Test RRF
                rrf_results = self.hybrid_search(query, top_k=5, fusion_method="rrf", alpha=alpha)
                
                # Test Weighted Sum
                ws_results = self.hybrid_search(query, top_k=5, fusion_method="weighted_sum", alpha=alpha)
                
                # Calculate diversity metrics
                lexical_docs = set(r[0] for r in lexical_results)
                semantic_docs = set(r[0] for r in semantic_results)
                rrf_docs = set(r[0] for r in rrf_results)
                ws_docs = set(r[0] for r in ws_results)
                
                results.append({
                    'query': query[:30] + "..." if len(query) > 30 else query,
                    'alpha': alpha,
                    'method': 'RRF',
                    'unique_docs': len(rrf_docs),
                    'overlap_with_lexical': len(rrf_docs.intersection(lexical_docs)),
                    'overlap_with_semantic': len(rrf_docs.intersection(semantic_docs)),
                    'avg_score': np.mean([r[1] for r in rrf_results]) if rrf_results else 0
                })
                
                results.append({
                    'query': query[:30] + "..." if len(query) > 30 else query,
                    'alpha': alpha,
                    'method': 'Weighted Sum',
                    'unique_docs': len(ws_docs),
                    'overlap_with_lexical': len(ws_docs.intersection(lexical_docs)),
                    'overlap_with_semantic': len(ws_docs.intersection(semantic_docs)),
                    'avg_score': np.mean([r[1] for r in ws_results]) if ws_results else 0
                })
        
        return pd.DataFrame(results)

# Initialize hybrid search engine
hybrid_engine = HybridSearchEngine(search_lab)
print("🔄 Hybrid Search Engine initialized!")

In [None]:
# Test hybrid search with different methods
test_query = "machine learning algorithms for business applications"

print(f"🧪 HYBRID SEARCH COMPARISON")
print(f"Query: '{test_query}'")
print("=" * 80)

# Compare different fusion methods
methods = [
    ("lexical", "BM25 Only"),
    ("semantic", "Semantic Only"),
    ("rrf", "RRF Fusion (α=0.7)"),
    ("weighted_sum", "Weighted Sum (α=0.7)")
]

all_results = {}

for method_key, method_name in methods:
    print(f"\n📊 {method_name}:")
    
    if method_key == "lexical":
        results = search_lab.lexical_search(test_query, top_k=5)
    elif method_key == "semantic":
        results = search_lab.semantic_search(test_query, top_k=5)
    else:
        results = hybrid_engine.hybrid_search(test_query, top_k=5, fusion_method=method_key, alpha=0.7)
    
    all_results[method_key] = results
    
    for i, (doc_id, score, doc_text) in enumerate(results):
        print(f"   {i+1}. [{score:.4f}] {doc_text[:70]}...")

# Analyze result diversity
print("\n🔍 RESULT DIVERSITY ANALYSIS")
print("=" * 50)

# Calculate unique documents found by each method
for method_key, method_name in methods:
    doc_ids = set(r[0] for r in all_results[method_key])
    print(f"{method_name}: {len(doc_ids)} unique documents")

# Calculate overlap between methods
lexical_docs = set(r[0] for r in all_results['lexical'])
semantic_docs = set(r[0] for r in all_results['semantic'])
rrf_docs = set(r[0] for r in all_results['rrf'])

print(f"\nOverlap Analysis:")
print(f"  Lexical ∩ Semantic: {len(lexical_docs.intersection(semantic_docs))} documents")
print(f"  RRF ∩ Lexical: {len(rrf_docs.intersection(lexical_docs))} documents")
print(f"  RRF ∩ Semantic: {len(rrf_docs.intersection(semantic_docs))} documents")
print(f"  All three methods: {len(lexical_docs.intersection(semantic_docs).intersection(rrf_docs))} documents")

In [None]:
# Test different alpha values for hybrid search
print("\n⚖️ ALPHA PARAMETER TUNING")
print("=" * 50)

alpha_values = [0.1, 0.3, 0.5, 0.7, 0.9]
sample_queries = [
    "TensorFlow version optimization",  # Technical query
    "AI techniques for pattern recognition",  # Conceptual query
    "revenue performance business metrics"  # Mixed query
]

# Test alpha sensitivity
alpha_results = []

for query in sample_queries:
    print(f"\n🔍 Query: '{query}'")
    
    for alpha in alpha_values:
        rrf_results = hybrid_engine.hybrid_search(query, top_k=3, fusion_method="rrf", alpha=alpha)
        
        # Calculate method bias (how much each method contributes)
        lexical_only = set(r[0] for r in search_lab.lexical_search(query, top_k=10))
        semantic_only = set(r[0] for r in search_lab.semantic_search(query, top_k=10))
        hybrid_docs = set(r[0] for r in rrf_results)
        
        lexical_bias = len(hybrid_docs.intersection(lexical_only)) / len(hybrid_docs) if hybrid_docs else 0
        semantic_bias = len(hybrid_docs.intersection(semantic_only)) / len(hybrid_docs) if hybrid_docs else 0
        
        alpha_results.append({
            'query_type': 'technical' if 'TensorFlow' in query else ('conceptual' if 'pattern recognition' in query else 'mixed'),
            'alpha': alpha,
            'lexical_bias': lexical_bias,
            'semantic_bias': semantic_bias,
            'avg_score': np.mean([r[1] for r in rrf_results]) if rrf_results else 0
        })
        
        print(f"   α={alpha}: Lexical bias {lexical_bias:.2f}, Semantic bias {semantic_bias:.2f}")

# Visualize alpha effects
alpha_df = pd.DataFrame(alpha_results)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# 1. Alpha vs method bias
for query_type in alpha_df['query_type'].unique():
    subset = alpha_df[alpha_df['query_type'] == query_type]
    ax1.plot(subset['alpha'], subset['semantic_bias'], marker='o', label=f'{query_type} (semantic bias)')
    ax1.plot(subset['alpha'], subset['lexical_bias'], marker='s', linestyle='--', label=f'{query_type} (lexical bias)')

ax1.set_title('Effect of Alpha on Method Bias', fontweight='bold', fontsize=14)
ax1.set_xlabel('Alpha Value (0=lexical, 1=semantic)')
ax1.set_ylabel('Bias Score')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Alpha vs average score
for query_type in alpha_df['query_type'].unique():
    subset = alpha_df[alpha_df['query_type'] == query_type]
    ax2.plot(subset['alpha'], subset['avg_score'], marker='o', label=query_type)

ax2.set_title('Effect of Alpha on Average Score', fontweight='bold', fontsize=14)
ax2.set_xlabel('Alpha Value')
ax2.set_ylabel('Average Score')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Alpha Tuning Insights:")
print("- α=0.7 often provides good balance for most query types")
print("- Technical queries may benefit from lower α (more lexical weight)")
print("- Conceptual queries may benefit from higher α (more semantic weight)")
print("- Consider dynamic α adjustment based on query characteristics")

## 🚀 Exercise 3: Advanced Techniques - HyDE and Contextual Retrieval

Let's implement cutting-edge search techniques from 2025.

In [None]:
class AdvancedSearchTechniques:
    """Implement HyDE and Contextual Retrieval techniques"""
    
    def __init__(self, hybrid_engine: HybridSearchEngine):
        self.hybrid_engine = hybrid_engine
        self.search_lab = hybrid_engine.search_lab
        self.embedding_model = hybrid_engine.embedding_model
        
    def generate_hypothetical_document(self, query: str) -> str:
        """Generate a hypothetical document for HyDE (simplified version)"""
        # In production, this would use an LLM like GPT-4 or Claude
        # For this demo, we'll create rule-based hypothetical documents
        
        query_lower = query.lower()
        
        # Template-based hypothetical document generation
        if 'machine learning' in query_lower or 'ai' in query_lower:
            hypothetical = f"""Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience. 
            Key algorithms include supervised learning methods like decision trees and neural networks, unsupervised learning techniques such as clustering, 
            and reinforcement learning for decision-making tasks. These approaches help solve problems in {query} by finding patterns in data and making predictions."""
        
        elif 'tensorflow' in query_lower or 'pytorch' in query_lower:
            hypothetical = f"""TensorFlow and PyTorch are leading deep learning frameworks that provide tools for building and training neural networks. 
            They offer optimized computation for GPUs, automatic differentiation, and high-level APIs for common machine learning tasks. 
            Version updates typically include performance improvements, new layer types, and better optimization algorithms relevant to {query}."""
        
        elif 'business' in query_lower or 'revenue' in query_lower:
            hypothetical = f"""Business performance metrics and revenue analysis are crucial for understanding company growth and market position. 
            Key indicators include quarterly revenue growth, customer acquisition costs, lifetime value, and market share. 
            Analytics and data science help optimize these metrics through {query} and strategic decision-making."""
        
        else:
            # Generic hypothetical document
            hypothetical = f"""This document discusses {query} and covers the main concepts, applications, and implications. 
            It provides comprehensive information about the topic including technical details, practical examples, and real-world use cases. 
            The content is designed to answer questions and provide insights related to {query}."""
        
        return hypothetical.strip()
    
    def hyde_search(self, query: str, top_k: int = 10) -> List[Tuple[int, float, str]]:
        """Perform HyDE (Hypothetical Document Embeddings) search"""
        # Generate hypothetical document
        hypothetical_doc = self.generate_hypothetical_document(query)
        
        # Use hypothetical document for semantic search instead of original query
        hyp_embedding = self.embedding_model.encode([hypothetical_doc])
        hyp_embedding = hyp_embedding / np.linalg.norm(hyp_embedding)
        
        # Calculate similarities with document embeddings
        similarities = np.dot(self.search_lab.embeddings, hyp_embedding.T).flatten()
        
        # Get top-k results
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            results.append((idx, similarities[idx], self.search_lab.documents[idx]))
        
        return results, hypothetical_doc
    
    def add_contextual_information(self, documents: List[str], context_window: int = 1) -> List[str]:
        """Add contextual information to document chunks (simplified contextual retrieval)"""
        # In production, this would use LLMs to generate context for each chunk
        # For demo, we'll add simple positional and categorical context
        
        contextual_documents = []
        
        for i, doc in enumerate(documents):
            # Determine document category
            doc_lower = doc.lower()
            if any(term in doc_lower for term in ['tensorflow', 'pytorch', 'api', 'function', 'version']):
                category = "Technical Documentation"
            elif any(term in doc_lower for term in ['revenue', 'business', 'customer', 'market']):
                category = "Business Analysis"
            elif any(term in doc_lower for term in ['learning', 'algorithm', 'model', 'neural']):
                category = "Machine Learning Research"
            elif any(term in doc_lower for term in ['attention', 'transformer', 'language model']):
                category = "AI Research Papers"
            else:
                category = "General Content"
            
            # Add context prefix
            context_prefix = f"[Document {i+1} of {len(documents)} - Category: {category}] "
            
            # Add surrounding document context if available
            if context_window > 0:
                surrounding_context = []
                for j in range(max(0, i-context_window), min(len(documents), i+context_window+1)):
                    if j != i:
                        # Add brief context from nearby documents
                        nearby_words = documents[j].split()[:5]  # First 5 words
                        surrounding_context.append(" ".join(nearby_words))
                
                if surrounding_context:
                    context_prefix += f"Related content: {'; '.join(surrounding_context)}. "
            
            contextual_doc = context_prefix + doc
            contextual_documents.append(contextual_doc)
        
        return contextual_documents
    
    def contextual_retrieval_search(self, query: str, top_k: int = 10) -> List[Tuple[int, float, str]]:
        """Perform search with contextual retrieval"""
        # Add contextual information to documents
        contextual_docs = self.add_contextual_information(self.search_lab.documents)
        
        # Generate embeddings for contextual documents
        contextual_embeddings = self.embedding_model.encode(contextual_docs)
        contextual_embeddings = contextual_embeddings / np.linalg.norm(contextual_embeddings, axis=1, keepdims=True)
        
        # Perform semantic search with contextual embeddings
        query_embedding = self.embedding_model.encode([query])
        query_embedding = query_embedding / np.linalg.norm(query_embedding)
        
        similarities = np.dot(contextual_embeddings, query_embedding.T).flatten()
        
        # Get top-k results
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            # Return original document (without context prefix) but with contextual score
            results.append((idx, similarities[idx], self.search_lab.documents[idx]))
        
        return results
    
    def multi_query_expansion(self, query: str, num_expansions: int = 3) -> List[str]:
        """Generate multiple query variations for better coverage"""
        # In production, this would use LLMs to generate diverse query reformulations
        # For demo, we'll use template-based expansion
        
        expansions = [query]  # Include original query
        
        # Synonym-based expansions
        synonyms = {
            'machine learning': ['artificial intelligence', 'ML algorithms', 'automated learning'],
            'algorithms': ['methods', 'techniques', 'approaches'],
            'business': ['enterprise', 'commercial', 'corporate'],
            'performance': ['efficiency', 'effectiveness', 'results'],
            'optimization': ['improvement', 'enhancement', 'tuning']
        }
        
        query_lower = query.lower()
        for term, syns in synonyms.items():
            if term in query_lower and len(expansions) < num_expansions + 1:
                for syn in syns[:num_expansions - len(expansions) + 1]:
                    expanded = query_lower.replace(term, syn)
                    expansions.append(expanded)
        
        # Ensure we have enough expansions
        while len(expansions) < num_expansions + 1:
            # Add more generic expansions
            if 'how' not in query.lower():
                expansions.append(f"how to {query}")
            elif 'what' not in query.lower():
                expansions.append(f"what is {query}")
            else:
                expansions.append(f"{query} examples")
        
        return expansions[:num_expansions + 1]
    
    def multi_query_search(self, query: str, top_k: int = 10) -> List[Tuple[int, float, str]]:
        """Perform search with multiple query expansions and fusion"""
        # Generate query expansions
        expanded_queries = self.multi_query_expansion(query)
        
        # Collect results from all expanded queries
        all_results = []
        for exp_query in expanded_queries:
            results = self.search_lab.semantic_search(exp_query, top_k=top_k)
            all_results.extend(results)
        
        # Aggregate scores for documents that appear multiple times
        doc_scores = defaultdict(list)
        for doc_id, score, doc_text in all_results:
            doc_scores[doc_id].append(score)
        
        # Use maximum score aggregation
        final_results = []
        for doc_id, scores in doc_scores.items():
            max_score = max(scores)
            doc_text = self.search_lab.documents[doc_id]
            final_results.append((doc_id, max_score, doc_text))
        
        # Sort and return top-k
        final_results.sort(key=lambda x: x[1], reverse=True)
        
        return final_results[:top_k], expanded_queries

# Initialize advanced search techniques
advanced_search = AdvancedSearchTechniques(hybrid_engine)
print("🚀 Advanced Search Techniques initialized!")

In [None]:
# Test advanced search techniques
test_query = "improving business performance with AI"

print(f"🧪 ADVANCED SEARCH TECHNIQUES COMPARISON")
print(f"Query: '{test_query}'")
print("=" * 80)

# 1. Standard semantic search
print("\n1️⃣ Standard Semantic Search:")
standard_results = search_lab.semantic_search(test_query, top_k=5)
for i, (doc_id, score, doc_text) in enumerate(standard_results):
    print(f"   {i+1}. [{score:.4f}] {doc_text[:70]}...")

# 2. HyDE search
print("\n2️⃣ HyDE (Hypothetical Document Embeddings):")
hyde_results, hypothetical_doc = advanced_search.hyde_search(test_query, top_k=5)
print(f"   Generated hypothetical document: {hypothetical_doc[:100]}...")
print("   Results:")
for i, (doc_id, score, doc_text) in enumerate(hyde_results):
    print(f"   {i+1}. [{score:.4f}] {doc_text[:70]}...")

# 3. Contextual retrieval
print("\n3️⃣ Contextual Retrieval:")
contextual_results = advanced_search.contextual_retrieval_search(test_query, top_k=5)
for i, (doc_id, score, doc_text) in enumerate(contextual_results):
    print(f"   {i+1}. [{score:.4f}] {doc_text[:70]}...")

# 4. Multi-query expansion
print("\n4️⃣ Multi-Query Expansion:")
multi_query_results, expanded_queries = advanced_search.multi_query_search(test_query, top_k=5)
print(f"   Expanded queries: {expanded_queries}")
print("   Results:")
for i, (doc_id, score, doc_text) in enumerate(multi_query_results):
    print(f"   {i+1}. [{score:.4f}] {doc_text[:70]}...")

# 5. Hybrid with RRF
print("\n5️⃣ Hybrid Search (RRF):")
hybrid_results = hybrid_engine.hybrid_search(test_query, top_k=5, fusion_method="rrf", alpha=0.7)
for i, (doc_id, score, doc_text) in enumerate(hybrid_results):
    print(f"   {i+1}. [{score:.4f}] {doc_text[:70]}...")

In [None]:
# Comprehensive comparison across different query types
diverse_queries = [
    "TensorFlow optimization techniques",  # Technical/specific
    "AI methods for pattern recognition",  # Conceptual
    "Q4 business revenue analysis",       # Business/factual
    "neural network training methods"      # Research/academic
]

print("\n📊 COMPREHENSIVE SEARCH METHOD EVALUATION")
print("=" * 80)

# Collect results for analysis
evaluation_results = []

for query in diverse_queries:
    print(f"\n🔍 Evaluating: '{query}'")
    
    # Get results from all methods
    methods = {
        'Lexical (BM25)': search_lab.lexical_search(query, top_k=5),
        'Semantic': search_lab.semantic_search(query, top_k=5),
        'Hybrid (RRF)': hybrid_engine.hybrid_search(query, top_k=5, fusion_method="rrf", alpha=0.7),
        'HyDE': advanced_search.hyde_search(query, top_k=5)[0],
        'Contextual': advanced_search.contextual_retrieval_search(query, top_k=5),
        'Multi-Query': advanced_search.multi_query_search(query, top_k=5)[0]
    }
    
    # Calculate diversity metrics
    for method_name, results in methods.items():
        doc_ids = set(r[0] for r in results)
        avg_score = np.mean([r[1] for r in results]) if results else 0
        max_score = max([r[1] for r in results]) if results else 0
        
        evaluation_results.append({
            'query': query,
            'method': method_name,
            'unique_docs': len(doc_ids),
            'avg_score': avg_score,
            'max_score': max_score,
            'top_doc_id': results[0][0] if results else None
        })
        
        print(f"   {method_name}: {len(doc_ids)} docs, avg_score={avg_score:.4f}")

# Create evaluation DataFrame
eval_df = pd.DataFrame(evaluation_results)

# Visualize comprehensive comparison
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))

# 1. Average scores by method
method_scores = eval_df.groupby('method')['avg_score'].mean().sort_values(ascending=False)
bars1 = ax1.bar(range(len(method_scores)), method_scores.values)
ax1.set_title('Average Performance by Search Method', fontweight='bold', fontsize=14)
ax1.set_ylabel('Average Score')
ax1.set_xticks(range(len(method_scores)))
ax1.set_xticklabels(method_scores.index, rotation=45, ha='right')
ax1.grid(True, alpha=0.3)

# Add value labels
for i, bar in enumerate(bars1):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.001,
             f'{height:.3f}', ha='center', va='bottom', fontweight='bold')

# 2. Score distribution heatmap
score_matrix = eval_df.pivot(index='method', columns='query', values='avg_score')
im2 = ax2.imshow(score_matrix.values, cmap='YlOrRd', aspect='auto')
ax2.set_title('Performance Heatmap by Query Type', fontweight='bold', fontsize=14)
ax2.set_xlabel('Query')
ax2.set_ylabel('Search Method')
ax2.set_xticks(range(len(score_matrix.columns)))
ax2.set_xticklabels([q[:20] + '...' for q in score_matrix.columns], rotation=45, ha='right')
ax2.set_yticks(range(len(score_matrix.index)))
ax2.set_yticklabels(score_matrix.index)
plt.colorbar(im2, ax=ax2, label='Average Score')

# 3. Document diversity comparison
diversity_data = eval_df.groupby('method')['unique_docs'].mean().sort_values(ascending=False)
bars3 = ax3.bar(range(len(diversity_data)), diversity_data.values, color='lightgreen')
ax3.set_title('Document Diversity by Method', fontweight='bold', fontsize=14)
ax3.set_ylabel('Average Unique Documents')
ax3.set_xticks(range(len(diversity_data)))
ax3.set_xticklabels(diversity_data.index, rotation=45, ha='right')
ax3.grid(True, alpha=0.3)

# Add value labels
for i, bar in enumerate(bars3):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height + 0.02,
             f'{height:.1f}', ha='center', va='bottom', fontweight='bold')

# 4. Method correlation analysis
# Calculate how often methods agree on top document
agreement_matrix = np.zeros((len(method_scores), len(method_scores)))
methods_list = list(method_scores.index)

for query in diverse_queries:
    query_results = eval_df[eval_df['query'] == query]
    top_docs = {row['method']: row['top_doc_id'] for _, row in query_results.iterrows()}
    
    for i, method1 in enumerate(methods_list):
        for j, method2 in enumerate(methods_list):
            if top_docs.get(method1) == top_docs.get(method2):
                agreement_matrix[i, j] += 1

# Normalize by number of queries
agreement_matrix = agreement_matrix / len(diverse_queries)

im4 = ax4.imshow(agreement_matrix, cmap='Blues', aspect='auto')
ax4.set_title('Method Agreement (Top Document)', fontweight='bold', fontsize=14)
ax4.set_xlabel('Search Method')
ax4.set_ylabel('Search Method')
ax4.set_xticks(range(len(methods_list)))
ax4.set_xticklabels([m[:10] for m in methods_list], rotation=45, ha='right')
ax4.set_yticks(range(len(methods_list)))
ax4.set_yticklabels([m[:10] for m in methods_list])
plt.colorbar(im4, ax=ax4, label='Agreement Rate')

plt.tight_layout()
plt.show()

print("\n🎯 SEARCH METHOD RECOMMENDATIONS")
print("=" * 60)
print("✅ Best Overall: Hybrid (RRF) - Balances precision and recall")
print("🎯 High Precision: HyDE - Good for zero-shot complex queries")
print("📚 Rich Context: Contextual Retrieval - Better document understanding")
print("🔍 High Recall: Multi-Query - Covers more query variations")
print("⚡ Fast & Simple: BM25 - For exact term matching")
print("🧠 Conceptual: Semantic - For meaning-based search")

## 🎯 Key Takeaways

From this module, you should now understand:

### 🔤 Lexical Search Characteristics:
1. **Excellent for**: Exact terms, names, IDs, technical specifications
2. **Algorithm**: BM25 remains the gold standard for keyword matching
3. **Strengths**: Fast, explainable, no training required
4. **Weaknesses**: Misses synonyms, paraphrases, and conceptual matches

### 🧠 Semantic Search Characteristics:
1. **Excellent for**: Conceptual queries, cross-lingual search, intent understanding
2. **Algorithm**: Vector similarity (cosine, dot product) with transformer embeddings
3. **Strengths**: Handles meaning, synonyms, context
4. **Weaknesses**: May miss exact terms, computationally expensive

### 🔄 Hybrid Search (2025 Standard):
1. **Best Practice**: Combine BM25 + semantic search with RRF fusion
2. **Optimal Alpha**: ~0.7 (70% semantic, 30% lexical) for most use cases
3. **Production Ready**: Balances precision and recall effectively
4. **Flexibility**: Adjust alpha based on query characteristics

### 🚀 Advanced Techniques (2025):

#### HyDE (Hypothetical Document Embeddings):
- **Concept**: Generate hypothetical answer, then search for similar documents
- **Best for**: Zero-shot scenarios, complex queries
- **Improvement**: 15-30% better performance on out-of-domain queries

#### Contextual Retrieval:
- **Concept**: Add context information to document chunks
- **Best for**: Large document collections, ambiguous chunks
- **Improvement**: 49% reduction in failed retrievals (Anthropic research)

#### Multi-Query Expansion:
- **Concept**: Generate multiple query variations, search all, fuse results
- **Best for**: High recall requirements, diverse terminology
- **Improvement**: Better coverage of relevant documents

### 📊 Performance Comparison:
| Method | Speed | Precision | Recall | Use Case |
|--------|-------|-----------|--------|---------|
| **BM25** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | Exact matching |
| **Semantic** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Conceptual queries |
| **Hybrid** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Production default |
| **HyDE** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Complex queries |
| **Contextual** | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Ambiguous content |

### 🛠️ Implementation Guidelines:
1. **Start with hybrid**: RRF fusion with α=0.7
2. **Add HyDE**: For domains with complex queries
3. **Use contextual retrieval**: When chunks lack sufficient context
4. **Implement multi-query**: When recall is critical
5. **Monitor and tune**: Alpha values based on query characteristics

### 🔄 Search Method Selection Workflow:
1. **Analyze Query** → 2. **Detect Intent** → 3. **Choose Method** → 4. **Execute Search** → 5. **Fuse Results** → 6. **Return Ranked List**

## 🎯 Next Steps

In the next modules, we'll explore:
- **Module 9**: Advanced retrieval strategies and re-ranking techniques
- **Module 10**: Prompt engineering for optimal RAG performance
- **Module 11**: LLM integration and model selection strategies

Understanding search methods is fundamental for building effective RAG systems that can handle diverse query types and user needs!

## 🤔 Discussion Questions

1. In what scenarios would you prefer pure semantic search over hybrid search?
2. How would you dynamically adjust the alpha parameter based on query characteristics?
3. What are the computational trade-offs between different advanced search techniques?
4. How would you evaluate search quality in a production system?
5. What factors would influence your choice of fusion algorithm (RRF vs weighted sum)?

## 📝 Optional Exercises

1. **Real LLM Integration**: Implement HyDE with actual LLM calls (GPT-4, Claude)
2. **Production Elasticsearch**: Set up hybrid search with Elasticsearch
3. **Custom Fusion**: Develop your own fusion algorithm based on query features
4. **A/B Testing**: Design experiments to compare search methods in production
5. **Domain-Specific**: Adapt techniques for your specific domain (medical, legal, technical)