# Lab 3.5.4: Hybrid Search Implementation

**Module:** 3.5 - RAG Systems & Vector Databases  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê (Advanced)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

- [ ] Understand the limitations of pure dense or sparse retrieval
- [ ] Implement BM25 sparse retrieval from scratch
- [ ] Combine dense and sparse retrieval with Reciprocal Rank Fusion (RRF)
- [ ] Find the optimal fusion weights for your use case
- [ ] Measure the improvement from hybrid search

---

## üìö Prerequisites

- Completed: Labs 3.5.1-3.5.3
- Understanding of: Embeddings, vector search, basic statistics

---

## üåç Real-World Context

**The Problem:** Your semantic search is great for conceptual queries ("how does memory work?") but fails for specific terms ("LPDDR5X" or "bge-large-en-v1.5"). Keyword search is the opposite - great for exact terms, poor for concepts.

**The Solution:** Hybrid search combines the best of both worlds. Google, Pinecone, and enterprise search all use this approach.

---

## üßí ELI5: Hybrid Search

> **Imagine you're looking for a book in a library with two librarians:**
>
> **Librarian A (Semantic/Dense)**: "You want a book about love? Let me find you romance novels, relationship guides, poetry about affection..." - They understand MEANING.
>
> **Librarian B (Keyword/Sparse)**: "You want 'Romeo and Juliet'? It's in aisle 5, shelf 3, exactly where that title is." - They find EXACT MATCHES.
>
> **Hybrid Search**: Ask BOTH librarians! Librarian A finds conceptually similar books, Librarian B finds exact matches. Then combine their recommendations.
>
> For "CUDA memory management", the keyword librarian finds docs with "CUDA" exactly, while the semantic librarian finds docs about "GPU programming" and "device memory."

---

## Part 1: Setup

In [None]:
# Install dependencies
!pip install -q \
    langchain langchain-community langchain-huggingface \
    chromadb sentence-transformers \
    rank_bm25 \
    nltk

import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

print("‚úÖ Dependencies installed!")

In [None]:
import os
import time
import re
from pathlib import Path
from typing import List, Dict, Tuple, Any, Optional
from dataclasses import dataclass
import numpy as np

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import torch
import gc

print(f"CUDA: {torch.cuda.is_available()}")

In [None]:
# Load and chunk documents
DOCS_PATH = Path("../data/sample_documents")

documents = []
for file_path in sorted(DOCS_PATH.glob("*.md")):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    documents.append(Document(
        page_content=content,
        metadata={"source": file_path.name}
    ))

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(documents)

print(f"üìö Loaded {len(documents)} documents ‚Üí {len(chunks)} chunks")

In [None]:
# Load embedding model
print("üîÑ Loading embedding model...")

embedding_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
    encode_kwargs={"normalize_embeddings": True, "batch_size": 32}
)

print("‚úÖ Model loaded!")

---

## Part 2: Understanding Dense vs Sparse Retrieval

### Dense Retrieval (Embeddings)
- Maps text to dense vectors (e.g., 1024 dimensions)
- Every dimension has a value
- Captures semantic meaning
- Good for: conceptual queries, paraphrases

### Sparse Retrieval (BM25)
- Maps text to sparse vectors (vocabulary size dimensions)
- Most dimensions are zero
- Based on word frequency (TF-IDF style)
- Good for: exact terms, rare words, technical jargon

In [None]:
# Visualize the difference
sample_text = "The DGX Spark uses LPDDR5X unified memory"

# Dense representation
dense_vec = embedding_model.embed_query(sample_text)
print("üìä Dense Representation:")
print(f"   Dimensions: {len(dense_vec)}")
print(f"   Non-zero values: {sum(1 for v in dense_vec if abs(v) > 0.001)}")
print(f"   First 10 values: {[f'{v:.3f}' for v in dense_vec[:10]]}")

# Sparse representation (conceptual)
words = sample_text.lower().split()
print(f"\nüìä Sparse Representation (BM25 style):")
print(f"   Vocabulary terms: {words}")
print(f"   Non-zero dimensions: {len(set(words))} (only terms in document)")

---

## Part 3: Implementing BM25 Sparse Retrieval

### üßí ELI5: BM25

> **BM25 is like counting word importance:**
>
> 1. Words that appear often in a document are more important FOR that document
> 2. Words that appear in MANY documents are less important overall (like "the", "is")
> 3. Short documents get a boost (word density matters)
>
> It's basically: "How much does this document talk about these specific words compared to other documents?"

In [None]:
class BM25Retriever:
    """
    BM25 sparse retrieval implementation.
    """
    
    def __init__(self, documents: List[Document], remove_stopwords: bool = True):
        """
        Initialize BM25 with documents.
        
        Args:
            documents: List of LangChain Documents
            remove_stopwords: Whether to remove common words
        """
        self.documents = documents
        self.remove_stopwords = remove_stopwords
        
        if remove_stopwords:
            self.stop_words = set(stopwords.words('english'))
        else:
            self.stop_words = set()
        
        # Tokenize all documents
        self.tokenized_docs = [self._tokenize(doc.page_content) for doc in documents]
        
        # Build BM25 index
        self.bm25 = BM25Okapi(self.tokenized_docs)
        
    def _tokenize(self, text: str) -> List[str]:
        """Tokenize and optionally remove stopwords."""
        # Lowercase and tokenize
        tokens = word_tokenize(text.lower())
        
        # Remove stopwords and non-alphanumeric
        tokens = [
            t for t in tokens 
            if t.isalnum() and t not in self.stop_words
        ]
        
        return tokens
    
    def search(self, query: str, k: int = 5) -> List[Tuple[Document, float]]:
        """
        Search for documents matching the query.
        
        Args:
            query: Search query
            k: Number of results to return
            
        Returns:
            List of (document, score) tuples
        """
        tokenized_query = self._tokenize(query)
        scores = self.bm25.get_scores(tokenized_query)
        
        # Get top-k indices
        top_indices = np.argsort(scores)[-k:][::-1]
        
        results = []
        for idx in top_indices:
            if scores[idx] > 0:  # Only include matches
                results.append((self.documents[idx], scores[idx]))
        
        return results
    
    def get_scores(self, query: str) -> np.ndarray:
        """Get BM25 scores for all documents."""
        tokenized_query = self._tokenize(query)
        return self.bm25.get_scores(tokenized_query)


# Build BM25 index
print("üìù Building BM25 index...")
start = time.time()
bm25_retriever = BM25Retriever(chunks)
print(f"‚úÖ BM25 index built in {time.time() - start:.2f}s")

In [None]:
# Test BM25 retrieval
test_query = "LPDDR5X memory bandwidth"

print(f"üîç BM25 Search: '{test_query}'")
print("-" * 60)

bm25_results = bm25_retriever.search(test_query, k=3)

for i, (doc, score) in enumerate(bm25_results):
    print(f"\nüîπ Result {i+1} (BM25 Score: {score:.2f}):")
    print(f"   Source: {doc.metadata['source']}")
    print(f"   Content: {doc.page_content[:150]}...")

---

## Part 4: Building Dense Retrieval

In [None]:
class DenseRetriever:
    """
    Dense (embedding-based) retrieval implementation.
    """
    
    def __init__(self, documents: List[Document], embedding_model: HuggingFaceEmbeddings):
        """
        Initialize with documents and embedding model.
        """
        self.documents = documents
        self.embedding_model = embedding_model
        
        # Pre-compute document embeddings
        texts = [doc.page_content for doc in documents]
        self.embeddings = np.array(embedding_model.embed_documents(texts))
        
    def search(self, query: str, k: int = 5) -> List[Tuple[Document, float]]:
        """
        Search for semantically similar documents.
        """
        query_emb = np.array(self.embedding_model.embed_query(query))
        
        # Cosine similarity (embeddings are normalized)
        scores = np.dot(self.embeddings, query_emb)
        
        # Get top-k
        top_indices = np.argsort(scores)[-k:][::-1]
        
        return [(self.documents[idx], scores[idx]) for idx in top_indices]
    
    def get_scores(self, query: str) -> np.ndarray:
        """Get similarity scores for all documents."""
        query_emb = np.array(self.embedding_model.embed_query(query))
        return np.dot(self.embeddings, query_emb)


# Build dense retriever
print("üîÑ Building dense retriever (computing embeddings)...")
start = time.time()
dense_retriever = DenseRetriever(chunks, embedding_model)
print(f"‚úÖ Dense retriever built in {time.time() - start:.2f}s")

In [None]:
# Test dense retrieval
print(f"üîç Dense Search: '{test_query}'")
print("-" * 60)

dense_results = dense_retriever.search(test_query, k=3)

for i, (doc, score) in enumerate(dense_results):
    print(f"\nüîπ Result {i+1} (Similarity: {score:.3f}):")
    print(f"   Source: {doc.metadata['source']}")
    print(f"   Content: {doc.page_content[:150]}...")

---

## Part 5: Implementing Hybrid Search

### Reciprocal Rank Fusion (RRF)

RRF combines rankings from multiple retrievers:

```
RRF_score(d) = Œ£ 1 / (k + rank_i(d))
```

Where:
- `d` is a document
- `k` is a constant (usually 60)
- `rank_i(d)` is the rank of document `d` from retriever `i`

In [None]:
class HybridRetriever:
    """
    Hybrid retrieval combining dense and sparse methods.
    """
    
    def __init__(
        self,
        dense_retriever: DenseRetriever,
        sparse_retriever: BM25Retriever,
        alpha: float = 0.5,
        fusion_method: str = "rrf"  # "rrf" or "linear"
    ):
        """
        Initialize hybrid retriever.
        
        Args:
            dense_retriever: Dense (embedding) retriever
            sparse_retriever: Sparse (BM25) retriever
            alpha: Weight for dense scores (1-alpha for sparse)
            fusion_method: "rrf" (Reciprocal Rank Fusion) or "linear" (weighted sum)
        """
        self.dense = dense_retriever
        self.sparse = sparse_retriever
        self.alpha = alpha
        self.fusion_method = fusion_method
        
        # Ensure same documents
        assert len(dense_retriever.documents) == len(sparse_retriever.documents)
        self.documents = dense_retriever.documents
        
    def search(self, query: str, k: int = 5, first_stage_k: int = 50) -> List[Tuple[Document, float]]:
        """
        Hybrid search combining dense and sparse retrieval.
        """
        if self.fusion_method == "rrf":
            return self._rrf_search(query, k, first_stage_k)
        else:
            return self._linear_search(query, k)
    
    def _rrf_search(self, query: str, k: int, first_stage_k: int) -> List[Tuple[Document, float]]:
        """
        Reciprocal Rank Fusion.
        """
        rrf_k = 60  # Standard RRF constant
        
        # Get rankings from both retrievers
        dense_results = self.dense.search(query, k=first_stage_k)
        sparse_results = self.sparse.search(query, k=first_stage_k)
        
        # Build document to rank mapping
        doc_to_id = {id(doc): i for i, doc in enumerate(self.documents)}
        
        # Calculate RRF scores
        rrf_scores = {}
        
        for rank, (doc, _) in enumerate(dense_results):
            doc_id = doc_to_id[id(doc)]
            rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1)
        
        for rank, (doc, _) in enumerate(sparse_results):
            doc_id = doc_to_id[id(doc)]
            rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1)
        
        # Sort by RRF score
        sorted_docs = sorted(rrf_scores.items(), key=lambda x: -x[1])
        
        return [(self.documents[doc_id], score) for doc_id, score in sorted_docs[:k]]
    
    def _linear_search(self, query: str, k: int) -> List[Tuple[Document, float]]:
        """
        Linear combination of normalized scores.
        """
        # Get all scores
        dense_scores = self.dense.get_scores(query)
        sparse_scores = self.sparse.get_scores(query)
        
        # Normalize to [0, 1]
        def normalize(scores):
            min_s, max_s = scores.min(), scores.max()
            if max_s - min_s < 1e-6:
                return np.zeros_like(scores)
            return (scores - min_s) / (max_s - min_s)
        
        dense_norm = normalize(dense_scores)
        sparse_norm = normalize(sparse_scores)
        
        # Combine with weights
        hybrid_scores = self.alpha * dense_norm + (1 - self.alpha) * sparse_norm
        
        # Get top-k
        top_indices = np.argsort(hybrid_scores)[-k:][::-1]
        
        return [(self.documents[idx], hybrid_scores[idx]) for idx in top_indices]


# Create hybrid retriever
print("üîÄ Creating hybrid retriever...")
hybrid_retriever = HybridRetriever(
    dense_retriever=dense_retriever,
    sparse_retriever=bm25_retriever,
    alpha=0.5,  # Equal weight to dense and sparse
    fusion_method="rrf"
)
print("‚úÖ Hybrid retriever ready!")

In [None]:
# Compare all three retrieval methods
test_queries = [
    "LPDDR5X memory bandwidth",  # Technical term - sparse should help
    "How does GPU memory work?",  # Conceptual - dense should help
    "bge-large-en-v1.5 embedding model",  # Specific model name
    "advantages of unified memory architecture",  # Conceptual + specific
]

for query in test_queries:
    print(f"\n{'='*70}")
    print(f"üîç Query: '{query}'")
    print(f"{'='*70}")
    
    # Get top result from each method
    dense_top = dense_retriever.search(query, k=1)[0]
    sparse_top = bm25_retriever.search(query, k=1)
    sparse_top = sparse_top[0] if sparse_top else (None, 0)
    hybrid_top = hybrid_retriever.search(query, k=1)[0]
    
    print(f"\nüîµ Dense Top: {dense_top[0].metadata['source'] if dense_top[0] else 'N/A'}")
    print(f"   Score: {dense_top[1]:.3f}")
    
    print(f"\nüü¢ Sparse Top: {sparse_top[0].metadata['source'] if sparse_top[0] else 'N/A'}")
    print(f"   Score: {sparse_top[1]:.2f}")
    
    print(f"\nüü£ Hybrid Top: {hybrid_top[0].metadata['source']}")
    print(f"   Score: {hybrid_top[1]:.4f}")

---

## Part 6: Finding Optimal Fusion Weights

Let's evaluate different alpha values to find the best balance.

In [None]:
# Evaluation dataset
eval_dataset = [
    {
        "question": "What is the memory capacity of DGX Spark?",
        "expected_source": "dgx_spark_technical_guide.md",
    },
    {
        "question": "LPDDR5X bandwidth specifications",
        "expected_source": "dgx_spark_technical_guide.md",
    },
    {
        "question": "How does self-attention work in transformers?",
        "expected_source": "transformer_architecture_explained.md",
    },
    {
        "question": "QK^T divided by sqrt(d_k)",
        "expected_source": "transformer_architecture_explained.md",
    },
    {
        "question": "What is low-rank adaptation?",
        "expected_source": "lora_finetuning_guide.md",
    },
    {
        "question": "target_modules q_proj k_proj",
        "expected_source": "lora_finetuning_guide.md",
    },
    {
        "question": "How does GPTQ quantization work?",
        "expected_source": "quantization_methods.md",
    },
    {
        "question": "Q4_K_M GGUF format",
        "expected_source": "quantization_methods.md",
    },
    {
        "question": "Benefits of retrieval augmented generation",
        "expected_source": "rag_architecture_patterns.md",
    },
    {
        "question": "ChromaDB vs FAISS performance",
        "expected_source": "vector_database_comparison.md",
    },
]

print(f"üìã Evaluation dataset: {len(eval_dataset)} queries")

In [None]:
def evaluate_retriever(retriever, eval_dataset: List[Dict], k: int = 5) -> Dict[str, float]:
    """
    Evaluate retriever on the evaluation dataset.
    """
    correct_at_1 = 0
    correct_at_3 = 0
    correct_at_5 = 0
    
    for item in eval_dataset:
        question = item["question"]
        expected = item["expected_source"]
        
        results = retriever.search(question, k=k)
        sources = [r[0].metadata.get('source') for r in results]
        
        if sources and sources[0] == expected:
            correct_at_1 += 1
        if expected in sources[:3]:
            correct_at_3 += 1
        if expected in sources[:5]:
            correct_at_5 += 1
    
    n = len(eval_dataset)
    return {
        "recall@1": correct_at_1 / n,
        "recall@3": correct_at_3 / n,
        "recall@5": correct_at_5 / n,
    }


# Evaluate baseline retrievers
print("üìä Evaluating baseline retrievers...")

dense_metrics = evaluate_retriever(dense_retriever, eval_dataset)
sparse_metrics = evaluate_retriever(bm25_retriever, eval_dataset)

print(f"\nüîµ Dense Retriever:")
print(f"   Recall@1: {dense_metrics['recall@1']:.0%}")
print(f"   Recall@5: {dense_metrics['recall@5']:.0%}")

print(f"\nüü¢ Sparse Retriever (BM25):")
print(f"   Recall@1: {sparse_metrics['recall@1']:.0%}")
print(f"   Recall@5: {sparse_metrics['recall@5']:.0%}")

In [None]:
# Find optimal alpha
print("üî¨ Finding optimal alpha for hybrid search...")
print("-" * 60)

alphas = [0.0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]
results = []

for alpha in alphas:
    # RRF hybrid
    hybrid = HybridRetriever(
        dense_retriever, bm25_retriever,
        alpha=alpha, fusion_method="rrf"
    )
    metrics = evaluate_retriever(hybrid, eval_dataset)
    results.append((alpha, "RRF", metrics))
    
    # Linear hybrid
    hybrid_linear = HybridRetriever(
        dense_retriever, bm25_retriever,
        alpha=alpha, fusion_method="linear"
    )
    metrics_linear = evaluate_retriever(hybrid_linear, eval_dataset)
    results.append((alpha, "Linear", metrics_linear))

# Display results
print(f"\n{'Alpha':<8} {'Method':<10} {'R@1':<10} {'R@3':<10} {'R@5':<10}")
print("-" * 50)

best_result = None
best_score = 0

for alpha, method, metrics in results:
    print(f"{alpha:<8} {method:<10} {metrics['recall@1']:<10.0%} "
          f"{metrics['recall@3']:<10.0%} {metrics['recall@5']:<10.0%}")
    
    score = metrics['recall@1'] + metrics['recall@5']
    if score > best_score:
        best_score = score
        best_result = (alpha, method, metrics)

print(f"\nüèÜ Best Configuration:")
print(f"   Alpha: {best_result[0]}")
print(f"   Method: {best_result[1]}")
print(f"   Recall@5: {best_result[2]['recall@5']:.0%}")

---

## Part 7: Improvement Analysis

In [None]:
# Create optimized hybrid retriever
optimal_alpha, optimal_method = best_result[0], best_result[1]

optimal_hybrid = HybridRetriever(
    dense_retriever, bm25_retriever,
    alpha=optimal_alpha, 
    fusion_method=optimal_method.lower()
)

# Compare improvements
print("\nüìà IMPROVEMENT ANALYSIS")
print("=" * 60)

print(f"\n{'Method':<20} {'R@1':<10} {'R@5':<10}")
print("-" * 40)
print(f"{'Dense Only':<20} {dense_metrics['recall@1']:<10.0%} {dense_metrics['recall@5']:<10.0%}")
print(f"{'Sparse Only':<20} {sparse_metrics['recall@1']:<10.0%} {sparse_metrics['recall@5']:<10.0%}")

hybrid_metrics = evaluate_retriever(optimal_hybrid, eval_dataset)
print(f"{'Hybrid (Optimal)':<20} {hybrid_metrics['recall@1']:<10.0%} {hybrid_metrics['recall@5']:<10.0%}")

# Calculate improvement
dense_r5 = dense_metrics['recall@5']
hybrid_r5 = hybrid_metrics['recall@5']
improvement = (hybrid_r5 - dense_r5) / dense_r5 * 100 if dense_r5 > 0 else 0

print(f"\nüéØ Improvement over Dense: {improvement:.1f}%")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not Normalizing Scores Before Linear Fusion
```python
# ‚ùå Wrong: Raw scores have different scales
hybrid_score = alpha * dense_score + (1-alpha) * bm25_score

# ‚úÖ Right: Normalize to [0, 1] first
dense_norm = (dense_score - min) / (max - min)
sparse_norm = (sparse_score - min) / (max - min)
hybrid_score = alpha * dense_norm + (1-alpha) * sparse_norm
```

### Mistake 2: Using RRF with Too Few Candidates
```python
# ‚ùå Wrong: Only get top 5 from each, then RRF
dense_top5 = dense.search(query, k=5)
sparse_top5 = sparse.search(query, k=5)

# ‚úÖ Right: Get more candidates for better RRF
dense_top50 = dense.search(query, k=50)
sparse_top50 = sparse.search(query, k=50)
# Then RRF and take top 5
```

### Mistake 3: Same Alpha for All Query Types
```python
# ‚ùå Wrong: Fixed alpha for everything
hybrid = HybridRetriever(alpha=0.5)

# ‚úÖ Better: Adjust based on query type
if looks_like_keyword_query(query):  # e.g., contains model names
    alpha = 0.3  # Weight sparse more
else:
    alpha = 0.7  # Weight dense more
```

---

## ‚úã Try It Yourself

### Exercise 1: Query-Adaptive Alpha
Implement a function that adjusts alpha based on whether the query contains technical jargon.

### Exercise 2: Weighted RRF
Modify the RRF implementation to accept different weights for dense and sparse.

### Exercise 3: Three-Way Hybrid
Add a third retriever (e.g., based on document titles only) to the fusion.

<details>
<summary>üí° Hint for Exercise 1</summary>

```python
import re

def adaptive_alpha(query: str) -> float:
    # Check for technical patterns
    technical_patterns = [
        r'\b[A-Z]{2,}\d*\b',  # Acronyms like LPDDR5X, GPU
        r'\b\d+GB\b',          # Memory sizes
        r'\b[a-z]+-[a-z]+',    # Hyphenated terms
    ]
    
    for pattern in technical_patterns:
        if re.search(pattern, query):
            return 0.3  # More weight to sparse
    
    return 0.7  # Default: more weight to dense
```
</details>

---

## üéâ Checkpoint

You've learned:
- ‚úÖ The difference between dense (semantic) and sparse (keyword) retrieval
- ‚úÖ How to implement BM25 from scratch
- ‚úÖ How to combine retrieval methods with RRF and linear fusion
- ‚úÖ How to find optimal fusion weights for your data

**Key Insight:** Hybrid search often outperforms either method alone, especially for queries mixing concepts and specific terms!

---

## üßπ Cleanup

In [None]:
# Clean up
del embedding_model, dense_retriever, bm25_retriever, hybrid_retriever
gc.collect()
torch.cuda.empty_cache()

print("‚úÖ Cleanup complete!")

---

## Next Steps

In the next lab, we'll add **reranking** to further improve retrieval quality!

‚û°Ô∏è Continue to [Lab 3.5.5: Reranking Pipeline](./lab-3.5.5-reranking.ipynb)