# üîç Week 6: Dense Retrieval Systems

This notebook covers building production-ready retrieval systems.

## Table of Contents
1. [Retrieval Fundamentals](#1-retrieval-fundamentals)
2. [BM25 (Sparse Retrieval)](#2-bm25-sparse-retrieval)
3. [Dense Retrieval](#3-dense-retrieval)
4. [Hybrid Retrieval](#4-hybrid-retrieval)
5. [Vector Databases](#5-vector-databases)
6. [Building a Search System](#6-building-a-search-system)

---

In [1]:
# Setup
import sys
sys.path.insert(0, '../..')

import numpy as np

from src.retrieval import (
    BM25Retriever,
    DenseRetriever,
    HybridRetriever,
    RetrievalPipeline,
)
from src.retrieval.retrieval import Document, TextPreprocessor

print("‚úÖ Setup complete!")

‚úÖ Setup complete!


---

## 1. Retrieval Fundamentals

### 1.1 The Information Retrieval Problem

**Goal:** Given a query, find the most relevant documents from a corpus.

```
Query: "How do neural networks learn?"
         ‚Üì
    [Retrieval System]
         ‚Üì
Ranked Results:
  1. "Neural networks learn through backpropagation..."
  2. "Learning in deep networks involves..."
  3. "Training neural models requires..."
```

### 1.2 Retrieval Methods Comparison

| Method | Representation | Pros | Cons |
|--------|---------------|------|------|
| **BM25** | Sparse (term freq) | Fast, interpretable | No semantics |
| **Dense** | Dense vectors | Captures meaning | Needs embeddings |
| **Hybrid** | Both | Best of both | More complex |

---

## 2. BM25 (Sparse Retrieval)

### 2.1 BM25 Formula

$$score(D, Q) = \sum_{i=1}^{n} IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}$$

Where:
- $f(q_i, D)$ = frequency of term $q_i$ in document $D$
- $|D|$ = document length
- $avgdl$ = average document length
- $k_1, b$ = tuning parameters (typically 1.5, 0.75)

In [2]:
# Create sample documents
documents = [
    Document(id="1", content="Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed."),
    Document(id="2", content="Deep learning uses neural networks with many layers to learn representations of data with multiple levels of abstraction."),
    Document(id="3", content="Natural language processing helps computers understand, interpret, and generate human language in useful ways."),
    Document(id="4", content="Computer vision enables machines to interpret and understand visual information from the world."),
    Document(id="5", content="Reinforcement learning trains agents to make sequences of decisions by rewarding desired behaviors."),
    Document(id="6", content="Neural networks are inspired by biological neural networks in the human brain."),
]

# Initialize BM25
bm25 = BM25Retriever(k1=1.5, b=0.75)
bm25.index(documents)

# Search
query = "How do neural networks learn from data?"
results = bm25.retrieve(query, top_k=3)

print(f"Query: '{query}'\n")
print("BM25 Results:")
print("=" * 60)

for r in results:
    print(f"\n[Score: {r.score:.3f}] Doc {r.document.id}")
    print(f"  {r.document.content[:80]}...")

INFO:src.retrieval.retrieval:Initialized BM25Retriever with k1=1.5, b=0.75
INFO:src.retrieval.retrieval:Indexed 6 documents with 51 unique terms


Query: 'How do neural networks learn from data?'

BM25 Results:

[Score: 3.778] Doc 2
  Deep learning uses neural networks with many layers to learn representations of ...

[Score: 3.212] Doc 6
  Neural networks are inspired by biological neural networks in the human brain....

[Score: 1.889] Doc 1
  Machine learning is a subset of artificial intelligence that enables computers t...


In [3]:
# Explain BM25 Score
contributions = bm25.explain_score(query, 0)

print("Score Breakdown for Doc 1:")
print("-" * 40)
for term, score in sorted(contributions.items(), key=lambda x: -x[1]):
    if score > 0:
        print(f"  {term:15s}: {score:.4f}")

Score Breakdown for Doc 1:
----------------------------------------
  learn          : 0.9446
  data           : 0.9446


---

## 3. Dense Retrieval

### 3.1 How Dense Retrieval Works

```
Query: "How do neural networks learn?"
         ‚Üì
   [Encoder Model]
         ‚Üì
   Query Vector: [0.2, -0.1, 0.5, ...]
         ‚Üì
   Compare with Document Vectors (cosine similarity)
         ‚Üì
   Ranked Results
```

In [None]:
# Initialize Dense Retriever
dense = DenseRetriever(model_name="all-MiniLM-L6-v2", use_faiss=False)
dense.index(documents)

# Search
results = dense.retrieve(query, top_k=3)

print(f"Query: '{query}'\n")
print("Dense Retrieval Results:")
print("=" * 60)

for r in results:
    print(f"\n[Score: {r.score:.3f}] Doc {r.document.id}")
    print(f"  {r.document.content[:80]}...")

INFO:src.retrieval.retrieval:Initialized DenseRetriever with model: all-MiniLM-L6-v2
INFO:src.retrieval.retrieval:Indexed 6 documents with dense embeddings


ValueError: shapes (6,62) and (7,) not aligned: 62 (dim 1) != 7 (dim 0)

: 

### 3.2 BM25 vs Dense: When to Use What

| Scenario | BM25 | Dense |
|----------|------|-------|
| **Exact keyword match** | ‚úÖ Better | May miss |
| **Semantic similarity** | ‚ùå Limited | ‚úÖ Better |
| **Speed (no GPU)** | ‚úÖ Faster | Slower |
| **Out-of-vocabulary** | ‚ùå Fails | ‚úÖ Can handle |
| **Interpretability** | ‚úÖ Clear | ‚ùå Black box |

In [None]:
# Compare on semantic query
semantic_query = "AI that understands text and language"

print(f"Query: '{semantic_query}'\n")

bm25_results = bm25.retrieve(semantic_query, top_k=2)
dense_results = dense.retrieve(semantic_query, top_k=2)

print("BM25 Top Result:")
print(f"  {bm25_results[0].document.content[:60]}...")

print("\nDense Top Result:")
print(f"  {dense_results[0].document.content[:60]}...")

print("\nüí° Dense retrieval better captures 'NLP' as relevant to 'understands text'")

---

## 4. Hybrid Retrieval

### 4.1 Combining Sparse and Dense

Hybrid retrieval combines both methods:
1. Get candidates from BM25 and Dense
2. Combine rankings using RRF or weighted fusion

In [None]:
# Hybrid Retriever
hybrid = HybridRetriever(alpha=0.5, fusion="rrf")
hybrid.index(documents)

# Compare all three
test_queries = [
    "machine learning algorithms",
    "understanding human language",
    "neural network brain"
]

print("Retrieval Comparison")
print("=" * 70)

for q in test_queries:
    bm25_top = bm25.retrieve(q, top_k=1)[0].document.id
    dense_top = dense.retrieve(q, top_k=1)[0].document.id
    hybrid_top = hybrid.retrieve(q, top_k=1)[0].document.id
    
    print(f"\nQuery: '{q}'")
    print(f"  BM25: Doc {bm25_top} | Dense: Doc {dense_top} | Hybrid: Doc {hybrid_top}")

---

## 5. Vector Databases

### 5.1 Why Vector Databases?

For production, you need efficient similarity search at scale:

| Database | Type | Key Features |
|----------|------|-------------|
| **FAISS** | Library | Facebook, very fast, GPU support |
| **Pinecone** | Managed | Serverless, easy to use |
| **Milvus** | Open source | Distributed, scalable |
| **Qdrant** | Open source | Modern, filtering support |
| **ChromaDB** | Open source | Simple, good for prototyping |

In [None]:
# Example: Simple Vector Store
class SimpleVectorStore:
    """Basic in-memory vector store for demonstration."""
    
    def __init__(self):
        self.documents = []
        self.embeddings = None
    
    def add(self, documents, embeddings):
        """Add documents with embeddings."""
        self.documents.extend(documents)
        if self.embeddings is None:
            self.embeddings = embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, embeddings])
    
    def search(self, query_embedding, top_k=5):
        """Search by embedding similarity."""
        # Cosine similarity
        similarities = np.dot(self.embeddings, query_embedding)
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        return [
            (self.documents[i], similarities[i])
            for i in top_indices
        ]

print("‚úÖ Vector store pattern demonstrated!")
print("\nIn production, use FAISS, Pinecone, or similar for efficient search.")

---

## 6. Building a Search System

### 6.1 Complete Pipeline

In [None]:
class SearchSystem:
    """Complete search system with hybrid retrieval."""
    
    def __init__(self, use_hybrid=True):
        if use_hybrid:
            self.retriever = HybridRetriever(alpha=0.5, fusion="rrf")
        else:
            self.retriever = DenseRetriever()
        
        self.documents = []
    
    def index(self, documents):
        """Index documents for search."""
        self.documents = documents
        self.retriever.index(documents)
        print(f"Indexed {len(documents)} documents")
    
    def search(self, query, top_k=5):
        """Search for relevant documents."""
        results = self.retriever.retrieve(query, top_k=top_k)
        
        return [
            {
                "id": r.document.id,
                "content": r.document.content,
                "score": r.score,
                "rank": r.rank
            }
            for r in results
        ]

# Build and use
search = SearchSystem(use_hybrid=True)
search.index(documents)

# Search
results = search.search("How do machines learn from experience?", top_k=3)

print("\nSearch Results:")
for r in results:
    print(f"  [{r['score']:.3f}] {r['content'][:50]}...")

---

## üìù Summary

### Key Takeaways

1. **BM25** - Great for keyword matching, fast, interpretable
2. **Dense Retrieval** - Captures semantics, needs embeddings
3. **Hybrid** - Best of both worlds, recommended for production
4. **Vector DBs** - Essential for scale (FAISS, Pinecone, etc.)

### Production Checklist

- [ ] Choose retrieval method based on use case
- [ ] Index documents with appropriate chunking
- [ ] Use vector database for scale
- [ ] Implement caching for efficiency
- [ ] Add reranking for improved quality
- [ ] Monitor latency and relevance metrics