# 🔍 Retrieval Techniques - Beyond Simple Similarity Search

> **Complete Guide with Code Explanations and Expected Outputs**

This comprehensive notebook demonstrates four powerful retrieval techniques for finding relevant documents: **Basic Similarity Search**, **Sparse Retrieval (BM25)**, **Hybrid Search**, and **Maximum Marginal Relevance (MMR)**.

---

## 🛠️ Setup and Sample Data

### 📦 Required Libraries Installation

In [6]:
! pip install sentence-transformers rank-bm25 scikit-learn numpy

Collecting rank-bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2


| Library | Purpose |
|---------|---------|
| `sentence-transformers` | Pre-trained embedding models for dense retrieval |
| `rank-bm25` | Implementation of BM25 algorithm for sparse retrieval |
| `scikit-learn` | Cosine similarity calculations |
| `numpy` | Numerical operations |

### 📄 Sample Documents

In [1]:
documents = [
    "Machine learning algorithms are powerful tools for data analysis and prediction",
    "Deep learning neural networks can process complex patterns in data",
    "Python is a popular programming language for artificial intelligence",
    "Data science involves extracting insights from large datasets",
    "Natural language processing helps computers understand human language",
    "Computer vision enables machines to interpret and analyze visual information",
    "Supervised learning uses labeled data to train predictive models",
    "Unsupervised learning finds hidden patterns in unlabeled data",
    "Reinforcement learning trains agents through reward and punishment",
    "Big data analytics requires specialized tools and techniques"
]


#### 📋 **Expected Output:**
```
Sample Documents:
1. Machine learning algorithms are powerful tools for data analysis and prediction
2. Deep learning neural networks can process complex patterns in data
3. Python is a popular programming language for artificial intelligence
4. Data science involves extracting insights from large datasets
5. Natural language processing helps computers understand human language
6. Computer vision enables machines to interpret and analyze visual information
7. Supervised learning uses labeled data to train predictive models
8. Unsupervised learning finds hidden patterns in unlabeled data
9. Reinforcement learning trains agents through reward and punishment
10. Big data analytics requires specialized tools and techniques
```

---

## 1️⃣ Basic Similarity Search (Dense Retrieval)

### 🧠 Code Explanation

In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = model.encode(documents)

**What's happening here:**
- 🔄 Loads a pre-trained sentence transformer model
- 🎯 Converts all documents into dense vector embeddings
- 📊 Each document becomes a **384-dimensional vector**

### 🔍 The Search Function

In [27]:
def basic_similarity_search(query, top_k=3):
    # Encode the input query into an embedding vector using the model
    query_embedding = model.encode([query])
    
    # Compute cosine similarity between the query embedding and all document embeddings
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    
    # Get the indices of the top_k most similar documents, sorted in descending order of similarity
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    # Prepare a list to store the final results
    results = []
    
    # Iterate over the top indices and gather the corresponding scores and documents
    for idx in top_indices:
        results.append({
            'score': similarities[idx],       # Similarity score between query and document
            'document': documents[idx]        # The actual document text
        })
    
    # Return the list of top_k results with scores and corresponding documents
    return results


#### ✨ **Expected Output for Query: "machine learning algorithms"**

In [4]:
basic_similarity_search("machine learning algorithms")

[{'score': np.float32(0.7338959),
  'document': 'Machine learning algorithms are powerful tools for data analysis and prediction'},
 {'score': np.float32(0.45129114),
  'document': 'Supervised learning uses labeled data to train predictive models'},
 {'score': np.float32(0.43556952),
  'document': 'Deep learning neural networks can process complex patterns in data'}]

🔍 Query: 'machine learning algorithms'

📊 Top 3 results:
1. Score: 0.733 - Machine learning algorithms are powerful tools for data analysis and prediction
2. Score: 0.451 - Supervised learning uses labeled data to train predictive models
3. Score: 0.4355 - Deep learning neural networks can process complex patterns in data
```


> **💡 Why this works:** Dense embeddings capture semantic meaning, so phrases like *"machine learning algorithms"* can effectively match documents related to ML concepts—even if the exact keywords aren't present.


## 2️⃣ Sparse Retrieval (BM25)

### 📝 Code Explanation


In [7]:
from rank_bm25 import BM25Okapi

tokenized_docs = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)


**Process breakdown:**
- ✂️ Tokenizes documents by splitting on whitespace and converting to lowercase
- 🏗️ Creates BM25 index for term frequency-based search
- ⚖️ BM25 considers **term frequency**, **document frequency**, and **document length**

### 🎯 The Search Function

In [28]:
def sparse_search(query, top_k=3):
    # Convert the query to lowercase and split it into tokens (words)
    query_tokens = query.lower().split()
    
    # Use the BM25 model to calculate relevance scores for each document based on the query tokens
    scores = bm25.get_scores(query_tokens)
    
    # Get the indices of the top_k documents with the highest BM25 scores
    top_indices = np.argsort(scores)[::-1][:top_k]
    
    # Prepare a list to store the final top_k results
    results = []
    
    # Iterate over the top indices and collect the corresponding scores and documents
    for idx in top_indices:
        results.append({
            'score': scores[idx],        # BM25 relevance score for the document
            'document': documents[idx]   # The actual document content
        })
    
    # Return the list of top_k results with scores and corresponding documents
    return results


#### ✨ **Expected Output for Query: "machine learning algorithms"**

In [9]:
sparse_search("machine learning algorithms")

[{'score': np.float64(3.3372996537196844),
  'document': 'Machine learning algorithms are powerful tools for data analysis and prediction'},
 {'score': np.float64(0.0),
  'document': 'Big data analytics requires specialized tools and techniques'},
 {'score': np.float64(0.0),
  'document': 'Reinforcement learning trains agents through reward and punishment'}]

🔍 Query: 'machine learning algorithms'

📊 Top 3 Dense Embedding Results:
1. Score: 3.337 — Machine learning algorithms are powerful tools for data analysis and prediction
2. Score: 0.000 — Big data analytics requires specialized tools and techniques
3. Score: 0.000 — Reinforcement learning trains agents through reward and punishment

> **💡 Why this works:** BM25 gives high scores to documents containing the exact terms "machine", "learning", and "algorithms". It's excellent for keyword-based matching.

## 3️⃣ Hybrid Search (Dense + Sparse)

### 🔄 Code Explanation


In [30]:
def hybrid_search(query, alpha=0.7, top_k=3):
    """
    Perform a hybrid search combining dense (semantic) and sparse (BM25) scores.

    Parameters:
    - query (str): The search query.
    - alpha (float): Weight for dense scores; (1 - alpha) is used for sparse scores.
    - top_k (int): Number of top results to return.

    Returns:
    - List[dict]: Top documents with hybrid, dense, and sparse scores.
    """

    # Step 1: Dense retrieval using cosine similarity between query and document embeddings
    query_embedding = model.encode([query])  # Encode the query into an embedding
    dense_scores = cosine_similarity(query_embedding, doc_embeddings)[0]  # Compute dense similarity scores

    # Step 2: Sparse retrieval using BM25 (bag-of-words based relevance)
    query_tokens = query.lower().split()  # Tokenize the query
    sparse_scores = bm25.get_scores(query_tokens)  # Compute BM25 scores for the query

    # Step 3: Normalize both dense and sparse scores to the 0–1 range
    dense_norm = (dense_scores - dense_scores.min()) / (dense_scores.max() - dense_scores.min() + 1e-8)
    sparse_norm = (sparse_scores - sparse_scores.min()) / (sparse_scores.max() - sparse_scores.min() + 1e-8)

    # Step 4: Combine both scores using a weighted average controlled by alpha
    hybrid_scores = alpha * dense_norm + (1 - alpha) * sparse_norm  # Final score balances semantic and lexical relevance

    # Step 5: Identify indices of the top_k documents with the highest hybrid scores
    top_indices = np.argsort(hybrid_scores)[::-1][:top_k]

    # Step 6: Build the final list of top results with relevant metadata
    results = []
    for idx in top_indices:
        results.append({
            "rank": len(results) + 1,             # Rank of the document
            "document": documents[idx],           # The actual document text
            "hybrid_score": float(hybrid_scores[idx]),  # Combined score
            "dense_score": float(dense_norm[idx]),      # Normalized dense score
            "sparse_score": float(sparse_norm[idx]),    # Normalized sparse score
        })

    # Return the list of results
    return results


**Process steps:**
1. 🔄 Runs both dense and sparse retrieval
2. 📏 Normalizes scores to comparable ranges (0-1)
3. ⚖️ Combines using weighted average where **alpha** controls the balance

#### **Alpha = 0.7** (70% dense, 30% sparse):

In [16]:
hybrid_search("machine learning algorithms", alpha=0.7, top_k=3)

[{'rank': 1,
  'document': 'Machine learning algorithms are powerful tools for data analysis and prediction',
  'hybrid_score': 0.9999999871801407,
  'dense_score': 1.0,
  'sparse_score': 0.9999999970035655},
 {'rank': 2,
  'document': 'Supervised learning uses labeled data to train predictive models',
  'hybrid_score': 0.27922937273979187,
  'dense_score': 0.398899108171463,
  'sparse_score': 0.0},
 {'rank': 3,
  'document': 'Deep learning neural networks can process complex patterns in data',
  'hybrid_score': 0.2558214068412781,
  'dense_score': 0.3654591739177704,
  'sparse_score': 0.0}]

📊 Top 3 hybrid results:
1. Hybrid: 0.999 (Dense: 1.0, Sparse: 0.999) - Machine learning algorithms are powerful tools for data analysis and prediction
2. Hybrid: 0.279 (Dense: 0.398, Sparse: 0.0) - Supervised learning uses labeled data to train predictive models
3. Hybrid: 0.255 (Dense: 0.365, Sparse: 0.0) - Deep learning neural networks can process complex patterns in data


#### **Alpha = 0.3** (30% dense, 70% sparse):

In [17]:
hybrid_search("machine learning algorithms", alpha=0.3, top_k=3)

[{'rank': 1,
  'document': 'Machine learning algorithms are powerful tools for data analysis and prediction',
  'hybrid_score': 1.0000000098234247,
  'dense_score': 1.0,
  'sparse_score': 0.9999999970035655},
 {'rank': 2,
  'document': 'Supervised learning uses labeled data to train predictive models',
  'hybrid_score': 0.11966973543167114,
  'dense_score': 0.398899108171463,
  'sparse_score': 0.0},
 {'rank': 3,
  'document': 'Deep learning neural networks can process complex patterns in data',
  'hybrid_score': 0.10963775962591171,
  'dense_score': 0.3654591739177704,
  'sparse_score': 0.0}]

📊 Top 3 hybrid results:
1. Hybrid: 1.000 (Dense: 1.0, Sparse: 0.999) - Machine learning algorithms are powerful tools for data analysis and prediction
2. Hybrid: 0.119 (Dense: 0.398, Sparse: 0.0) - Supervised learning uses labeled data to train predictive models
3. Hybrid: 0.109 (Dense: 0.365, Sparse: 0.0) - Unsupervised learning finds hidden patterns in unlabeled data

> **💡 Why this works:** Hybrid search combines the semantic understanding of dense retrieval with the precise keyword matching of sparse retrieval, often providing the best of both worlds.

## 4️⃣ Maximum Marginal Relevance (MMR)

### 🎯 Code Explanation

In [29]:
def mmr_search(query, lambda_param=0.7, top_k=3):
    """
    Perform Maximal Marginal Relevance (MMR) based search for diverse and relevant documents.

    Parameters:
    - query (str): The user query.
    - lambda_param (float): Trade-off between relevance and diversity (0 to 1).
    - top_k (int): Number of top documents to return.

    Returns:
    - List[dict]: Top documents with metadata including scores.
    """

    # Step 1: Embed the input query using the model to get its dense vector representation
    query_embedding = model.encode([query])
    
    # Compute cosine similarity between the query and all document embeddings
    relevance_scores = cosine_similarity(query_embedding, doc_embeddings)[0]

    # Initialize lists to track selected and unselected document indices
    selected_indices = []
    remaining_indices = list(range(len(documents)))

    # Step 2: Select the most relevant document (highest cosine similarity)
    first_idx = np.argmax(relevance_scores)  # Index of the most relevant document
    selected_indices.append(first_idx)       # Add it to selected list
    remaining_indices.remove(first_idx)      # Remove it from remaining list

    # Step 3: Iteratively select the rest of the top_k documents using MMR
    for _ in range(top_k - 1):
        mmr_scores = []  # List to store MMR scores for each candidate document

        for idx in remaining_indices:
            relevance = relevance_scores[idx]  # Relevance score of the current document

            # Calculate max similarity to already selected documents to ensure diversity
            max_sim_to_selected = max(
                cosine_similarity(
                    doc_embeddings[idx].reshape(1, -1),
                    doc_embeddings[sel_idx].reshape(1, -1)
                )[0][0] for sel_idx in selected_indices
            )

            # Apply the MMR formula: trade-off between relevance and diversity
            mmr_score = lambda_param * relevance - (1 - lambda_param) * max_sim_to_selected

            # Append index and its MMR score
            mmr_scores.append((idx, mmr_score))

        # Select the document with the highest MMR score
        best_idx = max(mmr_scores, key=lambda x: x[1])[0]
        selected_indices.append(best_idx)        # Add to selected
        remaining_indices.remove(best_idx)       # Remove from remaining

    # Step 4: Build the final result list with ranks, documents, and scores
    results = []
    for rank, idx in enumerate(selected_indices, 1):
        results.append({
            "rank": rank,                            # Rank of the document
            "document": documents[idx],              # The document text
            "relevance_score": float(relevance_scores[idx]),  # Original relevance score
            "index": idx                             # Index of the document in original list
        })

    # Return the list of selected top_k documents with metadata
    return results



**Algorithm steps:**
1. 🎯 First selects the most relevant document
2. ⚖️ For each subsequent selection, balances relevance against similarity to already-selected documents
3. 🎚️ Lambda parameter controls **relevance vs diversity** trade-off

### ✨ **Expected Output for Query: "machine learning"**

#### **Lambda = 0.7** (70% relevance, 30% diversity):


In [25]:
mmr_search("machine learning", lambda_param=0.7, top_k=3)

[{'rank': 1,
  'document': 'Machine learning algorithms are powerful tools for data analysis and prediction',
  'relevance_score': 0.6522308588027954,
  'index': np.int64(0)},
 {'rank': 2,
  'document': 'Supervised learning uses labeled data to train predictive models',
  'relevance_score': 0.4839949309825897,
  'index': 6},
 {'rank': 3,
  'document': 'Deep learning neural networks can process complex patterns in data',
  'relevance_score': 0.41273051500320435,
  'index': 1}]

🔍 **MMR Search** — Query: *"machine learning"* (λ = 0.7)

1. 🎯 **Rank 1** (Most Relevant):  
   *Machine learning algorithms are powerful tools for data analysis and prediction*  
   **Relevance Score:** 0.652

2. 🔄 **Rank 2** (MMR Selected for Diversity):  
   *Supervised learning uses labeled data to train predictive models*  
   **Relevance Score:** 0.484

3. 🔄 **Rank 3** (MMR Selected for Diversity):  
   *Deep learning neural networks can process complex patterns in data*  
   **Relevance Score:** 0.413



#### **Lambda = 0.3** (30% relevance, 70% diversity):

In [26]:
mmr_search("machine learning", lambda_param=0.3, top_k=3)

[{'rank': 1,
  'document': 'Machine learning algorithms are powerful tools for data analysis and prediction',
  'relevance_score': 0.6522308588027954,
  'index': np.int64(0)},
 {'rank': 2,
  'document': 'Reinforcement learning trains agents through reward and punishment',
  'relevance_score': 0.278318852186203,
  'index': 8},
 {'rank': 3,
  'document': 'Unsupervised learning finds hidden patterns in unlabeled data',
  'relevance_score': 0.3412044644355774,
  'index': 7}]

🔍 **MMR Search** — Query: *"machine learning"* (λ = 0.3)

1. 🎯 **Rank 1** (Most Relevant):  
   *Machine learning algorithms are powerful tools for data analysis and prediction*  
   **Relevance Score:** 0.652

2. 🌟 **Rank 2** (MMR Selected for Diversity):  
   *Reinforcement learning trains agents through reward and punishment*  
   **Relevance Score:** 0.278

3. 🌟 **Rank 3** (MMR Selected for Diversity):  
   *Unsupervised learning finds hidden patterns in unlabeled data*  
   **Relevance Score:** 0.341


> **💡 Why this works:** MMR prevents returning multiple very similar documents. With higher lambda, you get more relevant but potentially similar results. With lower lambda, you get more diverse results that cover different aspects of the topic.

## 5️⃣ Method Comparison
### 🔍 **Expected Output for Query: "machine learning"**

<table>
<tr><th>Method</th><th>Results</th></tr>

<tr>
<td><strong>🧠 DENSE SIMILARITY</strong></td>
<td>
<pre>
1. Score: 0.734 - Machine learning algorithms are powerful tools for data analysis and prediction  
2. Score: 0.451 - Supervised learning uses labeled data to train predictive models  
3. Score: 0.436 - Deep learning neural networks can process complex patterns in data  
</pre>
</td>
</tr>

<tr>
<td><strong>📝 SPARSE (BM25)</strong></td>
<td>
<pre>
1. Score: 3.337 - Machine learning algorithms are powerful tools for data analysis and prediction  
2. Score: 0.000 - Big data analytics requires specialized tools and techniques  
3. Score: 0.000 - Reinforcement learning trains agents through reward and punishment  
</pre>
</td>
</tr>

<tr>
<td><strong>🔄 HYBRID</strong></td>
<td>
<pre>
1. Hybrid: 1.000 (Dense: 1.000, Sparse: 1.000) - Machine learning algorithms are powerful tools for data analysis and prediction  
2. Hybrid: 0.279 (Dense: 0.399, Sparse: 0.000) - Supervised learning uses labeled data to train predictive models  
3. Hybrid: 0.256 (Dense: 0.365, Sparse: 0.000) - Deep learning neural networks can process complex patterns in data  
</pre>
</td>
</tr>

<tr>
<td><strong>🎯 MMR</strong></td>
<td>
<pre>
1. Selected (relevance: 0.652): Machine learning algorithms are powerful tools for data analysis and prediction  
2. Selected (relevance: 0.484): Supervised learning uses labeled data to train predictive models  
3. Selected (relevance: 0.413): Deep learning neural networks can process complex patterns in data  
</pre>
</td>
</tr>
</table>


## 📊 Key Insights and When to Use Each Method

### 🚀 Performance Characteristics

<table>
<tr><th>Method</th><th>Strengths</th><th>Weaknesses</th><th>Best Use Cases</th></tr>
<tr>
<td><strong>🧠 Basic Similarity</strong></td>
<td>✅ Fast and simple<br>✅ Good semantic understanding</td>
<td>❌ May miss exact keyword matches</td>
<td>Quick prototyping, semantic search primary</td>
</tr>
<tr>
<td><strong>📝 Sparse (BM25)</strong></td>
<td>✅ Excellent keyword matching<br>✅ Works well with technical terms</td>
<td>❌ Poor semantic understanding</td>
<td>Legal documents, technical specs, exact terms</td>
</tr>
<tr>
<td><strong>🔄 Hybrid Search</strong></td>
<td>✅ Best overall performance<br>✅ Semantic + keyword matching</td>
<td>❌ More complex to tune</td>
<td>Production systems, balanced requirements</td>
</tr>
<tr>
<td><strong>🎯 MMR</strong></td>
<td>✅ Prevents redundant results<br>✅ Good for exploration</td>
<td>❌ May sacrifice relevance for diversity</td>
<td>Research, content discovery, avoiding redundancy</td>
</tr>
</table>


### ⚙️ Parameter Tuning Guidelines

#### 🔄 **Hybrid Search Alpha Values:**
| Alpha Range | Focus | Use Case |
|-------------|-------|----------|
| `α = 0.8-0.9` | 🧠 Semantic-heavy | Research papers, technical documentation |
| `α = 0.5-0.6` | 📝 Keyword-heavy | Legal, compliance, exact specifications |
| `α = 0.7` | ⚖️ Balanced | General-purpose applications |

#### 🎯 **MMR Lambda Values:**
| Lambda Range | Focus | Use Case |
|-------------|-------|----------|
| `λ = 0.8-0.9` | 🎯 Relevance-focused | Specific information needs |
| `λ = 0.5-0.6` | 🌟 Diversity-focused | Brainstorming, exploration |
| `λ = 0.7` | ⚖️ Balanced | Most common use case |

### 🌍 Real-World Applications

| Application | Recommended Method | Reasoning |
|-------------|-------------------|-----------|
| 🔍 **Search Engines** | Hybrid + MMR | Diverse, relevant results |
| 📋 **Document Q&A** | Basic Similarity | Semantic matching priority |
| ⚖️ **Legal Research** | BM25 | Exact term matching critical |
| 📺 **Content Recommendation** | MMR | Diverse suggestions needed |
| 📖 **Technical Documentation** | Hybrid (high α) | Semantic understanding important |

## 🏁 Conclusion

This comprehensive guide provides the foundation for implementing sophisticated retrieval systems that go beyond simple similarity search. Each method has its strengths and optimal use cases:

- **Start with Basic Similarity** for quick prototypes
- **Use BM25** when exact keywords matter most
- **Implement Hybrid Search** for production systems
- **Add MMR** when diversity is important

> **🚀 Pro Tip:** Most production systems benefit from a hybrid approach with MMR post-processing to balance relevance, precision, and diversity.