# Week 1, Day 1: Hybrid Retrieval Implementation

**Sprint**: Week 1 (LLM & RAG Mastery)  
**Goal**: validate and demonstrate existing `HybridRetriever` on complex queries.

---

## ðŸŽ¯ Problem Statement

We need to prove that **Hybrid Retrieval** (BM25 + Semantic) outperforms single-mode retrieval, especially for:
1. **Concept queries**: "How does transformer attention work?" (Dense wins)
2. **Specific entity queries**: "What is the error code in Protocol 3.2?" (Sparse wins)

## ðŸ“¦ Setup

In [None]:
import sys
import os

# Add project root to path to import src modules
project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))
if project_root not in sys.path:
    sys.path.append(project_root)

from src.retrieval.retrieval import HybridRetriever, Document
import pandas as pd

## ðŸ“Š Data Loading

Creating a synthetic dataset that highlights the strengths of both methods.

In [None]:
documents = [
    # Semantic-heavy docs
    Document(id="1", content="Machine learning algorithms improve automatically through experience and by the use of data."),
    Document(id="2", content="Deep learning architectures such as deep neural networks have been applied to fields including computer vision."),
    
    # Keyword-heavy docs
    Document(id="3", content="Error Code 505: HTTP Version Not Supported response status code."),
    Document(id="4", content="The specifics of Protocol 3.2 require a 256-bit encryption key."),
    Document(id="5", content="Configuration parameter 'max_retries' should be set to 5 for production environments.")
]

# Initialize Retriever
retriever = HybridRetriever(
    alpha=0.5,           # Equal weight to start
    fusion="rrf",        # Reciprocal Rank Fusion
    dense_model="all-MiniLM-L6-v2"
)

print("Indexing documents...")
retriever.index(documents)
print("Done.")

## ðŸ”§ Experiment: Semantic vs Keyword Queries

In [None]:
queries = [
    "How do neural networks learn?",        # Semantic
    "Error Code 505 status",                # Exact Keyword
    "max_retries configuration value",      # Code/Technical
    "computer vision deep learning"         # Mixed
]

results_data = []

for q in queries:
    # 1. Hybrid
    hybrid_res = retriever.retrieve(q, top_k=1)
    
    # 2. Dense only
    dense_res = retriever.dense_retriever.retrieve(q, top_k=1)
    
    # 3. Sparse only
    sparse_res = retriever.sparse_retriever.retrieve(q, top_k=1)
    
    results_data.append({
        "Query": q,
        "Hybrid Top 1": hybrid_res[0].document.content if hybrid_res else "None",
        "Dense Top 1": dense_res[0].document.content if dense_res else "None",
        "Sparse Top 1": sparse_res[0].document.content if sparse_res else "None"
    })

df = pd.DataFrame(results_data)
df

## ðŸ“ˆ Analysis

Observe how:
1. `Dense` performs well on the first query.
2. `Sparse` (BM25) ensures the exact error code is found in the second query, where Dense might drift.
3. `Hybrid` should capture the best of both.

## ðŸŽ¤ Interview Connection

**Q: When would you choose Hybrid over just Dense retrieval?**

**A:** Dense retrieval struggles with exact matches (IDs, error codes, specific numbers) and out-of-domain vocabulary. Hybrid ensures we don't lose the precision of keyword search while gaining the semantic understanding of embeddings. RRF (Reciprocal Rank Fusion) allows us to combine these two different score distributions effectively without sensitive hyperparameter tuning.