# 04 ‚Äì Advanced Retrieval: Hybrid Search & Reranking

**Learning Goals:**
- Understand dense vs sparse retrieval trade-offs
- Implement BM25 keyword-based search
- Combine retrieval methods using Reciprocal Rank Fusion (RRF)
- Apply cross-encoder reranking for improved precision

**What we'll cover:**
1. **Step 1: Data Loading** - Load dermatology corpus from text files
2. **Step 2: Dense Retrieval** - Vector similarity search with ChromaDB
3. **Step 3: Sparse Retrieval** - BM25 keyword matching
4. **Step 4: Hybrid Fusion** - Combine dense + sparse with RRF
5. **Step 5: Reranking** - Cross-encoder for refined relevance

**Prerequisites:** Notebooks 01, 02, 03 completed

**Key Insight:** No single retrieval method is best for all queries. Hybrid approaches combine the semantic understanding of dense retrieval with the keyword precision of sparse retrieval.


In [1]:
# ‚öôÔ∏è Global Config & Services (using centralized modules)

import json
import sys
from pathlib import Path
from datetime import datetime
from dotenv import load_dotenv

# Add parent directory to path and change to project root
import os

# Get the notebook's current directory and find project root
notebook_dir = Path.cwd()
if notebook_dir.name == "notebooks":
    project_root = notebook_dir.parent
else:
    project_root = notebook_dir

# Change to project root and add to path
os.chdir(project_root)
sys.path.insert(0, str(project_root))

print(f"üìÇ Working directory: {os.getcwd()}")

from src.services.llm_services import (
    load_config,
    get_llm,
    get_text_embeddings,
    validate_api_keys,
    print_config_summary
)

# Load environment variables
load_dotenv()

# Load configuration from config.yaml (now we're in project root)
config = load_config("src/config/config.yaml")

# Validate API keys
validate_api_keys(config, verbose=True)

# Print summary
print_config_summary(config)


üìÇ Working directory: /Users/machinelearningzuu/Dropbox/Zuu Crew/Courses/üöß AI Engineer Essentials/Live Classes/Week 03
‚úÖ Config loaded:
  LLM: openrouter (openai/gpt-4o-mini)
  Embeddings: sbert / sentence-transformers/all-MiniLM-L6-v2
  Temperature: 0.2
  Artifacts: ./artifacts




In [2]:
# Initialize LLM, Embeddings, and Reranker
from sentence_transformers import CrossEncoder

llm = get_llm(config)
embeddings = get_text_embeddings(config)

# CrossEncoder: A reranker model that scores query-document pairs
# Unlike bi-encoders (embeddings), cross-encoders see query AND document together
# This gives higher accuracy but is slower (can't pre-compute embeddings)
reranker = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L-6-v2"  # Model trained on MS MARCO dataset
                                             # Other options: "cross-encoder/ms-marco-TinyBERT-L-2-v2" (faster)
                                             #                "cross-encoder/ms-marco-MiniLM-L-12-v2" (more accurate)
)

print(f"‚úÖ LLM: {config['llm_provider']} / {config.get('openrouter_model', config.get('llm_model'))}")
print(f"‚úÖ Embeddings: {config['text_emb_model']}")
print(f"‚úÖ Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2")

# Verify API key with test completion
print("\nüîç Testing LLM API connection...")
try:
    test_response = llm.invoke("Say 'API working!' if you can read this.")
    test_msg = test_response.content if hasattr(test_response, 'content') else str(test_response)
    print(f"‚úÖ LLM API verified: {test_msg[:50]}")
except Exception as e:
    print(f"‚ùå LLM API test failed: {e}")
    print("‚ö†Ô∏è  Please check your .env file and API key configuration.")


  return HuggingFaceEmbeddings(


‚úÖ LLM: openrouter / gpt-4o-mini
‚úÖ Embeddings: sentence-transformers/all-MiniLM-L6-v2
‚úÖ Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2

üîç Testing LLM API connection...
‚úÖ LLM API verified: API working!


---

## Step 1: Load or Create Data


In [3]:
from langchain.schema import Document
import re

# Load corpus dynamically from raw_text files
text_dir = Path(config["data_root"]) / "raw_text"

def load_and_chunk_text_files(directory: Path, chunk_size: int = 500):
    """Load text files and chunk them into manageable paragraphs."""
    corpus = []
    
    for txt_file in directory.glob("*.txt"):
        print(f"  Loading: {txt_file.name}")
        content = txt_file.read_text(encoding='utf-8')
        
        # Split by double newlines (paragraphs) or section markers
        paragraphs = re.split(r'\n\n+|‚∏ª', content)
        
        for para in paragraphs:
            # Clean and normalize
            para = para.strip()
            
            # Skip very short paragraphs, headers, or empty lines
            if len(para) < 50 or para.startswith('‚Ä¢') or para.startswith('#'):
                continue
            
            # Remove excessive whitespace and bullet points
            para = re.sub(r'\s+', ' ', para)
            para = re.sub(r'^\s*[‚Ä¢\-]\s*', '', para)
            
            # Skip if still too short after cleaning
            if len(para) < 100:
                continue
                
            corpus.append(para)
    
    return corpus

print("üìö Loading dermatology corpus from raw_text files...")
corpus = load_and_chunk_text_files(text_dir)

# Create documents with metadata
documents = [
    Document(
        page_content=text, 
        metadata={
            'doc_id': i, 
            'source': 'dermatology_corpus',
            'length': len(text)
        }
    ) 
    for i, text in enumerate(corpus)
]

print(f"‚úÖ Loaded {len(documents)} dermatology documents from text files")

# Check if loaded successfully - add fallback if empty
if len(documents) == 0:
    print("‚ö†Ô∏è  No documents from raw_text. Creating sample corpus...")
    sample_texts = [
        "Eczema (atopic dermatitis) is a chronic inflammatory skin condition. Treatment includes daily moisturizing, topical corticosteroids during flare-ups, and avoiding triggers.",
        "Psoriasis is an autoimmune condition causing rapid skin cell turnover, resulting in thick, silvery scales. Common treatments include topical corticosteroids, phototherapy, and systemic medications.",
        "Fungal infections (tinea) such as ringworm are caused by dermatophytes. Treatment involves topical antifungal creams like terbinafine applied for 2-4 weeks.",
        "Acne vulgaris occurs when hair follicles become clogged. Treatment options include topical retinoids, benzoyl peroxide, and oral antibiotics for severe cases.",
        "Contact dermatitis results from skin exposure to irritants or allergens. Management involves identifying and avoiding triggers.",
        "Rosacea causes facial redness and visible blood vessels. Treatment includes avoiding triggers and topical medications like metronidazole.",
        "Seborrheic dermatitis causes scaly patches on the scalp. Treatment includes medicated shampoos containing ketoconazole.",
        "Vitiligo causes loss of skin pigmentation. Management includes sun protection, topical corticosteroids, and phototherapy.",
    ]
    documents = [
        Document(page_content=text, metadata={"doc_id": i, "source": "sample_corpus", "length": len(text)})
        for i, text in enumerate(sample_texts)
    ]
    print(f"‚úÖ Created {len(documents)} sample documents")

if len(documents) > 0:
    avg_len = sum(len(d.page_content) for d in documents) // len(documents)
    print(f"  Average length: {avg_len} chars")
    print(f"  Topics: eczema, psoriasis, fungal infections, treatments")
    print(f"\nSample: {documents[0].page_content[:120]}...")


üìö Loading dermatology corpus from raw_text files...
  Loading: Understanding Skin Diseases.txt
  Loading: skin-care habits.txt
‚úÖ Loaded 34 dermatology documents from text files
  Average length: 348 chars
  Topics: eczema, psoriasis, fungal infections, treatments

Sample: Sure ‚Äî here‚Äôs a detailed and comprehensive overview of skin diseases, written in an informative, medically accurate styl...


---

## Step 2: Dense Retrieval (ChromaDB)

Build a vector store using dense embeddings.


In [4]:
from langchain_chroma import Chroma

# Setup persistence directory for ChromaDB
chroma_root = Path(config["artifacts_root"]) / "chroma"
chroma_root.mkdir(parents=True, exist_ok=True)

print("üîµ Building dense vector store...")

# Chroma.from_documents: Creates a vector store from LangChain Documents
dense_vectorstore = Chroma.from_documents(
    documents=documents,        # documents: List of Document objects to index
    embedding=embeddings,       # embedding: Embedding model to convert text ‚Üí vectors
    collection_name="advanced_dense",  # collection_name: Name of the collection in ChromaDB
    persist_directory=str(chroma_root / "advanced_dense"),  # persist_directory: Where to save the index
)

print(f"‚úÖ Dense index built: {len(documents)} docs")

# Test dense retrieval
query = "What are treatments for eczema?"

# similarity_search: Find documents with vectors closest to query vector
dense_results = dense_vectorstore.similarity_search(
    query,  # query: Search query (will be embedded automatically)
    k=8     # k: Number of top results to return
)

print(f"\nüîç Dense search: '{query}'")
for i, doc in enumerate(dense_results, 1):
    print(f"  [{i}] {doc.page_content[:100]}...")

üîµ Building dense vector store...
‚úÖ Dense index built: 34 docs

üîç Dense search: 'What are treatments for eczema?'
  [1] Eczema (atopic dermatitis) ‚Ä¢ Core remedies: daily emollients, short lukewarm baths/showers, fragranc...
  [2] Eczema (atopic dermatitis) ‚Ä¢ Core remedies: daily emollients, short lukewarm baths/showers, fragranc...
  [3] Eczema (atopic dermatitis) ‚Ä¢ Core remedies: daily emollients, short lukewarm baths/showers, fragranc...
  [4] Eczema (atopic dermatitis) ‚Ä¢ Core remedies: daily emollients, short lukewarm baths/showers, fragranc...
  [5] Treatment depends on the underlying cause and may include: ‚Ä¢ Topical medications: Corticosteroids, a...
  [6] Treatment depends on the underlying cause and may include: ‚Ä¢ Topical medications: Corticosteroids, a...
  [7] Treatment depends on the underlying cause and may include: ‚Ä¢ Topical medications: Corticosteroids, a...
  [8] Treatment depends on the underlying cause and may include: ‚Ä¢ Topical medications: Corti

---

## Step 3: Sparse Retrieval (BM25)

Use BM25 for keyword-based retrieval.


In [5]:
from rank_bm25 import BM25Okapi
import numpy as np

print("üü† Building BM25 index...")

# BM25 (Best Match 25): Classic sparse retrieval algorithm
# Unlike dense retrieval, BM25 uses exact keyword matching with TF-IDF-like scoring

# Step 1: Tokenize corpus (lowercase + split by whitespace)
tokenized_corpus = [doc.page_content.lower().split() for doc in documents]

# BM25Okapi: BM25 variant with Okapi weighting
# Other variants: BM25L, BM25Plus (handle long documents better)
bm25 = BM25Okapi(tokenized_corpus)

print(f"‚úÖ BM25 index built")


def bm25_search(query: str, top_k: int = 3):
    """
    Search using BM25 (sparse retrieval algorithm).
    
    Args:
        query: Search query string
        top_k: Number of top results to return
        
    Returns:
        List of dictionaries with doc, score, and doc_id
    """
    # Tokenize query the same way as corpus
    tokenized_query = query.lower().split()
    
    # Get BM25 scores for all documents
    scores = bm25.get_scores(tokenized_query)
    
    # Find top-k indices (argsort ascending, then reverse for descending)
    top_indices = np.argsort(scores)[::-1][:top_k]
    
    # Build results list
    results = []
    for idx in top_indices:
        results.append({
            "doc": documents[idx],
            "score": float(scores[idx]),
            "doc_id": idx
        })
    
    return results

# Test BM25
bm25_results = bm25_search(query, top_k=8)

print(f"\nüîç BM25 search: '{query}'")
for i, res in enumerate(bm25_results, 1):
    print(f"  [{i}] (score: {res['score']:.2f}) {res['doc'].page_content[:100]}...")


üü† Building BM25 index...
‚úÖ BM25 index built

üîç BM25 search: 'What are treatments for eczema?'
  [1] (score: 3.76) Actinic keratoses, BCC, SCC, melanoma ‚Ä¢ What helps: prevention & early detection. Follow the ABCDE s...
  [2] (score: 2.61) Urticaria (hives) ‚Ä¢ What helps: for most, second-generation oral antihistamines (non-sedating) are f...
  [3] (score: 2.21) 2) Fungal (tinea/ringworm, athlete‚Äôs foot, jock itch) ‚Ä¢ What helps at home: OTC antifungals (creams,...
  [4] (score: 2.20) Skin diseases are among the most prevalent health problems worldwide. According to WHO and the Globa...
  [5] (score: 1.60) These are caused by microorganisms such as bacteria, viruses, fungi, or parasites. ‚Ä¢ Bacterial infec...
  [6] (score: 1.33) Acne (common, though not in your earlier list) ‚Ä¢ At home: gentle cleanse; benzoyl peroxide, adapalen...
  [7] (score: 0.91) Vitiligo ‚Ä¢ What helps: strict sun protection to minimize contrast; dermatologist-directed topical co...
  [8] (score: 0.

---

## Step 4: Hybrid Fusion (Dense + BM25)

Combine dense and sparse retrieval using Reciprocal Rank Fusion (RRF).


In [6]:
from typing import List

def rrf_fusion(dense_docs: List, bm25_results: List, k: int = 60) -> List:
    """
    Reciprocal Rank Fusion (RRF) - combines dense and sparse retrieval.
    
    RRF Formula: score(d) = Œ£ 1/(k + rank(d)) for each ranking
    
    - k=60 is the default constant (from original paper)
    - Higher k ‚Üí more weight to lower-ranked documents
    - Lower k ‚Üí more weight to top-ranked documents
    
    Args:
        dense_docs: Results from dense (vector) retrieval
        bm25_results: Results from BM25 (sparse) retrieval
        k: RRF constant (default 60, typical range 1-100)
        
    Returns:
        Fused results sorted by RRF score (higher = more relevant)
    """
    rrf_scores = {}
    
    # Add scores from dense retrieval
    for rank, doc in enumerate(dense_docs, 1):  # rank starts at 1
        doc_id = doc.metadata["doc_id"]
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    
    # Add scores from BM25 retrieval
    for rank, doc in enumerate(bm25_results, 1):
        doc_id = doc["doc_id"]
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    
    # Sort by combined RRF score (descending)
    sorted_ids = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    
    # Build final results list
    fused_docs = []
    for doc_id, score in sorted_ids:
        fused_docs.append({
            "doc": documents[doc_id],
            "score": score,
            "doc_id": doc_id
        })
    
    return fused_docs
# Test hybrid fusion
fused_results = rrf_fusion(dense_results, bm25_results)

print(f"üîÄ Hybrid (RRF) search: '{query}'")
for i, res in enumerate(fused_results, 1):
    print(f"  [{i}] (RRF: {res['score']:.3f}) {res['doc'].page_content[:100]}...")


üîÄ Hybrid (RRF) search: 'What are treatments for eczema?'
  [1] (RRF: 0.064) Eczema (atopic dermatitis) ‚Ä¢ Core remedies: daily emollients, short lukewarm baths/showers, fragranc...
  [2] (RRF: 0.060) Treatment depends on the underlying cause and may include: ‚Ä¢ Topical medications: Corticosteroids, a...
  [3] (RRF: 0.016) Actinic keratoses, BCC, SCC, melanoma ‚Ä¢ What helps: prevention & early detection. Follow the ABCDE s...
  [4] (RRF: 0.016) Urticaria (hives) ‚Ä¢ What helps: for most, second-generation oral antihistamines (non-sedating) are f...
  [5] (RRF: 0.016) 2) Fungal (tinea/ringworm, athlete‚Äôs foot, jock itch) ‚Ä¢ What helps at home: OTC antifungals (creams,...
  [6] (RRF: 0.016) Skin diseases are among the most prevalent health problems worldwide. According to WHO and the Globa...
  [7] (RRF: 0.015) These are caused by microorganisms such as bacteria, viruses, fungi, or parasites. ‚Ä¢ Bacterial infec...
  [8] (RRF: 0.015) Acne (common, though not in your earlier list)

---

## Step 5: Reranking with Cross-Encoder

Refine results using a cross-encoder for more accurate relevance scoring.


### Why Rerank?

**The Problem:**
- Dense retrieval is fast but approximate
- Getting top-10 from 10,000 chunks may miss relevant docs

**The Solution: Two-Stage Retrieval**
```
Stage 1: Fast retrieval (dense + BM25) ‚Üí 100 candidates
Stage 2: Rerank with cross-encoder ‚Üí 10 final results
```

**Cross-encoders vs Bi-encoders:**
- **Bi-encoder** (embeddings): Encodes query and doc separately ‚Üí fast but less accurate
- **Cross-encoder**: Encodes query+doc together ‚Üí slow but more accurate

In [7]:
def rerank(query: str, results: List, top_k: int = 3):
    """
    Rerank results using a cross-encoder for more accurate relevance scoring.
    
    Cross-encoder sees [query, document] together, enabling deeper understanding
    of relevance than separate embeddings.
    
    Args:
        query: Search query string
        results: List of initial results to rerank (from fusion)
        top_k: Number of top results to return after reranking
        
    Returns:
        Reranked results with rerank_score added (higher = more relevant)
    """
    # Create query-document pairs for cross-encoder
    # Format: [[query, doc1], [query, doc2], ...]
    pairs = [[query, res['doc'].page_content] for res in results]
    
    # Cross-encoder predicts relevance score for each pair
    # Returns array of scores (can be negative, higher = more relevant)
    scores = reranker.predict(pairs)
    
    # Add rerank scores to results
    for i, res in enumerate(results):
        res['rerank_score'] = float(scores[i])
    
    # Sort by rerank score (descending) and take top-k
    reranked = sorted(results, key=lambda x: x['rerank_score'], reverse=True)[:top_k]
    return reranked

# Test reranking
reranked_results = rerank(query, fused_results[:6], top_k=3)

print(f"üèÜ Reranked results: '{query}'")
for i, res in enumerate(reranked_results, 1):
    print(f"  [{i}] (rerank: {res['rerank_score']:.3f}) {res['doc'].page_content[:100]}...")


üèÜ Reranked results: 'What are treatments for eczema?'
  [1] (rerank: 4.700) Eczema (atopic dermatitis) ‚Ä¢ Core remedies: daily emollients, short lukewarm baths/showers, fragranc...
  [2] (rerank: -1.330) Treatment depends on the underlying cause and may include: ‚Ä¢ Topical medications: Corticosteroids, a...
  [3] (rerank: -1.724) Actinic keratoses, BCC, SCC, melanoma ‚Ä¢ What helps: prevention & early detection. Follow the ABCDE s...


---

## Bonus Exercise: Complete Hybrid RAG Pipeline

**Challenge:** Combine all techniques into a single end-to-end pipeline!

In [8]:
def hybrid_rag_pipeline(
    query: str, 
    dense_top_n: int = 10,   # dense_top_n: How many docs to retrieve with dense search
    bm25_top_n: int = 10,    # bm25_top_n: How many docs to retrieve with BM25
    rerank_top_n: int = 6,   # rerank_top_n: How many fused results to rerank
    final_top_k: int = 3     # final_top_k: Final number of docs for LLM context
):
    """
    Complete hybrid RAG pipeline combining all advanced techniques.
    
    Pipeline: Dense ‚Üí BM25 ‚Üí Fusion (RRF) ‚Üí Rerank ‚Üí LLM Generation
    
    Args:
        query: User question
        dense_top_n: Number of results from dense retrieval
        bm25_top_n: Number of results from BM25
        rerank_top_n: Number of fused results to rerank (reduces cross-encoder calls)
        final_top_k: Final number of chunks to use for generation
        
    Returns:
        Dictionary with query, answer, retrieved_docs, and pipeline stats
    """
    # Stage 1: Dense retrieval (semantic similarity)
    dense_results = dense_vectorstore.similarity_search(query, k=dense_top_n)
    
    # Stage 2: Sparse retrieval (keyword matching)
    bm25_results = bm25_search(query, top_k=bm25_top_n)
    
    # Stage 3: Fusion (combine rankings with RRF)
    fused_results = rrf_fusion(dense_results, bm25_results)[:rerank_top_n]
    
    # Stage 4: Reranking (refine with cross-encoder)
    reranked_results = rerank(query, fused_results, top_k=final_top_k)
    
    # Stage 5: Build context from top results
    context = "\n\n".join([res["doc"].page_content for res in reranked_results])
    
    # Stage 6: Generate answer with LLM
    prompt = f"""Use the following context to answer the question. Be concise and accurate.

Context:
{context}

Question: {query}

Answer:"""
    
    response = llm.invoke(prompt)
    answer = response.content if hasattr(response, 'content') else str(response)
    
    return {
        "query": query,
        "answer": answer,
        "retrieved_docs": reranked_results,
        "num_dense": len(dense_results),
        "num_bm25": len(bm25_results),
        "num_fused": len(fused_results),
        "num_final": len(reranked_results)
    }


# Test the complete pipeline
print("üöÄ Testing Complete Hybrid RAG Pipeline\n")
test_query = "What are treatments for eczema?"
result = hybrid_rag_pipeline(test_query)

print(f"Query: {test_query}")
print(f"\nPipeline stats:")
print(f"  Dense retrieval: {result['num_dense']} docs")
print(f"  BM25 retrieval: {result['num_bm25']} docs")
print(f"  After fusion: {result['num_fused']} docs")
print(f"  After reranking: {result['num_final']} docs")
print(f"\nFinal Answer:\n{result['answer']}")

üöÄ Testing Complete Hybrid RAG Pipeline

Query: What are treatments for eczema?

Pipeline stats:
  Dense retrieval: 10 docs
  BM25 retrieval: 10 docs
  After fusion: 6 docs
  After reranking: 3 docs

Final Answer:
Treatments for eczema include daily emollients, short lukewarm baths/showers, fragrance-free products, and trigger avoidance. For flares, clinicians may prescribe topical anti-inflammatories (steroids or non-steroids), wet-wraps, and selected phototherapy for moderate to severe cases. Additionally, dilute bleach baths may be recommended by a dermatologist to reduce Staph burden and itch.


---

## Summary

**What we learned:**

### Retrieval Methods
- ‚úÖ **Dense Retrieval** - Vector similarity search with embeddings (ChromaDB)
- ‚úÖ **Sparse Retrieval** - BM25 keyword matching (exact terms)
- ‚úÖ **Hybrid Fusion** - Reciprocal Rank Fusion (RRF) combining both
- ‚úÖ **Reranking** - Cross-encoder for refined relevance scoring

### Complete Pipeline
```
Query ‚Üí Dense (top-N) + BM25 (top-N) ‚Üí Fusion (RRF) ‚Üí Rerank (cross-encoder) ‚Üí Final top-k ‚Üí LLM
```

### When to Use Each Method
| Method | Best For | Trade-off |
|--------|----------|-----------|
| **BM25** | Keyword/exact match queries | Fast, misses synonyms |
| **Dense** | Semantic/paraphrase queries | Good accuracy, slower |
| **Hybrid** | General queries | Best coverage, more computation |
| **Reranking** | High precision needed | Highest accuracy, slowest |

### Key Takeaways
- No single retrieval method works best for all queries
- Hybrid approaches combine strengths of different methods
- Two-stage retrieval (fast ‚Üí accurate) balances speed and quality
- Cross-encoders are powerful but should be used on small candidate sets

**Artifacts:**
- `./artifacts/chroma/advanced_dense/`
- `./artifacts/manifests/advanced_retrieval.json`