# Hybrid Search: BM25 + Dense Retrieval

## Welcome!
So far, we've used only semantic search (embeddings). But sometimes keywords matter!

## The Problem with Semantic-Only Search

```
Query: "What is GPT-4?"

Semantic search might return:
- "Large language models have revolutionized AI..." (semantically similar but no GPT-4 mention!)

Keyword search would find:
- "GPT-4 is OpenAI's latest model..." (exact match!)
```

## The Solution: Combine Both!

**Hybrid Search** = Keyword Search (BM25) + Semantic Search (Dense)

- **BM25**: Finds exact keyword matches ("GPT-4", "LoRA", specific terms)
- **Dense**: Finds semantically similar content (meaning, context)
- **Hybrid**: Gets the best of both worlds!

## What You'll Learn
1. How BM25 (keyword search) works
2. How to combine BM25 with semantic search
3. Reciprocal Rank Fusion (RRF) for combining results
4. When to use hybrid search

## Step 1: Environment Setup

In [1]:
from dotenv import load_dotenv
import os

load_dotenv()  # This should work since .env is in the same folder
print("OPENAI_API_KEY loaded:", "OPENAI_API_KEY" in os.environ)


OPENAI_API_KEY loaded: True


In [2]:
# Install required packages (run once)
# !pip install rank_bm25

## Step 2: Load and Prepare Documents

In [3]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load PDF
pdf_path = "\llm_fundamentals.pdf"
if not os.path.exists(pdf_path):
    pdf_path = "../RAG/llm_fundamentals.pdf"

loader = PyPDFLoader(pdf_path)
documents = loader.load()

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = text_splitter.split_documents(documents)

# Extract just the text for our examples
chunk_texts = [chunk.page_content for chunk in chunks]

print(f"Loaded {len(chunks)} chunks")
print(f"\nSample chunk:")
print(chunk_texts[0][:200] + "...")

  from .autonotebook import tqdm as notebook_tqdm


Loaded 37 chunks

Sample chunk:
@genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Contents 
 
Core LLM Building Blocks ....................................................................


---
## Part 1: Understanding BM25 (Keyword Search)

**BM25** (Best Matching 25) is a classic keyword search algorithm.

**How it works (simplified):**
1. Looks for exact word matches
2. Gives higher scores to:
   - Rare words ("LoRA" is more important than "the")
   - Multiple matches (more occurrences = higher score)
3. Normalizes by document length

**Think of it like:** Google search before AI - pure keyword matching.

In [4]:
from rank_bm25 import BM25Okapi
import numpy as np

# Step 1: Tokenize documents (split into words)
def simple_tokenize(text):
    """
    Simple tokenization: lowercase and split by spaces.
    In production, you'd use better tokenization.
    """
    return text.lower().split()

# Tokenize all chunks
tokenized_chunks = [simple_tokenize(chunk) for chunk in chunk_texts]

# Step 2: Create BM25 index
bm25 = BM25Okapi(tokenized_chunks)

print("BM25 index created!")
print(f"Vocabulary size: {len(bm25.idf)} unique words")

BM25 index created!
Vocabulary size: 914 unique words


In [5]:
def bm25_search(query: str, k: int = 5):
    """
    Search using BM25 (keyword matching).
    
    Args:
        query: Search query
        k: Number of results to return
    
    Returns:
        List of (chunk_text, score) tuples
    """
    # Tokenize the query
    tokenized_query = simple_tokenize(query)
    
    # Get BM25 scores for all chunks
    scores = bm25.get_scores(tokenized_query)
    
    # Get top-k indices
    top_indices = np.argsort(scores)[::-1][:k]
    
    # Return results with scores
    results = [(chunk_texts[i], scores[i]) for i in top_indices]
    return results

# Test BM25 search
query = "What is LoRA?"
bm25_results = bm25_search(query, k=3)

print(f"BM25 Search Results for: '{query}'")
print("="*80)

for i, (text, score) in enumerate(bm25_results, 1):
    print(f"\nResult {i} (BM25 Score: {score:.2f}):")
    print(f"{text[:200]}...")
    print("-"*40)

BM25 Search Results for: 'What is LoRA?'

Result 1 (BM25 Score: 2.92):
9. Data Pipelines → Automated data processing and validation workflows 
10. Model Governance → Policies and processes for responsible model deployment 
11. Rollback Strategies → Safe deployment and qu...
----------------------------------------

Result 2 (BM25 Score: 0.00):
15. Jailbreaking Prevention → Protecting against attempts to bypass safety measures...
----------------------------------------

Result 3 (BM25 Score: 0.00):
9. Red-Teaming → Stress-test models for safety, robustness, and alignment 
10. Robustness Evaluation → Test model resilience against noise, domain shifts, and edge 
cases 
11. Mechanistic Interpretabi...
----------------------------------------


---
## Part 2: Setting Up Dense (Semantic) Search

This is what we've been doing - using embeddings for semantic similarity.

In [6]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

# Create embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Create vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="hybrid_demo"
)

print("Dense (Semantic) search ready!")

Dense (Semantic) search ready!


In [7]:
def dense_search(query: str, k: int = 5):
    """
    Search using dense embeddings (semantic similarity).
    """
    results = vectorstore.similarity_search_with_score(query, k=k)
    # Return in same format as BM25
    return [(doc.page_content, score) for doc, score in results]

# Test dense search
dense_results = dense_search(query, k=3)

print(f"Dense Search Results for: '{query}'")
print("="*80)

for i, (text, score) in enumerate(dense_results, 1):
    print(f"\nResult {i} (Distance: {score:.4f} - lower is better):")
    print(f"{text[:200]}...")
    print("-"*40)

Dense Search Results for: 'What is LoRA?'

Result 1 (Distance: 1.5255 - lower is better):
9. QLoRA → LoRA + quantization, enabling fine-tuning of huge models on modest hardware 
10. PEFT → Family of methods (e.g., LoRA, QLoRA, adapters) updating only small parts of the 
model 
11. Instruct...
----------------------------------------

Result 2 (Distance: 1.6433 - lower is better):
@genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Contents 
 
Core LLM Building Blocks ....................................................................
----------------------------------------

Result 3 (Distance: 1.7043 - lower is better):
requirements 
11. Explainability / Interpretability Evaluation → Assess clarity and transparency of model 
reasoning 
Extensions 
1. Multimodality → Combine text, images, audio, and video for richer u...
----------------------------------------


---
## Part 3: Comparing BM25 vs Dense

Let's see where each method shines!

In [8]:
def compare_search_methods(query: str):
    """
    Compare BM25 and Dense search side by side.
    """
    print(f"Query: '{query}'")
    print("\n" + "="*80)
    
    # BM25 results
    print("\nBM25 (Keyword) Results:")
    print("-"*40)
    bm25_results = bm25_search(query, k=2)
    for i, (text, score) in enumerate(bm25_results, 1):
        print(f"{i}. (score: {score:.2f}) {text[:100]}...")
    
    # Dense results
    print("\nDense (Semantic) Results:")
    print("-"*40)
    dense_results = dense_search(query, k=2)
    for i, (text, score) in enumerate(dense_results, 1):
        print(f"{i}. (dist: {score:.4f}) {text[:100]}...")
    
    print("\n" + "="*80 + "\n")

# Test with different query types
print("COMPARISON 1: Specific term query")
compare_search_methods("LoRA")

print("COMPARISON 2: Conceptual query")
compare_search_methods("How can I make my model smaller?")

print("COMPARISON 3: Acronym query")
compare_search_methods("QLoRA")

COMPARISON 1: Specific term query
Query: 'LoRA'


BM25 (Keyword) Results:
----------------------------------------
1. (score: 2.52) 3. Sharded / Distributed Training → Scale across multiple GPUs/nodes 
4. Continual / Lifelong Learni...
2. (score: 2.31) 9. QLoRA → LoRA + quantization, enabling fine-tuning of huge models on modest hardware 
10. PEFT → F...

Dense (Semantic) Results:
----------------------------------------


1. (dist: 1.4533) 9. QLoRA → LoRA + quantization, enabling fine-tuning of huge models on modest hardware 
10. PEFT → F...
2. (dist: 1.5870) @genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Co...


COMPARISON 2: Conceptual query
Query: 'How can I make my model smaller?'


BM25 (Keyword) Results:
----------------------------------------
1. (score: 2.83) 3. Top-k / Top-p → Sampling filters, Higher = safer, looser = more diverse 
4. Repetition Penalty → ...
2. (score: 2.61) @genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Co...

Dense (Semantic) Results:
----------------------------------------
1. (dist: 1.5048) @genieincodebottle 
11. Sharded / Distributed Training → Split model parameters across devices for m...
2. (dist: 1.5145) 15. Distillation → Transfer knowledge from a large model into a smaller one 
16. Gradient Descent & ...


COMPARISON 3: Acronym query
Query: 'QLoRA'


BM25 (Keyword

---
## Part 4: Hybrid Search with Reciprocal Rank Fusion (RRF)

**The Problem:**
BM25 and Dense scores are on different scales - we can't just add them!

**The Solution: Reciprocal Rank Fusion (RRF)**

Instead of combining scores, we combine RANKS:

```
RRF Score = 1/(k + rank_bm25) + 1/(k + rank_dense)

Where k is a constant (usually 60)
```

**Why this works:**
- If a document ranks high in BOTH methods, it gets a high combined score
- Documents that rank high in only one method still get credit
- Handles different score scales automatically

In [9]:
def hybrid_search_rrf(query: str, k: int = 5, rrf_k: int = 60):
    """
    Hybrid search combining BM25 and Dense using Reciprocal Rank Fusion.
    
    Args:
        query: Search query
        k: Number of final results to return
        rrf_k: RRF constant (default 60, standard value)
    
    Returns:
        List of (chunk_text, rrf_score) tuples
    """
    # Get more results than needed from each method
    num_candidates = k * 3
    
    # Step 1: Get BM25 results
    bm25_results = bm25_search(query, k=num_candidates)
    
    # Step 2: Get Dense results  
    dense_results = dense_search(query, k=num_candidates)
    
    # Step 3: Create rank dictionaries
    # Map chunk text to its rank in each method
    bm25_ranks = {text: rank for rank, (text, _) in enumerate(bm25_results, 1)}
    dense_ranks = {text: rank for rank, (text, _) in enumerate(dense_results, 1)}
    
    # Step 4: Get all unique chunks
    all_chunks = set(bm25_ranks.keys()) | set(dense_ranks.keys())
    
    # Step 5: Calculate RRF scores
    rrf_scores = {}
    for chunk in all_chunks:
        # Get rank from each method (use large number if not in results)
        bm25_rank = bm25_ranks.get(chunk, 1000)
        dense_rank = dense_ranks.get(chunk, 1000)
        
        # Calculate RRF score
        rrf_score = (1 / (rrf_k + bm25_rank)) + (1 / (rrf_k + dense_rank))
        rrf_scores[chunk] = {
            'rrf_score': rrf_score,
            'bm25_rank': bm25_rank,
            'dense_rank': dense_rank
        }
    
    # Step 6: Sort by RRF score and return top-k
    sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1]['rrf_score'], reverse=True)
    
    return sorted_results[:k]

print("Hybrid search function ready!")

Hybrid search function ready!


In [10]:
# Test hybrid search
query = "What is LoRA and how does it help with fine-tuning?"

print(f"Hybrid Search Results for: '{query}'")
print("="*80)

hybrid_results = hybrid_search_rrf(query, k=5)

for i, (text, scores) in enumerate(hybrid_results, 1):
    print(f"\nResult {i}:")
    print(f"  RRF Score: {scores['rrf_score']:.4f}")
    print(f"  BM25 Rank: {scores['bm25_rank']} | Dense Rank: {scores['dense_rank']}")
    print(f"  Text: {text[:150]}...")
    print("-"*40)

Hybrid Search Results for: 'What is LoRA and how does it help with fine-tuning?'

Result 1:
  RRF Score: 0.0320
  BM25 Rank: 4 | Dense Rank: 1
  Text: 9. QLoRA → LoRA + quantization, enabling fine-tuning of huge models on modest hardware 
10. PEFT → Family of methods (e.g., LoRA, QLoRA, adapters) upd...
----------------------------------------

Result 2:
  RRF Score: 0.0308
  BM25 Rank: 8 | Dense Rank: 2
  Text: 3. Sharded / Distributed Training → Scale across multiple GPUs/nodes 
4. Continual / Lifelong Learning → Update models without forgetting old knowledg...
----------------------------------------

Result 3:
  RRF Score: 0.0292
  BM25 Rank: 6 | Dense Rank: 11
  Text: @genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Contents 
 
Core LLM Building Blocks ..................
----------------------------------------

Result 4:
  RRF Score: 0.0286
  BM25 Rank: 13 | Dense Rank: 7
  Text: 16. Elastic / Dynamic Batching → Adjust batch size d

---
## Part 5: Weighted Hybrid Search

Sometimes you want to give more weight to one method over the other.

**Use cases:**
- Technical docs with many acronyms → Weight BM25 higher
- Conceptual queries → Weight Dense higher

In [11]:
def weighted_hybrid_search(query: str, k: int = 5, bm25_weight: float = 0.5):
    """
    Weighted hybrid search - control the balance between BM25 and Dense.
    
    Args:
        query: Search query
        k: Number of results
        bm25_weight: Weight for BM25 (0.0 to 1.0)
                     0.0 = Pure dense, 1.0 = Pure BM25, 0.5 = Equal
    """
    dense_weight = 1 - bm25_weight
    num_candidates = k * 3
    rrf_k = 60
    
    # Get results from both methods
    bm25_results = bm25_search(query, k=num_candidates)
    dense_results = dense_search(query, k=num_candidates)
    
    # Create rank dictionaries
    bm25_ranks = {text: rank for rank, (text, _) in enumerate(bm25_results, 1)}
    dense_ranks = {text: rank for rank, (text, _) in enumerate(dense_results, 1)}
    
    # Get all unique chunks
    all_chunks = set(bm25_ranks.keys()) | set(dense_ranks.keys())
    
    # Calculate WEIGHTED RRF scores
    rrf_scores = {}
    for chunk in all_chunks:
        bm25_rank = bm25_ranks.get(chunk, 1000)
        dense_rank = dense_ranks.get(chunk, 1000)
        
        # Weighted RRF
        rrf_score = (
            bm25_weight * (1 / (rrf_k + bm25_rank)) + 
            dense_weight * (1 / (rrf_k + dense_rank))
        )
        rrf_scores[chunk] = rrf_score
    
    # Sort and return
    sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_results[:k]

# Compare different weights
query = "QLoRA quantization"

print(f"Query: '{query}'")
print("\n" + "="*80)

for weight in [0.0, 0.5, 1.0]:
    weight_name = {0.0: "Pure Dense", 0.5: "Balanced", 1.0: "Pure BM25"}[weight]
    print(f"\n{weight_name} (BM25 weight: {weight}):")
    results = weighted_hybrid_search(query, k=2, bm25_weight=weight)
    for i, (text, score) in enumerate(results, 1):
        print(f"  {i}. {text[:80]}...")

Query: 'QLoRA quantization'


Pure Dense (BM25 weight: 0.0):
  1. 9. QLoRA → LoRA + quantization, enabling fine-tuning of huge models on modest ha...
  2. 16. Elastic / Dynamic Batching → Adjust batch size dynamically to optimize throu...

Balanced (BM25 weight: 0.5):
  1. 9. QLoRA → LoRA + quantization, enabling fine-tuning of huge models on modest ha...
  2. 16. Elastic / Dynamic Batching → Adjust batch size dynamically to optimize throu...

Pure BM25 (BM25 weight: 1.0):
  1. 9. QLoRA → LoRA + quantization, enabling fine-tuning of huge models on modest ha...
  2. 16. Elastic / Dynamic Batching → Adjust batch size dynamically to optimize throu...


---
## Part 6: Using LangChain's Built-in Ensemble Retriever

LangChain provides an easy way to do hybrid search!

In [12]:
from langchain_classic.retrievers import BM25Retriever, EnsembleRetriever

# Create BM25 retriever from documents
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5  # Number of results

# Create dense retriever from vector store
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Create ensemble (hybrid) retriever/
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.5, 0.5]  # Equal weights
)

print("LangChain Ensemble Retriever ready!")

LangChain Ensemble Retriever ready!


In [13]:
# Test the ensemble retriever
query = "What is attention mechanism in transformers?"

print(f"Ensemble Retriever Results for: '{query}'")
print("="*80)

results = ensemble_retriever.invoke(query)

for i, doc in enumerate(results[:5], 1):
    print(f"\nResult {i}:")
    print(f"{doc.page_content[:200]}...")
    print("-"*40)

Ensemble Retriever Results for: 'What is attention mechanism in transformers?'

Result 1:
5. Attention → Highlights the most relevant tokens in context 
6. Self-Attention → Each token attends to every other token for context 
7. Cross-Attention → Connect encoder and decoder (in encoder-dec...
----------------------------------------

Result 2:
17. ALiBi / Relative Positional Encoding → Alternative to RoPE for long contexts 
18. Linear / Performer Attention → Efficient attention variants for very long sequences 
19. Grouped Query Attention (...
----------------------------------------

Result 3:
9. Data Pipelines → Automated data processing and validation workflows 
10. Model Governance → Policies and processes for responsible model deployment 
11. Rollback Strategies → Safe deployment and qu...
----------------------------------------

Result 4:
11. Layer Normalization → Normalizes activations to stabilize and speed up training 
12. Output Projection (LM Head) → Final linear layer mapp

---
## Part 7: Complete Hybrid RAG Pipeline

In [14]:
from langchain_openai import ChatOpenAI
from langchain_classic.chains.retrieval_qa.base import RetrievalQA

# Initialize LLM
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.7,
    api_key = os.environ['OPENAI_API_KEY']
)

# Create RAG chain with hybrid retriever
hybrid_qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=ensemble_retriever,
    return_source_documents=True
)

print("Hybrid RAG pipeline ready!")

Hybrid RAG pipeline ready!


In [15]:
# Test the hybrid RAG pipeline
def ask_hybrid_rag(question: str):
    """
    Ask a question using the hybrid RAG pipeline.
    """
    result = hybrid_qa_chain.invoke({"query": question})
    
    print(f"Question: {question}")
    print(f"\n" + "="*80)
    print(f"\nAnswer:\n{result['result']}")
    print(f"\n" + "="*80)
    print(f"\nSources ({len(result['source_documents'])} documents):")
    for i, doc in enumerate(result['source_documents'][:3], 1):
        print(f"  {i}. {doc.page_content[:100]}...")

# Test with different types of queries
ask_hybrid_rag("What is LoRA and how does it differ from QLoRA?")

Question: What is LoRA and how does it differ from QLoRA?


Answer:
LoRA (Low-Rank Adaptation) is a method that involves parameter-efficient adapters for fine-tuning machine learning models. It allows for cheap fine-tuning by updating only a small subset of model parameters, which can be beneficial for adapting models to specific tasks or domains without the need for extensive retraining.

QLoRA (Quantized LoRA) combines LoRA with quantization techniques. This approach enables fine-tuning of large models on modest hardware by reducing the memory footprint of the model. By quantizing the parameters, QLoRA makes it feasible to work with larger models even on hardware with limited resources.

In summary, while both LoRA and QLoRA are designed for efficient fine-tuning, QLoRA specifically incorporates quantization to facilitate the training of larger models on less powerful hardware.


Sources (9 documents):
  1. 3. Sharded / Distributed Training → Scale across multiple GPUs/nodes 
4. Cont

In [16]:
# Test with a more conceptual query
ask_hybrid_rag("How can I efficiently fine-tune a large model on limited hardware?")

Question: How can I efficiently fine-tune a large model on limited hardware?


Answer:
You can efficiently fine-tune a large model on limited hardware by using techniques such as:

1. **LoRA (Low-Rank Adaptation)**: This method allows for parameter-efficient fine-tuning, enabling you to adapt the model without needing to adjust all of its parameters.

2. **QLoRA**: This combines LoRA with quantization, allowing for fine-tuning of large models while using modest hardware resources.

3. **Gradient Checkpointing**: This technique saves memory by recomputing intermediate activations on demand instead of storing them all.

4. **Mixed Precision Training**: Using FP16 or BF16 can speed up training and reduce memory usage.

5. **Sharded / Distributed Training**: If you have access to multiple devices, you can scale across multiple GPUs or nodes to distribute the training workload.

These strategies can help you maximize efficiency while working within the constraints of limited hardware.


Sou

---
## When to Use Hybrid Search?

| Scenario | Recommendation |
|----------|----------------|
| Technical docs with acronyms | Hybrid (weight BM25 higher) |
| Conceptual questions | Hybrid or Pure Dense |
| Exact term lookups | BM25 or Hybrid |
| Synonyms/paraphrasing | Dense or Hybrid |
| General RAG | Hybrid (balanced) |

**Rule of thumb:** When in doubt, use Hybrid with equal weights. It's rarely worse than either method alone!

---
## Summary

### What You've Learned:
1. **BM25**: Keyword-based search (exact matches)
2. **Dense**: Semantic search (meaning-based)
3. **Hybrid**: Combines both for better retrieval
4. **RRF**: Reciprocal Rank Fusion for combining results
5. **Weighted Hybrid**: Control the balance between methods

### Key Takeaways:
- Neither BM25 nor Dense is always better - it depends on the query!
- Hybrid search gives you the best of both worlds
- RRF is a simple but effective way to combine rankings
- LangChain's EnsembleRetriever makes hybrid search easy

### BM25 vs Dense vs Hybrid:
```
BM25:   "GPT-4" → Finds exact "GPT-4" mentions
Dense:  "GPT-4" → Might find "large language model" (semantically similar)
Hybrid: "GPT-4" → Finds both! Best coverage.
```

### Next Up:
**Re-ranking** - After retrieval, use a more powerful model to re-order results for even better quality!