# Hybrid Retrieval Implementation

This notebook implements the hybrid retrieval layer of the system — combining semantic and lexical signals to improve precision and robustness.

---

## Scope

- Implement BM25 keyword-based retrieval  
- Retain FAISS semantic retrieval  
- Apply score fusion between lexical and semantic signals  
- Introduce cross-encoder reranking  
- Compare retrieval quality metrics  
- Measure latency impact  

This notebook represents the core intelligence layer of the RAG architecture and is critical for production-grade systems and technical interviews.

---

## Why Hybrid Retrieval?

Initial baseline evaluation showed:

Precision@5 = 0.20  

Pure vector search captures semantic similarity effectively, but it does not always prioritize explicit keyword intent (e.g., terms like “limitations”, “drawbacks”, “constraints”).

BM25, on the other hand, emphasizes exact keyword matching and term frequency signals.

Hybrid retrieval combines:

- Semantic understanding (vector similarity)  
- Lexical precision (BM25 keyword scoring)  

By fusing these signals, the system significantly improves retrieval precision while maintaining contextual relevance.


In [1]:
import sys
import os

# Add project root to Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(project_root)


In [2]:
import pickle

with open("processed_chunks.pkl", "rb") as f:
    documents = pickle.load(f)

print(f"Loaded {len(documents)} documents for hybrid retrieval.")

Loaded 1333 documents for hybrid retrieval.


In [3]:
# Load Embedding Model
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm
W0221 18:11:37.427000 22416 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


In [4]:
# Load FAISS Index
import faiss

index = faiss.read_index("faiss_index.bin")
print("FAISS index loaded. Total Vectors:", index.ntotal)

FAISS index loaded. Total Vectors: 1333


## 1 – Prepare Corpus for BM25

To enable lexical retrieval, the document corpus is prepared for BM25 indexing.

Since BM25 operates on tokenized text, each document is converted to lowercase and split into tokens. This creates a structured corpus suitable for term-frequency and inverse document frequency calculations.

The BM25 index is then initialized over the tokenized corpus, enabling efficient keyword-based scoring.

This step establishes the lexical retrieval backbone required for hybrid search fusion.


In [5]:
from rank_bm25 import BM25Okapi

tokenized_corpus = [
    doc.page_content.lower().split()
    for doc in documents
]

bm25 = BM25Okapi(tokenized_corpus)


## 2 – BM25 Search Function

A dedicated BM25 search function is implemented to retrieve documents based on lexical relevance.

The query is tokenized and scored against the preprocessed corpus using BM25. Documents are then ranked by their relevance scores, and the top-k results are returned along with their ranking position and score.

This function provides:

- Deterministic keyword-based retrieval  
- Transparent scoring behavior  
- Structured result format for later score fusion  

BM25 retrieval complements semantic search by prioritizing exact keyword matches, forming the lexical component of the hybrid retrieval pipeline.


In [6]:
def bm25_search(query, documents, top_k=5):
    tokenized_query = query.lower().split()
    scores = bm25.get_scores(tokenized_query)

    ranked_indices = sorted(
        range(len(scores)),
        key=lambda i: scores[i],
        reverse=True
    )[:top_k]

    results = []

    for rank, idx in enumerate(ranked_indices):
        results.append({
            "rank": rank+1,
            "score": scores[idx],
            "content": documents[idx].page_content,
            "doc": documents[idx]
        })

    return results

### Validate BM25 Retrieval

The BM25 search function is tested using a targeted query to evaluate lexical relevance performance.


In [7]:
bm25_results = bm25_search("How many AI publications were there in 2023?",
                           documents)

for r in bm25_results:
    print(r["rank"], r["score"])
    print(r["content"][:300])
    print("-"*300)

1 16.18384673243373
Table of Contents 40
Artificial Intelligence
Index Report 2025Chapter 1 Preview
Academia Industry Industry and academia Mixed Other
Sector
Number of highly cited publications in top 100
Number of highly cited publications in top 100 by sector, 2021–23
Source: AI Index, 2025 | Chart: 2025 AI Index re
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2 14.632243749264937
database. As a result, the numbers in this year’s report differ 
slightly from those in previous editions.
1 Given that there is a 
significant lag in the collection of publication metadata, and 
that in some cases it takes until the middle of any given year 
to fully capture the previous year’s pub
--------------------------------------------------------

This validation step confirms:

- Proper tokenization of the query  
- Correct BM25 scoring behavior  
- Accurate ranking of documents based on keyword match strength  

Inspecting the ranked results helps verify that terms such as “limitations” and “LoRA” are prioritized effectively.

This qualitative check establishes the lexical baseline before integrating semantic fusion and reranking.

## 3 – Score Normalization for Fusion

Before combining semantic and lexical signals, both score types must be normalized.

Vector search (FAISS) returns L2 distances, where lower values indicate higher similarity.  
BM25 returns positive relevance scores, where higher values indicate stronger keyword match.

To enable meaningful fusion:

- L2 distances are converted into similarity scores.  
- Both vector and BM25 scores are scaled into a comparable range (0–1).  



In [8]:
# normalize L2 distances before fusion
def normalize_vector_scores(distances):
    max_dist = max(distances)
    return [1 - (d / max_dist) for d in distances]

In [9]:
# Normalize BM25 Scores
def normalize_bm25_scores(scores):
    max_score = max(scores)
    return [s / max_score for s in scores]

This normalization step ensures that neither retrieval method disproportionately influences the final ranking, enabling balanced hybrid scoring.








## 4 – Hybrid Score Fusion

Hybrid retrieval is implemented using weighted score fusion between semantic similarity and BM25 lexical relevance.

The process consists of:

1. Retrieving top candidates from the FAISS index (semantic signal).  
2. Computing BM25 scores for the full corpus (lexical signal).  
3. Normalizing both score distributions to a common scale.  
4. Combining them using a weighted sum:

   Hybrid Score = α × Semantic Score + (1 − α) × BM25 Score  

The `alpha` parameter controls the influence of semantic versus lexical signals.  
A higher alpha favors semantic similarity, while a lower alpha emphasizes keyword matching.

In [10]:
def hybrid_search(query, model, index, documents, top_k=5,
                  alpha=0.6):
    
    # Semantic
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, top_k*2)
    
    semantic_scores = normalize_vector_scores(distances[0])
    
    semantic_results = {
        indices[0][i]: semantic_scores[i]
        for i in range(len(indices[0]))
    }
    
    # BM25
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    
    bm25_norm = normalize_bm25_scores(bm25_scores)
    
    # Combine scores
    combined_scores = {}
    
    for idx in range(len(documents)):
        sem_score = semantic_results.get(idx, 0)
        bm_score = bm25_norm[idx]
        
        combined_scores[idx] = alpha * sem_score + (1 - alpha) * bm_score
    
    # Rank
    ranked = sorted(
        combined_scores.items(),
        key=lambda x: x[1],
        reverse=True
    )[:top_k]
    
    results = []
    
    for rank, (idx, score) in enumerate(ranked):
        results.append({
            "rank": rank + 1,
            "score": score,
            "content": documents[idx].page_content,
            "doc": documents[idx]
        })
    
    return results


This fusion strategy:

- Preserves contextual understanding from embeddings  
- Reinforces keyword intent through BM25  
- Produces a more balanced and precise ranking  

The output returns top-k results ranked by the combined hybrid score, forming the core retrieval mechanism of the production RAG system.

## 5 – Evaluate Hybrid Retrieval

The hybrid retrieval pipeline is executed using the same domain-specific query to assess ranking improvements over standalone semantic or BM25 search.

This evaluation helps verify:

- Proper score normalization and fusion behavior  
- Improved prioritization of keyword-specific intent  
- Better alignment between semantic relevance and lexical precision  


In [11]:
hybrid_results = hybrid_search(
    "How many AI publications were there in 2023?",
    model,
    index,
    documents
)

hybrid_results

[{'rank': 1,
  'score': 0.42924363835809254,
  'content': 'database. As a result, the numbers in this year’s report differ \nslightly from those in previous editions.\n1 Given that there is a \nsignificant lag in the collection of publication metadata, and \nthat in some cases it takes until the middle of any given year \nto fully capture the previous year’s publications, in this year’s \nreport, the AI Index team elected to examine publication \ntrends only through 2023.\nOverview\nThe following section reports on trends in the total number of \nEnglish-language AI publications. \nTotal Number of AI Publications\nFigure 1.1.1 displays the global count of AI publications. These \nare the publications with a computer science (CS) label in the \nOpenAlex catalog that were classified by the AI Index as being \nrelated to AI.',
  'doc': Document(page_content='database. As a result, the numbers in this year’s report differ \nslightly from those in previous editions.\n1 Given that there is a

Comparing these results with earlier vector-only and BM25-only outputs provides insight into precision gains and validates the effectiveness of the hybrid strategy.


## 6 - Cross-Encoder Reranking Implementation

A cross-encoder model is initialized to refine the ranking of hybrid retrieval results.

The reranking function:

- Constructs query–document pairs from the hybrid candidates  
- Computes relevance scores using the cross-encoder  
- Sorts candidates based on predicted relevance  
- Returns the reordered results  

Unlike embedding-based retrieval, this approach evaluates each query–document pair jointly, enabling deeper semantic interaction and more precise ranking.

This layer serves as the final refinement stage in the retrieval pipeline, significantly improving top-k precision while keeping latency manageable by operating on a limited candidate set.


In [12]:
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")


In [13]:
def rerank(query, candidates):
    
    pairs = [[query, c["content"]] for c in candidates]
    
    scores = reranker.predict(pairs)
    
    ranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True
    )
    
    return [r[0] for r in ranked]


## 7 – Measure Reranking Latency

The reranking stage is executed on the hybrid retrieval results, and total inference time is recorded.

This step evaluates:

- End-to-end cross-encoder latency  
- Practical feasibility of reranking in production  
- Trade-off between precision gain and response time  



In [14]:
import time

start = time.time()
reranked_results = rerank(query="How many AI publications were there in 2023?",
                          candidates=hybrid_results)
print(reranked_results)
print("Rerank latency:", round(time.time() - start, 3), "seconds")


[{'rank': 4, 'score': 0.3722453043396091, 'content': 'Table of Contents 30\nArtificial Intelligence\nIndex Report 2025Chapter 1 Preview\npublications more than doubled, rising from approximately \n102,000 in 2013 to more than 242,000 in 2023. The increase \nover the last year was a meaningful 19.7%. Many fields within \ncomputer science, from hardware and software engineering \nto human-computer interaction, are now contributing to \nAI. As a result, the observed growth reflects a broader and \nincreased interest in AI across the discipline.\nAI publications in CS (% of total)\nAI publications in CS (% of total) worldwide, 2013–23\nSource: AI Index, 2025 | Chart: 2025 AI Index report\nFigure 1.1.2\n1.1 Publications\nChapter 1: Research and Development', 'doc': Document(page_content='Table of Contents 30\nArtificial Intelligence\nIndex Report 2025Chapter 1 Preview\npublications more than doubled, rising from approximately \n102,000 in 2013 to more than 242,000 in 2023. The increase \nov

## 8 - Reranking Results Analysis

The cross-encoder successfully reordered the hybrid retrieval results based on deeper query–document interaction.

### Observations

- The reranker corrected minor ordering inconsistencies from the hybrid stage.

### Latency Impact

- Reranking latency: 0.11 seconds  
- This is acceptable given that reranking is applied only to a small candidate set (top-k).

### Interpretation

The cross-encoder significantly improves precision by prioritizing passages that explicitly contain limitation-related content rather than general benefits.

This validates the full hybrid pipeline:

Semantic Retrieval → BM25 → Score Fusion → Cross-Encoder Reranking

The system now operates at a high retrieval quality level suitable for production-grade RAG systems.


------------------