## 🧠 Fusion Retrieval for Smarter Search | RAG100X

This notebook implements **Fusion Retrieval** — a technique that blends both **vector-based similarity search** and **keyword-based BM25 retrieval** to improve the **relevance and coverage** of retrieved chunks.

Instead of relying on a single retrieval method, fusion combines the **semantic power of dense embeddings** with the **precision of keyword matching**, leading to more accurate and diverse retrieval — especially useful when queries are ambiguous or domain-specific.

---

### ✅ What You’ll Learn

- Why using only vector or keyword search can miss important results  
- How to build a **hybrid retriever** that fuses FAISS + BM25  
- How to normalize and combine relevance scores from both sources  
- How to tune the fusion ratio (`alpha`) to balance semantic vs. lexical signals  
- Why fusion improves robustness across different types of queries  

---

### 🔍 Real-world Analogy

Imagine searching a library for info on "artificial intelligence":

> 🔍 BM25 finds books with the exact phrase in titles or chapters  
> 🧠 FAISS retrieves books that conceptually relate to AI, even if the term isn’t mentioned directly  
> 🧪 Fusion blends both to give you the **most relevant and well-rounded** set of sources

✅ That’s fusion retrieval — **semantic + lexical synergy**.

---

### 🧠 How Fusion Retrieval Works Under the Hood

| Step                        | What Happens                                                                 |
|-----------------------------|------------------------------------------------------------------------------|
| 1. Document Chunking        | PDF is split into overlapping chunks using a recursive text splitter         |
| 2. FAISS Vector Index       | Chunks are embedded using OpenAI embeddings and stored in FAISS              |
| 3. BM25 Index               | Same chunks are indexed via BM25 for keyword-based retrieval                 |
| 4. Dual Retrieval           | A query is run through both FAISS and BM25 retrievers                        |
| 5. Score Normalization      | Results from both methods are scored and normalized                          |
| 6. Weighted Fusion          | Scores are combined using a tunable `alpha` (e.g. 0.5 for equal weight)      |
| 7. Top-k Selection          | Highest-ranking fused results are passed to the LLM                         |

🧠 You get results that combine **semantic depth** with **lexical precision**.

---

### 🚀 Why Fusion Retrieval Works So Well

- ⚖️ **Balanced relevance**: Captures both meanings and exact terms  
- 🧠 **Resilient to query type**: Works well for vague or specific queries  
- 🧪 **Customizable weighting**: Adjust fusion ratio for different use cases  
- 📈 **Better coverage**: Reduces missed hits from one-sided retrieval  

---

### 🏗️ Why This Matters in Production

Vector search alone might miss:

> "How does GPT learn?" → May not match docs that say “language model training”

BM25 alone might miss:

> "LLM capabilities in summarization" → Finds docs with keywords, but not conceptually related ones

✅ Fusion ensures both **surface-level matches and deep semantic connections** are retrieved.

This is critical for:

- Complex search queries  
- Knowledge-heavy domains (e.g. law, medicine, finance)  
- Maximizing both recall and precision in production RAG systems  

---

### 🔄 Where This Fits in RAG100X

In previous projects, you explored:

1. Raw vector retrieval from PDF, CSV, Web  
2. Chunking enhancements (Semantic, Propositional, Header-based)  
3. Compression and Query Expansion techniques (RSE, CEW, HyDE, HyPE)  
4. Windowed and segment-aware context methods

Now, in **Day 14**, you enhance the **retrieval strategy itself**:  
> 💡 **Blend the strengths of different retrievers for smarter, richer context.**


## 📦 Installation & Setup

In [None]:
# Install required packages
!pip install langchain numpy python-dotenv rank-bm25

import os
import sys
from dotenv import load_dotenv
from langchain.docstore.document import Document

from typing import List
from rank_bm25 import BM25Okapi
import numpy as np

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Define document path

In [None]:

path = "data/Understanding_Climate_Change.pdf"

## 📄 Step 1: Encode PDF to Vector Store and Prepare for BM25 | RAG100X

Before we can perform fusion retrieval, we need to prepare our **document corpus** in two formats:

1. A **vector store** (for semantic search via FAISS)  
2. A **split document list** (for lexical search via BM25)

This function does exactly that — it takes in a PDF file, splits its content into manageable chunks, and creates a vector store using OpenAI embeddings.

---

### 🧠 What’s Happening Here?

- **PDF Loading**: The `PyPDFLoader` reads the full document and extracts its pages as raw text.
- **Chunking**: We split the text into smaller overlapping chunks using `RecursiveCharacterTextSplitter`. This helps preserve context while fitting within LLM token limits.
- **Cleaning**: A custom function `replace_t_with_space()` is applied to fix formatting issues (like `\t`).
- **Embedding**: We encode the chunks using `OpenAIEmbeddings`, turning them into high-dimensional vectors.
- **Vector Store**: These embeddings are saved into a FAISS index for fast similarity-based retrieval.
- **Return Value**: We return both the FAISS vector store and the cleaned split documents (needed later for BM25 indexing).

---

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS


# Function for Cleaning the document
def replace_t_with_space(list_of_documents):
    """
    Replaces all tab characters ('\t') with spaces in the page content of each document

    Args:
        list_of_documents: A list of document objects, each with a 'page_content' attribute.

    Returns:
        The modified list of documents with tab characters replaced by spaces.
    """

    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces
    return list_of_documents




def encode_pdf_and_get_split_documents(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF into a FAISS vector store using OpenAI embeddings,
    and returns the split text chunks for BM25 indexing.

    Args:
        path (str): Path to the PDF file.
        chunk_size (int): Number of characters in each chunk.
        chunk_overlap (int): Overlap between adjacent chunks for context preservation.

    Returns:
        vectorstore (FAISS): Vector store with OpenAI-embedded chunks.
        cleaned_texts (List[Document]): List of cleaned, split chunks (for BM25).
    """

    # Step 1: Load the PDF document
    loader = PyPDFLoader(path)
    documents = loader.load()  # Extract pages as Document objects

    # Step 2: Split text into overlapping chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    texts = text_splitter.split_documents(documents)

    # Step 3: Clean up formatting issues (e.g. remove tab characters)
    cleaned_texts = replace_t_with_space(texts)

    # Step 4: Embed chunks using OpenAI and store in FAISS vector DB
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    # Return both vector store and chunked docs for BM25
    return vectorstore, cleaned_texts

### Create vectorstore and get the chunked documents

In [None]:

vectorstore, cleaned_texts = encode_pdf_and_get_split_documents(path)

## 🔍 Create BM25 Index for Keyword-Based Retrieval | RAG100X

While vector stores handle **semantic similarity**, we also want to support **exact keyword-based matching**. That’s where **BM25** comes in.

BM25 (Best Matching 25) is a traditional, **lexical retrieval algorithm** that works great when users search with exact terms. It builds on **TF-IDF**, but adds normalization to handle document length more fairly.



### 🧠 What’s Happening Here?

- **Purpose**: We’re building a **BM25 index** from the previously split and cleaned document chunks.
- **Input**: The function expects a list of `Document` objects (already chunked and cleaned).
- **Tokenization**: For each document, we split the text into words using simple whitespace-based tokenization.
- **Indexing**: `BM25Okapi` (from the `rank_bm25` library) is used to build the index over these tokenized lists.



### ⚙️ Under the Hood

Let’s say we have two documents:

- Doc 1: `"Climate change is caused by carbon emissions."`
- Doc 2: `"Carbon dioxide leads to global warming."`

After tokenization:

```python
[
  ["Climate", "change", "is", "caused", "by", "carbon", "emissions."],
  ["Carbon", "dioxide", "leads", "to", "global", "warming."]
]

When someone searches for "carbon emissions", BM25 scores each document based on how many query terms match, and how rare those words are across the corpus (via IDF weighting). The result is a ranked list of documents most likely to answer the query.

 BM25 complements semantic search by surfacing documents with strong lexical overlap, even if their meaning isn’t captured well by embeddings. This sets us up to later fuse both scores for better results.



In [None]:
from rank_bm25 import BM25Okapi
from typing import List
from langchain.schema import Document

def create_bm25_index(documents: List[Document]) -> BM25Okapi:
    """
    Creates a BM25 index over the cleaned document chunks.

    Args:
        documents (List[Document]): Chunks of text wrapped as LangChain Document objects.

    Returns:
        BM25Okapi: A keyword-aware index that can score chunks based on query relevance.
    """

    # Tokenize each document using simple whitespace split.
    # More advanced NLP tokenization (e.g., stemming, stopword removal) could improve results.
    tokenized_docs = [doc.page_content.split() for doc in documents]

    # Create the BM25 index using the tokenized text
    bm25_index = BM25Okapi(tokenized_docs)

    return bm25_index


In [None]:
bm25 = create_bm25_index(cleaned_texts) # Create BM25 index from the cleaned texts (chunks)

## ♾️ Fusion-Based Retrieval | Combine Semantic + Keyword Matching | RAG100X

In this step, we build a **hybrid retriever** that merges two powerful techniques:

- 🧠 **Vector Search** — captures semantic similarity based on meaning
- 🔍 **BM25 Keyword Search** — captures exact word overlap

Instead of choosing one or the other, we **combine the scores** of both approaches using **score-level fusion**. This balances precision (exact match) and recall (semantic understanding), leading to more accurate retrieval.


### 🧠 Why Combine BM25 and Vector Search?

Let’s say the user query is:

> "causes of global warming"

- **BM25** will surface documents that literally contain “causes”, “global”, and “warming”.
- **Vector Search** might find:
  - “burning fossil fuels has increased Earth’s temperature”
  - “CO₂ emissions driving climate change”

While the second set might not contain the original words, they’re clearly **semantically aligned**.

By **fusing** both, we ensure the retrieved chunks:
- Are *semantically relevant*
- And *lexically grounded*

This makes the retrieval much more reliable for downstream LLMs.


### ⚙️ Step-by-Step Breakdown

#### 1. **Get All Documents**
We retrieve all indexed documents from the vectorstore (even though we only need top-k), because we’ll need to **score all documents** to compare properly.

#### 2. **BM25 Scoring**
We compute keyword-based scores using BM25, which works by comparing term frequency, inverse document frequency, and document length.

#### 3. **Vector Scoring**
We run a semantic search using FAISS and get similarity scores (lower is better — they're distances).

#### 4. **Score Normalization**
Since the score scales are very different (BM25 vs FAISS), we normalize both to the range `[0, 1]`:
- BM25 scores are already in a positive range.
- FAISS scores are inverted (because lower = better), so we flip them by `1 - normed`.

We also add a small epsilon to prevent divide-by-zero errors.

#### 5. **Score Fusion**
We blend the two normalized scores using a weighted average:

- `final_score = alpha * vector_score + (1 - alpha) * bm25_score`

- `alpha = 1.0` → only vector search
- `alpha = 0.0` → only BM25
- `alpha = 0.5` → equal weight

#### 6. **Rank and Return**
We sort documents by this combined score and return the top `k` most relevant ones.

### ✅ Benefits of Fusion Retrieval

- ✅ Balances *meaning* and *exact matches*
- ✅ Handles synonyms, paraphrasing, and misspellings better than BM25 alone
- ✅ Improves precision by grounding abstract matches with lexical overlap


In [None]:
import numpy as np
from typing import List
from rank_bm25 import BM25Okapi
from langchain.vectorstores import FAISS
from langchain.schema import Document

def fusion_retrieval(vectorstore, bm25, query: str, k: int = 5, alpha: float = 0.5) -> List[Document]:
    """
    Combines semantic (vector-based) and lexical (BM25-based) retrieval scores,
    normalizes them, and returns the top-k documents ranked by the fused score.
    
    Args:
        vectorstore: FAISS vector index for semantic similarity.
        bm25: BM25 index for keyword matching.
        query: The user's input query.
        k: Number of top documents to return.
        alpha: Weight for semantic score in fusion (between 0 and 1).
        
    Returns:
        List[Document]: Top-k most relevant documents based on fused ranking.
    """
    
    epsilon = 1e-8  # To avoid division-by-zero during normalization

    # Step 1: Retrieve all documents from the vectorstore (to align scores properly)
    all_docs = vectorstore.similarity_search("", k=vectorstore.index.ntotal)

    # Step 2: Compute BM25 keyword scores for all documents
    bm25_scores = bm25.get_scores(query.split())

    # Step 3: Perform vector similarity search and get FAISS distances
    vector_results = vectorstore.similarity_search_with_score(query, k=len(all_docs))
    vector_scores = np.array([score for _, score in vector_results])  # Lower = better

    # Step 4: Normalize both score sets to [0, 1] range
    # Invert vector scores because FAISS returns distances
    vector_scores = 1 - (vector_scores - np.min(vector_scores)) / (np.max(vector_scores) - np.min(vector_scores) + epsilon)

    # Normalize BM25 scores (already higher = better)
    bm25_scores = (bm25_scores - np.min(bm25_scores)) / (np.max(bm25_scores) - np.min(bm25_scores) + epsilon)

    # Step 5: Combine scores using weighted average
    combined_scores = alpha * vector_scores + (1 - alpha) * bm25_scores

    # Step 6: Sort documents by combined score (high to low)
    sorted_indices = np.argsort(combined_scores)[::-1]

    # Step 7: Return top-k documents
    return [all_docs[i] for i in sorted_indices[:k]]


### Use Case example

In [None]:
def show_context(context):
    """
    Display the contents of the provided context list.

    Args:
        context (list): A list of context items to be displayed.

    Prints each context item in the list with a heading indicating its position.
    """
    for i, c in enumerate(context):
        print(f"Context {i + 1}:")
        print(c)
        print("\n")


# Query
query = "What are the impacts of climate change on the environment?"

# Perform fusion retrieval
top_docs = fusion_retrieval(vectorstore, bm25, query, k=5, alpha=0.5)
docs_content = [doc.page_content for doc in top_docs]
show_context(docs_content)

---

## 📘 Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way, but using recent models — as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I’ve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I’ve added clear, concise markdowns throughout the notebook — explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It’s designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.

## 🔍 Why Use Fusion Retrieval in RAG?

Semantic search alone can miss important keywords. Keyword search alone can miss meaning.

**Fusion Retrieval** combines the best of both:
- 🧠 Captures **semantic relevance** using vector embeddings  
- 🔍 Preserves **keyword precision** using BM25 scoring  
- ⚖️ Produces balanced, high-quality document retrieval for LLMs  

---

## 🧠 What’s New in This Version?

This implementation includes:

- 🔄 A **score-level fusion** of FAISS (vector) and BM25 (keyword) retrieval  
- ⚙️ **Score normalization** to align semantic and lexical scales  
- ⚖️ **Weighted blending** of scores with customizable alpha control  
- 📚 End-to-end retrieval pipeline returning the top-K most relevant documents  

---

## 📈 Inferences & Key Takeaways

- ✅ Fusion improves **recall and precision** over using BM25 or vector search alone  
- 🔍 Helps catch **keyword-sensitive** queries while retaining semantic flexibility  
- ⚖️ Tunable alpha allows dynamic prioritization of meaning vs. match  
- 📦 Easily integratable with any LangChain retriever + BM25 wrapper  

---

## 🚀 What Could Be Added Next?

- 📊 Add evaluation with retrieval metrics (precision, recall, MRR)  
- 🔬 Explore rank fusion techniques (Reciprocal Rank Fusion, Borda Count)  
- 🧪 Integrate rerankers post-fusion to boost top-K accuracy  
- 🌐 Extend to multi-query or multi-modal fusion (e.g., images + text)  

---
## 💡 Final Word

This notebook is part of my larger personal project: **RAG100x** — a challenge to build and log my journney in RAG from 0 100 in the coming months.

It’s not built to impress — it’s built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course — check out the original repository for broader implementations and ideas.