## 🔁 Reranking for Smarter Retrieval | RAG100X

This notebook implements **Reranking** — a powerful technique that refines the output of initial retrieval by **reassessing document relevance** using more sophisticated models like **LLMs** or **Cross-Encoders**.

Instead of blindly trusting the first batch of documents retrieved via vector similarity, reranking helps prioritize **truly relevant** content — filtering out noise and boosting the overall quality of inputs sent to the LLM.

---

### ✅ What You’ll Learn

- Why initial vector retrieval may return weak or generic matches  
- How to apply **LLM-based scoring** to rate document relevance  
- How to use **Cross-Encoder models** for fine-grained query-doc comparisons  
- How to create custom retrievers that plug reranking into LangChain workflows  
- When and why reranking improves final answer quality in RAG systems  

---

### 🔍 Real-world Analogy

Imagine you ask a librarian for books on "climate change impacts on biodiversity":

> 🔎 The librarian gives you 15 random books about climate, nature, ecosystems…  
> 🧠 You skim each one and pick the **3 most useful** for your question  

✅ That’s reranking — **you refine the initial list based on deeper understanding**.

---

### 🧠 How Reranking Works Under the Hood

| Step                 | What Happens                                                                 |
|----------------------|------------------------------------------------------------------------------|
| 1. Initial Retrieval | Use a vector store to fetch top-K relevant chunks                            |
| 2. Pairing           | Form (query, document) pairs for relevance assessment                        |
| 3. Scoring           | Use LLM or Cross-Encoder to assign relevance scores to each pair             |
| 4. Sorting           | Rank documents by score in descending order                                  |
| 5. Selection         | Keep only the top-N most relevant documents                                  |

💡 This ensures that only the **most meaningful context** reaches the LLM for generation.

---

### 🚀 Why Reranking Makes a Difference

- 🧠 **Deeper understanding**: Goes beyond cosine similarity to semantic alignment  
- 🔍 **Fine-grained filtering**: Discards off-topic or tangential results  
- ⚙️ **Model flexibility**: Works with both general LLMs and pretrained relevance models  
- 🎯 **Precision boost**: Especially helpful for nuanced, domain-heavy queries  

---

### 🏗️ Why This Matters in Production

Basic vector search might rank irrelevant results high just due to word overlap:

> Query: “What are the biodiversity threats from climate change?”  
> Initial result: “The weather in France is unpredictable in spring.” ❌  

Reranking ensures:

✅ “Warming leads to habitat shifts and species loss in fragile ecosystems.”  
gets prioritized — **leading to better answers**.

This is critical for:

- Knowledge-dense domains (e.g. healthcare, policy, research)  
- Use cases where precision matters (e.g. chatbots, summarizers, assistants)  
- Scenarios with large corpora and high retrieval noise  

---

### 🔄 Where This Fits in RAG100X

In earlier projects, you’ve built:

1. Basic vector retrieval from various sources  
2. Chunking and compression enhancements  
3. Retrieval enrichment via query reformulation, CEW, HyDE, etc.  

Now, in **Day 15**, you take it further:

> 💡 **Don’t just retrieve — rethink what you keep.**  
> Reranking boosts quality at the final mile of retrieval.


## 📦 Installation & Setup

In [None]:

# Install required packages
!pip install langchain langchain-openai python-dotenv sentence-transformers


import os
import sys
from dotenv import load_dotenv
from langchain.docstore.document import Document
from typing import List, Dict, Any, Tuple
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain_core.retrievers import BaseRetriever
from sentence_transformers import CrossEncoder


# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Define the document's path

In [None]:

path = "data/Understanding_Climate_Change.pdf"

### Create a vector store

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS


# Cleaning the document

def replace_t_with_space(list_of_documents):
    """
    Replaces all tab characters ('\t') with spaces in the page content of each document

    Args:
        list_of_documents: A list of document objects, each with a 'page_content' attribute.

    Returns:
        The modified list of documents with tab characters replaced by spaces.
    """

    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces
    return list_of_documents

def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Create embeddings and vector store
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore


vectorstore = encode_pdf(path)

### 🧠 Building a Custom LLM-Based Reranker

In this section, we implement a **custom reranking function** using a **Large Language Model (LLM)**.  
The idea is simple: rather than trusting the initial similarity-based retrieval, we ask an LLM to **rate the relevance** of each document **on a scale of 1 to 10** — then sort and select the top ones.


### 💡 Why Use an LLM to Score Relevance?

Traditional retrieval systems (like FAISS) rank documents based on **vector similarity**.  
But this doesn’t always capture the **true meaning or intent** of a query. For example:

> Query: *"What are the threats to biodiversity due to climate change?"*  
> Vector result: *"Climate policies affect economic behavior..."* ❌ (related but not relevant)

By contrast, an LLM can **understand context**, **reason about meaning**, and score based on **true relevance**.

### ⚙️ How This Reranker Works Step-by-Step

| Step | What Happens |
|------|---------------|
| 1️⃣   | A prompt template is created to ask the LLM: "Rate how relevant this document is to the query (1–10)" |
| 2️⃣   | For every document in the initial results, a query-doc pair is fed to the LLM |
| 3️⃣   | The LLM returns a structured `relevance_score` |
| 4️⃣   | All scores are collected and sorted from high to low |
| 5️⃣   | The top N most relevant documents are returned |

This method turns the LLM into a **relevance judge** — helping your system choose the most meaningful evidence before generating answers.

In [None]:
from pydantic import BaseModel, Field
from typing import List
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.docstore.document import Document

# Define the expected structure of the LLM's output (must return a float score)
class RatingScore(BaseModel):
    relevance_score: float = Field(..., description="The relevance score of a document to a query.")

# Main reranking function using an LLM
def rerank_documents(query: str, docs: List[Document], top_n: int = 3) -> List[Document]:
    
    # Define the prompt that asks the LLM to score each document
    prompt_template = PromptTemplate(
        input_variables=["query", "doc"],
        template="""On a scale of 1-10, rate the relevance of the following document to the query. 
        Consider the specific context and intent of the query, not just keyword matches.
        
        Query: {query}
        Document: {doc}
        Relevance Score:"""
    )
    
    # Initialize the LLM (GPT-4o in this case) with structured output
    llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)
    llm_chain = prompt_template | llm.with_structured_output(RatingScore)
    
    scored_docs = []
    
    # Go through each retrieved document and get a relevance score from the LLM
    for doc in docs:
        input_data = {"query": query, "doc": doc.page_content}
        score = llm_chain.invoke(input_data).relevance_score
        
        # If LLM returns something non-numeric, assign 0 as fallback
        try:
            score = float(score)
        except ValueError:
            score = 0
        
        # Store the document with its score
        scored_docs.append((doc, score))
    
    # Sort documents by score in descending order
    reranked_docs = sorted(scored_docs, key=lambda x: x[1], reverse=True)
    
    # Return only the top N highest-scoring documents
    return [doc for doc, _ in reranked_docs[:top_n]]

### 🔍 Comparing Initial vs. Reranked Documents — Example Run

Now that we’ve defined our custom reranking function, let’s see it in action using a **real-world query**:

> **Query:** "What are the impacts of climate change on biodiversity?"

We'll compare:

- 🔹 The **top 3 results** from the initial vector similarity search  
- 🔸 The **top reranked documents** after applying the LLM-based scorer

This gives you a clear picture of how reranking boosts retrieval quality by filtering out noise and emphasizing true relevance.


### 🧪 What’s Happening Here?

| Step | What We Do |
|------|------------|
| 1️⃣   | Use `vectorstore.similarity_search()` to retrieve top 15 initial matches  
| 2️⃣   | Pass them to `rerank_documents()` to get relevance scores via the LLM  
| 3️⃣   | Sort and select the top N (default = 3) most relevant results  
| 4️⃣   | Print both sets for visual comparison  

In [None]:
# Define a real-world query
query = "What are the impacts of climate change on biodiversity?"

# Step 1: Retrieve documents using vector similarity
initial_docs = vectorstore.similarity_search(query, k=15)

# Step 2: Rerank those documents using the LLM-based scorer
reranked_docs = rerank_documents(query, initial_docs)

# Step 3: Print the first few from the original results
print("Top initial documents:")
for i, doc in enumerate(initial_docs[:3]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Show first 200 characters for brevity

# Step 4: Print the reranked results
print(f"\nQuery: {query}\n")
print("Top reranked documents:")
for i, doc in enumerate(reranked_docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")

### 🧠 Wrapping Reranking into a Custom Retriever

Now that we’ve built and tested our LLM-based reranker, let’s integrate it **seamlessly into a LangChain pipeline** by creating a **custom retriever**.

This allows us to use reranking as a drop-in replacement for any standard retriever — making it easier to plug into `RetrievalQA` or other chains.



### 🧩 Why Build a Custom Retriever?

LangChain’s retrieval pipeline expects any retriever to implement a method called `get_relevant_documents(query)`.  
To use our reranking logic with LangChain, we need to:

1. Start with an initial set of documents from a vector store  
2. Rerank them using our LLM scoring logic  
3. Return the top-N most relevant ones  

💡 So we wrap this logic inside a class that inherits from `BaseRetriever`.


### ⚙️ What Happens Under the Hood?

| Step | What It Does |
|------|---------------|
| 1️⃣   | `CustomRetriever` fetches 30 documents via standard vector search  
| 2️⃣   | It passes them into `rerank_documents()` to get LLM relevance scores  
| 3️⃣   | It returns only the **top N** reranked documents (e.g. top 2 or 3)  
| 4️⃣   | These documents are used by the LLM to generate the final answer  

This setup is now compatible with any LangChain workflow — just like a default retriever.

In [None]:
from langchain_core.retrievers import BaseRetriever
from langchain.chains import RetrievalQA
from typing import Any


# Step 1: Create a custom retriever class that applies reranking
class CustomRetriever(BaseRetriever, BaseModel):
    
    # Pass in a vectorstore (e.g., FAISS or Chroma)
    vectorstore: Any = Field(description="Vector store for initial retrieval")

    class Config:
        arbitrary_types_allowed = True  # Allows storing non-standard types like FAISS

    def get_relevant_documents(self, query: str, num_docs=2) -> List[Document]:
        # Retrieve more documents than we need (for reranking to work well)
        initial_docs = self.vectorstore.similarity_search(query, k=30)
        
        # Apply the reranking logic and return the top N
        return rerank_documents(query, initial_docs, top_n=num_docs)

# Step 2: Instantiate the custom retriever with our vectorstore
custom_retriever = CustomRetriever(vectorstore=vectorstore)

# Step 3: Create the language model (for answer generation, not reranking)
llm = ChatOpenAI(temperature=0, model_name="gpt-4o")

# Step 4: Connect everything into a RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Simple chain type that concatenates documents
    retriever=custom_retriever,
    return_source_documents=True
)

### Example query

In [None]:
result = qa_chain({"query": query})

print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document

### 🧠 Method 2: Reranking with Cross-Encoder Models

In this section, we’ll use a **Cross-Encoder model** for reranking retrieved documents — an **alternative to LLM-based scoring**.

Cross-Encoders are especially useful in production setups when:
- You want faster and cheaper reranking (compared to GPT)
- You care about precise similarity scoring
- You don’t need deep reasoning — just **textual relevance**



### 🤖 What Is a Cross-Encoder?

Unlike **bi-encoders** (used in vector search) that encode query and doc separately,  
**Cross-Encoders jointly encode a (query, document) pair** and predict a **relevance score**.

This makes them **slower but more accurate** at matching.

We use the model:  
> `cross-encoder/ms-marco-MiniLM-L-6-v2` — a lightweight and fast option.



### 🧱 Under the Hood: How It Works

| Step | Description |
|------|-------------|
| 1️⃣   | Retrieve top `k` documents using a vector search  
| 2️⃣   | Form `(query, document)` pairs for each result  
| 3️⃣   | Run them through the cross-encoder model  
| 4️⃣   | Rank documents by predicted relevance scores  
| 5️⃣   | Return the top-N documents for the final answer  

### ⚙️ Why We Only Implemented the Synchronous Retriever

#### 1. `CrossEncoder.predict()` is a Synchronous, CPU/GPU-bound Operation

- The `predict()` method from the 🤗 `sentence-transformers` library is a **blocking**, CPU/GPU-intensive operation.
- It does **not have an async version** like `await cross_encoder.apredict(...)` — such a method doesn’t exist.
- This is because the model runs **locally on your machine** and performs **matrix computations**, not I/O operations (e.g., API calls, file reads, or DB queries).
- As a result, using async would **not provide any performance gain** — it would still block the CPU while executing.

---

#### 2. LangChain Supports Both Sync and Async Retriever Interfaces

LangChain provides two interfaces for custom retrievers:

- `get_relevant_documents(query: str)` → **Synchronous**
- `aget_relevant_documents(query: str)` → **Asynchronous**

In this implementation, we only define the **synchronous version** because:

- Both the **vectorstore** and **CrossEncoder model** are synchronous.
- Async is only helpful when dealing with **I/O-bound tasks**, such as:
  - Calling OpenAI or Groq LLM APIs
  - Querying remote vector databases
  - Fetching data from external APIs or files

---

### 🧠 TL;DR

- We don’t implement `aget_relevant_documents()` because there's **no async benefit** — our reranking pipeline is **fully local and CPU-bound**.
- If needed, you *can* add an async wrapper using `asyncio.run_in_executor(...)`, but that’s mainly useful in **web server contexts** (like FastAPI apps).


In [None]:
from sentence_transformers import CrossEncoder

# Load a pre-trained cross-encoder model for relevance scoring
# This model takes (query, document) pairs and outputs a score indicating semantic similarity
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')


class CrossEncoderRetriever(BaseRetriever, BaseModel):
    """
    A custom retriever that performs initial dense retrieval from a vectorstore,
    then reranks the results using a cross-encoder model.
    """

    vectorstore: Any = Field(description="Vector store backend (e.g., FAISS, Chroma) for initial similarity search")
    cross_encoder: Any = Field(description="Cross-encoder model that scores (query, doc) pairs for reranking")
    
    k: int = Field(default=5, description="Number of documents to retrieve in initial search")
    rerank_top_k: int = Field(default=3, description="Number of top documents to keep after reranking")

    class Config:
        # Allows the use of complex, non-standard types like FAISS or SentenceTransformer
        arbitrary_types_allowed = True

    def get_relevant_documents(self, query: str) -> List[Document]:
        """
        Main retrieval method:
        - Retrieves top-k documents using vector similarity (bi-encoder)
        - Reranks those documents based on semantic matching using a cross-encoder
        """
        
        # STEP 1: Retrieve initial set of documents using vector similarity
        # Here we use a standard vector search to quickly find approximate matches
        initial_docs = self.vectorstore.similarity_search(query, k=self.k)

        # STEP 2: Create input pairs (query, document) for each result
        # The cross-encoder requires both query and doc as input together to assess their match
        pairs = [[query, doc.page_content] for doc in initial_docs]

        # STEP 3: Score each (query, doc) pair using the cross-encoder
        # The model returns a float score for each pair representing relevance
        scores = self.cross_encoder.predict(pairs)

        # 🪜 STEP 4: Zip the documents with their scores and sort by score (descending)
        # Higher scores indicate better semantic alignment with the query
        scored_docs = sorted(zip(initial_docs, scores), key=lambda x: x[1], reverse=True)

        # STEP 5: Return top N reranked documents
        # These are the most relevant documents as judged by the cross-encoder
        return [doc for doc, _ in scored_docs[:self.rerank_top_k]]

    async def aget_relevant_documents(self, query: str) -> List[Document]:
        # This retriever currently only supports synchronous execution
        raise NotImplementedError("Async retrieval not implemented")

### 🔧 Creating and Using the CrossEncoderRetriever
Now that we’ve defined the `CrossEncoderRetriever` class, let’s put it into action by:

1. **Instantiating the retriever**
2. **Creating a `RetrievalQA` chain with a GPT-4o LLM**
3. **Running a sample query to see reranked results**

####  Instantiate `CrossEncoderRetriever`

We create an instance of our custom retriever by passing:
- A `vectorstore` for initial retrieval
- A `cross_encoder` model for reranking
- `k=10` — Retrieve 10 documents initially from the vectorstore
- `rerank_top_k=5` — Rerank them and keep only the top 5

This modular design makes our retriever easily pluggable into LangChain pipelines.



#### Set Up the LLM (GPT-4o)

We use OpenAI's **GPT-4o** model via the `ChatOpenAI` wrapper with a temperature of 0 for deterministic output.


#### Build the `RetrievalQA` Chain

The chain uses:

- The **CrossEncoderRetriever** for smarter, reranked retrieval
- The **"stuff"** chain type (concatenates all retrieved documents)
- `return_source_documents=True` so we can see what evidence was used


#### Run an Example Query

We use a relevant query:
> *"What are the impacts of climate change on biodiversity?"*

The model uses reranked results to answer and also shows the top documents that influenced its response.

### 🧪 Example Output
This helps verify that the reranking logic meaningfully improves the relevance of context used for LLM generation.


In [None]:
# Instantiate the CrossEncoderRetriever
cross_encoder_retriever = CrossEncoderRetriever(
    vectorstore=vectorstore,               # Initial retrieval using vector similarity
    cross_encoder=cross_encoder,           # Reranker model (cross-encoder)
    k=10,                                  # Retrieve top 10 candidates
    rerank_top_k=5                         # Keep top 5 after reranking
)

# Set up the LLM using GPT-4o (deterministic output with temperature=0)
llm = ChatOpenAI(
    temperature=0,
    model_name="gpt-4o"
)

# Create the RetrievalQA pipeline using LangChain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",                        # Concatenates all retrieved docs into one prompt
    retriever=cross_encoder_retriever,         # Our custom reranking retriever
    return_source_documents=True               # Enables access to source documents
)

# Run an example query
query = "What are the impacts of climate change on biodiversity?"
result = qa_chain({"query": query})

# Print results
print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")

# Display the top reranked source documents
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Show first 200 characters of each


---

## 📘 Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way, but using recent models — as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I’ve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I’ve added clear, concise markdowns throughout the notebook — explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It’s designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.

## 🔍 Why Use Cross-Encoder Reranking in RAG?

Vector similarity retrieval is fast but imprecise — it only considers embeddings, not the actual query-token alignment.

**Cross-Encoder Reranking** upgrades your RAG pipeline by:
- 🧠 Performing **deep pairwise comparison** between query and each candidate document  
- 📏 Scoring based on **actual token-level attention**, not just vector proximity  
- 🧹 Filtering out semantically close but contextually irrelevant chunks  

---

## 🧠 What’s New in This Version?

This implementation includes:

- 🤖 A custom `CrossEncoderRetriever` compatible with LangChain  
- 🔁 Initial recall with **FAISS vectorstore**, followed by CrossEncoder reranking  
- 🧮 **Local reranking** using `sentence-transformers` without external API calls  
- ⚙️ Configurable `k` and `rerank_top_k` for control over retrieval and precision  

---

## 📈 Inferences & Key Takeaways

- ✅ Cross-encoders significantly **improve retrieval precision** by modeling query-doc interactions  
- 🧠 Best used when you need **fewer but highly relevant chunks**  
- ⚡ Reranking is CPU/GPU-bound — great for small-to-mid-sized corpora  
- 🧩 Easily pluggable into LangChain’s `RetrievalQA` pipeline  

---

## 🚀 What Could Be Added Next?

- 📊 Evaluate the impact of reranking using faithfulness and relevancy metrics  
- 🧪 Compare performance vs. embedding-only retrievers like HyDE or HyPE  
- 🔗 Try **Hybrid Fusion + Reranking** for improved recall *and* precision  
- 🌍 Serve CrossEncoder via a lightweight API for use in web apps or agents  

---

## 💡 Final Word

This notebook is part of my larger personal project: **RAG100x** — a challenge to build and log my journney in RAG from 0 100 in the coming months.

It’s not built to impress — it’s built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course — check out the original repository for broader implementations and ideas.