# **Module 3: Improving Precision with a Re-Ranker**

**Objective:**
In this module, we will solve the **precision failure** identified at the end of Module 2. We will add a **Re-Ranker** to our pipeline to ensure that the most relevant and precise document chunk is always prioritized, preventing the LLM from getting confused by noisy context.

**Recap of Module 2's Problem**
Our powerful Hybrid Search retriever successfully found the correct table containing the answer to our query about "tax withholding." However, it passed the entire dense table to the LLM as context. The LLM then got confused and extracted a value from the wrong year. This is a classic precision failure.

**Core Concept: Re-Ranking with a Cross-Encoder**
A Re-ranker works *after* the initial retrieval stage. It takes a list of candidate documents and uses a more powerful model to re-order them based on a deeper understanding of their relevance to the query.

  * **Bi-Encoder (Our Retriever):** Processes the query and documents separately to create vectors. It's very fast and great for finding a broad set of relevant candidates (good recall).
  * **Cross-Encoder (Our Re-Ranker):** Processes the query and *each* candidate document *together*. This allows it to understand the fine-grained interaction and nuance between them, making it excellent at identifying the single best answer (high precision).

We will add a Cross-Encoder to our pipeline to intelligently filter the results from our hybrid search before they reach the LLM.

### Learning Objectives

By the end of this module, you will be able to:

  * **Explain the role of a Re-ranker** in a RAG pipeline and how it improves precision.
  * **Understand the architectural difference** between Bi-Encoder retrievers and Cross-Encoder re-rankers.
  * **Implement a Re-ranking stage** using a powerful open-source Cross-Encoder model.
  * **Integrate the Re-ranker** into your existing RAG chain.
  * **Validate that the Re-ranker fixes** the specific precision failure we identified in Module 2.

-----

#### **Step 1: Install Dependencies**

The dependencies are the same as our previous module.

In [None]:
!pip install -q langchain langchain-community langchain-groq langchain_huggingface qdrant-client  pypdf fastembed

In [None]:
import warnings
warnings.filterwarnings('ignore')

-----

#### **Step 2: Setup (API Key, Document Loading, and Qdrant Population)**

This cell contains all the setup code from Module 2. It will load the document, create the dense and sparse embeddings, and populate our in-memory Qdrant collection. We will process the **full document** this time to ensure our system is robust.

In [None]:
import os
from google.colab import userdata
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models
#from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from fastembed import SparseTextEmbedding
from tqdm.auto import tqdm

# --- 1. Setup API Key ---
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')

# --- 2. Load and Split Document ---
pdf_path = "./NVIDIA-Q1-FY26-Financial-Results.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)
print(f"Document loaded and split into {len(docs)} chunks.")

# --- 3. Initialize Qdrant Client and Collection ---

# Initialize an in-memory Qdrant client
client = QdrantClient(location=":memory:")
collection_name = "rag_foundations_qdrant_hybrid"

# Create collection
client.recreate_collection(
    collection_name=collection_name,
    vectors_config={
        "dense": models.VectorParams(size=1024, distance=models.Distance.COSINE)
    },
    sparse_vectors_config={
        "text-sparse": models.SparseVectorParams(
            index=models.SparseIndexParams(
                on_disk=False
            )
        )
    }
)

print("Collection created successfully.")

# --- 4. Initialize Embedding Models ---
dense_embed_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True},
)
sparse_embed_model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
print("Embedding models initialized.")

# --- 5. Embed and Upsert Full Document ---
points_to_upsert = []
for i, doc in enumerate(tqdm(docs, desc="Processing and Upserting All Docs")):
    doc_text = doc.page_content
    dense_vec = dense_embed_model.embed_query(doc_text)
    sparse_vec = list(sparse_embed_model.embed([doc_text]))[0]
    points_to_upsert.append(
        models.PointStruct(
            id=i,
            payload={"text": doc_text, **doc.metadata},
            vector={
                "dense": dense_vec,
                "text-sparse": models.SparseVector(
                    indices=sparse_vec.indices.tolist(),
                    values=sparse_vec.values.tolist()
                ),
            },
        )
    )

client.upsert(
    collection_name=collection_name,
    points=points_to_upsert,
    wait=True
)
print(f"Successfully embedded and upserted all {len(docs)} documents.")

Document loaded and split into 191 chunks.
Collection created successfully.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/532M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

Embedding models initialized.


Processing and Upserting All Docs:   0%|          | 0/191 [00:00<?, ?it/s]

Successfully embedded and upserted all 191 documents.


-----

#### **Step 3: Initialize the Re-Ranker**

Now, we introduce our new component. We will load a powerful Cross-Encoder model from the `sentence-transformers` library. This model is specifically trained to predict the relevance score between a query and a document.

In [None]:
from sentence_transformers.cross_encoder import CrossEncoder

# Initialize the CrossEncoder model
# This model is specifically trained for re-ranking tasks.
cross_encoder = CrossEncoder('BAAI/bge-reranker-base')

print("Re-ranker model initialized successfully.")

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Re-ranker model initialized successfully.


-----

#### **Step 4: Build the RAG Chain with Re-Ranking**

This is the core of our upgrade. We will create a new retrieval function that first uses our hybrid search to get a broad set of candidate documents and then uses our new `cross_encoder` to re-rank them for precision.

In [None]:
from langchain_groq import ChatGroq
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document

# Initialize the Groq LLM
llm = ChatGroq(temperature=0, model_name="meta-llama/llama-4-scout-17b-16e-instruct")


# --- New Retrieval Function with Re-Ranking ---

# A helper function to format and print documents
def pretty_print_docs(docs):
    for i, doc in enumerate(docs):
        # The 'source' metadata key comes from PyPDFLoader
        source = doc.metadata.get('source', 'Unknown Source')
        page = doc.metadata.get('page', 'Unknown Page')

        # Check if the doc object is a Qdrant ScoredPoint or a LangChain Document
        score = ""
        if hasattr(doc, 'score') and doc.score is not None:
            score = f" | Score: {doc.score:.4f}"

        print(f"  [{i+1}] Source: {source} (Page: {page}){score}")
        # Print the first 120 characters of the content
        print(f"      Content: '{doc.page_content[:120]}...'")
    print()

def rerank_and_retrieve_with_prints(query: str) -> str:
    """
    This function performs and visualizes a two-stage retrieval process:
    1. Initial hybrid search to get candidate documents.
    2. Re-ranking with a Cross-Encoder to find the most precise documents.
    """
    # === Stage 1: Initial Retrieval (Casting a Wider Net) ===
    print("--- 1. Performing Initial Hybrid Search ---")
    top_k_retrieval = 10

    dense_query_vec = dense_embed_model.embed_query(query)
    sparse_query_vec = list(sparse_embed_model.embed([query]))[0]

    dense_results = client.search(
        collection_name=collection_name,
        query_vector=models.NamedVector(name="dense", vector=dense_query_vec),
        limit=top_k_retrieval,
        with_payload=True
    )

    sparse_results = client.search(
        collection_name=collection_name,
        query_vector=models.NamedSparseVector(
            name="text-sparse",
            vector=models.SparseVector(
                indices=sparse_query_vec.indices.tolist(),
                values=sparse_query_vec.values.tolist()
            )
        ),
        limit=top_k_retrieval,
        with_payload=True
    )

    # Combine and deduplicate initial results
    seen_ids = set()
    candidate_docs_lc = [] # LangChain Document objects
    candidate_docs_text = [] # Just the text for the re-ranker

    # We use a helper to create LangChain documents from Qdrant results
    all_results = dense_results + sparse_results
    for result in all_results:
        if result.id not in seen_ids:
            doc = Document(
                page_content=result.payload.get('text', ''),
                metadata={k: v for k, v in result.payload.items() if k != 'text'}
            )
            candidate_docs_lc.append(doc)
            candidate_docs_text.append(result.payload['text'])
            seen_ids.add(result.id)

    print(f"\nFound {len(candidate_docs_lc)} unique candidate documents from Hybrid Search:")
    pretty_print_docs(candidate_docs_lc)

    # === Stage 2: Re-Ranking for Precision ===
    print("--- 2. Applying Cross-Encoder to Re-Rank for Precision ---")
    rerank_pairs = [[query, doc] for doc in candidate_docs_text]
    rerank_scores = cross_encoder.predict(rerank_pairs)

    # Combine documents with their new scores
    doc_with_scores = list(zip(candidate_docs_lc, rerank_scores))
    # Sort documents by their new re-ranker score in descending order
    sorted_docs = sorted(doc_with_scores, key=lambda x: x[1], reverse=True)

    # === 3. Select Top-K Re-Ranked Documents for LLM Context ===
    top_k_rerank = 3
    final_docs = [doc[0] for doc in sorted_docs[:top_k_rerank]]

    print(f"\nPassing the TOP {top_k_rerank} Re-Ranked documents to the LLM:")
    pretty_print_docs(final_docs)

    final_context = "\\n---\\n".join([doc.page_content for doc in final_docs])

    return final_context

# --- Build the RAG Chain ---
prompt_template = "Answer the question based only on the following context:\\n\\nContext:\\n{context}\\n\\nQuestion: {question}"
prompt = ChatPromptTemplate.from_template(prompt_template)

rag_chain_with_reranker = (
    # Use the new function with prints
    {"context": RunnablePassthrough() | (lambda q: rerank_and_retrieve_with_prints(q)), "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print("RAG chain with Re-Ranker (and printing) initialized successfully.")

RAG chain with Re-Ranker (and printing) initialized successfully.


-----

#### **Step 5: Validate the Solution**

It's time to test our new, high-precision system. We will run the exact same query that failed in Module 2 and see if the re-ranker fixed the problem.

In [None]:
# Lets run the same query that failed in Module 2 due to a precision error

query = "What was the exact value for Tax withholding related to common stock from stock plans for the period ending April 27, 2025?"

print(f"Query: {query}\n")
answer = rag_chain_with_reranker.invoke(query)
# Add ANSI escape codes for blue color
print(f"\033[94mAnswer: {answer}\033[0m\n")
print("-" * 50)

Query: What was the exact value for Tax withholding related to common stock from stock plans for the period ending April 27, 2025?

--- 1. Performing Initial Hybrid Search ---

Found 12 unique candidate documents from Hybrid Search:
  [1] Source: ./NVIDIA-Q1-FY26-Financial-Results.pdf (Page: 36)
      Content: '(1)     Average price paid per share includes broker commissions, but excludes our liability under the 1% excise tax on ...'
  [2] Source: ./NVIDIA-Q1-FY26-Financial-Results.pdf (Page: 5)
      Content: 'Stock-based compensation — — 1,470 — — 1,470 
Balances as of Apr 27, 2025 24,388 $ 24 $ 11,475 $ 186 $ 72,158 $ 83,843 
...'
  [3] Source: ./NVIDIA-Q1-FY26-Financial-Results.pdf (Page: 36)
      Content: 'incentive program. During the first quarter of fiscal year 2026, we withheld approximately 13 million shares for a total...'
  [4] Source: ./NVIDIA-Q1-FY26-Financial-Results.pdf (Page: 17)
      Content: 'ultimate outcome of these actions will not have a material adverse effect

### **Module 3: Conclusion & Analysis**

After running our query, `"What was the exact value for 'Tax withholding related to common stock from stock plans' for the period ending April 27, 2025?"`, you should see that the system now provides the **correct** answer: **$1,532 million**.

This is a critical result, especially because the Module 2 pipeline (with `k=4`) failed on this exact query, stating the information could not be found. Let's analyze why the Module 3 pipeline succeeded where the previous one failed.

### Why it Worked: A Two-Stage Solution to a Two-Part Problem

Our success is not just because of the re-ranker alone. It's because we implemented a more robust, two-stage retrieval strategy that is a common pattern in production-grade systems.

**1. Part One: Solving the Recall Failure (The Wider Net)**

First, we addressed the **Recall Failure** we saw in Module 2. Our original system with `k=4` was too "narrow" and failed to retrieve the correct document chunk from the dense table on page 6.

* In this module, the *first* part of our solution was to increase the initial retrieval size to `k=10`.
* This "casts a wider net," ensuring that even if the correct chunk has a low initial score, it's very likely to be included in our list of candidates. This step is all about maximizing **recall**—making sure the right answer is found in the first place.

**2. Part Two: Solving the Precision Failure (The Intelligent Filter)**

Casting a wider net creates a new, more subtle problem: our candidate list is now much **noisier**. It contains the correct answer, but it's mixed with up to 15-20 other documents that are only tangentially related. If we passed this messy list to the LLM, we would be hoping for a "fragile success" and risking a wrong answer.

* This is where the **re-ranker** proves its value. It acts as an intelligent, high-precision filter.
* The `bge-reranker-base` Cross-Encoder analyzed this noisy list of candidates. It performed a deep comparison of the query's specific intent ("exact value," "tax withholding," "April 27, 2025") against each one.
* It recognized that the specific table row from page 6 was a much more precise match than any other chunk and promoted it to the top of the list.
* By passing only the top 3 re-ranked chunks, we provided clean, unambiguous context to the LLM, allowing it to easily extract the correct number.


### You might ask: "Aren't we just fixing the problem by using a larger k?"

The answer is: **Yes, but that's only the first half of the solution.** This is a deliberate choice to demonstrate the two distinct problems every advanced RAG system must solve:

**The Recall Problem**: In Module 2, our system suffered from a Recall Failure. The search was too narrow (k=4) and couldn't find the correct document chunk in the first place. The first step to building a robust system is to solve this by casting a wider net (k=10), ensuring the correct information is almost certainly in our initial candidate pool.

**The Precision Problem**: Casting a wider net solves the recall issue, but it creates a new problem: our candidate list is now much larger and noisier. If we were to pass this entire messy list to the LLM, we would be creating a Precision Failure, where the LLM gets confused and likely provides the wrong answer.

This module is designed to solve the second, more subtle problem. We first solved for recall by increasing k, and then we implemented a Re-Ranker as an "intelligent filter" to provide the high precision needed for a trustworthy and reliable answer. This two-stage "wide net, then precise filter" approach is the core architectural pattern we learnt here.


### Key Takeaway: Engineering a Robust RAG System

The key lesson is that building a trustworthy RAG system requires engineering a pipeline that solves for both recall and precision.

* **The Retriever (`k=10`)** is our fast, high-recall component. Its job is to cast a wide net and make sure the answer is on the table.
* **The Re-ranker** is our slow, high-precision component. Its job is to inspect everything the retriever found and identify the single best piece of evidence.

This two-stage process is the foundation of building systems that are not just powerful, but also **reliable and trustworthy**. We are no longer just hoping the LLM can figure it out; we are engineering the best possible conditions for it to succeed every time.

**Next Up:** Now that we have built a powerful and precise RAG pipeline, how do we prove it? In **Module 4**, we will learn how to **evaluate our system quantitatively** using the RAGAS framework to measure its performance on key metrics like faithfulness and relevancy.