# **Guided Notebook: Module 3 - Improving Precision with a Re-Ranker**

*This notebook is designed for learners to complete. Code sections marked with `YOUR CODE HERE` are to be filled in by the learner.*

-----

### **An Architect's Note: The "Why" Behind Our `k` Values**

Before we dive in, let's address a critical design choice for this module. Astute learners will notice that our Module 2 pipeline (with `k=4`) failed to find the correct document, while the pipeline we're building in this module uses an initial retrieval of `k=10`.

You might ask: "Aren't we just fixing the problem by using a larger `k`?"

The answer is: **Yes, but that's only the first half of the solution.** This is a deliberate choice to demonstrate the two distinct problems every advanced RAG system must solve:

1.  **The Recall Problem:** In Module 2, our system suffered from a Recall Failure. The search was too narrow (`k=4`) and couldn't find the correct document chunk in the first place. The first step to building a robust system is to solve this by casting a wider net (`k=10`), ensuring the correct information is almost certainly in our initial candidate pool.

2.  **The Precision Problem:** Casting a wider net solves the recall issue, but it creates a new problem: our candidate list is now much larger and noisier. If we were to pass this entire messy list to the LLM, we would be creating a Precision Failure, where the LLM gets confused and likely provides the wrong answer.

This module is designed to solve the second, more subtle problem. We will first solve for recall by increasing `k`, and then we will implement a **Re-Ranker** as an "intelligent filter" to provide the high precision needed for a trustworthy and reliable answer. This two-stage "wide net, then precise filter" approach is the core architectural pattern you will learn here.

-----

### **Step 1: Install Dependencies**

The dependencies are the same as our previous module.

In [None]:
!pip install -q langchain langchain-community langchain-groq qdrant-client sentence-transformers pypdf fastembed

In [None]:
import warnings
warnings.filterwarnings('ignore')

-----

### **Step 2: Setup (API Key, Document Loading, and Qdrant Population)**

This cell contains all the setup code from Module 2. It will load the document, create the dense and sparse embeddings, and populate our in-memory Qdrant collection. We will process the **full document** this time to ensure our system is robust.

In [None]:
import os
from google.colab import userdata
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from fastembed import SparseTextEmbedding
from tqdm.auto import tqdm

# --- 1. Setup API Key ---
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')

# --- 2. Load and Split Document ---
pdf_path = "./NVIDIA-Q1-FY26-Financial-Results.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)
print(f"Document loaded and split into {len(docs)} chunks.")

# --- 3. Initialize Qdrant Client and Collection ---
client = QdrantClient(location=":memory:")
collection_name = "rag_foundations_qdrant_hybrid"
client.recreate_collection(
    collection_name=collection_name,
    vectors_config={
        "dense": models.VectorParams(size=1024, distance=models.Distance.DOT)
    },
    sparse_vectors_config={
        "text-sparse": models.SparseVectorParams(index=models.SparseIndexParams(on_disk=False))
    }
)
print("Qdrant collection created.")

# --- 4. Initialize Embedding Models ---
dense_embed_model = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-m3",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True},
)
sparse_embed_model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
print("Embedding models initialized.")

# --- 5. Embed and Upsert Full Document ---
points_to_upsert = []
for i, doc in enumerate(tqdm(docs, desc="Processing and Upserting All Docs")):
    doc_text = doc.page_content
    dense_vec = dense_embed_model.embed_query(doc_text)
    sparse_vec = list(sparse_embed_model.embed([doc_text]))[0]
    points_to_upsert.append(
        models.PointStruct(
            id=i,
            payload={"text": doc_text, **doc.metadata},
            vector={
                "dense": dense_vec,
                "text-sparse": models.SparseVector(
                    indices=sparse_vec.indices.tolist(),
                    values=sparse_vec.values.tolist()
                ),
            },
        )
    )

client.upsert(
    collection_name=collection_name,
    points=points_to_upsert,
    wait=True
)
print(f"Successfully embedded and upserted all {len(docs)} documents.")

-----

### **Step 3: Initialize the Re-Ranker**

Now, we introduce our new component. We will load a powerful Cross-Encoder model from the `sentence-transformers` library. This model is specifically trained to predict the relevance score between a query and a document.

In [None]:
from sentence_transformers.cross_encoder import CrossEncoder

# YOUR CODE HERE
# Task: Initialize the CrossEncoder model.
# HINT: Use the model name 'BAAI/bge-reranker-base'.
cross_encoder = # ... complete this line


print("Re-ranker model initialized successfully.")

-----

### **Step 4: Build the RAG Chain with Re-Ranking**

This is the core of our upgrade. We will create a new retrieval function that first uses our hybrid search to get a broad set of candidate documents and then uses our new `cross_encoder` to re-rank them for precision.

In [None]:
from langchain_groq import ChatGroq
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document

# Initialize the Groq LLM (as provided by instructor)
llm = ChatGroq(temperature=0, model_name="meta-llama/llama-4-scout-17b-16e-instruct")

# --- Helper function to visualize the context ---
def pretty_print_docs(docs, title):
    print(f"--- {title} ---")
    print(f"Found {len(docs)} documents.\n")
    for i, doc in enumerate(docs):
        source = doc.metadata.get('source', 'Unknown Source'); page = doc.metadata.get('page', 'Unknown Page')
        score = f" | Score: {doc.score:.4f}" if hasattr(doc, 'score') and doc.score is not None else ""
        print(f"  [{i+1}] Source: {source} (Page: {page}){score}")
        print(f"      Content: '{doc.page_content[:120]}...'")
    print("-" * 50)

# --- New Retrieval Function with Re-Ranking ---
def rerank_and_retrieve(query: str) -> str:
    """
    Performs and visualizes a two-stage retrieval process.
    """
    # === Stage 1: Initial Retrieval (Casting a Wider Net) ===
    print("--- 1. Performing Initial Hybrid Search ---")
    top_k_retrieval = 10

    # YOUR CODE HERE (Part 1)
    # Task: Create the dense and sparse vectors for the query.
    dense_query_vec = # ... complete this line
    sparse_query_vec = # ... complete this line

    # YOUR CODE HERE (Part 2)
    # Task: Perform the dense and sparse searches using the client.search() method.
    # Make sure to use top_k_retrieval for the limit.
    dense_results = # ... complete this line
    sparse_results = # ... complete this line

    # The fusion logic is provided for you
    seen_ids = set()
    candidate_docs_lc, candidate_docs_text = [], []
    all_results = dense_results + sparse_results
    for result in all_results:
        if result.id not in seen_ids:
            doc = Document(page_content=result.payload.get('text', ''), metadata={k: v for k, v in result.payload.items() if k != 'text'})
            candidate_docs_lc.append(doc)
            candidate_docs_text.append(result.payload['text'])
            seen_ids.add(result.id)
    pretty_print_docs(candidate_docs_lc, "Initial Hybrid Search Candidates")

    # === Stage 2: Re-Ranking for Precision ===
    print("\n--- 2. Applying Cross-Encoder to Re-Rank for Precision ---")
    # YOUR CODE HERE (Part 3)
    # Task: Create the pairs of [query, document_text] for the re-ranker.
    rerank_pairs = # ... complete this line

    # YOUR CODE HERE (Part 4)
    # Task: Get the relevance scores by calling cross_encoder.predict().
    rerank_scores = # ... complete this line

    # The sorting and selection logic is provided for you
    doc_with_scores = list(zip(candidate_docs_lc, rerank_scores))
    sorted_docs = sorted(doc_with_scores, key=lambda x: x[1], reverse=True)
    top_k_rerank = 3
    final_docs = [doc[0] for doc in sorted_docs[:top_k_rerank]]
    pretty_print_docs(final_docs, f"Top {top_k_rerank} Re-Ranked Documents")

    final_context = "\n---\n".join([doc.page_content for doc in final_docs])
    return final_context


# --- Build the RAG Chain (This part is provided for you) ---
prompt_template = "Answer the question based only on the following context:\n\nContext:\n{context}\n\nQuestion: {question}"
prompt = ChatPromptTemplate.from_template(prompt_template)
rag_chain_with_reranker = (
    {"context": RunnablePassthrough() | (lambda q: rerank_and_retrieve(q)), "question": RunnablePassthrough()}
    | prompt | llm | StrOutputParser()
)
print("\nRAG chain with Re-Ranker initialized successfully.")

-----

### **Step 5: Validate the Solution**

It's time to test our new, high-precision system. We will run the exact same query that failed in Module 2 and see if the re-ranker fixed the problem.

In [None]:
# This part is provided for you

# The query that failed in Module 2 due to a recall failure
query = "What was the exact value for \"Tax withholding related to common stock from stock plans\" for the period ending April 27, 2025?"

print(f"Query: {query}\n")
answer = rag_chain_with_reranker.invoke(query)
print(f"Answer: {answer}\n")
print("-" * 50)

# Module 3 Conclusion & Analysis

After running our query, `"What was the exact value for 'Tax withholding related to common stock from stock plans' for the period ending April 27, 2025?"`, you should see that the system now provides the **correct** answer: **$1,532 million**.

This is a critical result, especially because the Module 2 pipeline (with `k=4`) failed on this exact query. Let's analyze why the Module 3 pipeline succeeded where the previous one failed.

### Why it Worked: A Two-Stage Solution to a Two-Part Problem

Our success is not just because of the re-ranker alone. It's because we implemented a more robust, two-stage retrieval strategy that is a common pattern in production-grade systems.

**1. Part One: Solving the Recall Failure (The Wider Net)**
First, we addressed the **Recall Failure** we saw in Module 2. Our original system with `k=4` was too "narrow" and failed to retrieve the correct document chunk from the dense table on page 6. In this module, the *first* part of our solution was to increase the initial retrieval size to `k=10`. This "casts a wider net," ensuring that even if the correct chunk has a low initial score, it's very likely to be included in our list of candidates. This step is all about maximizing **recall**—making sure the right answer is found in the first place.

**2. Part Two: Solving the Precision Failure (The Intelligent Filter)**
Casting a wider net creates a new, more subtle problem: our candidate list is now much **noisier**. If we passed this entire messy list to the LLM, we would be hoping for a "fragile success" and risking a wrong answer.

This is where the **re-ranker** proves its value. It acts as an intelligent, high-precision filter. The `bge-reranker-base` Cross-Encoder analyzed this noisy list of candidates and performed a deep comparison of the query's specific intent. It recognized that the specific table row from page 6 was a much more precise match than any other chunk and promoted it to the top of the list. By passing only the top 3 re-ranked chunks, we provided clean, unambiguous context to the LLM, allowing it to easily extract the correct number.


### Key Takeaway: Engineering a Robust RAG System

The key lesson is that building a trustworthy RAG system requires engineering a pipeline that solves for both recall and precision.

* **The Retriever (`k=10`)** is our fast, high-recall component. Its job is to cast a wide net and make sure the answer is on the table.
* **The Re-ranker** is our slow, high-precision component. Its job is to inspect everything the retriever found and identify the single best piece of evidence.

This two-stage process is the foundation of building systems that are not just powerful, but also **reliable and trustworthy**. We are no longer just hoping the LLM can figure it out; we are engineering the best possible conditions for it to succeed every time.

**Next Up:** Now that we have built a powerful and precise RAG pipeline, how do we prove it? In **Module 4**, we will learn how to **evaluate our system quantitatively** using the RAGAS framework to measure its performance on key metrics like faithfulness and relevancy.