# **Guided Notebook: Module 2 - Hybrid Search**

**Improving Recall with Hybrid Search**

*This notebook is designed for learners to complete. Code sections marked with `YOUR CODE HERE` are to be filled in by the learner.*

-----


**Objective:**
In our first module, we saw a critical **Recall Failure**. Our basic RAG system, using only semantic search, completely missed the correct document chunk for a query about "share repurchases." It failed to find the right information in the knowledge base.

The objective of this module is to solve that recall problem by implementing a more powerful **Hybrid Search** system. We will combine traditional keyword-based search with the semantic search we've already learned. This will create a much more reliable retriever.

**Learning Objectives:**
By the end of this module, you will be able to:
- Explain the core concept of Hybrid Search and understand the distinct roles of dense (semantic) and sparse (keyword) vectors.
- Implement a hybrid data strategy by creating both dense and sparse embeddings for your documents using open-source models.
- Configure and populate a Qdrant collection that handles a sophisticated hybrid search workload.
- Build a custom retrieval function that performs both dense and sparse searches and fuses the results.
- Diagnose a **Recall Failure** and understand why a narrow search (`k=4`) can cause the system to fail, even with a better algorithm.

**Core Concept: Hybrid Search with Qdrant**
We will create and store two types of vectors for each document chunk:
1.  **Dense Vector (from `bge-m3`):** Captures the *semantic meaning* and conceptual relationships.
2.  **Sparse Vector (from `Splade`):** Captures the *keyword importance*.

When a query comes in, our system will perform two separate searches—one for meaning and one for keywords—and then combine the results. This gives us the best of both worlds, making our system far more robust against the type of keyword-based failure we saw in Module 1.


### **Step 1: Install Dependencies**

In [None]:
# Install all required libraries
!pip install -q langchain langchain-community langchain-groq qdrant-client pypdf fastembed

# Ignore standard warnings
import warnings
warnings.filterwarnings('ignore')

-----

### **Step 2: Setup API Key & Document Loading**

This step remains the same as Module 1. In this module, we reuse our Module-1 API keys, we load the NVIDIA financial report PDF, and split it into chunks.

In [None]:
import os
from google.colab import userdata
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# --- Setup API Key ---
# Make sure you have added your GROQ_API_KEY to the Colab secrets manager
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')

# --- Load and Split Document ---
# Make sure you have uploaded the NVIDIA Q1 FY26 PDF to your Colab session
pdf_path = "./NVIDIA-Q1-FY26-Financial-Results.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()

# Use the same chunking strategy as Module 1
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

print(f"Document loaded and split into {len(docs)} chunks.")

-----

### **Step 3: Initialize Qdrant for Hybrid Search**


This is a key step. We will create a Qdrant client and then create a new **collection** that is specifically configured to handle both dense and sparse vectors. This is different from Module 1 where we only had one type of vector.

In [None]:
from qdrant_client import QdrantClient, models

# Initialize an in-memory Qdrant client
client = QdrantClient(location=":memory:")

# Define the collection name
collection_name = "rag_foundations_m2_guided"

# Create the collection with configurations for both dense and sparse vectors
print(f"Creating Qdrant collection '{collection_name}' for hybrid search...")

# YOUR CODE HERE
# Task: Configure and create the Qdrant collection.
# HINT: Use the client.recreate_collection() method.
# You need to configure two types of vectors inside the 'vectors_config' and 'sparse_vectors_config' arguments.
#   1. A 'dense' vector using models.VectorParams. Set the size to 1024 (for bge-m3) and the distance to models.Distance.DOT.
#   2. A 'text-sparse' sparse vector using models.SparseVectorParams.
# The final structure should look like: client.recreate_collection(collection_name=..., vectors_config={...}, sparse_vectors_config={...})




print("Collection created successfully.")

-----

### **Step 4: Embed and Store Documents**

Now we will perform the main data processing. We will loop through every document chunk, create both a dense and a sparse vector for it, and then store them together in our new Qdrant collection.


In [None]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from fastembed import SparseTextEmbedding
from tqdm.auto import tqdm

print("Initializing local embedding models...")
# 1. Initialize our embedding models
dense_embed_model = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-m3", model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": True}
)
sparse_embed_model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
print("Models initialized.")

# 2. Embed and prepare all documents for upsert
print("Embedding and preparing all documents for upsert...")
points_to_upsert = []
for i, doc in enumerate(tqdm(docs, desc="Processing All Documents")):
    doc_text = doc.page_content

    # YOUR CODE HERE (Part 1)
    # Task: Create the dense vector for the doc_text.
    # HINT: Use the dense_embed_model.embed_query() method.
    dense_vec = # ... complete this line

    # YOUR CODE HERE (Part 2)
    # Task: Create the sparse vector for the doc_text.
    # HINT: The sparse_embed_model.embed() method returns a list, so you'll need the first element.
    sparse_vec = # ... complete this line

    # YOUR CODE HERE (Part 3)
    # Task: Create a Qdrant PointStruct to hold all the data.
    # HINT: It needs an 'id', a 'payload' (with the text and metadata), and a 'vector' dictionary.
    # The vector dictionary should have keys "dense" and "text-sparse".
    # For the sparse vector, you will need to convert its indices and values to a list.
    # e.g., models.SparseVector(indices=sparse_vec.indices.tolist(), values=sparse_vec.values.tolist())
    point = # ... complete this line

    points_to_upsert.append(point)

# 3. Upsert the points to Qdrant
# YOUR CODE HERE (Part 4)
# Task: Upload the prepared points to your Qdrant collection.
# HINT: Use the client.upsert() method, passing the collection_name and the points_to_upsert list.



print(f"Successfully embedded and upserted all {len(docs)} documents.")

-----

### **Step 5: Build the Hybrid RAG Chain**


Now we'll build our retrieval function. This function needs to perform two separate searches in Qdrant (one for dense vectors, one for sparse) and then intelligently combine the results before passing them to the LLM.


In [None]:
from langchain_groq import ChatGroq
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document

# Initialize the Groq LLM
llm = ChatGroq(temperature=0, model_name="meta-llama/llama-4-scout-17b-16e-instruct")

# --- Helper function to visualize the context ---
def pretty_print_docs(docs):
    print(f"Found {len(docs)} documents to pass to the LLM.\n")
    for i, doc in enumerate(docs):
        source = doc.metadata.get('source', 'Unknown Source'); page = doc.metadata.get('page', 'Unknown Page')
        print(f"  [{i+1}] Source: {source} (Page: {page})"); print(f"      Content: '{doc.page_content[:150]}...'")
    print("-" * 50)

# --- Custom Retrieval Function ---
def qdrant_hybrid_retrieve(query: str, top_k=4) -> list[Document]:
    """
    Performs hybrid search and returns a list of LangChain Document objects.
    We are deliberately keeping k=4 to demonstrate recall failure.
    """
    # YOUR CODE HERE (Part 1)
    # Task: Create the dense and sparse vectors for the input 'query'.
    # This is the same process as in the previous step.
    dense_query_vec = # ... complete this line
    sparse_query_vec = # ... complete this line

    # YOUR CODE HERE (Part 2)
    # Task: Perform the two separate searches (dense and sparse) using the client.search() method.
    # Remember to set the limit=top_k and with_payload=True.
    # For each search, you must specify which vector you are querying against ('dense' or 'text-sparse').
    dense_results = # ... complete this line
    sparse_results = # ... complete this line

    # The fusion logic is provided for you
    seen_ids = set()
    combined_documents = []
    all_results = dense_results + sparse_results
    for result in all_results:
        if result.id not in seen_ids:
            doc = Document(page_content=result.payload.get('text', ''), metadata={k: v for k, v in result.payload.items() if k != 'text'})
            combined_documents.append(doc)
            seen_ids.add(result.id)

    print("--- Context Being Passed to LLM (from Hybrid Search with k=4) ---")
    pretty_print_docs(combined_documents)

    return combined_documents

# --- Build the RAG Chain (This part is provided for you) ---

prompt_template = "Answer the question based only on the following context:\n\nContext:\n{context}\n\nQuestion: {question}"

prompt = ChatPromptTemplate.from_template(prompt_template)


rag_chain = (
    {"context": qdrant_hybrid_retrieve, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser()
)

print("RAG chain with Qdrant hybrid retrieval is ready.")

### **Step 6: Test the Hybrid RAG Chain**

In [None]:
# --- Run the Test Queries ---
# This part is provided for you

# Query #1: The query that failed in Module 1
module_1_failure_query = "How much did NVIDIA spend on share repurchases in the first quarter of fiscal year 2026?"

# Query #2: Our new, more difficult query for this module
module_2_failure_query = "What was the exact value for Tax withholding related to common stock from stock plans for the period ending April 27, 2025?"

print("--- Testing Query #1 (The Module 1 Failure) ---")
print(f"Query: {module_1_failure_query}\\n")
answer_1 = rag_chain.invoke(module_1_failure_query)
print(f"Answer: {answer_1}\\n")
print("-" * 100)


print("\\n\\n--- Testing Query #2 (Our New Challenge) ---")
print(f"Query: {module_2_failure_query}\\n")
answer_2 = rag_chain.invoke(module_2_failure_query)
print(f"Answer: {answer_2}\\n")
print("-" * 100)

# Module 2 Conclusion: A Step Forward, and a Critical Failure

After completing the notebook, you should see that the results from this module are a fantastic **real-world lesson in building RAG systems.**

**1. A Major Success:**
Our new Hybrid Search retriever will **successfully solve the critical failure from Module 1**. For the query about "share repurchases," the system will correctly found the relevant chunks and provided the right answer ($14.5 billion).

This proves that by combining dense (semantic) and sparse (keyword) vectors, we can build a system with excellent **recall**—the ability to find relevant documents even when the query relies on specific keywords.

**2. A New, More Subtle Failure:**
However, you will see when we test it with our second, more difficult query, the system fails in a critical way.

* **The Query:** `"What was the exact value for 'Tax withholding related to common stock from stock plans' for the period ending April 27, 2025?"`
* **The Result:** The system returns the wrong value: **$1,752 million** (the value from the wrong year).
* **The Diagnosis: A Recall Failure.** This is not a case of the LLM getting confused. The root cause is that our retriever, with its narrow search of `k=4`, **never finds the correct chunk of text from the document.** The combination of a basic document loader (`PyPDFLoader`) that struggles with tables and a small `k` value means that the correct information from page 6 never passes to the LLM. The system retrieves other, less relevant chunks that happens to contain the wrong number.

### Key Takeaway

Hybrid Search is a powerful tool, but it's not a magic bullet. The performance of a RAG pipeline is only as strong as its weakest link. We've just proven that even with a strong search algorithm, a poor **chunking strategy** combined with an overly **narrow retrieval setting (`k=4`)** can cause the entire system to fail. We have not yet built a truly high-recall system capable of handling this difficult query.

### Next Up

In **Module 3**, we will solve this problem by implementing a more robust, two-stage architecture. First, we will solve the **recall** problem by casting a wider net (increasing `k` to 10). Then, we will solve the resulting **precision** problem by implementing a **Re-Ranker**—an intelligent filter designed to analyze the noisy results and promote the single best answer to the top, ensuring our LLM always gets the cleanest, most direct context.