### **Module 4: Measuring Performance with RAGAs**

**Objective:**
In our previous modules, we've focused on improving our RAG pipeline's performance based on our intuition. Now, we will introduce a **quantitative and automated** way to measure its quality. The objective of this module is to build an evaluation pipeline using the industry-standard **RAGAs** framework to score our system on key metrics like factual consistency and relevance.

**Core Concept: RAG Evaluation**
You can't improve what you can't measure. A RAG evaluation framework allows us to move beyond "it feels better" to a data-driven approach. It works by taking a small, curated set of questions and "golden" answers (ground truth) and using them to score our pipeline's performance. RAGAs cleverly uses powerful LLMs as judges to assess the quality of the retrieved context and the generated answer, providing us with a "report card" for our system.

### Learning Objectives

By the end of this module, you will be able to:

  * Understand the importance of quantitative evaluation for RAG systems.
  * Explain the core RAGAs metrics: **Faithfulness**, **Answer Relevancy**, **Context Precision**, and **Context Recall**.
  * Prepare a test dataset in the format required by RAGAs, including creating "ground truth" answers.
  * Execute an evaluation pipeline using RAGAs to score the advanced system we built in Module 3.
  * Analyze the RAGAs scores to identify the specific strengths and weaknesses of the RAG pipeline.

---

### **Step 1: Install Dependencies & Import Libraries**

First, let's get our environment ready by installing the necessary Python packages and importing all the modules we'll need for this exercise.

In [None]:
!pip install -q ragas datasets
!pip install -q langchain langchain-community langchain-groq qdrant-client pypdf fastembed langchain_huggingface sentence_transformers

import os
import asyncio
import pandas as pd
import numpy as np
from google.colab import userdata
from datasets import Dataset
from tqdm.auto import tqdm

# LangChain and related imports
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_groq import ChatGroq
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

# Qdrant, Reranking, and Sparse Embedding imports
from qdrant_client import QdrantClient, models
from fastembed import SparseTextEmbedding
from sentence_transformers.cross_encoder import CrossEncoder

# RAGAs evaluation imports
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

print("All libraries installed and imported.")

---

### **Step 2: Configure API Key and Load Data**

Here, we'll set up our Groq API key and load the NVIDIA financial report PDF. This part is provided for you.

In [None]:
# Ensure you have 'GROQ_API_KEY' saved as a secret in your Colab environment
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')

# Ensure the PDF file is uploaded to your Colab session
pdf_path = "./NVIDIA-Q1-FY26-Financial-Results.pdf"
if not os.path.exists(pdf_path):
    raise FileNotFoundError(f"The file {pdf_path} was not found. Please upload it to the Colab environment.")

loader = PyPDFLoader(pdf_path)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

print(f"Document loaded and split into {len(docs)} chunks.")

---

### **Step 3: Initialize RAG Components**

> **Your Task:** Initialize all the core components for our RAG pipeline. This includes:
> 1.  The **Qdrant** client and a new collection named `"rag_foundations_m4"` with both `dense` and `sparse` vector configurations.
> 2.  The **dense embedding model** (`BAAI/bge-m3`).
> 3.  The **sparse embedding model** (`prithivida/Splade_PP_en_v1`).
> 4.  The **cross-encoder model** for re-ranking (`BAAI/bge-reranker-base`).

In [None]:
# 1. Initialize Qdrant Client
client = QdrantClient(location=":memory:")
collection_name = "rag_foundations_m4"

# 2. Create the collection with dense and sparse vector support
# YOUR CODE HERE
client.recreate_collection(
    collection_name=collection_name,
    vectors_config={"dense": models.VectorParams(size=1024, distance=models.Distance.COSINE)},
    sparse_vectors_config={"text-sparse": models.SparseVectorParams(index=models.SparseIndexParams(on_disk=False))}
)

# 3. Initialize the dense embedding model
# YOUR CODE HERE
dense_embed_model = ...

# 4. Initialize the sparse embedding model
# YOUR CODE HERE
sparse_embed_model = ...

# 5. Initialize the reranker model
# YOUR CODE HERE
cross_encoder = ...

print("Models and Qdrant collection initialized.")

---

### **Step 4: Embed and Upsert Documents**

> **Your Task:** Loop through all the document chunks (`docs`). For each chunk, generate its dense and sparse vector embeddings and create a Qdrant `PointStruct` to be upserted. Collect all these points and perform a single bulk `upsert` operation.

In [None]:
points_to_upsert = []
for i, doc in enumerate(tqdm(docs, desc="Upserting documents")):
    # YOUR CODE HERE: Generate dense and sparse vectors
    dense_vec = ...
    sparse_vec = ...

    # YOUR CODE HERE: Create a Qdrant PointStruct
    # Remember to include payload and the two vector types.
    points_to_upsert.append(models.PointStruct(
        id=i,
        payload=...,
        vector={...}
    ))

# YOUR CODE HERE: Upsert the points to the Qdrant collection
client.upsert(collection_name=..., points=..., wait=True)
print(f"Upserted all {len(docs)} documents.")

---

### **Step 5: Build the RAG Chain**

> **Your Task:** This is a crucial step. You need to build an efficient RAG chain that performs retrieval, re-ranking, and generation in a single call.
>
> 1.  Initialize the LLM (`meta-llama/llama-4-scout-17b-16e-instruct` via Groq).
> 2.  Complete the `rerank_and_retrieve` function to perform hybrid search and reranking.
> 3.  Define the `prompt` using the provided template string.
> 4.  Construct the final `rag_chain` using LCEL. This chain must take a dictionary `{"question": "..."}` as input and produce a dictionary `{"answer": "...", "contexts": [...]}` as output.

In [None]:
# YOUR CODE HERE: Initialize the LLM
llm = ...

def rerank_and_retrieve(query: str):
    """
    Performs hybrid search (dense + sparse) and then re-ranks the results.
    """
    top_k_retrieval = 10
    query_with_prefix = f"query: {query}"

    # YOUR CODE HERE: Generate dense and sparse query vectors
    dense_query_vec = ...
    sparse_query_vec = ...

    # YOUR CODE HERE: Perform dense and sparse search against Qdrant
    dense_results = client.search(...)
    sparse_results = client.search(...)

    # Combine and de-duplicate results (this part is provided)
    seen_ids = set()
    candidate_docs = []
    for result in dense_results + sparse_results:
        if result.id not in seen_ids:
            candidate_docs.append(result.payload['text'])
            seen_ids.add(result.id)

    # YOUR CODE HERE: Rerank the results using the cross_encoder
    rerank_pairs = ...
    rerank_scores = ...
    doc_with_scores = list(zip(candidate_docs, rerank_scores))
    sorted_docs = sorted(doc_with_scores, key=lambda x: x[1], reverse=True)

    # Return the top_k documents after re-ranking
    top_k_rerank = 3
    return [doc[0] for doc in sorted_docs[:top_k_rerank]]

# YOUR CODE HERE: Define the prompt template
prompt_template = """..."""
prompt = ChatPromptTemplate.from_template(prompt_template)

retriever = RunnableLambda(rerank_and_retrieve)

# YOUR CODE HERE: Construct the efficient RAG chain using LCEL
# Hint: Use RunnablePassthrough.assign(...) to add keys to the data as it flows through the chain.
rag_chain = (
    ...
)

print("--- RAG chain created ---")

---

### **Step 6: Create the Evaluation Dataset**


> **This Task is already completed for you:**
> 1.  Define the test `questions` and their corresponding `ground_truths` answers.
> 2.  Loop through the questions, `invoke` the chain you just built, and collect the resulting `answers` and `contexts`.
> 3.  Create the final `ragas_dataset` using `Dataset.from_dict()`.

In [None]:
# YOUR CODE HERE: Define the questions and ground_truths lists
questions = [
    "How much did NVIDIA spend on share repurchases in the first quarter of fiscal year 2026?",
    "What was the exact value for 'Tax withholding related to common stock from stock plans' for the period ending April 27, 2025?",
    "What specific action did the U.S. government take on April 9, 2025, that impacted H20 products?"
]

ground_truths = [
    "During the first quarter of fiscal year 2026, NVIDIA repurchased 126 million shares of its common stock for $14.5 billion.",
    "The exact value for tax withholding related to common stock from stock plans for the period ending April 27, 2025 (Q1 FY26) was $1,532 million.",
    "On April 9, 2025, the U.S. government informed NVIDIA that it requires a license for the export of its H20 integrated circuits to China.",
]

answers = []
contexts = []

for query in tqdm(questions, desc="Generating answers and contexts"):
    result = rag_chain.invoke({"question": query})
    answers.append(result["answer"])
    contexts.append(result["contexts"])

ragas_dataset = Dataset.from_dict({
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truth": ground_truths
})

print("\nEvaluation dataset created successfully.")
print(ragas_dataset)

---

### **Step 7: Configure and Run the RAGAs Evaluation**

> **Your Task:**
> 1. Define the list of `metrics` you want RAGAs to compute.
> 2. Call the `evaluate` function from RAGAs, passing the `ragas_dataset`, the `metrics` list, the `llm` object, and the `dense_embed_model` object.
> 3. Display the final results as a pandas DataFrame.

In [None]:
# YOUR CODE HERE: Define the list of metrics
metrics = [
    ...
]

async def run_evaluation():
    print("Running RAGAs evaluation...")
    # YOUR CODE HERE: Call the RAGAs evaluate function
    result = evaluate(
        dataset=...,
        metrics=...,
        llm=...,
        embeddings=...
    )
    print("Evaluation complete.")
    return result

# Run the async evaluation function
result = asyncio.run(run_evaluation())

# YOUR CODE HERE: Convert the result to a pandas DataFrame and display it
df = ...
df

---
### **Module 4: Conclusion & Analysis**

After running the evaluation, you will have a "report card" for your RAG system.

**How to Interpret the Scores:**

  * **`faithfulness`:** This is the most important metric. It checks if the answer is factually consistent with the provided context. A score of 1.0 is perfect; a score of 0 means the answer is completely made up.
  * **`answer_relevancy`:** This measures how well the answer addresses the actual question. It ignores factual accuracy and just focuses on whether the answer is "on-topic."
  * **`context_precision`:** This scores the retriever. It asks: "Of the context we provided, how much of it was actually useful?" A high score means we are not passing a lot of "noise" to the LLM.
  * **`context_recall`:** This also scores the retriever. It asks: "Did we find all the necessary information needed to answer the question?" A high score means our retriever didn't miss any critical information.

By analyzing these scores, you can now scientifically prove the quality of your RAG system and diagnose where it needs improvement. For example, if `context_recall` is low, you need to improve your retriever. If `faithfulness` is low, you may need to improve your prompt or use a better generation model.