
### **Module 4: Measuring Performance with RAGAs**

**Objective:**
In our previous modules, we've focused on improving our RAG pipeline's performance based on our intuition. Now, we will introduce a **quantitative and automated** way to measure its quality. The objective of this module is to build an evaluation pipeline using the industry-standard **RAGAs** framework to score our system on key metrics like factual consistency and relevance.

**Core Concept: RAG Evaluation**
You can't improve what you can't measure. A RAG evaluation framework allows us to move beyond "it feels better" to a data-driven approach. It works by taking a small, curated set of questions and "golden" answers (ground truth) and using them to score our pipeline's performance. RAGAs cleverly uses powerful LLMs as judges to assess the quality of the retrieved context and the generated answer, providing us with a "report card" for our system.

### Learning Objectives

By the end of this module, you will be able to:

  * Understand the importance of quantitative evaluation for RAG systems.
  * Explain the core RAGAs metrics: **Faithfulness**, **Answer Relevancy**, **Context Precision**, and **Context Recall**.
  * Prepare a test dataset in the format required by RAGAs, including creating "ground truth" answers.
  * Execute an evaluation pipeline using RAGAs to score the advanced system we built in Module 3.
  * Analyze the RAGAs scores to identify the specific strengths and weaknesses of the RAG pipeline.

-----

#### **Step 1: Install Dependencies**

We will install all the libraries from our previous module, and add `ragas` and `datasets` to our environment.

In [None]:
!pip install -q ragas datasets
!pip install -q langchain langchain-community langchain-groq qdrant-client pypdf fastembed langchain_huggingface sentence_transformers

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/190.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m33.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.6/70.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.0/329.0 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

-----

#### **Step 2: Setup (API Keys, and Full RAG Chain from Module 3)**

This cell contains the complete, working setup from the end of Module 3. It prepares the exact RAG pipeline (`rag_chain_with_reranker`) that we are going to evaluate.

In [None]:
import os
from google.colab import userdata
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models
from langchain_huggingface import HuggingFaceEmbeddings
from fastembed import SparseTextEmbedding
from tqdm.auto import tqdm
from sentence_transformers.cross_encoder import CrossEncoder
from langchain_groq import ChatGroq
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser
import numpy as np

In [None]:
# --- 1. Setup API Key ---
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')

In [None]:
# --- 2. Load and Split Document ---
pdf_path = "./NVIDIA-Q1-FY26-Financial-Results.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)
print(f"Document loaded and split into {len(docs)} chunks.")

Document loaded and split into 191 chunks.


In [None]:
# --- 3. Initialize Qdrant and Embeddings ---
client = QdrantClient(location=":memory:")
collection_name = "rag_foundations_m4"
client.recreate_collection(
    collection_name=collection_name,
    vectors_config={"dense": models.VectorParams(size=1024, distance=models.Distance.COSINE)},
    sparse_vectors_config={"text-sparse": models.SparseVectorParams(index=models.SparseIndexParams(on_disk=False))}
)
dense_embed_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3", model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": True}
)
sparse_embed_model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
cross_encoder = CrossEncoder('BAAI/bge-reranker-base')
print("Models and Qdrant collection initialized.")

  client.recreate_collection(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model.onnx:   0%|          | 0.00/532M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Models and Qdrant collection initialized.


In [None]:
# --- 4. Embed and Upsert Full Document ---
points_to_upsert = []
for i, doc in enumerate(tqdm(docs, desc="Upserting documents")):
    dense_vec = dense_embed_model.embed_query(doc.page_content)
    sparse_vec = list(sparse_embed_model.embed([doc.page_content]))[0]
    points_to_upsert.append(models.PointStruct(id=i, payload={"text": doc.page_content, **doc.metadata}, vector={"dense": dense_vec, "text-sparse": models.SparseVector(indices=sparse_vec.indices.tolist(), values=sparse_vec.values.tolist())}))
client.upsert(collection_name=collection_name, points=points_to_upsert, wait=True)
print(f"Upserted all {len(docs)} documents.")


Upserting documents:   0%|          | 0/191 [00:00<?, ?it/s]

Upserted all 191 documents.


In [None]:
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
# --- 5. Build the Full RAG Chain with Re-Ranking ---
llm = ChatGroq(temperature=0, model_name="meta-llama/llama-4-scout-17b-16e-instruct")

def rerank_and_retrieve(query: str):
    """
    Performs hybrid search (dense + sparse) and then re-ranks the results.
    This function will be called only ONCE per query by our efficient chain.
    """
    top_k_retrieval = 10
    # The BGE models recommend adding "query: " for retrieval queries
    query_with_prefix = f"query: {query}"
    dense_query_vec = dense_embed_model.embed_query(query_with_prefix)
    sparse_query_vec = list(sparse_embed_model.embed([query]))[0]

    # Perform dense and sparse search
    dense_results = client.search(
        collection_name=collection_name,
        query_vector=models.NamedVector(name="dense", vector=dense_query_vec),
        limit=top_k_retrieval,
        with_payload=True
    )
    sparse_results = client.search(
        collection_name=collection_name,
        query_vector=models.NamedSparseVector(
            name="text-sparse",
            vector=models.SparseVector(indices=sparse_query_vec.indices.tolist(), values=sparse_query_vec.values.tolist())
        ),
        limit=top_k_retrieval,
        with_payload=True
    )

    # Combine and de-duplicate results
    seen_ids = set()
    candidate_docs = []
    for result in dense_results + sparse_results:
        if result.id not in seen_ids:
            candidate_docs.append(result.payload['text'])
            seen_ids.add(result.id)

    # Re-rank the candidates
    rerank_pairs = [[query, doc] for doc in candidate_docs]
    rerank_scores = cross_encoder.predict(rerank_pairs)
    doc_with_scores = list(zip(candidate_docs, rerank_scores))
    sorted_docs = sorted(doc_with_scores, key=lambda x: x[1], reverse=True)

    # Select the top-k documents after re-ranking
    top_k_rerank = 3
    final_docs = [doc[0] for doc in sorted_docs[:top_k_rerank]]
    return final_docs

# Define the prompt template
prompt_template = """
Answer the question based only on the following context:

Context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(prompt_template)

# This chain is designed to return both the answer and the context documents.
retriever = RunnableLambda(rerank_and_retrieve)

rag_chain = (
    RunnablePassthrough.assign(context_docs=lambda x: retriever.invoke(x["question"]))
    .assign(
        answer=(
            RunnablePassthrough.assign(context=lambda x: "\n---\n".join(x["context_docs"]))
            | prompt
            | llm
            | StrOutputParser()
        )
    )
    | (lambda x: {"answer": x["answer"], "contexts": x["context_docs"]})
)

print("--- Efficient and Correct RAG chain created ---")

--- Efficient and Correct RAG chain created ---


-----

#### **Step 3: Create the Evaluation Dataset**

To evaluate our system, we need a "gold standard" dataset. This involves writing our test questions and, crucially, providing a perfect, human-written "ground truth" answer for each one. This ground truth answer is what RAGAs will use as a benchmark.

In [None]:
from datasets import Dataset

# Define our test questions
questions = [
    "How much did NVIDIA spend on share repurchases in the first quarter of fiscal year 2026?",
    "What was the exact value for 'Tax withholding related to common stock from stock plans' for the period ending April 27, 2025?",
    "What specific action did the U.S. government take on April 9, 2025, that impacted H20 products?"
]

# The ground truth answers in simple strings.
ground_truths = [
    "During the first quarter of fiscal year 2026, NVIDIA repurchased 126 million shares of its common stock for $14.5 billion.",
    "The exact value for tax withholding related to common stock from stock plans for the period ending April 27, 2025 (Q1 FY26) was $1,532 million.",
    "On April 9, 2025, the U.S. government informed NVIDIA that it requires a license for the export of its H20 integrated circuits to China.",
]

# Generate answers and retrieve contexts from our RAG pipeline
answers = []
contexts = []

for query in tqdm(questions, desc="Generating answers and contexts"):
    # Get the answer from our RAG chain
    result = rag_chain.invoke({"question": query}) # Corrected variable name
    answers.append(result["answer"])
    contexts.append(result["contexts"])

# Create the final dataset in the format required by RAGAs
ragas_dataset = Dataset.from_dict({
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truth": ground_truths
})

print("\nEvaluation dataset created successfully.")
print(ragas_dataset)

Generating answers and contexts:   0%|          | 0/3 [00:00<?, ?it/s]

  dense_results = client.search(
  sparse_results = client.search(
  dense_results = client.search(
  sparse_results = client.search(
  dense_results = client.search(
  sparse_results = client.search(



Evaluation dataset created successfully.
Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truth'],
    num_rows: 3
})


-----

#### **Step 4: Configure and Run the RAGAs Evaluation**

Now we configure RAGAs. We need to tell it which LLM and which embedding model to use for its "judging" process. Then we pass it our dataset and the list of metrics we want to calculate.

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
import asyncio
import pandas as pd

# 1. Define the list of metrics we want to calculate.
# We don't need to configure them individually anymore.
metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
]

# 2. Run the evaluation.
# We pass our LangChain LLM and embedding model objects directly to the 'evaluate' function.
# Ragas will handle the integration automatically.
async def run_evaluation():
    print("Running RAGAs evaluation...")
    result = evaluate(
        dataset=ragas_dataset,
        metrics=metrics,
        llm=llm,
        embeddings=dense_embed_model
    )
    print("Evaluation complete.")
    return result

result = asyncio.run(run_evaluation())

Running RAGAs evaluation...


Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

Evaluation complete.


-----

#### **Step 5: Analyze the Results**

The output of the evaluation is a dictionary containing the scores for each metric, for each question. We can easily convert this to a Pandas DataFrame for clear analysis.

In [None]:
# Display the results in a clean table
df = result.to_pandas()
df

Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall
0,How much did NVIDIA spend on share repurchases...,[Capital Return to Shareholders\nWe repurchase...,"According to the context, NVIDIA repurchased 1...","During the first quarter of fiscal year 2026, ...",1.0,0.986721,1.0,1.0
1,What was the exact value for 'Tax withholding ...,"[Stock-based compensation — — 1,470 — — 1,470 ...",The exact value for 'Tax withholding related t...,The exact value for tax withholding related to...,0.5,1.0,0.5,1.0
2,What specific action did the U.S. government t...,"[On April 9, 2025, we were informed by the USG...","On April 9, 2025, the US government informed t...","On April 9, 2025, the U.S. government informed...",1.0,0.656539,0.833333,1.0


### **Module 4: Conclusion & Analysis**

After running the evaluation, you will have a "report card" for your RAG system.

**How to Interpret the Scores:**

  * **`faithfulness`:** This is the most important metric. It checks if the answer is factually consistent with the provided context. A score of 1.0 is perfect; a score of 0 means the answer is completely made up.
  * **`answer_relevancy`:** This measures how well the answer addresses the actual question. It ignores factual accuracy and just focuses on whether the answer is "on-topic."
  * **`context_precision`:** This scores the retriever. It asks: "Of the context we provided, how much of it was actually useful?" A high score means we are not passing a lot of "noise" to the LLM.
  * **`context_recall`:** This also scores the retriever. It asks: "Did we find all the necessary information needed to answer the question?" A high score means our retriever didn't miss any critical information.

By analyzing these scores, you can now scientifically prove the quality of your RAG system and diagnose where it needs improvement. For example, if `context_recall` is low, you need to improve your retriever. If `faithfulness` is low, you may need to improve your prompt or use a better generation model.