# Evaluating GraphRAG with Ragas and Nvidia's Nemotron-340b-reward model

##### **Overview**
This notebook will walk you through how to evaluate Retrieval-Augmented Generation (RAG) pipelines using two advanced tools: **Ragas** and reward models, specifically Nvidia's **Nemotron-340b-reward** model. Traditional evaluation metrics often require extensive human annotations, but these tools offer efficient, reference-free evaluations that correlate strongly with human judgment.

##### **What is Ragas?**
Ragas (Retrieval Augmented Generation Assessment) is a framework designed for the reference-free evaluation of Retrieval-Augmented Generation (RAG) systems. RAG systems combine a retrieval component with a Large Language Model (LLM) to provide responses based on both external data and the model's internal knowledge. Evaluating such systems is challenging due to multiple factors, including the relevance of retrieved information and the faithfulness of the generated responses. Ragas addresses these challenges by offering a suite of metrics that assess various dimensions of RAG pipelines without relying on ground truth human annotations.

##### **What is the Nemotron-4-340B-Reward Model?**
The Nemotron-4-340B-Reward model by NVIDIA is a multi-dimensional reward model designed to evaluate AI-generated responses across several attributes. Built upon the Nemotron-4-340B-Base model, it adds a linear layer that outputs scores corresponding to specific attributes. This model is particularly useful in synthetic data generation pipelines and reinforcement learning from AI feedback.

The evaluation focused on the following key metrics, scored on scale of 0 to 4 (higher is better):
1. Helpfulness: Measures how effectively the response addresses the prompt.
2. Correctness: Assesses the inclusion of all pertinent facts without inaccuracies.
3. Coherence: Evaluates the consistency and clarity of expression in the response.
4. Complexity: Determines the intellectual depth required to generate the response (for example, whether it demands deep domain expertise or can be produced with basic language competency).
5. Verbosity: Analyzes the level of detail provided relative to the requirements of the prompt.

##### **Key Differences Between Ragas and Reward Models**
While both Ragas and reward models like Nemotron-4-340B-Reward aim to evaluate LLM outputs, they differ in scope and methodology:
1. Evaluation Scope:
- Ragas: Focuses on assessing the entire RAG pipeline, including both the retrieval and generation components. It evaluates aspects such as context relevance, faithfulness, and answer quality.
- Reward Models: Concentrate on evaluating the generated responses themselves, scoring them based on predefined attributes like helpfulness and correctness.
2. Methodology:
- Ragas: Utilizes a suite of metrics to provide a holistic evaluation of RAG systems without relying on human-annotated data.
- Reward Models: Finetuned models to predict human preferences, providing scalar scores for specific attributes of responses.

By integrating both Ragas and reward models into your evaluation process, you can gain comprehensive insights into the performance of your QA pipelines, from retrieval effectiveness to response quality.

##### **What You’ll Learn**

In this tutorial, we will walk you through:
1. RAGAS Evaluation:
- Set up Ragas for evaluating a QA pipeline.
- Use NVIDIA’s Nemotron-4-340B-Reward model as an evaluation judge.
- Analyze evaluation metrics and interpret the results using Ragas.
2. Reward Model Evaluation:
- Implement the Nemotron-4-340B-Reward model to assess LLM responses.
- Interpret the reward scores to understand response quality.
- Incorporate reward-based feedback into an LLM evaluation pipeline.

Let’s get started! 🚀

##### 1️⃣ Environment Setup

🔧 Import Required Dependencies

In [None]:
import pandas as pd

import os
import getpass
import asyncio

from ragas import evaluate
from datasets import Dataset

from openai import OpenAI
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

🔑 Setting Up NVIDIA API Key

To access NVIDIA AI Endpoints, you need to provide a valid **NVIDIA API key**.

- If the key is not already set as an environment variable, you will be prompted to enter it.
- The key should start with `nvapi-`, ensuring it is correctly formatted.
- This step is essential for interacting with NVIDIA's LLM services.

Run the following cell to set up your API key:

In [None]:
# Ensure NVIDIA API key is set
if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

##### 2️⃣ Load and Preprocess evaluation data

Before we can evaluate a **retrieval-augmented generation (RAG) system**, we need a structured dataset containing:
1. **User queries** (questions asked by users).
2. **Generated responses** (answers produced by the RAG system) --- Here we evaluate the GraphRAG response
3. **Retrieved contexts** (documents fetched by the retriever).
4. **Reference answers** (ground-truth answers for comparison).

In previous tutorials, Ground-truth (GT) question-answer pairs are synthetically generated using the nemotron-340b synthetic data generation model. 


In [None]:
eval_df = pd.read_csv('../data/evaluation_data.csv')
eval_df.head()

##### 3️⃣ Ragas Evaluation

📌 Format the evaluation dataset into Ragas's expected input structure

We need to format the evaluation dataset into a dictionary that matches Ragas' expected input structure and convert the data into a Hugging Face Dataset for efficient processing.

**Why use a Hugging Face Dataset?**
- Optimized for large-scale processing.
- Memory-efficient, allowing operations on large datasets without excessive RAM usage.
- Easily integrates with Ragas evaluation functions.

In [None]:
eval_data = {
    'user_input': eval_df['question'].tolist(),
    'response': eval_df['answer'].tolist(),
    'retrieved_contexts': eval_df['gt_context'].apply(lambda x: [x] if isinstance(x, str) else x).tolist(),
    'reference': eval_df['gt_answer'].tolist()
}
eval_dataset = Dataset.from_dict(eval_data)

📌 Setting Up NVIDIA AI Endpoints for Ragas Evaluation

To effectively evaluate **retrieval-augmented generation (RAG) pipelines**, we need a **LLM** for scoring responses and an **embedding model** for similarity comparisons. In this section, we integrate **NVIDIA AI Endpoints** with Ragas to power our evaluation.

Why do we need these models?
- LLM (In this example: mixtral-8x22b-instruct): Acts as a judge to evaluate responses generated by the RAG system.
- Embedding Model (In this example: NV-EmbedQA-E5-V5): Computes semantic similarity between the retrieved document and the expected answer, which is crucial for assessing retrieval quality.

In [None]:
llm = ChatNVIDIA(
    model="meta/llama-3.3-70b-instruct",
    temperature=0.0,
    max_tokens=300,
)
embeddings = NVIDIAEmbeddings(model="nvidia/nv-embedqa-e5-v5", model_type="passage")

📌 Wrapping NVIDIA AI Models for Ragas

Ragas expects models in a specific format. To ensure compatibility, we wrap our NVIDIA LLM and embedding model using LangchainLLMWrapper and LangchainEmbeddingsWrapper.

In [None]:
nvpl_llm = LangchainLLMWrapper(langchain_llm=llm)
nvpl_embeddings = LangchainEmbeddingsWrapper(embeddings)

📌 View and Interpret Results from Ragas

A Ragas score is comprised of the following:

<img src="../data/ragas.png" alt="Ragas Evaluation" width="600">

**Metrics explained**

1. Faithfulness: measures the factual accuracy of the generated answer with the context provided. This is done in 2 steps. First, given a question and generated answer, Ragas uses an LLM to figure out the statements that the generated answer makes. This gives a list of statements whose validity we have we have to check. In step 2, given the list of statements and the context returned, Ragas uses an LLM to check if the statements provided are supported by the context. The number of correct statements is summed up and divided by the total number of statements in the generated answer to obtain the score for a given example.

2. Answer Relevancy: measures how relevant and to the point the answer is to the question. For a given generated answer Ragas uses an LLM to find out the probable questions that the generated answer would be an answer to and computes similarity to the actual question asked.

3. Context Precision: measures the precision of the retrieved context in providing relevant information for generating answer. Given a question, answer and retrieved context, Ragas calls LLM to check sentences from the ground truth answer against a retrieved context. It is the ratio between the relevant sentences from retrieved context and the total sentence from ground truth answer.

4. Context Recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question. Ragas calculates this by using the provided ground_truth answer and using an LLM to check if each statement from it can be found in the retrieved context. If it is not found that means the retriever was not able to retrieve the information needed to support that statement.

In [None]:
ragas_evaluation = evaluate(eval_dataset, llm=nvpl_llm, embeddings=nvpl_embeddings,
                             metrics=[answer_relevancy, context_precision])

In [None]:
ragas_evaluation

##### 4️⃣ Reward Model Evaluation

📌 Setting Up NVIDIA AI Endpoints for Reward Evaluation

In [None]:
reward_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=os.environ["NVIDIA_API_KEY"]
)

In [None]:
async def get_reward_scores(question, answer):
    try:
        completion = reward_client.chat.completions.create(
            model="nvidia/nemotron-4-340b-reward",
            messages=[
                {"role": "user", "content": question},
                {"role": "assistant", "content": answer}
            ]
        )

        # Extract and parse response
        content = completion.choices[0].message[0].content
        res = content.split(",")
        content_dict = {item.split(":")[0].strip(): float(item.split(":")[1]) for item in res}
        #print(content_dict) #debug output
        return content_dict
    except Exception as e:
        print(f"Error processing {question}: {e}")
        return None

In [None]:
async def process_dataframe(df):
    tasks = [get_reward_scores(row["question"], row["answer"]) for _, row in df.iterrows()]
    results = await asyncio.gather(*tasks)  # Run all tasks concurrently

    # Define metrics
    metrics = ["Helpfulness", "Correctness", "Coherence", "Complexity", "Verbosity"]

    # Add results to DataFrame
    for metric in metrics:
        df[metric] = [res.get(metric.lower()) if res else None for res in results]

    # Calculate mean of each metric
    mean_scores = df[metrics].mean()
    print("Mean Scores:\n", mean_scores)

In [None]:
async def main():
    await process_dataframe(eval_df)

# Run the async function
asyncio.run(main())

Given a conversation with multiple turns between user and assistant, it rates the following attributes (typically between 0 and 4) for every assistant turn.

- Helpfulness: Overall helpfulness of the response to the prompt.
- Correctness: Inclusion of all pertinent facts without errors.
- Coherence: Consistency and clarity of expression.
- Complexity: Intellectual depth required to write response (i.e. whether the response can be written by anyone with basic language competency or requires deep domain expertise)
- Verbosity: Amount of detail included in the response, relative to what is asked for in the prompt.