<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/kirit-branch/docs/notebooks/data-designer/rag-examples/generate-rag-evaluation-dataset.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🎨NeMo Data Designer: Generate Diverse RAG Evaluations

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.

<br>

If this is your first time using Data Designer, we recommend starting with the [first notebook](../intro-tutorial/1-the-basics.ipynb) in this 101 series.

This tutorial demonstrates how to generate comprehensive evaluation datasets for Retrieval-Augmented Generation (RAG) systems, customized to your content and use cases. 

You'll learn how to create diverse question-answer pairs at scale, covering a variety of difficulty levels and reasoning types, including both answerable and unanswerable scenarios.

### What You'll Learn
- How to process and chunk source documents for RAG evaluation

- How to configure categorical distributions for controlled diversity

- How to generate high-quality Q&A pairs with structured output

- How to evaluate the quality of generated pairs with rubric-based scoring

- How to analyze and export the complete dataset

## 1. Setup and Installation

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.

If the installation worked, you should be able to make the following imports:

In [21]:
from getpass import getpass

from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer (NDD) Client

- The NDD client is responsible for submitting generation requests to the Data Designer microservice.

In [22]:
ndd = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8000"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process. 

Refer to [this](https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/configure-models.html) for more information on configuring models.

#### Configuring models

You can either use [build.nvidia.com](https://build.nvidia.com/) endpoints or deploy a local NIM for data generation.  
This notebook demonstrates both approaches for different data generation use cases.

In [None]:
# build.nvidia.com model endpoint
endpoint = "https://integrate.api.nvidia.com/v1"
model_id = "mistralai/mistral-small-24b-instruct"

model_alias = "evaluation_model"

# You will need to enter your model provider API key to run this notebook.
api_key = getpass("Enter model provider API key: ")

if len(api_key) > 0:
    print("✅ API key received.")
else:
    print("❌ No API key provided. Please enter your model provider API key.")

✅ API key received.


##### Using a Locally Hosted NIM for Data Generation

1. Log in to [build.nvidia.com](https://build.nvidia.com).
2. Pull the container for your desired model, and deploy the NIM container. Eg [meta/llama-3_1-8b-instruct](https://build.nvidia.com/meta/llama-3_1-8b-instruct)
3. Provide its endpoint in the notebook for data generation.

**Note:**
- Since NeMo Microservices (NMS) run inside a Docker container, you must use the **host IP address** and correct port instead of `localhost`.  
- You may need to port-forward NIM to a different port because NMS uses port `8000`.

In [1]:
#local hosted NIM endpoint on port 8004
byom_endpoint = "http://<host-ip-address>/v1" #
byom_model_id = "meta/llama-3.1-8b-instruct"

byom_model_alias = "review-generator"

In [None]:
#Define both the models you want to use for data generation
model_configs_yaml = f"""\
model_configs:
  - alias: "{model_alias}"
    inference_parameters:
      max_tokens: 1024
      temperature: 0.5
      top_p: 1.0
    model:
      api_endpoint:
        api_key: "{api_key}"
        model_id: "{model_id}"
        url: "{endpoint}"
  - alias: "{byom_model_alias}"
    inference_parameters:
      max_tokens: 1024
      temperature: 0.5
      top_p: 1.0
    model:
      api_endpoint:
        model_id: "{byom_model_id}"
        url: "{byom_endpoint}"
"""

config_builder = DataDesignerConfigBuilder(model_configs=model_configs_yaml)

## 2. Configuration

Let's define our source documents and the total number of evaluation pairs we want to generate. You can replace the document list with your own PDFs, web pages, or other text sources.

In [13]:
# Define source documents and total number of evaluation pairs to generate
# You can replace this with your own documents
DOCUMENT_LIST = ["https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/rag_evals/databricks-state-of-data-ai-report.pdf"]

## 3. Document Processing

Now we'll create a Document Processor class that handles loading and chunking the source documents. 

This class uses langchain's RecursiveCharacterTextSplitter and unstructured.io for robust document parsing.

In [None]:
from typing import List, Union
from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.auto import partition
from smart_open import open
import tempfile
import os

class DocumentProcessor:
    """Handles loading and chunking source documents for RAG evaluation."""
    
    def __init__(self, chunk_size: int = 4192, chunk_overlap: int = 200):
        """Initialize with configurable chunk size and overlap."""
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
        )

    def parse_document(self, uri: str) -> str:
        """Parse a single document from URI into raw text."""
        with open(uri, 'rb') as file:
            content = file.read()
            with tempfile.NamedTemporaryFile(delete=False) as temp_file:
                temp_file.write(content)
                temp_file.flush()
                elements = partition(temp_file.name)

        os.unlink(temp_file.name)
        return "\n\n".join([str(element) for element in elements])

    def process_documents(self, uris: Union[str, List[str]]) -> List[str]:
        """Process one or more documents into chunks for RAG evaluation."""
        if isinstance(uris, str):
            uris = [uris]

        all_chunks = []
        for uri in uris:
            text = self.parse_document(uri)
            chunks = self.text_splitter.split_text(text)
            all_chunks.extend(chunks)

        return all_chunks

## 4. Data Models

Let's define Pydantic models for structured output generation. These schemas will ensure our generated data has consistent structure and validation.

In [33]:
from pydantic import BaseModel, Field

class QAPair(BaseModel):
    question: str = Field(
        ..., description="A specific question related to the domain of the context"
    )
    answer: str = Field(
        ..., description="Either a context-supported answer or explanation of why the question cannot be answered"
    )
    reasoning: str = Field(
        ..., description="A clear and traceable explanation of the reasoning behind the answer"
    )

## 5. Processing Documents and Setting Up Data Designer

Now we'll process our document chunks and set up the Data Designer with our seed dataset.

In [None]:
import pandas as pd

# Process document chunks
processor = DocumentProcessor(chunk_size=4192, chunk_overlap=10)
chunks = processor.process_documents(DOCUMENT_LIST)

# Create a seed DataFrame with the document chunks
seed_df = pd.DataFrame({"context": chunks})

# Save to CSV
seed_df.to_csv("document_chunks.csv", index=False)
print("Seed dataset ", seed_df.head())
print("Saved to document_chunks.csv")

In [34]:
config_builder.with_seed_dataset(
    repo_id="into-tutorials/seeding-with-a-dataset",
    filename="document_chunks.csv",
    dataset_path="/home/slikhite/Desktop/NVAIE/gretel-SDG/document_chunks.csv",
    sampling_strategy="shuffle",
    with_replacement=True,
    datastore={"endpoint": "http://localhost:3000/v1/hf"},
)

## 6. Adding Categorical Columns for Controlled Diversity

Now we'll add categorical columns to control the diversity of our RAG evaluation pairs. We'll define:

1. **Difficulty levels**: easy, medium, hard

2. **Reasoning types**: factual recall, inferential reasoning, etc.

3. **Question types**: answerable vs. unanswerable (with weighting)

In [35]:
config_builder.add_column(
    C.SamplerColumn(
        name="difficulty",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["easy","medium", "hard"],
            description="The difficulty level of the question"
        )
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="reasoning_type",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=[
                "factual recall",
                "inferential reasoning",
                "comparative analysis",
                "procedural understanding",
                "cause and effect"
            ],
            description="The type of reasoning required to answer the question"
        )
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="question_type",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["answerable", "unanswerable"],
            # 10:1 ratio of answerable to unanswerable questions.
            weights=[10, 1],  
        )
    )
).validate()

[12:16:26] [INFO] ✅ Validation passed


## 7. Adding LLM-Structured Column for Q&A Pair Generation

Now let's set up the core of our data generation: the Q&A pair column, which will produce structured question–answer pairs based on our document context and control parameters. The columns in the seed data— `context` in our case—can be used in the prompt for data generation.

In [41]:
# Add Q&A pair generation column
config_builder.add_column(
    C.LLMStructuredColumn(
        name="qa_pair",
        system_prompt=( 
            "You are an expert at generating high-quality RAG evaluation pairs. "
            "You are very careful in assessing whether the question can be answered from the provided context. "
        ),
        prompt="""\
{{context}}

Generate a {{difficulty}} {{reasoning_type}} question-answer pair.
The question should be {{question_type}} using the provided context.

For answerable questions:
- Ensure the answer is fully supported by the context

For unanswerable questions:
- Keep the question topically relevant
- Make it clearly beyond the context's scope
""",
        output_format=QAPair,
        model_alias=byom_model_alias,
    )
).validate()

[12:20:10] [INFO] ✅ Validation passed


## 8. Adding Evaluation Metrics with Custom Rubrics

To assess the quality of our generated Q&A pairs, we'll add evaluation metrics using detailed rubrics for scoring. 

We use Data Designer's `LLMJudgeColumn` for this, defining a set of custom Rubrics designed for our task.

In [42]:
from nemo_microservices.beta.data_designer.config import params as P\

context_relevance_rubric = P.Rubric(
    name="Context Relevance",
    description="Evaluates how relevant the answer is to the provided context",
    scoring={
        "5": "Perfect relevance to context with no extraneous information",
        "4": "Highly relevant with minor deviations from context",
        "3": "Moderately relevant but includes some unrelated information",
        "2": "Minimally relevant with significant departure from context",
        "1": "Almost entirely irrelevant to the provided context"
    }
)

answer_precision_rubric = P.Rubric(
    name="Answer Precision",
    description="Evaluates the accuracy and specificity of the answer",
    scoring={
        "5": "Extremely precise with exact, specific information",
        "4": "Very precise with minor imprecisions",
        "3": "Adequately precise but could be more specific",
        "2": "Imprecise with vague or ambiguous information",
        "1": "Completely imprecise or inaccurate"
    }
)

answer_completeness_rubric = P.Rubric(
    name="Answer Completeness",
    description="Evaluates how thoroughly the answer addresses all aspects of the question",
    scoring={
        "5": "Fully complete, addressing all aspects of the question",
        "4": "Mostly complete with minor omissions",
        "3": "Adequately complete but missing some details",
        "2": "Substantially incomplete, missing important aspects",
        "1": "Severely incomplete, barely addresses the question"
    }
)

hallucination_avoidance_rubric = P.Rubric(
    name="Hallucination Avoidance",
    description="Evaluates the absence of made-up or incorrect information",
    scoring={
        "5": "No hallucinations, all information is factual and verifiable",
        "4": "Minimal hallucinations that don't impact the core answer",
        "3": "Some hallucinations that partially affect the answer quality",
        "2": "Significant hallucinations that undermine the answer",
        "1": "Severe hallucinations making the answer entirely unreliable"
    }
)

EVAL_METRICS_PROMPT_TEMPLATE = """\
You are an expert evaluator of question-answer pairs. Analyze the following Q&A pair and evaluate it objectively.

For this {{difficulty}} {{reasoning_type}} Q&A pair:
{{qa_pair}}

Take a deep breath and carefully evaluate each criterion based on the provided rubrics, considering the difficulty level and reasoning type indicated.
"""

#use a different model for evaluation
config_builder.add_column(
    C.LLMJudgeColumn(
        name="eval_metrics",
        prompt=EVAL_METRICS_PROMPT_TEMPLATE,
        rubrics=[context_relevance_rubric, answer_precision_rubric, answer_completeness_rubric, hallucination_avoidance_rubric],
        model_alias=model_alias
    )
).validate()

[12:20:14] [INFO] ✅ Validation passed


## 9. Preview Sample Records

Let's generate a preview to see what our data will look like before running the full generation.

In [None]:
preview = ndd.preview(config_builder, verbose_logging=True)

In [44]:
# The preview dataset is available as a pandas DataFrame.
preview.dataset.head()

Unnamed: 0,context,difficulty,reasoning_type,question_type,qa_pair,judged_by_llm,eval_metrics
0,shows that many data teams are choosing to bui...,hard,factual recall,answerable,"{""question"": ""What is the name of the two smal...",True,{'Context Relevance': {'reasoning': 'The answe...
1,Sciences experiences significant fluctuations ...,easy,factual recall,unanswerable,"{""question"": ""What is the approximate number o...",True,{'Context Relevance': {'reasoning': 'The answe...
2,"process of experimental testing, trying out di...",hard,procedural understanding,answerable,"{""question"": ""What percentage of companies are...",True,{'Context Relevance': {'reasoning': 'The answe...
3,shows that many data teams are choosing to bui...,easy,inferential reasoning,answerable,"{""question"": ""What percentage of open source L...",True,{'Context Relevance': {'reasoning': 'The answe...
4,"models and GenAI, John Snow Labs is instrument...",easy,comparative analysis,answerable,"{""question"": ""What is the difference between S...",True,{'Context Relevance': {'reasoning': 'The answe...


In [45]:
# Run this cell multiple times to cycle through the 10 preview records.
preview.display_sample_record()

## 11. Generate the Full Dataset

Now let's generate our full dataset of RAG evaluation pairs, analyze the coverage, and export it to a JSONL file for use in evaluating RAG systems. If you want to wait for the job to complete, set wait_until_done=True.

In [48]:
# Let's add an evaluation report to the dataset
config_builder.with_evaluation_report()

# Generate the full dataset.
workflow_run = ndd.create(
   config_builder, num_records=20, wait_until_done=True
)

[12:24:17] [INFO] 🎨 Creating Data Designer generation job
[12:24:17] [INFO]   |-- job_id: 3ca1af735043492e97687b2c2e630eaa
[12:24:19] [INFO] 🎲 Sampling 20 records from input dataset *with replacement*
[12:24:19] [INFO] 🎲 Using numerical samplers to generate 20 records across 3 columns
[12:24:19] [INFO] (💾 + 💾) Concatenating 2 datasets
[12:24:19] [INFO] 📝 Preparing template to generate data column `qa_pair`
[12:24:19] [INFO]   |-- model_alias: review-generator-local
[12:24:19] [INFO] Model config being used for model alias 'review-generator-local': 
{
    "alias": "review-generator-local",
    "model": {
        "api_endpoint": {
            "url": "http://10.110.20.111:8004/v1",
            "model_id": "meta/llama-3.1-8b-instruct",
            "provider_type": "openai"
        }
    },
    "inference_parameters": {
        "temperature": 0.5,
        "top_p": 1.0,
        "max_tokens": 1024,
        "max_parallel_requests": 4
    },
    "is_reasoner": false
}
[12:24:19] [INFO] 🩺 Runnin

In [50]:
dataset = workflow_run.load_dataset()

print("\nGenerated dataset shape:", dataset.shape)

# Export the dataset to JSONL format.
dataset.to_json('rag_evals.jsonl', orient='records', lines=True)
print("\nDataset exported to rag_evals.jsonl")


Generated dataset shape: (20, 7)

Dataset exported to rag_evals.jsonl


## 12. Using Your RAG Evaluation Dataset

Now that you've generated a diverse RAG evaluation dataset, here are some ways to use it:

1. **Benchmarking**: Test your RAG system against these evaluation pairs to measure performance

2. **Error Analysis**: Identify patterns in where your RAG system struggles

3. **Optimization**: Use insights to tune retrieval and generation parameters

4. **Regression Testing**: Track performance over time as you improve your system

5. **Model Comparison**: Compare different LLMs, retrievers, or RAG architectures

The JSONL file contains structured data with questions, ground truth answers, and quality metrics that you can use with most evaluation frameworks.