# 🎨 NeMo Data Designer: Generate Diverse RAG Evaluations

#### 📚 What you'll learn

This tutorial demonstrates how to generate comprehensive evaluation datasets for Retrieval-Augmented Generation (RAG) systems, customized to your content and use cases.

<br>

> 👋 **IMPORTANT** – Environment Setup
>
> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.
>
> - You may need to restart your notebook's kernel after setting up the environment.
> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.
>
> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).


### 📦 Import the essentials

- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.

- The `essentials` module provides quick access to the most commonly used objects.


In [None]:
from nemo_microservices.data_designer.essentials import (
    CategorySamplerParams,
    DataDesignerConfigBuilder,
    ExpressionColumnConfig,
    InferenceParameters,
    LLMJudgeColumnConfig,
    LLMStructuredColumnConfig,
    ModelConfig,
    NeMoDataDesignerClient,
    SamplerColumnConfig,
    SamplerType,
    Score,
    UniformSamplerParams,
)

### ⚙️ Initialize the NeMo Data Designer Client

- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.


In [None]:
NEMO_MICROSERVICES_BASE_URL = "http://localhost:8080"

data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)

### 🎛️ Define model configurations

- Each `ModelConfig` defines a model that can be used during the generation process.

- The "model alias" is used to reference the model in the Data Designer config (as we will see below).

- The "model provider" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).

- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.


In [None]:
# This name is set in the microservice deployment configuration.
MODEL_PROVIDER = "nvidiabuild"

# The model ID is from build.nvidia.com.
MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2"

# We choose this alias to be descriptive for our use case.
MODEL_ALIAS = "nemotron-nano-v2"

# This sets reasoning to False for the nemotron-nano-v2 model.
SYSTEM_PROMPT = "/no_think"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        provider=MODEL_PROVIDER,
        inference_parameters=InferenceParameters(
            temperature=0.6,
            top_p=0.95,
            max_tokens=1024,
        ),
    )
]

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- The list of model configs is provided to the builder at initialization.


In [None]:
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

## 🌱 Loading Seed Data

- We'll use the symptom-to-diagnosis dataset as our seed data.

- This dataset contains patient symptoms and corresponding diagnoses which will help generate realistic medical scenarios.

<br>

> 🌱 **Why use a seed dataset?**
>
> - Seed datasets let you steer the generation process by providing context that is specific to your use case.
>
> - Seed datasets are also an excellent way to inject real-world diversity into your synthetic data.
>
> - During generation, prompt templates can reference any of the seed dataset fields.

<br>

> 💡 **About datastores**
>
> - You can use seed datasets from _either_ the Hugging Face Hub or a locally deployed datastore.
>
> - By default, we use the local datastore deployed with the Data Designer microservice.
>
> - The datastore endpoint is specified in the deployment configuration.

👋 **Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as \
seeds, it is recommended you consolidated these into a single file.
<br>

### ⚙️ Document Processing

Now we'll create a Document Processor class that handles loading and chunking the source documents.

This class uses langchain's RecursiveCharacterTextSplitter and unstructured.io for robust document parsing.


In [None]:
from typing import List, Union
from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.auto import partition
import tempfile
import os

In [None]:
class DocumentProcessor:
    """Handles loading and chunking source documents for RAG evaluation."""

    def __init__(self, chunk_size: int = 4192, chunk_overlap: int = 200):
        """Initialize with configurable chunk size and overlap."""
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
        )

    def parse_document(self, uri: str) -> str:
        """Parse a single document from URI into raw text."""
        with open(uri, "rb") as file:
            content = file.read()
            with tempfile.NamedTemporaryFile(delete=False) as temp_file:
                temp_file.write(content)
                temp_file.flush()
                elements = partition(temp_file.name)

        os.unlink(temp_file.name)
        return "\n\n".join([str(element) for element in elements])

    def process_documents(self, uris: Union[str, List[str]]) -> List[str]:
        """Process one or more documents into chunks for RAG evaluation."""
        if isinstance(uris, str):
            uris = [uris]

        all_chunks = []
        for uri in uris:
            text = self.parse_document(uri)
            chunks = self.text_splitter.split_text(text)
            all_chunks.extend(chunks)

        return all_chunks

### 🏗️ Data Models

- Let's define Pydantic models for structured output generation.

- These schemas will ensure our generated data has consistent structure and validation.


In [None]:
from pydantic import BaseModel, Field


class QAPair(BaseModel):
    question: str = Field(
        ..., description="A specific question related to the domain of the context"
    )
    answer: str = Field(
        ...,
        description="Either a context-supported answer or explanation of why the question cannot be answered",
    )
    reasoning: str = Field(
        ...,
        description="A clear and traceable explanation of the reasoning behind the answer",
    )

In [None]:
import pandas as pd

# Process document chunks
DOCUMENT_LIST = ["./data/databricks-state-of-data-ai-report.pdf"]

processor = DocumentProcessor(chunk_size=4192, chunk_overlap=200)
chunks = processor.process_documents(DOCUMENT_LIST)

# Create a seed DataFrame with the document chunks
seed_df = pd.DataFrame({"context": chunks})

os.makedirs("data", exist_ok=True)
seed_df.to_csv("data/document_chunks.csv", index=False)

seed_df.head()

In [None]:
dataset_reference = data_designer_client.upload_seed_dataset(
    repo_id="data-designer-demo/rag-evaluation-dataset",
    dataset=seed_df,
    datastore_settings={"endpoint": "http://localhost:3000/v1/hf"},
)

config_builder.with_seed_dataset(dataset_reference)

## 🎲 Adding Categorical Columns for Controlled Diversity

Now we'll add categorical columns to control the diversity of our RAG evaluation pairs. We'll define:

1. **Difficulty levels**: easy, medium, hard

2. **Reasoning types**: factual recall, inferential reasoning, etc.

3. **Question types**: answerable vs. unanswerable (with weighting)


In [None]:
# Configure categorical columns for controlled diversity
config_builder.add_column(
    SamplerColumnConfig(
        name="difficulty",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["easy", "medium", "hard"],
        ),
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="reasoning_type",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "factual recall",
                "inferential reasoning",
                "comparative analysis",
                "procedural understanding",
                "cause and effect",
            ],
        ),
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="question_type",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["answerable", "unanswerable"],
            # 10:1 ratio of answerable to unanswerable questions.
            weights=[10, 1],
        ),
    )
)

## 🦜 Adding LLM-Structured Column for Q&A Pair Generation

Now let's set up the core of our data generation: the Q&A pair column that will produce structured question-answer \
pairs based on our document context and control parameters.


In [None]:
# Add Q&A pair generation column
config_builder.add_column(
    LLMStructuredColumnConfig(
        name="qa_pair",
        model_alias=MODEL_ALIAS,
        system_prompt=SYSTEM_PROMPT,
        prompt=(
            "{{context}}\n"
            "\n"
            "Generate a {{difficulty}} {{reasoning_type}} question-answer pair.\n"
            "The question should be {{question_type}} using the provided context.\n"
            "\n"
            "For answerable questions:\n"
            "- Ensure the answer is fully supported by the context\n"
            "\n"
            "For unanswerable questions:\n"
            "- Keep the question topically relevant\n"
            "- Make it clearly beyond the context's scope\n"
        ),
        output_format=QAPair,
    )
)

## 🔍 Quality Assessment: LLM-as-a-Judge

When generating our synthetic dataset, we need to determine the quality of the generated data \
We use the LLM-as-a-Judge strategy to do this.

To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt
that provides relavant instructions.


In [None]:
context_relevance_rubric = Score(
    name="Context Relevance",
    description="Evaluates how relevant the answer is to the provided context",
    options={
        "5": "Perfect relevance to context with no extraneous information",
        "4": "Highly relevant with minor deviations from context",
        "3": "Moderately relevant but includes some unrelated information",
        "2": "Minimally relevant with significant departure from context",
        "1": "Almost entirely irrelevant to the provided context",
    },
)

answer_precision_rubric = Score(
    name="Answer Precision",
    description="Evaluates the accuracy and specificity of the answer",
    options={
        "5": "Extremely precise with exact, specific information",
        "4": "Very precise with minor imprecisions",
        "3": "Adequately precise but could be more specific",
        "2": "Imprecise with vague or ambiguous information",
        "1": "Completely imprecise or inaccurate",
    },
)

answer_completeness_rubric = Score(
    name="Answer Completeness",
    description="Evaluates how thoroughly the answer addresses all aspects of the question",
    options={
        "5": "Fully complete, addressing all aspects of the question",
        "4": "Mostly complete with minor omissions",
        "3": "Adequately complete but missing some details",
        "2": "Substantially incomplete, missing important aspects",
        "1": "Severely incomplete, barely addresses the question",
    },
)

hallucination_avoidance_rubric = Score(
    name="Hallucination Avoidance",
    description="Evaluates the absence of made-up or incorrect information",
    options={
        "5": "No hallucinations, all information is factual and verifiable",
        "4": "Minimal hallucinations that don't impact the core answer",
        "3": "Some hallucinations that partially affect the answer quality",
        "2": "Significant hallucinations that undermine the answer",
        "1": "Severe hallucinations making the answer entirely unreliable",
    },
)

EVAL_METRICS_PROMPT_TEMPLATE = (
    "You are an expert evaluator of question-answer pairs. Analyze the following Q&A pair and evaluate it objectively.\n\n"
    "For this {{difficulty}} {{reasoning_type}} Q&A pair:\n"
    "{{qa_pair}}\n\n"
    "Take a deep breath and carefully evaluate each criterion based on the provided rubrics, considering the "
    "difficulty level and reasoning type indicated."
)

config_builder.add_column(
    LLMJudgeColumnConfig(
        name="eval_metrics",
        model_alias=MODEL_ALIAS,
        system_prompt=SYSTEM_PROMPT,
        prompt=EVAL_METRICS_PROMPT_TEMPLATE,
        scores=[
            context_relevance_rubric,
            answer_precision_rubric,
            answer_completeness_rubric,
            hallucination_avoidance_rubric,
        ],
    )
)

### 🔁 Iteration is key – preview the dataset!

1. Use the `preview` method to generate a sample of records quickly.

2. Inspect the results for quality and format issues.

3. Adjust column configurations, prompts, or parameters as needed.

4. Re-run the preview until satisfied.


In [None]:
# Preview a few records
preview = data_designer_client.preview(config_builder)

In [None]:
# More previews
preview.display_sample_record()

### 📊 Analyze the generated data

- Data Designer automatically generates a basic statistical analysis of the generated data.

- This analysis is available via the `analysis` property of generation result objects.


In [None]:
# Print the analysis as a table.
preview.analysis.to_report()

### 🆙 Scale up!

- Happy with your preview data?

- Use the `create` method to submit larger Data Designer generation jobs.


In [None]:
job_results = data_designer_client.create(config_builder, num_records=20)

# This will block until the job is complete.
job_results.wait_until_done()

In [None]:
# Load the generated dataset as a pandas DataFrame.
dataset = job_results.load_dataset()

dataset.head()

In [None]:
# Load the analysis results into memory.
analysis = job_results.load_analysis()

analysis.to_report()

In [None]:
TUTORIAL_OUTPUT_PATH = "data-designer-tutorial-output"

# Download the job artifacts and save them to disk.
job_results.download_artifacts(
    output_path=TUTORIAL_OUTPUT_PATH,
    artifacts_folder_name="artifacts-community-contributions-rag-examples-generate-rag-generation-eval-dataset",
);