# 🎨 NeMo Data Designer: Product Information Dataset Generator with Q&A

#### 📚 What you'll learn

This notebook demonstrates how to use NeMo Data Designer to create a synthetic dataset of product information with corresponding questions and answers.

<br>

> 👋 **IMPORTANT** – Environment Setup
>
> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.
>
> - You may need to restart your notebook's kernel after setting up the environment.
> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.
>
> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).


### 📦 Import the essentials

- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.

- The `essentials` module provides quick access to the most commonly used objects.


In [None]:
from nemo_microservices.data_designer.essentials import (
    BernoulliSamplerParams,
    CategorySamplerParams,
    DataDesignerConfigBuilder,
    ExpressionColumnConfig,
    InferenceParameters,
    LLMJudgeColumnConfig,
    LLMStructuredColumnConfig,
    LLMTextColumnConfig,
    ModelConfig,
    NeMoDataDesignerClient,
    SamplerColumnConfig,
    SamplerType,
    Score,
    UniformSamplerParams,
)

### ⚙️ Initialize the NeMo Data Designer Client

- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.


In [None]:
NEMO_MICROSERVICES_BASE_URL = "http://localhost:8080"

data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)

### 🎛️ Define model configurations

- Each `ModelConfig` defines a model that can be used during the generation process.

- The "model alias" is used to reference the model in the Data Designer config (as we will see below).

- The "model provider" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).

- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.


In [None]:
# This name is set in the microservice deployment configuration.
MODEL_PROVIDER = "nvidiabuild"

# The model ID is from build.nvidia.com.
MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2"

# We choose this alias to be descriptive for our use case.
MODEL_ALIAS = "nemotron-nano-v2"

# This sets reasoning to False for the nemotron-nano-v2 model.
SYSTEM_PROMPT = "/no_think"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        provider=MODEL_PROVIDER,
        inference_parameters=InferenceParameters(
            temperature=0.6,
            top_p=0.95,
            max_tokens=1024,
        ),
    )
]

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- The list of model configs is provided to the builder at initialization.


In [None]:
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

## 🏗️ Defining Data Structures

Now we'll define the data models and evaluation rubrics for our product information dataset.


In [None]:
import string
from pydantic import BaseModel
from pydantic import Field


# Define product information structure
class ProductInfo(BaseModel):
    product_name: str = Field(
        ..., description="A realistic product name for the market."
    )
    key_features: list[str] = Field(
        ..., min_length=1, max_length=3, description="Key product features."
    )
    description: str = Field(
        ...,
        description="A short, engaging description of what the product does, highlighting a unique but believable feature.",
    )
    price_usd: float = Field(..., description="The stated price in USD.")

## 🎲 Adding Sampler Columns

- Sampler columns offer non-LLM based generation of synthetic data.

- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.


In [None]:
# Define product category options
config_builder.add_column(
    SamplerColumnConfig(
        name="category",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "Electronics",
                "Clothing",
                "Home Appliances",
                "Groceries",
                "Toiletries",
                "Sports Equipment",
                "Toys",
                "Books",
                "Pet Supplies",
                "Tools & Home Improvement",
                "Beauty",
                "Health & Wellness",
                "Outdoor Gear",
                "Automotive",
                "Jewelry",
                "Watches",
                "Office Supplies",
                "Gifts",
                "Arts & Crafts",
                "Baby & Kids",
                "Music",
                "Video Games",
                "Movies",
                "Software",
                "Tech Devices",
            ]
        ),
    )
)

# Define price range to seed realistic product types
config_builder.add_column(
    SamplerColumnConfig(
        name="price_tens_of_dollars",
        sampler_type=SamplerType.UNIFORM,
        params=UniformSamplerParams(low=1, high=200),
    )
)

config_builder.add_column(
    ExpressionColumnConfig(
        name="product_price",
        expr="{{ (price_tens_of_dollars * 10) - 0.01 | round(2) }}",
        dtype="float",
    )
)

# Generate first letter for product name to ensure diversity
config_builder.add_column(
    SamplerColumnConfig(
        name="first_letter",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=list(string.ascii_uppercase)),
    )
)

# Determine if this example will include hallucination
config_builder.add_column(
    SamplerColumnConfig(
        name="is_hallucination",
        sampler_type=SamplerType.BERNOULLI,
        params=BernoulliSamplerParams(p=0.5),
    )
)


## 🦜 LLM-generated columns

- When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.

- As we see below, nested json fields can be accessed using dot notation.


In [None]:
# Generate product information
config_builder.add_column(
    LLMStructuredColumnConfig(
        name="product_info",
        model_alias=MODEL_ALIAS,
        system_prompt=SYSTEM_PROMPT,
        prompt=(
            "Generate a realistic product description for a product in the {{ category }} "
            "category that costs {{ product_price }}.\n"
            "The name of the product MUST start with the letter {{ first_letter }}.\n"
        ),
        output_format=ProductInfo,
    )
)

# Generate user questions about the product
config_builder.add_column(
    LLMTextColumnConfig(
        name="question",
        model_alias=MODEL_ALIAS,
        system_prompt=SYSTEM_PROMPT,
        prompt=("Ask a question about the following product:\n\n {{ product_info }}"),
    )
)


# Generate answers to the questions
config_builder.add_column(
    LLMTextColumnConfig(
        name="answer",
        model_alias=MODEL_ALIAS,
        system_prompt=SYSTEM_PROMPT,
        prompt=(
            "{%- if is_hallucination == 0 -%}\n"
            "<product_info>\n"
            "{{ product_info }}\n"
            "</product_info>\n"
            "{%- endif -%}\n"
            "User Question: {{ question }}\n"
            "Directly and succinctly answer the user's question.\n"
            "{%- if is_hallucination == 1 -%}\n"
            "Make up whatever information you need to in order to answer the user's request.\n"
            "{%- endif -%}"
        ),
    )
)


## 🔍 Quality Assessment: LLM-as-a-Judge

When generating our synthetic dataset, we need to determine the quality of the generated data \
We use the LLM-as-a-Judge strategy to do this.

To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt
that provides relavant instructions.


In [None]:
# Define evaluation rubrics for answer quality
CompletenessRubric = Score(
    name="Completeness",
    description="Evaluation of AI assistant's thoroughness in addressing all aspects of the user's query.",
    options={
        "Complete": "The response thoroughly covers all key points requested in the question, providing sufficient detail to satisfy the user's information needs.",
        "PartiallyComplete": "The response addresses the core question but omits certain important details or fails to elaborate on relevant aspects that were requested.",
        "Incomplete": "The response significantly lacks necessary information, missing major components of what was asked and leaving the query largely unanswered.",
    },
)

AccuracyRubric = Score(
    name="Accuracy",
    description="Evaluation of how factually correct the AI assistant's response is relative to the product information.",
    options={
        "Accurate": "The information provided aligns perfectly with the product specifications without introducing any misleading or incorrect details.",
        "PartiallyAccurate": "While some information is correctly stated, the response contains minor factual errors or potentially misleading statements about the product.",
        "Inaccurate": "The response presents significantly wrong information about the product, with claims that contradict the actual product details.",
    },
)


# Evaluate answer quality
config_builder.add_column(
    LLMJudgeColumnConfig(
        name="llm_answer_metrics",
        model_alias=MODEL_ALIAS,
        prompt=(
            "<product_info>\n"
            "{{ product_info }}\n"
            "</product_info>\n"
            "User Question: {{question }}\n"
            "AI Assistant Answer: {{ answer }}\n"
            "Judge the AI assistant's response to the user's question about the product described in <product_info>."
        ),
        scores=[CompletenessRubric, AccuracyRubric],
    )
)


# Extract metric scores for easier analysis
config_builder.add_column(
    ExpressionColumnConfig(
        name="completeness_result",
        expr="{{ llm_answer_metrics.completeness.score }}",
    )
)

config_builder.add_column(
    ExpressionColumnConfig(
        name="accuracy_result",
        expr="{{ llm_answer_metrics.accuracy.score }}",
    )
)

### 🔁 Iteration is key – preview the dataset!

1. Use the `preview` method to generate a sample of records quickly.

2. Inspect the results for quality and format issues.

3. Adjust column configurations, prompts, or parameters as needed.

4. Re-run the preview until satisfied.


In [None]:
# Preview a few records
preview = data_designer_client.preview(config_builder)

In [None]:
# More previews
preview.display_sample_record()

### 📊 Analyze the generated data

- Data Designer automatically generates a basic statistical analysis of the generated data.

- This analysis is available via the `analysis` property of generation result objects.


In [None]:
# Print the analysis as a table.
preview.analysis.to_report()

### 🆙 Scale up!

- Happy with your preview data?

- Use the `create` method to submit larger Data Designer generation jobs.


In [None]:
job_results = data_designer_client.create(config_builder, num_records=20)

# This will block until the job is complete.
job_results.wait_until_done()

In [None]:
# Load the generated dataset as a pandas DataFrame.
dataset = job_results.load_dataset()

dataset.head()

In [None]:
# Load the analysis results into memory.
analysis = job_results.load_analysis()

analysis.to_report()

In [None]:
TUTORIAL_OUTPUT_PATH = "data-designer-tutorial-output"

# Download the job artifacts and save them to disk.
job_results.download_artifacts(
    output_path=TUTORIAL_OUTPUT_PATH,
    artifacts_folder_name="artifacts-community-contributions-qa-generation-product-question-answer-generator",
);