# üé® NeMo Data Designer: Product Information Dataset Generator with Q&A

> ‚ö†Ô∏è **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
>
> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. 
>
> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method.

This notebook demonstrates how to use NeMo Data Designer to create a synthetic dataset of product information with corresponding questions and answers. This dataset can be used for training and evaluating Q&A systems focused on product information.


#### üíæ Install dependencies

**IMPORTANT** üëâ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.


In [None]:
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ‚öôÔ∏è Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).


In [None]:
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

### üèóÔ∏è Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# We specify the endpoint of the model during deployment using the model_provider_registry.
model_id = "nvidia/nvidia-nemotron-nano-9b-v2"
model_alias = "nemotron-nano-9b-v2"

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.95,
            ),
            is_reasoner=True
        ),
    ]
)

## Defining Data Structures

Now we'll define the data models and evaluation rubrics for our product information dataset.

In [None]:
import string
from pydantic import BaseModel
from pydantic import Field

In [None]:
# Define product information structure
class ProductInfo(BaseModel):
  product_name: str = Field(..., description="A realistic product name for the market.")
  key_features: list[str] = Field(..., min_length=1, max_length=3, description="Key product features.")
  description: str = Field(..., description="A short, engaging description of what the product does, highlighting a unique but believable feature.")
  price_usd: float = Field(..., description="The stated price in USD.")

In [None]:
# Define evaluation rubrics for answer quality
CompletenessRubric = P.Rubric(
    name="Completeness",
    description="Evaluation of AI assistant's thoroughness in addressing all aspects of the user's query.",
    scoring={
        "Complete": "The response thoroughly covers all key points requested in the question, providing sufficient detail to satisfy the user's information needs.",
        "PartiallyComplete": "The response addresses the core question but omits certain important details or fails to elaborate on relevant aspects that were requested.",
        "Incomplete": "The response significantly lacks necessary information, missing major components of what was asked and leaving the query largely unanswered.",
    }
)

AccuracyRubric = P.Rubric(
    name="Accuracy",
    description="Evaluation of how factually correct the AI assistant's response is relative to the product information.",
    scoring={
        "Accurate": "The information provided aligns perfectly with the product specifications without introducing any misleading or incorrect details.",
        "PartiallyAccurate": "While some information is correctly stated, the response contains minor factual errors or potentially misleading statements about the product.",
        "Inaccurate": "The response presents significantly wrong information about the product, with claims that contradict the actual product details.",
    }
)

## Data Generation Workflow

Now we'll configure the data generation workflow to create product information, questions, and answers.

In [None]:
# Define product category options
config_builder.add_column(
    name="category",
    type="category",
    params={"values": ['Electronics', 'Clothing', 'Home Appliances', 'Groceries', 'Toiletries',
                       'Sports Equipment', 'Toys', 'Books', 'Pet Supplies', 'Tools & Home Improvement',
                       'Beauty', 'Health & Wellness', 'Outdoor Gear', 'Automotive', 'Jewelry',
                       'Watches', 'Office Supplies', 'Gifts', 'Arts & Crafts', 'Baby & Kids',
                       'Music', 'Video Games', 'Movies', 'Software', 'Tech Devices']}
)

# Define price range to seed realistic product types
config_builder.add_column(
    name="price_tens_of_dollars",
    type="uniform",
    params={"low": 1, "high": 200},
    convert_to="int"
)

config_builder.add_column(
    name="product_price",
    type="expression",
    expr="{{ (price_tens_of_dollars * 10) - 0.01 | round(2) }}",
    dtype="float"
)

# Generate first letter for product name to ensure diversity
config_builder.add_column(
    name="first_letter",
    type="category",
    params={"values": list(string.ascii_uppercase)}
)

# Generate product information
config_builder.add_column(
    name="product_info",
    type="llm-structured",
    model_alias=model_alias,
    prompt="""\
Generate a realistic product description for a product in the {{ category }} category that costs {{ product_price }}.
The name of the product MUST start with the letter {{ first_letter }}.\
""",
    output_format=ProductInfo
)

# Generate user questions about the product
config_builder.add_column(
    name="question",
    type='llm-text',
    model_alias=model_alias,
    prompt="Ask a question about the following product:\n\n {{ product_info }}",
)

# Determine if this example will include hallucination
config_builder.add_column(
  name="is_hallucination",
  type="bernoulli",
  params={"p": 0.5}
)

# Generate answers to the questions
config_builder.add_column(
    name="answer",
    type='llm-text',
    model_alias=model_alias,
    prompt="""\
{%- if is_hallucination == 0 -%}
<product_info>
{{ product_info }}
</product_info>

{%- endif -%}
User Question: {{ question }}

Directly and succinctly answer the user's question.\
{%- if is_hallucination == 1 -%}
 Make up whatever information you need to in order to answer the user's request.\
{%- endif -%}
"""
)

# Evaluate answer quality
config_builder.add_column(
    name="llm_answer_metrics",
    type="llm-judge",
    model_alias=model_alias,
    prompt="""\
<product_info>
{{ product_info }}
</product_info>

User Question: {{question }}
AI Assistant Answer: {{ answer }}

Judge the AI assistant's response to the user's question about the product described in <product_info>.\
""",
    rubrics=[CompletenessRubric, AccuracyRubric]
)

# Extract metric scores for easier analysis
config_builder.add_column(
    name="completeness_result",
    type="expression",
    expr="{{ llm_answer_metrics.Completeness.score }}"
)

config_builder.add_column(
    name="accuracy_result",
    type="expression",
    expr="{{ llm_answer_metrics.Accuracy.score }}"
)

## Generate the Preview

Let's examine a sample record to understand the generated data.

In [None]:
# Preview the generated data
preview = data_designer_client.preview(config_builder, verbose_logging=True)

In [None]:
preview.display_sample_record()

In [None]:
preview.dataset

## Generating the Full Dataset

Now that we've verified our data model looks good, let's generate a full dataset

In [None]:
# Run the job
job_results = data_designer_client.create(config_builder, num_records=1, wait_until_done=False)

job_results.wait_until_done()

In [None]:
dataset = job_results.load_dataset()
print("\nGenerated dataset shape:", dataset.shape)

dataset.head()