# DFW Blueprint Synthetic Data Generation with NeMo Data Designer

This notebook shows how to use [NeMo Data Designer](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html) to generate high-quality, domain-specific synthetic data at scale.

We break this process down into the following three steps:

1. Create a Data Designer client
2. Specify data schema and generation processes
3. Generate data

Note that this notebook uses a free, managed Data Designer service hosted on [build.nvidia.com](https://build.nvidia.com/nemo/data-designer). This is useful for getting started quickly and generating small sample datasets, however in real-world use cases you'll want to deploy your own Data Designer service by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/docker-compose.html).

In [None]:
# Ensure Data Designer is included in nemo-microservices
!uv pip install "nemo-microservices[data-designer]>=1.3.0"

## Create Data Designer Client

Our first step is to create a Data Designer client, which is used to submit data generation requests to the Data Designer microservice.

In [None]:
# Get managed Data Designer API key by clicking the "Get API Key" button on https://build.nvidia.com/nemo/data-designer
from getpass import getpass

api_key = getpass("Enter Data Designer API key: ")

In [None]:
from nemo_microservices.data_designer.essentials import NeMoDataDesignerClient

data_designer_client = NeMoDataDesignerClient(
    base_url="https://ai.api.nvidia.com/v1/nemo/dd",   # Managed Data Designer service on build.nvidia.com
    default_headers={"Authorization": f"Bearer {api_key}"}
)

## Create Configuration Builder

Next, let's create a configuration builder which allows us to define the dataset schema and generation process.

We'll define both the [columns](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/define-your-data-columns/index.html) that we want the generated dataset to have as well as any [models](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) used to generate those columns. 

In [None]:
from nemo_microservices.data_designer.essentials import DataDesignerConfigBuilder, ModelConfig, InferenceParameters

model_alias = "nemotron-nano-v2"
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model="nvidia/nvidia-nemotron-nano-9b-v2",
            inference_parameters=InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.95,
            )
        ),
    ]
)

In this example, we'll generate a dataset of customer queries for NVIDIA's gear store. Below we list the full set of available products.

In [None]:
PRODUCT_NAMES = [
    "GEFORCE NOW $50 MEMBERSHIP GIFT CARD",
    "NVIDIA® GEFORCE RTX™ 4090",
    "NVIDIA® GEFORCE RTX™ 4080",
    "NVIDIA® GEFORCE RTX™ 4070",
    "NVIDIA® GEFORCE™ RTX 4060TI 8GB",
    "NVIDIA® GEFORCE RTX™ 4070 SUPER",
    "NVIDIA® SHIELD™ REMOTE (2020)",
    "NVIDIA® SHIELD™ TV 2019",
    "NVIDIA® SHIELD™ TV PRO 2019",
    "JETSON NANO DEVELOPER KIT",
    "NVIDIA JETSON ORIN NANO DEVELOPER KIT",
    "UNISEX TMYBTMYS CUDA TEE",
    "MEN'S BEYOND YOGA CREW T-SHIRT - DARKEST NIGHT",
    "MEN'S BEYOND YOGA TAKE IT EASY SHORTS",
    "MEN'S BEYOND YOGA TAKE IT EASY PANT - DARKEST NIGHT",
    "MEN'S LOGO TEE",
    "NEXT LEVEL ECO UNISEX TEE UNISEX",
    "NVIDIA DUO-TONE LOGO UNISEX TEE",
    "UNISEX INSPIRE 365 TEE",
    "NORTH FACE MEN'S TREKKER JACKET",
    "MARINE LAYER CORBET FULL ZIP MEN'S",
    "MARINE LAYER CORBET FULL ZIP VEST MEN'S",
    "CHAMPION REVERSE WEAVE CREWNECK UNISEX",
    "FLUX BONDED JACKET 2.0 MEN'S LIGHT HEATHER",
    "FULL ZIP NEURAL HOODIE UNISEX",
    "LONG SLEEVE PERFORMANCE SHIRT MEN'S",
    "NIKE 2.0 POLO MEN'S",
    "COTTON TOUCH POLO MEN'S",
    "AUTONOMOUS TIME TRAVELER TEE UNISEX",
    "GEFORCE LOGO HIGH DENSITY TEE UNISEX",
    "GEFORCE TRIANGULATION TEE UNISEX",
    "GEFORCE ABSTRACTION TEE UNISEX",
    "WIREFRAME EYE GRAPHIC TEE UNISEX",
    "HEROES OF NVIDIA 3.0 UNISEX TEE",
    "I AM AI TEE MEN'S",
    "MEN'S LULULEMON ABC JOGGER",
    "NVIDIA TWILL HAT - SAGE",
    "12\" MARLED KNIT CUFF BEANIE",
    "NVIDIA GREEN LABEL BEANIE",
    "NVIDIA GEFORCE HAT",
    "NVIDIA CORDUROY HAT",
    "ANKLE SOCK UNISEX",
    "NVIDIA ATHLETIC CREW SOCKS",
    "NVIDIA LIGHTWEIGHT HOODIE",
    "YOUTH LOGO TEE",
    "PORTABLE FOLDING CHAIR",
    "TITLEIST® PRO V1® HALF DOZEN GOLF BALLS",
    "LAPEL PIN",
    "GEFORCE 40-SERIES ALUMINUM COASTER",
    "S'WELL 16OZ PET BOWL",
    "NVIDIA ICONOGRAPHY BANDANA",
    "CLUBMAN SUNGLASSES",
    "NVIDIA NIMBLE CHAMP PRO CHARGER",
    "NVIDIA LICENSE PLATE FRAME",
    "40 OZ. STANLEY QUENCHER TUMBLER",
    "20 OZ. ELEMENTAL WATER BOTTLE",
    "25 OZ. NVIDIA ICONOGRAPHY BOTTLE",
    "32 OZ. NVIDIA ICONOGRAPHY BOTTLE",
    "14 OZ. VISUAL PURR-CEPTION MUG",
    "14 OZ. A NEW BREED OF INNOVATION MUG",
    "14 OZ. I AM AI MUG",
    "14 OZ. NVIDIA LOGO MUG",
    "TIMBUK2 LAPTOP SLEEVE - 2 SIZES",
    "NVIDIA SMALL INFINITY MOUSEPAD",
    "NVIDIA FLOW LARGE MOUSEPAD",
    "NVIDIA ICONOGRAPHY KRAFT POCKET JOURNAL -3 PACK",
    "RULER 2.0",
    "RULER",
    "TROIKA CONSTRUCTION PEN",
    "PRODIR® PATTERN PEN",
    "INTENSITY CLIC GEL PEN",
    "COMPUTER CARE KIT",
    "TIMBUK2 VAPOR BACKPACK TOTE - GRAPHITE",
    "NVIDIA HEX SLING BAG - GRAY",
    "IGLOO COAST COOLER",
    "OGIO CATALYST DUFFEL - BLACK",
    "TIMBUK2 PARKSIDE BACKPACK 2.0",
    "VICTORINOX LAPTOP WOMEN'S TOTE",
    "TOPO ROVER TECH PACK",
    "TIMBUK2 COPILOT ROLLER LUGGAGE",
    "TOTE BAG",
    "OGIO® STRATAGEM BACKPACK",
    "GIFT BAG",
    "WOMEN'S BEYOND YOGA REFOCUS TANK",
    "WOMEN'S BEYOND YOGA MOVEMENT SKIRT",
    "WOMEN'S BEYOND YOGA IN-STRIDE PULLOVER - BLACK",
    "WOMEN'S V-NECK TEE",
    "MARINE LAYER CORBET FULL ZIP VEST WOMEN'S",
    "WOMEN'S BEYOND YOGA SPACEDYE RACERBACK CROPPED TANK",
    "WOMEN'S LOGO TEE",
    "WOMEN'S FLEECE CROPPED CREW",
    "NORTH FACE WOMEN'S TREKKER JACKET",
    "FLUX BONDED JACKET WOMEN'S LIGHT HEATHER",
    "NIKE 2.0 POLO WOMEN'S",
    "COTTON TOUCH POLO WOMEN'S",
    "LONG SLEEVE PERFORMANCE SHIRT WOMEN'S",
    "I AM AI TEE WOMEN'S",
    "WOMEN'S POCKET LEGGINGS",
]

Let's add our first column, `product_name`, to the configuration builder. We do this using a [sampling-based column](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/define-your-data-columns/column-types/sampling-based-columns.html), which allows us to generate data through statistical sampling methods and distributions. In this case, we randomly sample from our list of known product names. 

In [None]:
from nemo_microservices.data_designer.essentials import SamplerColumnConfig, SamplerType, CategorySamplerParams

config_builder.add_column(
    SamplerColumnConfig(
        name="product_name",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=PRODUCT_NAMES,
        ),
    )
)

Now that we have product names, let's add schema information about customer queries, including:

- The category of the customer query (e.g. asking for order status, initiating a product return, etc.)
- Sentiment of the query
- How long the customer query should be (including a minimum length)

In [None]:
from nemo_microservices.data_designer.essentials import GaussianSamplerParams

# Define type of customer query
config_builder.add_column(
    SamplerColumnConfig(
        name="query_category",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["ToReturnProcessing", "ToOrderStatusAssistant", "HandleOtherTalk", "ToProductQAAssistant"],
        ),
    )
)

# Define sentiment of the customer query
config_builder.add_column(
    SamplerColumnConfig(
        name="query_sentiment",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["frustrated", "negative", "positive", "confused", "neutral", "satisfied", "happy", "very happy"],
        ),
    )
)

# How long the customer query should be.
# This is sampled from a Gaussian distribution.
config_builder.add_column(
    SamplerColumnConfig(
        name="query_len",
        sampler_type=SamplerType.GAUSSIAN,
        params=GaussianSamplerParams(mean=35, stddev=18),
        convert_to="int"
    )
)

# Add a constraint on how short a customer query can be
config_builder.add_constraint(
    constraint_type="scalar_inequality",
    target_column="query_len",
    operator="ge",
    rhs=5,
)

Finally, let's add a column for the customer query itself. We'll use an [LLM-based column](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/define-your-data-columns/column-types/llm-based-columns.html) to generate text for realistic customer queries. We can include other columns (for example, `product_name`) in the model prompt and use [Jinja templating](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/define-your-data-columns/using-jinja-templates.html) to add additional control flow logic. 

In [None]:
from nemo_microservices.data_designer.essentials import LLMTextColumnConfig

# Generate answers to the questions
config_builder.add_column(
    LLMTextColumnConfig(
        name="query",
        prompt="""\
{%- if query_category == "ToReturnProcessing" -%}
You are a customer of the NVIDIA Gear Store. Generate a customer query for customer care about processing a return for the product {{ product_name }}. \
{%- elif query_category == "ToOrderStatusAssistant" -%}
You are a customer of the NVIDIA Gear Store. Generate a customer query for customer care about asking for the status of an order for the product {{ product_name }}. \
{%- elif query_category == "ToProductQAAssistant" -%}
You are a customer of the NVIDIA Gear Store. Generate a customer query for customer care about asking for more details or questions about the product {{ product_name }}. \
{%- elif query_category == "HandleOtherTalk" -%}
Generate a realistic customer query directed to a customer care agent, unrelated to product orders (e.g., small talk, random topics). \
{%- endif -%}
- Query Sentiment: {{ query_sentiment }}. \
- Target length: about {{ query_len }} words. \
- Vary the topics, tone, and sentence structure. \
- If you are talking about a product, you may or may not name the product in its full name. \
- Do not always start with a greeting. \
- Occasionally use informal language, typos, or incomplete sentences. \
- Be creative and make queries sound like real people.\
- You must return a non-empty value. \
""",
        model_alias=model_alias,
        system_prompt="/no_think",
    )
)

## Generate Preview of Synthetic Data

Now that our configuration builder schema is set, let's generate a sample of our dataset to preview (defaults to 10 records).

In [None]:
preview = data_designer_client.preview(config_builder)

The data sample can be accessed as a pandas DataFrame using the `.dataset` attribute

In [None]:
df = preview.dataset
df.head()

You can iterate on this process, refining the configuration builder schema to optimize your design before generating a full dataset.

When you're ready, use the Data Designer client's `.create(...)` method to [generate the full dataset](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/generate-data/generating-data.html).

```python
job_result = data_designer_client.create(
    config_builder,
    num_records=1_000,
    wait_until_done=True,
)
df = job_result.load_dataset()
```

## Resources

- [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html)
- [NeMo Data Designer tutorials](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/tutorials/index.html)
- [Deploying Data Designer](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/docker-compose.html)