# LangGraph CLI Synthetic Dataset Generation

This notebook demonstrates how to use **NVIDIA NeMo Data Designer** to create a synthetic dataset for training an AI agent to translate natural language queries into structured CLI tool calls.

## What is NeMo Data Designer?

[NeMo Data Designer](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/generate-data/index.html) is a powerful synthetic data generation engine that transforms data designs into high-quality datasets. It supports:

- **Sampling-based columns**: Generate values from statistical distributions (uniform, categorical, etc.)
- **LLM-based columns**: Use language models to generate realistic text or structured outputs
- **Expression columns**: Compute values based on other columns using Python expressions
- **Jinja templating**: Create dynamic prompts with conditional logic

## Use Case: LangGraph CLI Agent

We're generating training data for an agent that can interpret natural language requests like:
> "Create a new project using the react-agent template"

And convert them to structured tool calls:
```json
{"command": "new", "template": "react-agent", "path": null, ...}
```

This synthetic data will enable fine-tuning an LLM to perform accurate tool-calling for the LangGraph CLI.


## Step 1: Connect to the Data Designer Service

The NeMo Microservices Python SDK provides a streamlined interface for interacting with Data Designer. The `NeMoDataDesignerClient` wrapper offers convenience methods like automatic dataset loading and `wait_until_done` functionality.

We're connecting to a local Data Designer instance running via Docker Compose (see `nemo-microservices-quickstart_v25.11/` for the deployment configuration). When running locally, no API key is required.

In [75]:
from nemo_microservices.data_designer.essentials import (
    DataDesignerConfigBuilder, LLMTextColumnConfig, SamplerColumnConfig, LLMStructuredColumnConfig, SamplerType,
    CategorySamplerParams, UniformSamplerParams,
    ModelConfig, InferenceParameters,
    NeMoDataDesignerClient
)

client = NeMoDataDesignerClient(base_url="http://localhost:8080")  # local service (no API key needed)


## Step 2: Design the Synthetic Data Schema

This is where we define the structure of our synthetic dataset using Data Designer's [column types](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/define-your-data-columns/index.html). Our design uses three types of columns:

### 2.1 Output Schema (Pydantic Model)

First, we define a `CLIToolCall` Pydantic model that represents the structured output we want the LLM to generate. This enables **Structured Outputs** ‚Äî generating complex nested data objects with specific schemas rather than free-form text.

### 2.2 Model Configuration

We configure the LLM that will power our data generation. We're using `nvidia/nvidia-nemotron-nano-9b-v2` via NVIDIA's build.nvidia.com API. The `ModelConfig` specifies:
- **alias**: A friendly name to reference this model in column definitions
- **provider**: `nvidiabuild` for NVIDIA-hosted models
- **inference_parameters**: Temperature, top_p, and max_tokens for generation control

### 2.3 Sampler Columns

[Sampling-based columns](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/define-your-data-columns/column-types/sampling-based-columns/index.html) generate values from statistical distributions without LLM calls:

| Column | Sampler Type | Purpose |
|--------|--------------|---------|
| `command` | Category | Randomly select one of 5 CLI commands |
| `template` | Category | Template names for `new` command |
| `include_path` | Category (weighted) | Boolean with 25% chance of custom path |
| `port` | Uniform | Random port number between 3000-9000 |
| `no_browser` | Category (weighted) | Boolean with 20% chance of true |
| `watch` | Category (weighted) | Boolean with 33% chance of true |
| `image_tag` | Category | Docker image tag options |
| `dockerfile_path` | Category | Output path options |

### 2.4 LLM-Based Columns with Jinja Templates

[LLM-based columns](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/define-your-data-columns/column-types/llm-based-columns/index.html) use language models to generate content. We use [Jinja templating](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/define-your-data-columns/using-jinja-templates/index.html) to create dynamic prompts that reference sampled values:

- **`input` column** (`LLMTextColumnConfig`): Generates natural language user requests based on the sampled command type and parameters
- **`output` column** (`LLMStructuredColumnConfig`): Converts the natural language input into a structured `CLIToolCall` JSON object

The Jinja conditionals (`{% if command == 'new' %}...{% endif %}`) ensure the prompt is tailored to each command type, producing realistic and contextually appropriate training examples.


In [None]:
from pydantic import BaseModel, Field
from typing import Optional

class CLIToolCall(BaseModel):
    command: str = Field(..., description="CLI command: new, dev, up, build, or dockerfile")
    template: Optional[str] = Field(None, description="Template name for 'new' command")
    path: Optional[str] = Field(None, description="Project path for 'new' command")
    port: Optional[int] = Field(None, description="Port for 'dev' or 'up' command")
    no_browser: Optional[bool] = Field(None, description="Skip browser for 'dev' command")
    watch: Optional[bool] = Field(None, description="Watch mode for 'up' command")
    tag: Optional[str] = Field(None, description="Image tag for 'build' command")
    output_path: Optional[str] = Field(None, description="Output path for 'dockerfile' command")

# Model config
model_configs = [
    ModelConfig(
        alias="command-generator",
        provider="nvidiabuild",
        model="nvidia/nvidia-nemotron-nano-9b-v2",
        inference_parameters=InferenceParameters(
            temperature=0.5,
            top_p=0.95,
            max_tokens=1000
        )
    )
]

config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

# Sampler columns
config_builder.add_column(
    SamplerColumnConfig(
        name="command",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["new", "dev", "up", "build", "dockerfile"])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="template",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["basic", "react-agent", "memory-agent", "retrieval-agent", "data-enrichment"])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="include_path",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=[True, False], weights=[1, 3])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="port",
        sampler_type=SamplerType.UNIFORM,
        params=UniformSamplerParams(low=3000, high=9000),
        convert_to="int"
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="no_browser",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=[True, False], weights=[1, 4])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="watch",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=[True, False], weights=[1, 2])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="image_tag",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["myapp:latest", "latest", "langgraph-app:v1"])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="dockerfile_path",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["Dockerfile", "Dockerfile.custom", "docker/Dockerfile"])
    )
)

# Input column with Jinja conditionals
config_builder.add_column(
    LLMTextColumnConfig(
        name="input",
        model_alias="command-generator",
        prompt=(
            "Generate a natural user request for the LangGraph CLI.\n\n"
            "Command: {{ command }}\n\n"
            "{% if command == 'new' %}"
            "The user wants to create a new project with the '{{ template }}' template."
            "{% if include_path %} They want it in a custom directory.{% endif %}"
            "{% elif command == 'dev' %}"
            "The user wants to start the dev server on port {{ port }}."
            "{% if no_browser %} They don't want to auto-open a browser.{% endif %}"
            "{% elif command == 'up' %}"
            "The user wants to launch the server container on port {{ port }}."
            "{% if watch %} They want to watch for code changes.{% endif %}"
            "{% elif command == 'build' %}"
            "The user wants to build a Docker image with tag '{{ image_tag }}'."
            "{% elif command == 'dockerfile' %}"
            "The user wants to generate a Dockerfile at '{{ dockerfile_path }}'."
            "{% endif %}\n\n"
            "Write one natural, conversational sentence."
        ),
        system_prompt="Output only a single sentence. No explanation.",
    )
)

# Output column with structured output
config_builder.add_column(
    LLMStructuredColumnConfig(
        name="output",
        prompt=(
            "Convert this user request to a LangGraph CLI tool-call.\n\n"
            "Command type: {{ command }}\n"
            "User request: {{ input }}\n\n"
            "{% if command == 'new' %}"
            "Set: template, and path if specified."
            "{% elif command == 'dev' %}"
            "Set: port, and no_browser if specified."
            "{% elif command == 'up' %}"
            "Set: port, and watch if specified."
            "{% elif command == 'build' %}"
            "Set: tag."
            "{% elif command == 'dockerfile' %}"
            "Set: output_path."
            "{% endif %}\n\n"
            "Only set fields relevant to the command. Leave others as null."
        ),
        system_prompt="Output ONLY the user's request as a single sentence. No preamble, no quotes, no meta-commentary. /no_think",
        output_format=CLIToolCall,
        model_alias="command-generator",
    )
)

## Step 3: Preview Data Generation

Before generating a large dataset, we use the [preview](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/generate-data/manage-jobs/preview-data-generation.html) feature to validate our configuration and inspect sample outputs. This follows the recommended [Data Generation Workflow](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/generate-data/data-generation-workflow.html):

1. **Design phase** ‚Üí Define columns and prompts
2. **Preview** ‚Üí Generate small batches to validate quality
3. **Iterate** ‚Üí Refine prompts and constraints
4. **Batch generation** ‚Üí Create full dataset

The preview runs the full pipeline on a small sample (5 records here), returning:
- A Pandas DataFrame with all generated columns
- Token usage statistics
- Validation results

This lets us verify that:
- Natural language inputs sound realistic
- Structured outputs conform to our Pydantic schema
- The command-to-input-to-output pipeline produces coherent training pairs


In [83]:
# Generate synthetic data (e.g., 50 examples) and preview
preview = client.preview(config_builder, num_records=5)  # generate dataset
df = preview.dataset  # Pandas DataFrame of results
print(df[['input','output']].head(5))  # display sample pairs

[12:31:08] [INFO] ‚úÖ Validation passed
[12:31:08] [INFO] üöÄ Starting preview generation
[12:31:08] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[12:31:08] [INFO] ü©∫ Running health checks for models...
[12:31:08] [INFO]   |-- üëÄ Checking 'nvidia/nvidia-nemotron-nano-9b-v2'...
[12:31:08] [INFO]   |-- ‚úÖ Passed!
[12:31:08] [INFO] ‚è≥ Processing batch 1 of 1
[12:31:08] [INFO] üé≤ Preparing samplers to generate 5 records across 8 columns
[12:31:08] [INFO] üìù Preparing llm-text column generation
[12:31:08] [INFO]   |-- column name: 'input'
[12:31:08] [INFO]   |-- model config:
{
    "alias": "command-generator",
    "model": "nvidia/nvidia-nemotron-nano-9b-v2",
    "inference_parameters": {
        "temperature": 0.5,
        "top_p": 0.95,
        "max_tokens": 1000,
        "max_parallel_requests": 4,
        "timeout": null,
        "extra_body": null
    },
    "provider": "nvidiabuild"
}
[12:31:13] [INFO] üêô Processing llm-text column 'input' with 4 co

                                               input  \
0  Can you help me build a Docker image for my La...   
1  Sure, you can ask, "Can you help me generate a...   
2  Could you help me start the server container o...   
3  Could you generate a Dockerfile for me and sav...   
4  Can you help me build a Docker image for my La...   

                                              output  
0  {'command': 'build', 'output_path': None, 'por...  
1  {'command': 'dockerfile', 'output_path': 'Dock...  
2  {'command': 'up', 'output_path': None, 'port':...  
3  {'command': 'dockerfile', 'output_path': 'dock...  
4  {'command': 'build', 'output_path': None, 'por...  


## Step 4: Create Batch Generation Job

Once we're satisfied with the preview results, we [create a batch generation job](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/generate-data/manage-jobs/create-data-generation-job.html) to produce the full dataset. 

The `client.create()` method:
- Validates the configuration
- Creates an asynchronous job on the Data Designer server
- Returns a job handle with a unique `job_id`

For 1,000 records, the job processes data in batches with parallel LLM requests for efficiency. The logs show:
- Configuration validation status
- Job ID for tracking and retrieval


In [85]:
job_result = client.create(
    config_builder,
    num_records=1000
)

[12:34:37] [INFO] üé® Creating Data Designer generation job
[12:34:37] [INFO] ‚úÖ Validation passed
[12:34:37] [INFO]   |-- job_id: job-azvfpxaxx6m6v4vkvjxawf


### Wait for Job Completion

The `wait_until_done()` method polls the job status until generation is complete. You can also use the [job management API](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/generate-data/manage-jobs/index.html) to:
- [Get job status](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/generate-data/manage-jobs/get-job-status.html)
- [View job logs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/generate-data/manage-jobs/get-job-logs.html)
- [Retrieve results](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/generate-data/manage-jobs/get-job-results.html)


In [None]:
job_result.wait_until_done()

[12:37:39] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[12:37:39] [INFO] ü©∫ Running health checks for models...
[12:37:39] [INFO]   |-- üëÄ Checking 'nvidia/nvidia-nemotron-nano-9b-v2'...
[12:37:41] [INFO]   |-- ‚úÖ Passed!
[12:37:41] [INFO] ‚è≥ Processing batch 1 of 2
[12:37:41] [INFO] üé≤ Preparing samplers to generate 500 records across 8 columns
[12:37:41] [INFO] üìù Preparing llm-text column generation
[12:37:41] [INFO]   |-- column name: 'input'
[12:37:41] [INFO]   |-- model config:
{
    "alias": "command-generator",
    "model": "nvidia/nvidia-nemotron-nano-9b-v2",
    "inference_parameters": {
        "temperature": 0.5,
        "top_p": 0.95,
        "max_tokens": 1000,
        "max_parallel_requests": 4,
        "timeout": null,
        "extra_body": null
    },
    "provider": "nvidiabuild"
}
[12:37:41] [INFO] üêô Processing llm-text column 'input' with 4 concurrent workers
