# Building Multi-Step Tool-Calling Datasets with Data Designer

Generate synthetic training data for agentic Reinforcement-Learning using NVIDIA **Data Designer** and **NeMo Gym** to enhance multi-step tool calling ability.

## Prerequisites

- **NVIDIA API Key** from [build.nvidia.com](https://build.nvidia.com) to access a remote LLM for generation. Alternatively, you may choose to use your own endpoint or deployment.
- **Python 3.11+**
- **Tool definition files** in the `tools/` directory (included in this repo)
- Packages: `data-designer`, `pydantic`, `pandas`

## Objectives

By the end of this notebook, you will:

- Load known **tool schemas** as the seed for generating agent queries and simulated trajectories
- Use **Data Designer** to generate realistic multi-step user queries
- Simulate **agent trajectories** (step-by-step tool-call solutions)
- Apply **dual-level LLM judge filtering** to ensure data quality
- Export training data in **NeMo Gym format** for rollout collection and RLVR training

# 
#
> **Context Note:** The primary goal of this notebook is user query generation. The trajectory generation step in this notebook serves as a sanity check to ensure the generated query leads to a feasible path. In production RL training, rollout (oracle trajectory) traces are generated from the environment itself. You may find more information in [NeMo Gym Rollout Collection](https://docs.nvidia.com/nemo/gym/latest/get-started/rollout-collection.html) documentation.

## Architecture Overview

```
 ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
 ‚îÇ                  DATA GENERATION PIPELINE                     ‚îÇ
 ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
 ‚îÇ                                                               ‚îÇ
 ‚îÇ   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îÇ
 ‚îÇ   ‚îÇ Tool Schemas ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  User Query  ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  Trajectory  ‚îÇ    ‚îÇ
 ‚îÇ   ‚îÇ    (Seed)    ‚îÇ    ‚îÇ Generation   ‚îÇ    ‚îÇ Simulation   ‚îÇ    ‚îÇ
 ‚îÇ   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îÇ
 ‚îÇ                                                  ‚îÇ            ‚îÇ
 ‚îÇ                                                  ‚ñº            ‚îÇ
 ‚îÇ                                           ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îÇ
 ‚îÇ                                           ‚îÇ  LLM Judge   ‚îÇ    ‚îÇ
 ‚îÇ                                           ‚îÇ  (Quality)   ‚îÇ    ‚îÇ
 ‚îÇ                                           ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îÇ
 ‚îÇ                                                  ‚îÇ            ‚îÇ
 ‚îÇ                                                  ‚ñº            ‚îÇ
 ‚îÇ                                           ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îÇ
 ‚îÇ                                           ‚îÇ  NeMo Gym    ‚îÇ    ‚îÇ
 ‚îÇ                                           ‚îÇ   Format     ‚îÇ    ‚îÇ
 ‚îÇ                                           ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îÇ
 ‚îÇ                                                               ‚îÇ
 ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## Step 1: Install and Import Dependencies

In [1]:
%pip install -q data-designer pydantic pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import json
import random
from typing import List, Optional
from pydantic import BaseModel, Field
import pandas as pd

# Data Designer imports
from data_designer.essentials import (
    ChatCompletionInferenceParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    LLMStructuredColumnConfig,
    LLMTextColumnConfig,
    LocalFileSeedSource,
    ModelConfig,
    SamplingStrategy,
    ModelProvider
)

## Context: What is the Workplace Assistant Environment?

[**Workplace Assistant**](https://docs.nvidia.com/nemo/gym/latest/tutorials/nemo-rl-grpo/about-workplace-assistant.html#) is a multi-step tool-using benchmark environment used in **NeMo Gym** for RL training. A model gets a natural language business request and must call tools in the right order with valid arguments (up to 6 steps).

At a high level:
- The model reads a user request (for example, scheduling meetings or updating CRM records)
- The model decides which tools to call and with what parameters
- The environment verifies correctness using **state matching** (final database state), not exact step matching

In this notebook, we focus on **data generation**: starting from known tool schemas, generating realistic user requests, and simulating feasible trajectories to produce NeMo Gym-compatible training data.

> **Note:** The official NeMo Gym [Workplace Assistant environment](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/workplace_assistant) is the training target. This notebook is an example synthetic data preparation stage that feeds that workflow.

## Step 2: Load Tool Definitions

This notebook begins with **established tool schemas** and leverages them as foundational context for data generation. These schemas represent the (possibly domain-specific) tools on which you aim to enhance model performance.

We use 27 tools across 6 tool groups:
- **Company Directory**: Look up employee email addresses
- **Email**: Send, search, reply, forward, delete emails
- **Calendar**: Create, search, update, delete events
- **Analytics**: Query website visitor data and create plots
- **Project Management**: Manage tasks across Kanban boards
- **CRM**: Manage customer records and sales pipeline

These tools are designed to require **multi-step reasoning**. For example:
- "Email John about the meeting" requires first looking up John's email, then sending
- "Reassign all of Sarah's leads to Mike" requires looking up emails, searching customers, then updating each one

> **Why this matters:** The schemas define valid arguments and values (for example, allowed board/list/status values). We use these constraints to generate realistic user queries and schema-compliant simulated trajectories.

In [3]:
# Load tool definitions from separate JSON files (one per database)
import os

TOOLS_DIR = 'tools'

# Load environment config
with open(os.path.join(TOOLS_DIR, 'environment.json'), 'r') as f:
    env_config = json.load(f)

SYSTEM_PROMPT = env_config['system_prompt']
MULTI_STEP_PATTERNS = env_config['common_multi_step_patterns']

# Load tools from each database file
DATABASE_FILES = [
    'company_directory.json',
    'email.json', 
    'calendar.json',
    'analytics.json',
    'project_management.json',
    'customer_relationship_manager.json'
]

TOOLS = []
DATABASES = {}
TOOL_CATEGORIES = {}

for db_file in DATABASE_FILES:
    with open(os.path.join(TOOLS_DIR, db_file), 'r') as f:
        db_config = json.load(f)
        
    db_name = db_config['database']
    DATABASES[db_name] = {
        'description': db_config['description'],
        'data_schema': db_config['data_schema']
    }
    
    # Add tools and track category
    db_tools = db_config['tools']
    TOOLS.extend(db_tools)
    TOOL_CATEGORIES[db_name] = [t['name'] for t in db_tools]

print(f"Loaded {len(TOOLS)} tools across {len(DATABASES)} databases")
print(f"\nDatabases:")
for db_name, db_info in DATABASES.items():
    tool_count = len(TOOL_CATEGORIES[db_name])
    print(f"  - {db_name}: {tool_count} tools")
    print(f"    {db_info['description']}")

Loaded 27 tools across 6 databases

Databases:
  - company_directory: 1 tools
    Employee directory for looking up email addresses by name.
  - email: 6 tools
    Email inbox and outbox for sending, receiving, and managing emails.
  - calendar: 5 tools
    Calendar for managing meetings and events.
  - analytics: 6 tools
    Website analytics data for tracking visitor behavior and engagement.
  - project_management: 5 tools
    Project management board for tracking tasks across teams.
  - customer_relationship_manager: 4 tools
    CRM for managing customer records and sales pipeline.


Display all loaded tools grouped by database. This summary shows each tool's name and description.

In [4]:
# Helper function to format tools for prompts
def format_tools_for_prompt(tools: List[dict], include_schemas: bool = False) -> str:
    """Format tool definitions into a readable string for LLM prompts."""
    lines = []
    for tool in tools:
        lines.append(f"- **{tool['name']}**: {tool['description']}")
        if include_schemas:
            params = tool['parameters']['properties']
            if params:
                lines.append(f"  Parameters: {list(params.keys())}")
    return "\n".join(lines)

# Display tool summary by category
for category, tool_names in TOOL_CATEGORIES.items():
    print(f"\n### {category.upper()} ({len(tool_names)} tools)")
    category_tools = [t for t in TOOLS if t['name'] in tool_names]
    print(format_tools_for_prompt(category_tools))


### COMPANY_DIRECTORY (1 tools)
- **company_directory_find_email_address**: Finds all email addresses containing the given name (case-insensitive search).

### EMAIL (6 tools)
- **email_get_email_information_by_id**: Retrieves specific details of an email by its ID.
- **email_search_emails**: Searches for emails matching the given query across subject, body, or sender fields. The function matches an email if all words in the query appear in any of these fields.
- **email_send_email**: Sends an email to the specified recipient.
- **email_delete_email**: Deletes an email by its ID.
- **email_forward_email**: Forwards an email to the specified recipient.
- **email_reply_email**: Replies to an email by its ID.

### CALENDAR (5 tools)
- **calendar_get_event_information_by_id**: Returns the event for a given ID.
- **calendar_search_events**: Returns the events for a given query with pagination support.
- **calendar_create_event**: Creates a new event.
- **calendar_delete_event**: Deletes an

## Step 3: Define Output Schemas

**Data Designer** uses **Pydantic** models to define structured output formats, ensuring the LLM generates data in a consistent, parseable format.

We define five schemas:
1. **ToolCall** / **AgentStep** / **AgentTrajectory**: Represent a multi-step tool-calling solution
2. **UserQueryJudgeScores**: Quality scores for generated user queries
3. **TrajectoryJudgeScores**: Quality scores for generated trajectories

In [5]:
class ToolCall(BaseModel):
    """A single tool invocation."""
    name: str = Field(..., description="The name of the tool to call (e.g., 'email_send_email')")
    arguments: str = Field(..., description="JSON string of the tool arguments")


class AgentStep(BaseModel):
    """A single step in the agent's reasoning trajectory."""
    step_number: int = Field(..., description="The step number (1-indexed)")
    thought: str = Field(
        ..., 
        description="The agent's reasoning about what to do next and why. Should explain the purpose of the tool call."
    )
    tool_call: ToolCall = Field(..., description="The tool to call in this step")
    expected_result: str = Field(
        ..., 
        description="What information or state change we expect from this tool call"
    )


class AgentTrajectory(BaseModel):
    """Complete trajectory for solving a multi-step task."""
    reasoning_trace: List[AgentStep] = Field(
        ..., 
        description="The sequence of steps to solve the task. Should be 1-6 steps."
    )
    final_answer: str = Field(
        ..., 
        description="A brief confirmation of what was accomplished"
    )


class UserQueryJudgeScores(BaseModel):
    """Quality scores for the generated user query."""
    feasibility: int = Field(
        ..., ge=1, le=5, 
        description="Is the request achievable with the available tools? (1=impossible, 5=fully achievable)"
    )
    schema_compliance: int = Field(
        ..., ge=1, le=5, 
        description="Does the request use valid values as defined in tool schemas (e.g., valid board names, list names, statuses)? (1=uses invalid values, 5=all values valid)"
    )
    naturalness: int = Field(
        ..., ge=1, le=5, 
        description="Does the request sound like a natural user query? (1=robotic/artificial, 5=very natural)"
    )
    is_valid: bool = Field(
        ..., 
        description="True if the query is valid and should be kept, False if it should be discarded"
    )
    issues: str = Field(
        ..., 
        description="List any issues found (invalid enum values, impossible requests, etc.) or 'None' if valid"
    )


class TrajectoryJudgeScores(BaseModel):
    """Quality scores for the generated trajectory."""
    tool_validity: int = Field(
        ..., ge=1, le=5, 
        description="Are all tool names valid and arguments schema-compliant? (1=invalid tools/args, 5=all valid)"
    )
    argument_validity: int = Field(
        ..., ge=1, le=5, 
        description="Do all arguments use valid values as specified in tool descriptions? (1=invalid values, 5=all valid)"
    )
    completeness: int = Field(
        ..., ge=1, le=5, 
        description="Does the trajectory fully solve the user request? (1=incomplete, 5=fully complete)"
    )
    efficiency: int = Field(
        ..., ge=1, le=5, 
        description="Is the trajectory optimal without unnecessary steps? (1=very inefficient, 5=optimal)"
    )
    is_valid: bool = Field(
        ..., 
        description="True if the trajectory is valid and executable, False if it has errors"
    )
    issues: str = Field(
        ..., 
        description="List any issues found (invalid enum values, wrong tool names, missing steps, etc.) or 'None' if valid"
    )

## Step 4: Define Generation Prompts

The heart of synthetic data generation is the prompts. We define four prompts using **Jinja2 templates** (with `{{ variable }}` placeholders that Data Designer fills from seed columns):

1. **User Query Generation**: Create realistic workplace requests
2. **Trajectory Simulation**: Generate the step-by-step tool-call solution
3. **User Query Judge**: Evaluate query feasibility and schema compliance
4. **Trajectory Judge**: Evaluate tool-call correctness and completeness

### Key Principles
- **Specificity**: Tell the LLM exactly what format you want
- **Examples**: Show don't tell ‚Äî include concrete examples by complexity level
- **Constraints**: Define what NOT to do (avoid trivial tasks, don't skip steps)

In [6]:
# Prompt 1: Generate a realistic user query that may require one or more tool calls
USER_QUERY_GENERATION_PROMPT = """
You are creating training data for a workplace assistant AI agent.

**Your Task:** Generate a realistic user request that requires the agent to use one or more tools to complete.

**Available Tools (with full schemas):**
{{ tools_json }}

**Selected Tool Category:** {{ category }}

**Multi-Step Pattern to Use:** {{ pattern }}

**CRITICAL - Valid Values:**
Many tool parameters have RESTRICTED VALUES specified in their descriptions. You MUST only reference values that exist in the tool schemas. Pay close attention to phrases like "One of:" in parameter descriptions.

Common restrictions to follow:
- `list_name`: Only use 'Backlog', 'In Progress', 'In Review', or 'Completed' (NOT 'Prospects', 'Todo', 'Pipeline', etc.)
- `board`: Only use 'Back end', 'Front end', or 'Design' (NOT 'Sales', 'Marketing', 'Engineering', etc.)
- `status`: Only use 'Qualified', 'Won', 'Lost', 'Lead', or 'Proposal' (NOT 'Active', 'Prospect', 'Closed', etc.)
- `product_interest`: Only use 'Software', 'Hardware', 'Services', 'Consulting', or 'Training'

**Guidelines:**
1. The request should sound natural - like something a real employee would ask
2. It should require 1-6 tool calls to complete
3. Include specific details that make the task concrete (names, dates, subjects)
4. Don't mention tool names or technical details - speak like a normal user
5. The task MUST be achievable with the available tools using ONLY valid parameter values
6. When referencing boards, lists, statuses, etc., use EXACTLY the values allowed in the tool schemas

**Examples by Complexity:**

*Simple (1 step):*
- "Reply to Carlos's last email about the prototype with 'Thanks, I'll review it tomorrow'"
- "Change the name of my 3pm meeting to 'Risk Management Forum'"
- "How many website visitors did we have last week?"

*Medium (2-3 steps):*
- "Send an email to John about the quarterly review meeting tomorrow"
- "Schedule a 30-minute sync with Lisa tomorrow at 2pm"
- "Get the total visits and engaged users for November"

*Complex (4-6 steps):*
- "Raj is taking over all of Akira's leads that are interested in software. Can you reassign them in the CRM?"
- "Forward the last email from marketing about the Q4 report to everyone on the design team"
- "Move all of Sarah's overdue tasks on the Back end board to the Backlog"

**Output:** Return ONLY the user request as a single string. No quotes, no explanation.
"""

print("User Query Generation Prompt loaded")

User Query Generation Prompt loaded


In [7]:
# Prompt 2: Simulate the agent's trajectory for solving the task
TRAJECTORY_SIMULATION_PROMPT = """
You are simulating an expert workplace assistant agent solving a task step-by-step.

**User Request:**
{{ user_query }}

**System Context:**
{{ system_prompt }}

**Available Tools:**
{{ tools_json }}

**Your Task:** Generate a step-by-step trajectory showing how the agent would solve this request.

**Guidelines:**
1. **Think Step-by-Step**: Each step should have a clear thought explaining WHY we're calling this tool
2. **Use Real Tool Names**: The tool_call.name must exactly match one of the available tools
3. **Valid JSON Arguments**: The tool_call.arguments must be valid JSON matching the tool's parameter schema
4. **Realistic IDs**: When referencing IDs discovered in previous steps, use placeholder format like "00000001"
5. **Complete the Task**: The trajectory must fully solve the user's request
6. **1-6 Steps**: Use the minimum number of steps needed. Simple tasks may need only 1 step.

**Common Patterns:**
- Look up a person's email before sending them a message
- Search for records before updating/deleting them
- Get information from one database to use in another
- Some tasks can be completed in a single step (e.g., reply to an email, update an event)

**Example Step:**
{% raw %}
```json
{
  "step_number": 1,
  "thought": "The user wants to email Raj, but I need his email address first. I'll look it up in the company directory.",
  "tool_call": {
    "name": "company_directory_find_email_address",
    "arguments": "{\"name\": \"Raj\"}"
  },
  "expected_result": "Raj's email address (likely raj.patel@atlas.com)"
}
```
{% endraw %}

Output the complete AgentTrajectory with all steps needed to solve the task.
"""

print("Trajectory Simulation Prompt loaded")

Trajectory Simulation Prompt loaded


In [8]:
# Prompt 3a: Judge the quality of the generated USER QUERY
USER_QUERY_JUDGE_PROMPT = """
You are a quality assurance judge evaluating a synthetically generated user query for training an AI workplace assistant.

**Generated User Query:**
{{ user_query }}

**Available Tools (with full schemas):**
{{ tools_json }}

**Your Task:** Evaluate whether this user query is valid and achievable with the available tools.

**CRITICAL - Check for Schema Compliance:**
Many tools have RESTRICTED VALUES for certain fields. The user query must only reference values that are valid according to the tool schemas. For example:
- If a tool says `list_name` must be one of 'Backlog', 'In Progress', 'In Review', 'Completed' - the query cannot ask for a "Prospects" list
- If a tool says `board` must be one of 'Back end', 'Front end', 'Design' - the query cannot ask for a "Sales" board  
- If a tool says `status` must be one of 'Qualified', 'Won', 'Lost', 'Lead', 'Proposal' - the query cannot use other statuses

**Evaluation Criteria:**

1. **Feasibility (1-5)**: Can this request be fulfilled using the available tools?
   - Score 1 if the request requires tools/capabilities that don't exist
   - Score 5 if the request is fully achievable with available tools

2. **Schema Compliance (1-5)**: Does the request use valid values?
   - Score 1 if the query references invalid enum values (wrong board names, list names, statuses, etc.)
   - Score 3 if the query is ambiguous but could map to valid values
   - Score 5 if all referenced values exactly match valid options in tool schemas

3. **Naturalness (1-5)**: Does this sound like a real user request?
   - Score 1 if robotic or artificial sounding
   - Score 5 if very natural and realistic

**is_valid:** Set to False if feasibility < 3 OR schema_compliance < 3. These queries should be discarded.

**issues:** List specific problems found. Examples:
- "References 'Sales' board but valid boards are: 'Back end', 'Front end', 'Design'"
- "References 'Prospects' list but valid lists are: 'Backlog', 'In Progress', 'In Review', 'Completed'"
- "None" if no issues found

**Output:** Return UserQueryJudgeScores with all fields.
"""

# Prompt 3b: Judge the quality of the generated TRAJECTORY
TRAJECTORY_JUDGE_PROMPT = """
You are a quality assurance judge evaluating a generated trajectory (sequence of tool calls) for training an AI workplace assistant.

**User Request:**
{{ user_query }}

**Generated Trajectory:**
{{ trajectory }}

**Available Tools (with full schemas):**
{{ tools_json }}

**Your Task:** Evaluate whether this trajectory correctly solves the user request using valid tool calls.

**CRITICAL - Check for Argument Validity:**
Tool arguments must use EXACTLY the values allowed by the tool schemas. For example:
- `list_name` must be one of: 'Backlog', 'In Progress', 'In Review', 'Completed' (NOT 'Prospects', 'Todo', etc.)
- `board` must be one of: 'Back end', 'Front end', 'Design' (NOT 'Sales', 'Marketing', etc.)
- `status` must be one of: 'Qualified', 'Won', 'Lost', 'Lead', 'Proposal' (NOT 'Active', 'Prospect', etc.)
- `product_interest` must be one of: 'Software', 'Hardware', 'Services', 'Consulting', 'Training'

**Evaluation Criteria:**

1. **Tool Validity (1-5)**: Are all tool names correct?
   - Score 1 if any tool name doesn't match available tools
   - Score 5 if all tool names exactly match

2. **Argument Validity (1-5)**: Are all arguments schema-compliant?
   - Score 1 if any argument uses invalid enum values or wrong types
   - Score 3 if arguments are mostly valid but some are ambiguous
   - Score 5 if all arguments perfectly match the schema requirements

3. **Completeness (1-5)**: Does the trajectory fully solve the request?
   - Score 1 if major parts of the request are unaddressed
   - Score 5 if the trajectory completely fulfills the request

4. **Efficiency (1-5)**: Is the trajectory optimal?
   - Score 1 if there are many unnecessary steps
   - Score 5 if the trajectory is optimal with no wasted steps

**is_valid:** Set to False if tool_validity < 4 OR argument_validity < 4. These trajectories have errors and should be discarded.

**issues:** List specific problems found. Examples:
- "Step 2 uses list_name='Prospects' but valid values are: 'Backlog', 'In Progress', 'In Review', 'Completed'"
- "Step 1 calls 'email_send' but correct tool name is 'email_send_email'"
- "None" if no issues found

**Output:** Return TrajectoryJudgeScores with all fields.
"""

print("User Query Judge Prompt loaded")
print("Trajectory Judge Prompt loaded")

User Query Judge Prompt loaded
Trajectory Judge Prompt loaded


## Step 5: Create Seed Data

**Data Designer** works by expanding seed data through LLM generation. Each seed row provides context variables that get substituted into the prompt templates:

- `category`: Which tool database to focus on (ensures diversity across domains)
- `pattern`: Which multi-step pattern to use (e.g., lookup-then-send, search-then-update)
- `tools_json`: Full tool schemas for the LLM to reference
- `system_prompt`: The system context for the workplace assistant

> **Pattern Engineering Note:** The multi-step patterns used as seeds in `create_seed_data()` are domain-informed. In practice, you can engineer these patterns from heuristics, inferred tool-call chains observed in production traffic, or other rule-based design choices. In this case, we had some common patterns stored in `tools/environments.json`.

In [9]:
def create_seed_data(num_seeds: int = 100) -> pd.DataFrame:
    """
    Create seed data for the Data Designer pipeline.
    
    Each seed contains:
    - category: Which tool category to focus on
    - pattern: Which multi-step pattern to use
    - tools_description: Formatted tool descriptions
    - tools_json: Full tool schemas as JSON
    - system_prompt: The system context
    """
    seeds = []
    
    categories = list(TOOL_CATEGORIES.keys())
    patterns = [
        f"{p['pattern']}: {p['description']}" for p in MULTI_STEP_PATTERNS
    ]
    
    for i in range(num_seeds):
        # Select category and pattern (ensuring diversity)
        category = categories[i % len(categories)]
        pattern = patterns[i % len(patterns)]
        
        # Get tools for this category (plus company_directory for lookups)
        relevant_tool_names = TOOL_CATEGORIES[category] + TOOL_CATEGORIES.get('company_directory', [])
        relevant_tools = [t for t in TOOLS if t['name'] in relevant_tool_names]
        
        seeds.append({
            'seed_id': i,
            'category': category,
            'pattern': pattern,
            'tools_description': format_tools_for_prompt(relevant_tools, include_schemas=True),
            'tools_json': json.dumps(relevant_tools, indent=2),
            'tools_summary': format_tools_for_prompt(TOOLS),  # All tools for judge
            'system_prompt': SYSTEM_PROMPT,
        })
    
    return pd.DataFrame(seeds)

# Create seed data
seed_df = create_seed_data(num_seeds=50)
print(f"Created {len(seed_df)} seeds")
print(f"\nSample seed:")
print(seed_df.iloc[0].to_dict())

Created 50 seeds

Sample seed:
{'seed_id': 0, 'category': 'company_directory', 'pattern': "lookup_then_send_email: Look up a person's email address, then send them an email", 'tools_description': "- **company_directory_find_email_address**: Finds all email addresses containing the given name (case-insensitive search).\n  Parameters: ['name']", 'tools_json': '[\n  {\n    "type": "function",\n    "name": "company_directory_find_email_address",\n    "description": "Finds all email addresses containing the given name (case-insensitive search).",\n    "database": "company_directory",\n    "operation_type": "read",\n    "parameters": {\n      "type": "object",\n      "properties": {\n        "name": {\n          "type": "string",\n          "description": "Name or partial name to search for in email addresses"\n        }\n      },\n      "required": [],\n      "additionalProperties": false\n    },\n    "strict": false\n  }\n]', 'tools_summary': '- **company_directory_find_email_address**: Fi

Save the seed data as a Parquet file for Data Designer to consume.

In [10]:
# Save seeds to parquet for Data Designer
seed_df.to_parquet('workplace_assistant_seeds.parquet', index=False)
print("Seeds saved to workplace_assistant_seeds.parquet")

Seeds saved to workplace_assistant_seeds.parquet


## Step 6: Configure the Data Designer Pipeline

Now we wire everything together into a **Data Designer** workflow:

1. **Load Seeds** ‚Äî Provides category, pattern, and tools for each generation
2. **Generate User Query** ‚Äî LLM creates a realistic workplace request
3. **Judge User Query** ‚Äî LLM validates feasibility and schema compliance
4. **Simulate Trajectory** ‚Äî LLM generates the step-by-step tool-call solution
5. **Judge Trajectory** ‚Äî LLM validates tool names and argument correctness

### Configuration

First, set up the **NVIDIA Inference API** provider and model. The API key is read from the `NVIDIA_API_KEY` environment variable.

In [11]:
import getpass

if "NVIDIA_API_KEY" not in os.environ or not os.environ["NVIDIA_API_KEY"]:
    os.environ["NVIDIA_API_KEY"] = getpass.getpass("Enter your NVIDIA API key: ")

In [12]:
# Define custom provider pointing to NVIDIA Inference API
NVIDIA_INFERENCE_URL = "https://inference-api.nvidia.com/v1"

custom_providers = [
    ModelProvider(
        name="nvidia-inference",
        endpoint=NVIDIA_INFERENCE_URL,
        provider_type="openai",
        api_key=os.environ.get("NVIDIA_API_KEY", ""),
    ),
]

# Model name must match NVIDIA's model identifier
MODEL_ID = "nvidia/openai/gpt-oss-120b"
MODEL_ALIAS = "gpt-oss-120b"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        provider="nvidia-inference",
        inference_parameters=ChatCompletionInferenceParams(
            max_tokens=16384,
        ),
    )
]

# Initialize DataDesigner and config builder
data_designer = DataDesigner(model_providers=custom_providers)
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

Build the pipeline with four generation columns: user query, user query judge, trajectory, and trajectory judge.

In [13]:
def build_workplace_assistant_pipeline():
    """
    Build the complete Data Designer pipeline for generating 
    multi-step tool-calling training data.
    
    Pipeline stages:
    1. Generate user query
    2. Judge user query (filter invalid queries early)
    3. Generate trajectory 
    4. Judge trajectory (filter invalid trajectories)
    """
    
    # Initialize the config builder
    config_builder = DataDesignerConfigBuilder(model_configs=model_configs)
    
    # Load seed data
    seed_ref = LocalFileSeedSource(path='workplace_assistant_seeds.parquet')
    config_builder.with_seed_dataset(seed_ref, sampling_strategy=SamplingStrategy.SHUFFLE)
    
    # Column 1: Generate User Query
    # This creates a realistic workplace request based on the category and pattern
    config_builder.add_column(
        LLMTextColumnConfig(
            name="user_query",
            prompt=USER_QUERY_GENERATION_PROMPT,
            model_alias=MODEL_ALIAS,
        )
    )
    
    # Column 2: Judge User Query
    # Validates that the user query is feasible and uses valid enum values
    config_builder.add_column(
        LLMStructuredColumnConfig(
            name="user_query_judge",
            prompt=USER_QUERY_JUDGE_PROMPT,
            output_format=UserQueryJudgeScores,
            model_alias=MODEL_ALIAS,
        )
    )
    
    # Column 3: Simulate Agent Trajectory
    # This generates the step-by-step solution with tool calls
    config_builder.add_column(
        LLMStructuredColumnConfig(
            name="trajectory",
            prompt=TRAJECTORY_SIMULATION_PROMPT,
            output_format=AgentTrajectory,
            model_alias=MODEL_ALIAS,
        )
    )
    
    # Column 4: Judge Trajectory
    # Validates that the trajectory uses correct tool names and valid argument values
    config_builder.add_column(
        LLMStructuredColumnConfig(
            name="trajectory_judge",
            prompt=TRAJECTORY_JUDGE_PROMPT,
            output_format=TrajectoryJudgeScores,
            model_alias=MODEL_ALIAS,
        )
    )
    
    return config_builder

# Build the pipeline
pipeline = build_workplace_assistant_pipeline()
print("Pipeline configured with 4 generation columns:")
print("  1. user_query (text) - Generate realistic user request")
print("  2. user_query_judge (structured) - Validate query feasibility and schema compliance")
print("  3. trajectory (structured) - Generate step-by-step solution")
print("  4. trajectory_judge (structured) - Validate tool calls and argument values")

Pipeline configured with 4 generation columns:
  1. user_query (text) - Generate realistic user request
  2. user_query_judge (structured) - Validate query feasibility and schema compliance
  3. trajectory (structured) - Generate step-by-step solution
  4. trajectory_judge (structured) - Validate tool calls and argument values


Validate the pipeline configuration to catch any issues before generation.

In [14]:
data_designer.validate(pipeline)

[13:51:08] [INFO] ‚úÖ Validation passed


Run a quick preview with 2 records to verify the pipeline produces well-formed outputs before scaling up.

In [15]:
preview = data_designer.preview(pipeline, num_records=2)

[13:51:12] [INFO] üì∏ Preview generation in progress
[13:51:12] [INFO] ‚úÖ Validation passed
[13:51:12] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[13:51:12] [INFO] ü©∫ Running health checks for models...
[13:51:12] [INFO]   |-- üëÄ Checking 'nvidia/openai/gpt-oss-120b' in provider named 'nvidia-inference' for model alias 'gpt-oss-120b'...
[13:51:14] [INFO]   |-- ‚úÖ Passed!
[13:51:14] [INFO] üå± Sampling 2 records from seed dataset
[13:51:14] [INFO]   |-- seed dataset size: 50 records
[13:51:14] [INFO]   |-- sampling strategy: shuffle
[13:51:14] [INFO] llm-text model configuration for generating column 'user_query'
[13:51:14] [INFO]   |-- model: 'nvidia/openai/gpt-oss-120b'
[13:51:14] [INFO]   |-- model alias: 'gpt-oss-120b'
[13:51:14] [INFO]   |-- model provider: 'nvidia-inference'
[13:51:14] [INFO]   |-- inference parameters: generation_type=chat-completion, max_parallel_requests=4, max_tokens=16384
[13:51:14] [INFO] üêô Processing llm-text column 'user_

Inspect a sample generated user query from the preview.

In [18]:
preview.dataset

Unnamed: 0,seed_id,category,pattern,tools_description,tools_json,tools_summary,system_prompt,user_query,user_query__reasoning_trace,user_query_judge,user_query_judge__reasoning_trace,trajectory,trajectory__reasoning_trace,trajectory_judge,trajectory_judge__reasoning_trace
0,47,customer_relationship_manager,search_then_batch_update_customers: Search for...,- **company_directory_find_email_address**: Fi...,"[\n {\n ""type"": ""function"",\n ""name"": ""...",- **company_directory_find_email_address**: Fi...,"Today's date is Thursday, 2026-01-29 and the c...",Please reassign all of my Qualified customers ...,We need to produce a realistic user request th...,"{'feasibility': 5, 'schema_compliance': 5, 'na...","We need to evaluate the user query: ""Please re...","{'reasoning_trace': [{'step_number': 1, 'thoug...",We need to produce a trajectory of steps to re...,"{'tool_validity': 5, 'argument_validity': 5, '...",We need to evaluate the generated trajectory.\...
1,32,calendar,get_email_info_then_forward: Get specific info...,- **company_directory_find_email_address**: Fi...,"[\n {\n ""type"": ""function"",\n ""name"": ""...",- **company_directory_find_email_address**: Fi...,"Today's date is Thursday, 2026-01-29 and the c...","Can you pull the details of the ""Project Kicko...",We need generate a realistic user request that...,"{'feasibility': 5, 'schema_compliance': 5, 'na...","We need to evaluate the user query: ""Can you p...","{'reasoning_trace': [{'step_number': 1, 'thoug...",We need to produce trajectory steps to accompl...,"{'tool_validity': 5, 'argument_validity': 5, '...",We need to evaluate the generated trajectory.\...


In [23]:
preview.dataset.user_query[0], preview.dataset.trajectory[0]

('Please reassign all of my Qualified customers who are interested in Consulting and whose last contact date is before 2024-01-01 to Maria Lopez (maria.lopez@company.com) and set their follow‚Äëup date to 2024-03-01.',
 {'reasoning_trace': [{'step_number': 1,
    'thought': 'First I need to find all customers that match the criteria: status Qualified, product_interest Consulting, and last_contact_date before 2024-01-01. I will search the CRM with these filters.',
    'tool_call': {'name': 'customer_relationship_manager_search_customers',
     'arguments': '{"status": "Qualified", "product_interest": "Consulting", "last_contact_date_max": "2023-12-31", "page": 1, "page_size": 50}'},
    'expected_result': 'A list of customer records that satisfy the criteria, including their customer_id values (e.g., 11111111 and 22222222).'},
   {'step_number': 2,
    'thought': 'Reassign the first matching customer (ID 11111111) to Maria Lopez by updating the assigned_to_email field.',
    'tool_call'

## Step 7: Set Up Quality Filtering (Generic)

Before any downstream format conversion, we apply dual-level quality filtering to keep only high-quality examples.

This stage is generic to **Data Designer** workflows and not specific to NeMo Gym.

To keep this notebook clean, quality filtering helpers live in `utils/quality_filtering.py`.

In [36]:
from utils.quality_filtering import (
    FilterThresholds,
    filter_high_quality,
    print_quality_filtering_quickstart,
    show_rejection_reasons,
)

# This keeps notebook cells lightweight and points users to the utility API.
print_quality_filtering_quickstart()

# Optional: define reusable custom thresholds for stricter/looser filtering.
# custom_thresholds = FilterThresholds(min_query_schema_compliance=5)
# filtered_df = filter_high_quality(results_df, **custom_thresholds.to_kwargs(), verbose=True)

Quality filtering utility loaded.

Quickstart:
1) Inspect rejection reasons:
   show_rejection_reasons(results_df, num_examples=3)
2) Filter with default strict thresholds:
   filtered_df = filter_high_quality(results_df, verbose=True)
3) Customize thresholds (optional):
   thresholds = FilterThresholds(min_query_schema_compliance=5)
   filtered_df = filter_high_quality(results_df, **thresholds.to_kwargs())


## Step 8: Generate and Filter the Dataset

Run the full pipeline end-to-end: generate records and apply **dual-level quality filtering**.

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   Generate   ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ Stage 1:     ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ Stage 2:     ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ   Filtered   ‚îÇ
‚îÇ   Records    ‚îÇ    ‚îÇ Query Judge  ‚îÇ    ‚îÇ Traj Judge   ‚îÇ    ‚îÇ   Dataset    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Utility location:** `utils/quality_filtering.py`

**Quick usage:**
- Run `show_rejection_reasons(results_df, num_examples=3)` to inspect failures
- Run `filter_high_quality(results_df, verbose=True)` to apply default strict filtering
- Optionally tune thresholds with `FilterThresholds(...).to_kwargs()`

**Why Dual-Level Filtering?**
- **Stage 1 (User Query)**: Catches queries like which are intractable in this context.
- **Stage 2 (Trajectory)**: Catches tool argument errors that slipped through, or that doesn't answer the query.

In [32]:
print("Generating 10 examples...")
results = data_designer.create(pipeline, num_records=10)

results_df = results.load_dataset()
print(f"\nGenerated {len(results_df)} records")
print("\nColumns:", list(results_df.columns))

[14:43:34] [INFO] üé® Creating Data Designer dataset
[14:43:34] [INFO] ‚úÖ Validation passed
[14:43:34] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[14:43:34] [INFO] üìÇ Dataset path '/Users/shashankv/Documents/Work/workplace-asst-sdg/DataDesigner/docs/colab_notebooks/workplace_assistant/artifacts/dataset' already exists. Dataset from this session
		     will be saved to '/Users/shashankv/Documents/Work/workplace-asst-sdg/DataDesigner/docs/colab_notebooks/workplace_assistant/artifacts/dataset_02-12-2026_144334' instead.
[14:43:34] [INFO] ü©∫ Running health checks for models...
[14:43:34] [INFO]   |-- üëÄ Checking 'nvidia/openai/gpt-oss-120b' in provider named 'nvidia-inference' for model alias 'gpt-oss-120b'...


Generating 10 examples...


[14:43:34] [INFO]   |-- ‚úÖ Passed!
[14:43:34] [INFO] ‚è≥ Processing batch 1 of 1
[14:43:34] [INFO] üå± Sampling 10 records from seed dataset
[14:43:34] [INFO]   |-- seed dataset size: 50 records
[14:43:34] [INFO]   |-- sampling strategy: shuffle
[14:43:34] [INFO] llm-text model configuration for generating column 'user_query'
[14:43:34] [INFO]   |-- model: 'nvidia/openai/gpt-oss-120b'
[14:43:34] [INFO]   |-- model alias: 'gpt-oss-120b'
[14:43:34] [INFO]   |-- model provider: 'nvidia-inference'
[14:43:34] [INFO]   |-- inference parameters: generation_type=chat-completion, max_parallel_requests=4, max_tokens=16384
[14:43:34] [INFO] üêô Processing llm-text column 'user_query' with 4 concurrent workers
[14:43:47] [INFO] llm-structured model configuration for generating column 'user_query_judge'
[14:43:47] [INFO]   |-- model: 'nvidia/openai/gpt-oss-120b'
[14:43:47] [INFO]   |-- model alias: 'gpt-oss-120b'
[14:43:47] [INFO]   |-- model provider: 'nvidia-inference'
[14:43:47] [INFO]   |-- 


Generated 10 records

Columns: ['seed_id', 'category', 'pattern', 'tools_description', 'tools_json', 'tools_summary', 'system_prompt', 'user_query', 'user_query__reasoning_trace', 'user_query_judge', 'user_query_judge__reasoning_trace', 'trajectory', 'trajectory__reasoning_trace', 'trajectory_judge', 'trajectory_judge__reasoning_trace']


Inspect rejection reasons at both judge levels to understand what kinds of errors the pipeline catches.

In [33]:
show_rejection_reasons(results_df, num_examples=3)


STAGE 1: USER QUERY ISSUES

[INVALID QUERIES] (1 total)

  [1] Query: Please find the most recent email from Acme Corporation about their interest in ...
      Feasibility: 1/5 | Schema: 5/5
      Issues: No available tool can search for or retrieve email messages, nor forward them; the request cannot be fulfilled with the provided functions.

STAGE 2: TRAJECTORY ISSUES

  No trajectory issues found!




Apply dual-level filtering with strict schema compliance requirements. Records must pass **both** the user query judge and the trajectory judge to be kept.

In [37]:
filtered_df = filter_high_quality(
    results_df,
    min_query_feasibility=3,
    min_query_schema_compliance=4,
    min_query_naturalness=3,
    min_trajectory_tool_validity=4,
    min_trajectory_argument_validity=4,
    min_trajectory_completeness=3,
    min_trajectory_efficiency=3,
    verbose=True,
)

DUAL-LEVEL QUALITY FILTERING RESULTS

Total records: 10

----------------------------------------------------------------------
STAGE 1: USER QUERY FILTERING
----------------------------------------------------------------------
  is_valid=True:                    9 / 10 ( 90.0%)
  feasibility >= 3:              9 / 10 ( 90.0%)
  schema_compliance >= 4:       10 / 10 (100.0%)
  naturalness >= 3:             10 / 10 (100.0%)
  ----------------------------------------------
  PASSED Stage 1:                   9 / 10 ( 90.0%)

----------------------------------------------------------------------
STAGE 2: TRAJECTORY FILTERING
----------------------------------------------------------------------
  is_valid=True:                   10 / 10 (100.0%)
  tool_validity >= 4:          10 / 10 (100.0%)
  argument_validity >= 4:      10 / 10 (100.0%)
  completeness >= 3:            4 / 10 ( 40.0%)
  efficiency >= 3:              9 / 10 ( 90.0%)
  ----------------------------------------------
  PAS

In [38]:
show_rejection_reasons(filtered_df, num_examples=3)


STAGE 1: USER QUERY ISSUES

  No user query issues found!

STAGE 2: TRAJECTORY ISSUES

  No trajectory issues found!




## Step 9 (Optional): Convert to NeMo Gym Format and Save

If you plan to use this data with **NeMo Gym**, convert filtered records into NeMo Gym JSONL format and save them.

This conversion is NeMo Gym-specific and optional for generic Data Designer workflows.

In [39]:
from utils.convert_to_nemo_gym_format import (
    build_nemo_gym_converter,
    print_convert_to_nemo_gym_format_quickstart,
    save_for_nemo_gym,
)

# Build converter only when you need NeMo Gym output.
convert_to_nemo_gym_format = build_nemo_gym_converter(
    tools=TOOLS,
    system_prompt=SYSTEM_PROMPT,
    environment_name="workplace_assistant",
)
print_convert_to_nemo_gym_format_quickstart()

output_path = "workplace_assistant_train-gpt-oss.jsonl"
save_for_nemo_gym(filtered_df, output_path, convert_fn=convert_to_nemo_gym_format)

print("\nSample generated data (passed both quality stages):")
filtered_df.head()

convert_to_nemo_gym_format utility loaded.

Quickstart:
1) Build a converter tied to your tools + system prompt:
   convert_fn = build_nemo_gym_converter(tools=TOOLS, system_prompt=SYSTEM_PROMPT)
2) Save any filtered dataframe to JSONL:
   save_for_nemo_gym(filtered_df, 'workplace_assistant_train.jsonl', convert_fn)
Saved 4 examples to workplace_assistant_train-gpt-oss.jsonl

Sample generated data (passed both quality stages):


Unnamed: 0,seed_id,category,pattern,tools_description,tools_json,tools_summary,system_prompt,user_query,user_query__reasoning_trace,user_query_judge,user_query_judge__reasoning_trace,trajectory,trajectory__reasoning_trace,trajectory_judge,trajectory_judge__reasoning_trace
0,13,email,lookup_then_create_task: Look up a person's em...,- **company_directory_find_email_address**: Fi...,"[  {  ""type"": ""function"",  ""name"": ""com...",- **company_directory_find_email_address**: Fi...,"Today's date is Thursday, 2026-01-29 and the c...",Can you find Marco Alvarez‚Äôs email address and...,The user wants a realistic user request that r...,"{'feasibility': 5, 'is_valid': True, 'issues':...","We need to evaluate the user query: ""Can you f...","{'final_answer': ""Marco Alvarez's email was fo...",We need to produce a trajectory: find Marco Al...,"{'argument_validity': 5, 'completeness': 5, 'e...",We need to evaluate the generated trajectory. ...
1,40,project_management,compare_traffic_sources: Compare multiple traf...,- **company_directory_find_email_address**: Fi...,"[  {  ""type"": ""function"",  ""name"": ""com...",- **company_directory_find_email_address**: Fi...,"Today's date is Thursday, 2026-01-29 and the c...",I need to take over Jason Lee‚Äôs pending design...,We need to produce a realistic user request th...,"{'feasibility': 5, 'is_valid': True, 'issues':...","We need to evaluate the user query: ""I need to...","{'final_answer': ""All of Jason Lee's pending d...",We need to produce a trajectory of steps to mo...,"{'argument_validity': 5, 'completeness': 3, 'e...",We need to evaluate the trajectory. User requ...
2,9,analytics,multiple_analytics_queries: Query multiple ana...,- **company_directory_find_email_address**: Fi...,"[  {  ""type"": ""function"",  ""name"": ""com...",- **company_directory_find_email_address**: Fi...,"Today's date is Thursday, 2026-01-29 and the c...",Could you pull the total number of website vis...,We need to generate a realistic user request t...,"{'feasibility': 5, 'is_valid': True, 'issues':...","We need to evaluate query: ""Could you pull tot...","{'final_answer': 'Collected total visits, enga...",The user asks: pull total number of website vi...,"{'argument_validity': 5, 'completeness': 5, 'e...",We need to evaluate the trajectory. User requ...
3,1,email,lookup_then_forward_email: Search for an email...,- **company_directory_find_email_address**: Fi...,"[  {  ""type"": ""function"",  ""name"": ""com...",- **company_directory_find_email_address**: Fi...,"Today's date is Thursday, 2026-01-29 and the c...",Please locate the email Jane Patel sent on Mar...,We need to produce a realistic user request th...,"{'feasibility': 5, 'is_valid': True, 'issues':...","We need to evaluate the user query: ""Please lo...",{'final_answer': 'The email from Jane Patel da...,We need to produce a trajectory with steps to ...,"{'argument_validity': 5, 'completeness': 4, 'e...",We need to evaluate the trajectory. User requ...


## Summary

This notebook demonstrated how to build a complete synthetic data generation pipeline for multi-step tool-calling tasks using **Data Designer**. The pipeline generates user queries, simulates agent trajectories, and applies dual-level LLM judge filtering to produce high-quality training data.

## Next Steps

- **Scale up generation**: Increase `num_seeds` and `num_records` to produce larger training sets (1,000+ examples)
- **Customize for your domain**: Replace the Workplace Assistant tools with your own tool definitions
- **Add more multi-step patterns**: Define new patterns in `environment.json` to increase task diversity
- **Tune judge thresholds**: Inspect rejected examples with `show_rejection_reasons()` and adjust filtering thresholds
- **Train with NeMo Gym**: Use the exported JSONL file for GRPO training:

```bash
# Prepare data for NeMo Gym
ng_prepare_data "+config_paths=[workplace_assistant.yaml]" \
    +output_dirpath=data/workplace_assistant \
    +input_jsonl=workplace_assistant_train-gpt-oss.jsonl

# Run GRPO training
python run_grpo_nemo_gym.py \
    --config=grpo_workplace_assistant.yaml \
    ++data.train_jsonl_fpath=data/workplace_assistant/train.jsonl
```