# 🧠 NeMo Data Designer: Synthetic Reasoning Traces

#### 📚 What you'll learn

- This notebook demonstrates how to use NeMo Data Designer to build a synthetic data generation pipeline tailored for reasoning tasks.

- Instead of creating multi-turn conversations, we will generate reasoning traces that can be utilized for training and \
  fine-tuning language models with reinforcement learning techniques and invoking chain-of-thought processing.

- These synthetic reasoning traces can be used to enhance model performance in areas such as mathematics, coding, scientific \
  reasoning, and other domains that benefit from structured reasoning.

<br>

> 👋 **IMPORTANT** – Environment Setup
>
> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.
>
> - You may need to restart your notebook's kernel after setting up the environment.
> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.
>
> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).


### 📦 Import the essentials

- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.

- The `essentials` module provides quick access to the most commonly used objects.


In [None]:
from nemo_microservices.data_designer.essentials import (
    CategorySamplerParams,
    DataDesignerConfigBuilder,
    InferenceParameters,
    LLMStructuredColumnConfig,
    LLMTextColumnConfig,
    ModelConfig,
    NeMoDataDesignerClient,
    SamplerColumnConfig,
    SamplerType,
    SubcategorySamplerParams,
)

### ⚙️ Initialize the NeMo Data Designer Client

- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.


In [None]:
NEMO_MICROSERVICES_BASE_URL = "http://localhost:8080"

data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)

### 🎛️ Define model configurations

- Each `ModelConfig` defines a model that can be used during the generation process.

- The "model alias" is used to reference the model in the Data Designer config (as we will see below).

- The "model provider" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).

- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.


In [None]:
# This name is set in the microservice deployment configuration.
MODEL_PROVIDER = "nvidiabuild"

# The model ID is from build.nvidia.com.
MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2"

# We choose this alias to be descriptive for our use case.
MODEL_ALIAS = "nemotron-nano-v2"

# This sets reasoning to False for the nemotron-nano-v2 model.
SYSTEM_PROMPT = "/no_think"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        provider=MODEL_PROVIDER,
        inference_parameters=InferenceParameters(
            temperature=0.6,
            top_p=0.95,
            max_tokens=1024,
        ),
    )
]

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- The list of model configs is provided to the builder at initialization.


In [None]:
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

## 🎲 Adding Categorical Columns for Controlled Diversity

- Now we'll add categorical columns to control the diversity of our generated examples

- Sampler columns offer non-LLM based generation of synthetic data.

- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.


In [None]:
# Define a domain column that sets the context for empathic scenarios in everyday life.
config_builder.add_column(
    SamplerColumnConfig(
        name="domain",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "Family Dynamics",
                "Workplace Challenges",
                "Friendship Moments",
                "Community Interactions",
                "Personal Well-being",
                "Unexpected Encounters",
            ]
        ),
    )
)

# Add theme subcategories for each domain
config_builder.add_column(
    SamplerColumnConfig(
        name="theme",
        sampler_type=SamplerType.SUBCATEGORY,
        params=SubcategorySamplerParams(
            category="domain",
            values={
                "Family Dynamics": ["Parenting Dilemmas", "Sibling Rivalries"],
                "Workplace Challenges": [
                    "Communication Breakdowns",
                    "Leadership Dilemmas",
                ],
                "Friendship Moments": [
                    "Support & Understanding",
                    "Misunderstandings & Reconciliations",
                ],
                "Community Interactions": [
                    "Neighborhood Support",
                    "Cultural Celebrations",
                ],
                "Personal Well-being": ["Mental Health", "Self-care & Reflection"],
                "Unexpected Encounters": [
                    "Serendipitous Meetings",
                    "Moments of Realization",
                ],
            },
        ),
    )
)

# Define a complexity column to guide the level of detail and challenge in the empathic scenarios.
config_builder.add_column(
    SamplerColumnConfig(
        name="complexity",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["Basic", "Intermediate", "Advanced"]),
    )
)

## 🦜 LLM-generated columns

- When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.

- As we see below, nested json fields can be accessed using dot notation.

- These prompts instruct the LLM to produce the actual empathic reasoning trace and answer, following the specified format with <think> and <answer> tags.
  <br>

### 🧠 Empathic Reasoning Trace Generation

This column is designed to generate clear, thoughtful reasoning traces that blend logical analysis with emotional insight for everyday situations \
where empathy is crucial. The generation prompt is tailored to:

- Produce a structured explanation that highlights both the practical reasoning and the emotional dynamics at play.

- Encourage a dual output: one part detailing the empathic thought process (enclosed within `<think>` tags) and another delivering a \
  compassionate final answer (enclosed within `<answer>` tags).

- Ensure that the generated content reflects deep understanding, compassion, and a balanced view of the challenges and emotions involved.


In [None]:
EMPATHIC_SYSTEM_PROMPT = (
    "You are an empathic reasoning agent. Your task is to generate realistic and compassionate reasoning traces for "
    "common day-to-day situations. \n"
    "Adopt a caring and supportive tone as you provide detailed insights into human experiences and emotions.\n"
    "- Focus on everyday scenarios where empathy, understanding, and emotional intelligence are key.\n"
    "- Consider various perspectives, emphasizing the emotional impact of actions and decisions.\n"
    "- Ensure your reasoning process is clear, structured, and heartfelt, reflecting deep care for the individuals involved.\n"
    "- Enclose your thoughtful reasoning process within <think>...</think> tags before providing the final JSON output."
    "/no_think"
)

config_builder.add_column(
    LLMTextColumnConfig(
        name="scenario",
        model_alias=MODEL_ALIAS,
        system_prompt=EMPATHIC_SYSTEM_PROMPT,
        prompt=(
            "Generate a clear and concise everyday scenario for the {{domain}} domain, theme {{theme}}, and "
            "complexity {{complexity}}, where empathy and understanding play a crucial role. Focus on a situation that "
            "highlights emotional challenges or opportunities for compassionate support, and include a specific "
            "question or request for help that clearly outlines a problem or challenge needing resolution.\n\n"
            "Guidelines:\n"
            "- Provide only the scenario statement without any additional metadata, solution steps, or internal "
            "commentary.\n"
            "- Use everyday language and incorporate realistic, practical context from an empathic perspective.\n"
            "- Ensure the scenario includes a clear follow-up question or request for assistance, making it "
            "apparent what the problem or challenge is.\n"
            "- Do not include any formatting tags or markers.\n\n"
            "Examples:\n"
            "1. 'Imagine a situation where a friend is visibly upset after a long, challenging day. What might be "
            "causing their distress, and how could you offer support?'\n"
            "2. 'Consider a moment at a family dinner where a subtle conflict arises between members. What could be "
            "the underlying issue, and how might empathy help mend the situation?'\n"
            "3. 'Picture a colleague receiving unexpected criticism during a meeting. What are the potential emotional "
            "impacts, and what supportive response could be helpful?'\n"
        ),
    )
)

### ⚡️ Empathic Reasoning Process Generation

- These columns generate and evaluate a detailed empathic reasoning trace for addressing everyday scenarios.

- The process emphasizes a compassionate, thoughtful approach that blends logical reasoning with emotional insight.

- The prompts instruct the model to include its internal thought process within <think>...</think> tags before providing the JSON output.


In [None]:
from typing import List
from pydantic import BaseModel, Field


class Thought(BaseModel):
    """A single step in the structured empathic reasoning process.
    This step captures an empathetic observation or insight that informs a thoughtful, compassionate approach to addressing everyday challenges.
    """

    step_number: int = Field(
        ..., ge=1, description="The order of the reasoning step, starting from 1."
    )
    content: str = Field(
        ...,
        min_length=5,
        description=(
            "A detailed explanation of this reasoning step, incorporating both logical analysis and emotional insight."
        ),
    )


class ReasoningTrace(BaseModel):
    """A structured empathic reasoning trace for addressing a scenario.
    This model records a step-by-step process that integrates logical analysis with emotional insight and empathy to arrive at a supportive final answer.
    """

    reasoning: List[Thought] = Field(
        ...,
        description="Step-by-step reasoning leading to the final answer, enriched with empathetic observations and practical insights.",
    )
    answer: str = Field(
        ...,
        description="The final answer derived from the empathic reasoning process, offering compassionate guidance or resolution.",
    )


class Evaluation(BaseModel):
    """Output format for evaluating an empathic reasoning answer.
    The evaluation assesses the response based on correctness, clarity, and completeness,
    with feedback that emphasizes compassionate insight, clarity, and a holistic understanding of the scenario.
    """

    correctness: float = Field(
        ..., description="Overall correctness rating of the answer (0 to 1)."
    )
    clarity: float = Field(
        ...,
        description="Clarity rating of the reasoning, including the integration of empathic explanations (0 to 1).",
    )
    completeness: float = Field(
        ...,
        description="Completeness rating of the reasoning, assessing whether all practical and emotional aspects were considered (0 to 1).",
    )
    feedback: str = Field(
        ...,
        description="Detailed feedback on the reasoning trace and answer, with suggestions for enhancing empathetic and real-world applicability.",
    )


class FinalEvaluation(Evaluation):
    """Extended evaluation model for final empathic reasoning traces.
    This model adds criteria to assess visual structure and conciseness,
    ensuring the final output is both clear and visually appealing.
    """

    structure: float = Field(
        ...,
        description="Rating of the visual structure and formatting (0 to 1), assessing if reasoning steps and final answer are clearly delineated.",
    )
    conciseness: float = Field(
        ...,
        description="Rating of the conciseness of the reasoning trace (0 to 1), ensuring that extraneous verbosity is minimized.",
    )

In [None]:
config_builder.add_column(
    LLMStructuredColumnConfig(
        name="initial_trace",
        model_alias=MODEL_ALIAS,
        system_prompt=SYSTEM_PROMPT,
        prompt=(
            "You are an empathic reasoning agent. Provide a detailed, step-by-step reasoning process that "
            "thoughtfully addresses the following scenario. \n"
            "Begin by outlining your internal thought process, focusing on both logical considerations and "
            "emotional insights, enclosed within <think>...</think> tags. \n"
            "Then, provide your final compassionate answer.\n\n"
            "Scenario: {{scenario}}\n\n"
            "Ensure that your response is structured and reflective of a supportive, empathetic approach."
        ),
        output_format=ReasoningTrace,
    )
)

config_builder.add_column(
    LLMStructuredColumnConfig(
        name="initial_trace_evaluation",
        model_alias=MODEL_ALIAS,
        system_prompt=SYSTEM_PROMPT,
        prompt=(
            "<initial_trace>{{initial_trace}}</initial_trace>\n\n"
            "Now, analyze the provided empathic reasoning trace and final answer as if you were an insightful "
            "observer assessing both logical and compassionate approaches. \n"
            "Evaluate the response with a focus on emotional insight, clarity, and holistic consideration.\n\n"
            "Include your internal thought process within <think>...</think> tags before providing the JSON."
        ),
        output_format=Evaluation,
    )
)

### Final Empathic Reasoning Trace Generation and Evaluation

- These columns refine and evaluate the final empathic reasoning trace.

- The final trace is generated by reviewing the scenario, your initial empathic reasoning trace, and its evaluation.

- The process integrates improvements suggested by the evaluation and ensures that the final reasoning is compassionate, clear, and comprehensive.

- As always, include your internal thought process wrapped within <think>...</think> tags before providing the final JSON output.


In [None]:
config_builder.add_column(
    LLMStructuredColumnConfig(
        name="final_trace",
        model_alias=MODEL_ALIAS,
        system_prompt=SYSTEM_PROMPT,
        prompt=(
            "Review the scenario, your initial empathic reasoning trace, and its evaluation:\n\n"
            "Scenario: {{scenario}}\n\n"
            "Initial Empathic Reasoning Trace:\n{{initial_trace}}\n\n"
            "Initial Trace Evaluation:\n{{initial_trace_evaluation}}\n\n"
            "From the perspective of an empathic reasoning agent, provide a refined final reasoning trace that "
            "addresses both the emotional and logical dimensions of the scenario. \n"
            "Your final trace should be visually structured as follows:\n"
            "1. Present a numbered list of concise reasoning steps. Each step should be clear and free of "
            "unnecessary verbosity.\n"
            "2. Include a clearly separated section for the final answer, prefixed with a header "
            "(e.g., 'Final Answer:').\n"
            "3. Use visual markers or markdown formatting to enhance readability.\n"
            "Avoid adding extraneous details—focus on clarity and conciseness.\n\n"
            "Also, include your internal thought process wrapped within <think>...</think> tags. "
            "Return only the final, visually structured reasoning trace."
        ),
        output_format=ReasoningTrace,
    )
)

config_builder.add_column(
    LLMStructuredColumnConfig(
        name="final_trace_evaluation",
        model_alias=MODEL_ALIAS,
        system_prompt=SYSTEM_PROMPT,
        prompt=(
            "<final_trace>{{final_trace}}</final_trace>\n\n"
            "Analyze the provided empathic reasoning trace and final answer from the viewpoint of an "
            "insightful observer. \n"
            "Evaluate the response focusing on correctness, clarity, and completeness, as well as its "
            "visual structure and conciseness. \n"
            "Assess whether the reasoning steps are clearly separated (e.g., numbered or bullet-pointed) and if "
            "the final answer is distinct and succinct.\n\n"
            "Include your internal thought process within <think>...</think> tags before providing the JSON."
        ),
        output_format=FinalEvaluation,
    )
)

### 🔁 Iteration is key – preview the dataset!

1. Use the `preview` method to generate a sample of records quickly.

2. Inspect the results for quality and format issues.

3. Adjust column configurations, prompts, or parameters as needed.

4. Re-run the preview until satisfied.


In [None]:
# Preview a few records
preview = data_designer_client.preview(config_builder)

In [None]:
# More previews
preview.display_sample_record()

### 📊 Analyze the generated data

- Data Designer automatically generates a basic statistical analysis of the generated data.

- This analysis is available via the `analysis` property of generation result objects.


In [None]:
# Print the analysis as a table.
preview.analysis.to_report()

### 🆙 Scale up!

- Happy with your preview data?

- Use the `create` method to submit larger Data Designer generation jobs.


In [None]:
job_results = data_designer_client.create(config_builder, num_records=2)

# This will block until the job is complete.
job_results.wait_until_done()

In [None]:
# Load the generated dataset as a pandas DataFrame.
dataset = job_results.load_dataset()

dataset.head()

In [None]:
# Load the analysis results into memory.
analysis = job_results.load_analysis()

analysis.to_report()

In [None]:
TUTORIAL_OUTPUT_PATH = "data-designer-tutorial-output"

# Download the job artifacts and save them to disk.
job_results.download_artifacts(
    output_path=TUTORIAL_OUTPUT_PATH,
    artifacts_folder_name="artifacts-community-contributions-reasoning-reasoning-traces",
);