# 🎨 NeMo Data Designer: Synthetic Reasoning Traces

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
>
> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. 
>
> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method.

This notebook demonstrates how to use NeMo Data Designer to build a synthetic data generation pipeline tailored for reasoning tasks. Instead of creating multi-turn conversations, we will generate reasoning traces that can be utilized for training and fine-tuning language models with reinforcement learning techniques and invoking chain-of-thought processing.

These synthetic reasoning traces can be used to enhance model performance in areas such as mathematics, coding, scientific reasoning, and other domains that benefit from structured reasoning.

#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.

In [None]:
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).


In [None]:
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# We specify the endpoint of the model during deployment using the model_provider_registry.
model_id = "nvidia/nvidia-nemotron-nano-9b-v2"
model_alias = "nemotron-nano-9b-v2"

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.95,
            ),
            is_reasoner=True
        ),
    ]
)

### 🌱 Adding Categorical Seed Columns

Define categorical seed columns that set the context for the generated empathic reasoning traces. For example, domain and theme determine the type of everyday scenario where empathy is crucial, while complexity guides the depth of emotional insight and detailed support.

In [None]:
# Define a domain column that sets the context for empathic scenarios in everyday life.
config_builder.add_column(
    name="domain",
    type="category",
    params={
        "values": [
            "Family Dynamics",
            "Workplace Challenges",
            "Friendship Moments",
            "Community Interactions",
            "Personal Well-being",
            "Unexpected Encounters"
        ]
    }
)

# Add theme subcategories for each domain
config_builder.add_column(
    name="theme",
    type="subcategory",
    params={
        "category": "domain",
        "values": {
            "Family Dynamics": [
                "Parenting Dilemmas",
                "Sibling Rivalries"
            ],
            "Workplace Challenges": [
                "Communication Breakdowns",
                "Leadership Dilemmas"
            ],
            "Friendship Moments": [
                "Support & Understanding",
                "Misunderstandings & Reconciliations"
            ],
            "Community Interactions": [
                "Neighborhood Support",
                "Cultural Celebrations"
            ],
            "Personal Well-being": [
                "Mental Health",
                "Self-care & Reflection"
            ],
            "Unexpected Encounters": [
                "Serendipitous Meetings",
                "Moments of Realization"
            ]
        }
    }
)

# Define a complexity column to guide the level of detail and challenge in the empathic scenarios.
config_builder.add_column(
    name="complexity",
    type="category",
    params={
        "values": ["Basic", "Intermediate", "Advanced"]
    }
)

### ✨ Adding Generated Data Columns

Define the columns that the model will generate. These prompts instruct the LLM to produce the actual empathic reasoning trace and answer, following the specified format with <think> and <answer> tags.

#### Empathic Reasoning Trace Generation

This column is designed to generate clear, thoughtful reasoning traces that blend logical analysis with emotional insight for everyday situations where empathy is crucial. The generation prompt is tailored to:
- Produce a structured explanation that highlights both the practical reasoning and the emotional dynamics at play.
- Encourage a dual output: one part detailing the empathic thought process (enclosed within `<think>` tags) and another delivering a compassionate final answer (enclosed within `<answer>` tags).
- Ensure that the generated content reflects deep understanding, compassion, and a balanced view of the challenges and emotions involved.

In [None]:
special_system_instructions = """
You are an empathic reasoning agent. Your task is to generate realistic and compassionate reasoning traces for common day-to-day situations. Adopt a caring and supportive tone as you provide detailed insights into human experiences and emotions.
- Focus on everyday scenarios where empathy, understanding, and emotional intelligence are key.
- Consider various perspectives, emphasizing the emotional impact of actions and decisions.
- Ensure your reasoning process is clear, structured, and heartfelt, reflecting deep care for the individuals involved.
- Enclose your thoughtful reasoning process within <think>...</think> tags before providing the final JSON output.
"""

config_builder.add_column(
    name="scenario",
    type="llm-text",
    model_alias=model_alias,
    system_prompt=special_system_instructions,
    prompt=(
        "Generate a clear and concise everyday scenario for the {{domain}} domain, theme {{theme}}, and complexity {{complexity}}, "
        "where empathy and understanding play a crucial role. Focus on a situation that highlights emotional challenges or opportunities for compassionate support, and include a specific question or request for help that clearly outlines a problem or challenge needing resolution.\n\n"
        "Guidelines:\n"
        "- Provide only the scenario statement without any additional metadata, solution steps, or internal commentary.\n"
        "- Use everyday language and incorporate realistic, practical context from an empathic perspective.\n"
        "- Ensure the scenario includes a clear follow-up question or request for assistance, making it apparent what the problem or challenge is.\n"
        "- Do not include any formatting tags or markers.\n\n"
        "Examples:\n"
        "1. 'Imagine a situation where a friend is visibly upset after a long, challenging day. What might be causing their distress, and how could you offer support?'\n"
        "2. 'Consider a moment at a family dinner where a subtle conflict arises between members. What could be the underlying issue, and how might empathy help mend the situation?'\n"
        "3. 'Picture a colleague receiving unexpected criticism during a meeting. What are the potential emotional impacts, and what supportive response could be helpful?'\n"
    )
)

#### Empathic Reasoning Process Generation

These columns generate and evaluate a detailed empathic reasoning trace for addressing everyday scenarios. The process emphasizes a compassionate, thoughtful approach that blends logical reasoning with emotional insight. The prompts instruct the model to include its internal thought process within <think>...</think> tags before providing the JSON output.

In [None]:
from typing import List
from pydantic import BaseModel, Field

class Thought(BaseModel):
    """A single step in the structured empathic reasoning process.
    This step captures an empathetic observation or insight that informs a thoughtful, compassionate approach to addressing everyday challenges.
    """
    step_number: int = Field(..., ge=1, description="The order of the reasoning step, starting from 1.")
    content: str = Field(..., min_length=5, description="A detailed explanation of this reasoning step, incorporating both logical analysis and emotional insight.")

class ReasoningTrace(BaseModel):
    """A structured empathic reasoning trace for addressing a scenario.
    This model records a step-by-step process that integrates logical analysis with emotional insight and empathy to arrive at a supportive final answer.
    """
    reasoning: List[Thought] = Field(..., description="Step-by-step reasoning leading to the final answer, enriched with empathetic observations and practical insights.")
    answer: str = Field(..., description="The final answer derived from the empathic reasoning process, offering compassionate guidance or resolution.")

class Evaluation(BaseModel):
    """Output format for evaluating an empathic reasoning answer.
    The evaluation assesses the response based on correctness, clarity, and completeness,
    with feedback that emphasizes compassionate insight, clarity, and a holistic understanding of the scenario.
    """
    correctness: float = Field(..., description="Overall correctness rating of the answer (0 to 1).")
    clarity: float = Field(..., description="Clarity rating of the reasoning, including the integration of empathic explanations (0 to 1).")
    completeness: float = Field(..., description="Completeness rating of the reasoning, assessing whether all practical and emotional aspects were considered (0 to 1).")
    feedback: str = Field(..., description="Detailed feedback on the reasoning trace and answer, with suggestions for enhancing empathetic and real-world applicability.")

class FinalEvaluation(Evaluation):
    """Extended evaluation model for final empathic reasoning traces.
    This model adds criteria to assess visual structure and conciseness,
    ensuring the final output is both clear and visually appealing.
    """
    structure: float = Field(...,  description="Rating of the visual structure and formatting (0 to 1), assessing if reasoning steps and final answer are clearly delineated.")
    conciseness: float = Field(..., description="Rating of the conciseness of the reasoning trace (0 to 1), ensuring that extraneous verbosity is minimized.")

In [None]:
config_builder.add_column(
    name="initial_trace",
    type="llm-structured",
    model_alias=model_alias,
    prompt=(
        "You are an empathic reasoning agent. Provide a detailed, step-by-step reasoning process that thoughtfully addresses the following scenario. "
        "Begin by outlining your internal thought process, focusing on both logical considerations and emotional insights, enclosed within <think>...</think> tags. "
        "Then, provide your final compassionate answer.\n\n"
        "Scenario: {{scenario}}\n\n"
        "Ensure that your response is structured and reflective of a supportive, empathetic approach."
    ),
    output_format=ReasoningTrace
)

In [None]:
config_builder.add_column(
    name="initial_trace_evaluation",
    type="llm-structured",
    model_alias=model_alias,
    prompt=(
        "<initial_trace>{{initial_trace}}</initial_trace>\n\n"
        "Now, analyze the provided empathic reasoning trace and final answer as if you were an insightful observer assessing both logical and compassionate approaches. "
        "Evaluate the response with a focus on emotional insight, clarity, and holistic consideration.\n\n"
        "Include your internal thought process within <think>...</think> tags before providing the JSON."
    ),
    output_format=Evaluation
)

#### Final Empathic Reasoning Trace Generation and Evaluation

These columns refine and evaluate the final empathic reasoning trace. The final trace is generated by reviewing the scenario, your initial empathic reasoning trace, and its evaluation. The process integrates improvements suggested by the evaluation and ensures that the final reasoning is compassionate, clear, and comprehensive. As always, include your internal thought process wrapped within <think>...</think> tags before providing the final JSON output.

In [None]:
config_builder.add_column(
    name="final_trace",
    type="llm-structured",
    model_alias=model_alias,
    prompt=(
        "Review the scenario, your initial empathic reasoning trace, and its evaluation:\n\n"
        "Scenario: {{scenario}}\n\n"
        "Initial Empathic Reasoning Trace:\n{{initial_trace}}\n\n"
        "Initial Trace Evaluation:\n{{initial_trace_evaluation}}\n\n"
        "From the perspective of an empathic reasoning agent, provide a refined final reasoning trace that addresses both the emotional and logical dimensions of the scenario. "
        "Your final trace should be visually structured as follows:\n"
        "1. Present a numbered list of concise reasoning steps. Each step should be clear and free of unnecessary verbosity.\n"
        "2. Include a clearly separated section for the final answer, prefixed with a header (e.g., 'Final Answer:').\n"
        "3. Use visual markers or markdown formatting to enhance readability.\n"
        "Avoid adding extraneous details—focus on clarity and conciseness.\n\n"
        "Also, include your internal thought process wrapped within <think>...</think> tags. "
        "Return only the final, visually structured reasoning trace."
    ),
    output_format=ReasoningTrace
)

config_builder.add_column(
    name="final_trace_evaluation",
    type="llm-structured",
    model_alias=model_alias,
    prompt=(
        "<final_trace>{{final_trace}}</final_trace>\n\n"
        "Analyze the provided empathic reasoning trace and final answer from the viewpoint of an insightful observer. "
        "Evaluate the response focusing on correctness, clarity, and completeness, as well as its visual structure and conciseness. "
        "Assess whether the reasoning steps are clearly separated (e.g., numbered or bullet-pointed) and if the final answer is distinct and succinct.\n\n"
        "Include your internal thought process within <think>...</think> tags before providing the JSON."
    ),
    output_format=FinalEvaluation
)

## 👀 Generating a dataset preview

- Preview mode allows you to quickly iterate on your data design.

- Each preview generation call creates a sample for inspection, helping you verify prompts and instructions before running a larger batch job.

In [None]:
# Generate a preview
preview = data_designer_client.preview(config_builder, verbose_logging=True)

In [None]:
preview.display_sample_record()

## 🤔 Like what you see?

- Submit a batch workflow!

In [None]:
job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)

In [None]:
# Check to see if the Workflow is still active.
job_results.get_job_status()

In [None]:
dataset = job_results.load_dataset()
dataset.head()