# 🎨 NeMo Data Designer: Text-to-Python with Evolution

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
>
> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. 
>
> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method.

This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for Python code examples, with a focus on evolutionary improvements. We'll build a system that generates Python code based on natural language instructions, validates it, analyzes issues, and then improves the code based on feedback.

#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.


In [None]:
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).


In [None]:
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# We specify the endpoint of the model during deployment using the model_provider_registry.
model_id = "nvidia/nvidia-nemotron-nano-9b-v2"
model_alias = "nemotron-nano-9b-v2"

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.95,
            ),
            is_reasoner=True
        ),
    ]
)

## 🌱 Define Categorical Seed Columns

We'll set up our seed columns for industry sectors, code complexity, and instruction types. These will help generate diverse and relevant code examples.

In [None]:
# Add industry sector categories
config_builder.add_column(
    name="industry_sector",
    type="category",
    params={
        "values": ["Healthcare", "Finance", "Technology"],
        "description": "The industry sector for the code example"
    }
)

# Add topic as a subcategory of industry_sector
config_builder.add_column(
    name="topic",
    type="subcategory",
    params={
        "category": "industry_sector",
        "values": {
            "Healthcare": [
                "Electronic Health Records (EHR) Systems",
                "Telemedicine Platforms",
                "AI-Powered Diagnostic Tools"
            ],
            "Finance": [
                "Fraud Detection Software",
                "Automated Trading Systems",
                "Personal Finance Apps"
            ],
            "Technology": [
                "Cloud Computing Platforms",
                "Artificial Intelligence and Machine Learning Platforms",
                "DevOps and CI/CD Tools"
            ]
        }
    }
)

# Add code complexity with subcategory for code concepts
config_builder.add_column(
    name="code_complexity",
    type="category",
    params={
        "values": ["Beginner", "Intermediate", "Advanced"],
        "description": "The complexity level of the code"
    }
)

# Add code_concept as a subcategory of code_complexity
config_builder.add_column(
    name="code_concept",
    type="subcategory",
    params={
        "category": "code_complexity",
        "values": {
            "Beginner": [
                "Variables",
                "Data Types",
                "Functions",
                "Loops",
                "Classes"
            ],
            "Intermediate": [
                "List Comprehensions",
                "Object-oriented programming",
                "Lambda Functions",
                "Web frameworks",
                "Pandas"
            ],
            "Advanced": [
                "Multithreading",
                "Context Managers",
                "Generators"
            ]
        }
    }
)

# Add instruction phrases
config_builder.add_column(
    name="instruction_phrase",
    type="category",
    params={
        "values": [
            "Write a function that",
            "Create a class that",
            "Implement a script",
            "Can you create a function",
            "Develop a module that"
        ],
        "description": "Starting phrase for the code instruction"
    }
)

config_builder.validate()

## ✨ Define Initial Code Generation

First, we'll set up the columns for generating the instruction and initial code implementation using the same approach as in the text-to-python notebook.

In [None]:
# Generate instruction for the code
config_builder.add_column(
    name="instruction",
    type="llm-text",
    model_alias=model_alias,
    system_prompt="You are an expert at generating clear and specific programming tasks.",
    prompt="""\
Generate an instruction to create Python code that solves a specific problem.
Each instruction should begin with one of the following phrases: {{instruction_phrase}}.

Important Guidelines:
* Industry Relevance: Ensure the instruction pertains to the {{industry_sector}} sector and {{topic}} topic.
* Code Complexity: Tailor the instruction to the {{code_complexity}} level. Utilize relevant {{code_concept}} where appropriate to match the complexity level.
* Clarity and Specificity: Make the problem statement clear and unambiguous. Provide sufficient context to understand the requirements without being overly verbose.
* Response Formatting: Do not include any markers such as ### Response ### in the instruction.
"""
)

# Generate the initial Python code
config_builder.add_column(
    name="initial_code",
    type="llm-code",
    model_alias=model_alias,
    output_format="python",
    system_prompt="You are an expert Python programmer who writes clean, efficient, and well-documented code.",
    prompt="""\
Write Python code for the following instruction:
Instruction: {{instruction}}

Important Guidelines:
* Code Quality: Your code should be clean, complete, self-contained and accurate.
* Code Validity: Please ensure that your python code is executable and does not contain any errors.
* Packages: Remember to import any necessary libraries, and to use all libraries you import.
* Complexity & Concepts: The code should be written at a {{code_complexity}} level, making use of concepts such as {{code_concept}}.
"""
)

config_builder.validate()

## 🔍 Code Validation and Analysis

Now we'll add validation for the initial code and generate analysis of any issues found.

In [None]:
# Validate the initial code
config_builder.add_column(
    name="code_validation",
    type="code-validation",
    model_alias=model_alias,
    code_lang="python",
    target_column="initial_code"
)

config_builder.validate()

In [None]:
# Generate a detailed error analysis and improvement plan
config_builder.add_column(
    name="code_analysis",
    type="llm-text",
    model_alias=model_alias,
    prompt="""\
Analyze the following Python code and its validation results:

INSTRUCTION:
{{ instruction }}

INITIAL CODE:
{{ initial_code }}

VALIDATION RESULTS:
{{ code_validation }}

{% if not (code_validation == '[]') %}
Please provide:
1. A detailed analysis of each error or warning (categorize by type: convention, warning, error, refactor)
2. Specific recommendations that directly address each issue
3. A structured plan for implementing fixes while maintaining code functionality
4. Any PEP 8 style improvements that would improve code quality
{% else %}
The code passes all validation checks. Provide potential optimizations for:
1. Code readability
2. Performance improvements
3. Better adherence to Python best practices
4. Enhanced documentation
{% endif %}
"""
)


config_builder.validate()

## 🔄 Code Evolution

Next, we'll create the improved version of the code based on the analysis and validation.

In [None]:
# Generate improved code based on feedback
config_builder.add_column(
    name="improved_code",
    type="llm-code",
    model_alias=model_alias,
    output_format="python",
    system_prompt="You are an expert Python programmer focused on writing production-quality code that adheres to best practices.",
    prompt="""\
Rewrite and improve the following Python code based on the analysis provided.

ORIGINAL INSTRUCTION:
{{instruction}}

INITIAL CODE:
{{initial_code}}

CODE ANALYSIS:
{{code_analysis}}

Your task is to create a revised version that:
1. Addresses all issues identified in the analysis
2. Follows PEP 8 style guidelines systematically
3. Eliminates common anti-patterns
4. Includes comprehensive docstrings for functions, classes, and modules
5. Uses type hints for function parameters and return values where appropriate
6. Implements proper error handling with specific exception types
7. Ensures all imports are properly organized and used

The goal is production-quality code that would pass a professional code review at a {{code_complexity}} level.
"""
)


In [None]:
# Validate the improved code
config_builder.add_column(
    name="improved_code_validation",
    type="code-validation",
    model_alias=model_alias,
    code_lang="python",
    target_column="improved_code"
)

## 📊 Evaluation

Finally, we'll add an evaluation that compares the initial and improved code.

In [None]:
from nemo_microservices.beta.data_designer.config.params.rubrics import PYTHON_RUBRICS

# Add judge evaluation
config_builder.add_column(
    name="code_judge_result",
    type="llm-judge",
    model_alias=model_alias,
    prompt=(
        "You are an expert in Python programming, with specialized knowledge in software engineering, "
        "data science, and algorithmic problem-solving. You think about potential flaws and errors "
        "in the code. You are a tough critic, but a fair one.\n\n"
        "Take a deep breath and use the Python Code Quality Rubric below to score the **Generated Python Code** "
        "based on the INSTRUCTIONS.\n\n"
        "#### INSTRUCTIONS\n"
        "The Generated Python Code should be a valid response to the Natural Language Prompt below\n\n"
        "Natural Language Prompt:\n"
        "{{ instruction }}\n\n"
        "Generated Python Code\n"
        "```python\n"
        "{{ improved_code }}\n"
        "```\n"
    ),
    rubrics=PYTHON_RUBRICS
)

## 👀 Generate Preview Dataset

Let's generate a preview to see how our evolved code examples look.

In [None]:
# Generate a preview
preview = data_designer_client.preview(config_builder, verbose_logging=True)

In [None]:
preview.display_sample_record()

## 🚀 Generate Full Dataset

If you're satisfied with the preview, you can generate a larger dataset using a batch workflow.

In [None]:
# Submit batch job
job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)

job_results.wait_until_done()

In [None]:
dataset = job_results.load_dataset()
print("\nGenerated dataset shape:", dataset.shape)

dataset.head()