# 🎨 NeMo Data Designer: Text-to-Python

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
>
> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. 
>
> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method.

This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for Python code examples. We'll build a system that generates Python code based on natural language instructions, with varying complexity levels and industry focuses.

#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.

In [None]:
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).


In [None]:
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# We specify the endpoint of the model during deployment using the model_provider_registry.
model_id = "nvidia/nvidia-nemotron-nano-9b-v2"
model_alias = "nemotron-nano-9b-v2"

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.95,
            ),
            is_reasoner=True
        ),
    ]
)

## 🌱 Define Categorical Seed Columns

We'll set up our seed columns for industry sectors, code complexity, and instruction types. These will help generate diverse and relevant code examples.

In [None]:
# Add industry sector categories
config_builder.add_column(
    C.SamplerColumn(
        name="industry_sector",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=[
                "Healthcare",
                "Finance",
                "Technology",
            ],
        ),
    ),
)

# Add topic as a subcategory of industry_sector
config_builder.add_column(
    C.SamplerColumn(
        name="topic",
        type=P.SamplerType.SUBCATEGORY,
        params=P.SubcategorySamplerParams(
            category="industry_sector",
            values={
                "Healthcare": [
                    "Electronic Health Records (EHR) Systems",
                    "Telemedicine Platforms",
                    "AI-Powered Diagnostic Tools",
                ],
                "Finance": [
                    "Fraud Detection Software",
                    "Automated Trading Systems",
                    "Personal Finance Apps",
                ],
                "Technology": [
                    "Cloud Computing Platforms",
                    "Artificial Intelligence and Machine Learning Platforms",
                    "DevOps and CI/CD Tools",
                ],
            },
        ),
    ),
)

# Add code complexity with subcategory for code concepts
config_builder.add_column(
    C.SamplerColumn(
        name="code_complexity",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=[
                "Beginner",
                "Intermediate",
                "Advanced",
            ],
        ),
    ),
)

# Add code_concept as a subcategory of code_complexity
config_builder.add_column(
    C.SamplerColumn(
        name="code_concept",
        type=P.SamplerType.SUBCATEGORY,
        params=P.SubcategorySamplerParams(
            category="code_complexity",
            values={
                "Beginner": [
                    "Variables",
                    "Data Types",
                    "Functions",
                    "Loops",
                    "Classes",
                ],
                "Intermediate": [
                    "List Comprehensions",
                    "Object-oriented programming",
                    "Lambda Functions",
                    "Web frameworks",
                    "Pandas",
                ],
                "Advanced": [
                    "Multithreading",
                    "Context Managers",
                    "Generators",
                ],
            },
        ),
    ),
)

# Add instruction phrases
config_builder.add_column(
    C.SamplerColumn(
        name="instruction_phrase",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=[
                "Write a function that",
                "Create a class that",
                "Implement a script",
                "Can you create a function",
                "Develop a module that",
            ],
        ),
    ),
)

## ✨ Define Generated Data Columns

Now we'll set up the columns that will be generated by the LLMs, including the instruction and code implementation.

In [None]:
# Generate instruction for the code
config_builder.add_column(
    C.LLMTextColumn(
        name="instruction",
        model_alias=model_alias,
        system_prompt=(
            "You are an expert at generating clear and specific programming tasks."
        ),
        prompt=(
            "Generate an instruction to create Python code that solves a specific problem.\n"
            "Each instruction should begin with one of the following phrases: {{instruction_phrase}}.\n\n"
            "Important Guidelines:\n"
            "* Industry Relevance: Ensure the instruction pertains to the {{industry_sector}} sector and {{topic}} topic.\n"
            "* Code Complexity: Tailor the instruction to the {{code_complexity}} level. Utilize relevant {{code_concept}} where appropriate to match the complexity level.\n"
            "* Clarity and Specificity: Make the problem statement clear and unambiguous. Provide sufficient context to understand the requirements without being overly verbose.\n"
            "* Response Formatting: Do not include any markers such as ### Response ### in the instruction.\n"
        ),
    )
)

# Generate the Python code
config_builder.add_column(
    C.LLMCodeColumn(
        name="code_implementation",
        model_alias=model_alias,
        output_format=P.CodeLang.PYTHON,
        system_prompt=(
            "You are an expert Python programmer who writes clean, efficient, and well-documented code."
        ),
        prompt=(
            "Write Python code for the following instruction:\n"
            "Instruction: {{instruction}}\n\n"
            "Important Guidelines:\n"
            "* Code Quality: Your code should be clean, complete, self-contained, and accurate.\n"
            "* Code Validity: Please ensure that your Python code is executable and does not contain any errors.\n"
            "* Packages: Remember to import any necessary libraries, and to use all libraries you import.\n"
            "* Complexity & Concepts: The code should be written at a {{code_complexity}} level, making use of concepts such as {{code_concept}}.\n"
        ),
    )
)

## 🔍 Add Validation and Evaluation

Let's add post-processing steps to validate the generated code and evaluate the text-to-Python conversion.

In [None]:
from nemo_microservices.beta.data_designer.config.params.rubrics import TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE, PYTHON_RUBRICS


config_builder.add_column(
    C.CodeValidationColumn(
        name="code_validity_result",
        model_alias=model_alias,
        code_lang=P.CodeLang.PYTHON,
        target_column="code_implementation",
    )
)

config_builder.add_column(
    C.LLMJudgeColumn(
        name="code_judge_result",
        model_alias=model_alias,
        prompt=TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE,
        rubrics=PYTHON_RUBRICS,
    )
)

## 👀 Generate Preview Dataset

Let's generate a preview to see some data.

In [None]:
# Generate a preview
preview = data_designer_client.preview(config_builder, verbose_logging=True)

In [None]:
preview.display_sample_record()

## 🚀 Generate Full Dataset

If you're satisfied with the preview, you can generate a larger dataset using a batch workflow.

In [None]:
# Submit batch job
job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)

job_results.wait_until_done()

In [None]:
dataset = job_results.load_dataset()
print("\nGenerated dataset shape:", dataset.shape)

dataset.head()