# 🎨 NeMo Data Designer 101: Using Custom Model Configurations

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.

<br>

In this notebook, we will see how to create and use custom model configurations in Data Designer.

If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series.

#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.


In [None]:
import os
from getpass import getpass

from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).
- If you have an instance of data designer running locally, you can connect to it as follows

    ```python
    data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))
    ```


In [None]:
# if using the managed service of data designer, provide the api key here
api_key = getpass("Enter data designer API key: ")

if len(api_key) > 0:
    print("✅ API key received.")
else:
    print("❌ No API key provided. Please enter your model provider API key.")

In [None]:
data_designer_client = DataDesignerClient(
    client=NeMoMicroservices(
            base_url="https://ai.api.nvidia.com/v1/nemo/dd",
            default_headers={"Authorization": f"Bearer {api_key}"} # auto-generated API KEY
    )
)

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# build.nvidia.com model endpoint
model_id = "nvidia/nvidia-nemotron-nano-9b-v2"
model_alias_static_temp = "nemotron-nano-v2_static_temp"
model_alias_variable_temp = "nemotron-nano-v2_variable_temp"

## ⚙️ Custom Model Configurations

- In the previous notebooks, we've seen how we can reference a model using the model alias and pass static inference hyperparameters 

- In this notebook, we will see how we can sample values from a distribution to set as our temperature value. 
This will result in greater diversity in our generated data as a different temperature value will be used each time the LLM is called

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs = [
        P.ModelConfig(
            alias=model_alias_static_temp,
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.0,
                top_p=0.95,
                timeout=120
            ),
            is_reasoner=True
        ),
        P.ModelConfig(
            alias=model_alias_variable_temp,
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=P.UniformDistribution(
                    params=P.UniformDistributionParams(
                        low=0.5,
                        high=0.9
                    )
                ),
                top_p=0.95,
                timeout=120
            ),
            is_reasoner=True
        ),
    ]
)

## 🧑‍🎨 Generating our Data

- We follow a similar procedure to generate our product review dataset as we did in the the [basics tutorial](./1-the-basics.ipynb)

- The one difference is that we generate multiple samples of the LLM generated columns to demonstrate the difference in generation outputs due to different temperature values


In [None]:
config_builder.add_column(
    C.SamplerColumn(
        name="product_category",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=[
                "Electronics",
                "Clothing",
                "Home & Kitchen",
                "Books",
                "Home Office",
            ],
        ),
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="product_subcategory",
        type=P.SamplerType.SUBCATEGORY,
        params=P.SubcategorySamplerParams(
            category="product_category",
            values={
                "Electronics": [
                    "Smartphones",
                    "Laptops",
                    "Headphones",
                    "Cameras",
                    "Accessories",
                ],
                "Clothing": [
                    "Men's Clothing",
                    "Women's Clothing",
                    "Winter Coats",
                    "Activewear",
                    "Accessories",
                ],
                "Home & Kitchen": [
                    "Appliances",
                    "Cookware",
                    "Furniture",
                    "Decor",
                    "Organization",
                ],
                "Books": [
                    "Fiction",
                    "Non-Fiction",
                    "Self-Help",
                    "Textbooks",
                    "Classics",
                ],
                "Home Office": [
                    "Desks",
                    "Chairs",
                    "Storage",
                    "Office Supplies",
                    "Lighting",
                ],
            },
        ),
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="target_age_range",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["18-25", "25-35", "35-50", "50-65", "65+"]
        ),
    )
)

# Optionally validate that the columns are configured correctly.
config_builder.validate()


Next, let's add samplers to generate data related to the customer and their review.


In [None]:
# This column will sample synthetic person data based on statistics from the US Census.
config_builder.add_column(
    C.SamplerColumn(
        name="customer",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(age_range=[18, 70]),
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="number_of_stars",
        type=P.SamplerType.UNIFORM,
        params=P.UniformSamplerParams(low=1, high=5),
        convert_to="int",
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="review_style",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["rambling", "brief", "detailed", "structured with bullet points"],
            weights=[1, 2, 2, 1],
        ),
    )
)

config_builder.validate()

## 🦜 LLM-generated columns

- We generate three sets of the LLM-generated columns to demonstrate the difference in output based on different temperature values

In [None]:
config_builder.add_column(
    C.LLMTextColumn(
        name="product_name",
        prompt=(
            "Come up with a creative product name for a product in the '{{ product_category }}' category, focusing "
            "on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is "
            "{{ target_age_range }} years old. Respond with only the product name, no other text."
        ),
        # This is optional, but it can be useful for controlling the behavior of the LLM. Do not include instructions
        # related to output formatting in the system prompt, as Data Designer handles this based on the column type.
        system_prompt=(
            "You are a helpful assistant that generates product names. You respond with only the product name, "
            "no other text. You do NOT add quotes around the product name. "
        ),
        model_alias=model_alias_static_temp,
    )
)

config_builder.add_column(
    C.LLMTextColumn(
        name="customer_review_base",
        prompt=(
            "You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. "
            "You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. "
            "Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. "
            "The style of the review should be '{{ review_style }}'. "
        ),
        model_alias=model_alias_static_temp,
    )
)


config_builder.add_column(
    C.LLMTextColumn(
        name="customer_review_set_2",
        prompt=(
            "You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. "
            "You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. "
            "Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. "
            "The style of the review should be '{{ review_style }}'. "
        ),
        model_alias=model_alias_variable_temp,
    )
)

config_builder.add_column(
    C.LLMTextColumn(
        name="customer_review_set_3",
        prompt=(
            "You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. "
            "You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. "
            "Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. "
            "The style of the review should be '{{ review_style }}'. "
        ),
        model_alias=model_alias_variable_temp,
    )
)

config_builder.validate()

## 👀 Preview the dataset

- Use the `preview` method to generate 10 records for inspection.


In [None]:
preview = data_designer_client.preview(config_builder, num_records=3, verbose_logging=True)

In [None]:
# Run this cell multiple times to cycle through the 10 preview records.
preview.display_sample_record()

In [None]:
# The preview dataset is available as a pandas DataFrame.
preview.dataset