# 🎨 NeMo Data Designer 101: Seeding synthetic data generation with an external dataset

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.

<br>

In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset.

If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series.

#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.


If the installation worked, you should be able to make the following imports:


In [None]:
from getpass import getpass

from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)

### ⚙️ Initialize the NeMo Data Designer (NDD) Client

- The NDD client is responsible for submitting generation requests to the Data Designer microservice.


In [None]:
ndd = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8000"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# build.nvidia.com model endpoint
endpoint = "https://integrate.api.nvidia.com/v1"
model_id = "mistralai/mistral-small-24b-instruct"

model_alias = "mistral-small"

# You will need to enter your model provider API key to run this notebook.
api_key = getpass("Enter model provider API key: ")

if len(api_key) > 0:
    print("✅ API key received.")
else:
    print("❌ No API key provided. Please enter your model provider API key.")

In [None]:
# You can also load the model configs from a YAML string or file.

model_configs_yaml = f"""\
model_configs:
  - alias: "{model_alias}"
    inference_parameters:
      max_tokens: 1024
      temperature: 0.5
      top_p: 1.0
    model:
      api_endpoint:
        api_key: "{api_key}"
        model_id: "{model_id}"
        url: "{endpoint}"
"""

config_builder = DataDesignerConfigBuilder(model_configs=model_configs_yaml)

## 🏥 Download a seed dataset

- For this notebook, we'll change gears and create a synthetic dataset of patient notes.

- To steer the generation process, we will use an open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).


In [None]:
from datasets import load_dataset

df_seed = load_dataset("gretelai/symptom_to_diagnosis")["train"].to_pandas()

# Rename the columns to something more descriptive.
df_seed = df_seed.rename(
    columns={"output_text": "diagnosis", "input_text": "patient_summary"}
)

print(f"Number of records: {len(df_seed)}")

# Save the file so we can upload it to the microservice.
df_seed.to_csv("symptom_to_diagnosis.csv", index=False)

df_seed.head()

## 🎨 Designing our synthetic patient notes dataset

- We set the seed dataset using the `with_seed_dataset` method.

- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.

- We set `with_replacement=False`, which limits our max number of records to 853, which is the number of records in the seed dataset.


In [None]:
# The repo_id and filename arguments follow the Hugging Face Hub API format.
# Passing the dataset_path argument signals that we need to upload the dataset
# to the datastore. Note we need to pass in the datastore's endpoint, which
# must match the endpoint in the docker-compose file.
config_builder.with_seed_dataset(
    repo_id="into-tutorials/seeding-with-a-dataset",
    filename="symptom_to_diagnosis.csv",
    dataset_path="./symptom_to_diagnosis.csv",
    sampling_strategy="shuffle",
    with_replacement=False,
    datastore={"endpoint": "http://localhost:3000/v1/hf"},
)

In [None]:
# Since we often just want a few attributes from Person objects, we can use
# Data Designer's `with_person_samplers` method to create multiple person samplers
# at once and drop the person object columns from the final dataset.

# Empty dictionaries mean use default settings for the person samplers.
config_builder.with_person_samplers({"patient_sampler": {}, "doctor_sampler": {}})

In [None]:
# Here we demonstrate how you can add a column by calling `add_column` with the
# column name, column type, and any parameters for that column type. This is in
# contrast to using the column and parameter type objects, via `C` and `P`, as we
# did in the previous notebooks. Generally, we recommend using the concrete column
# and parameter type objects, but this is a convenient shorthand when you are
# familiar with the required arguments for each type.

config_builder.add_column(
    name="patient_id",
    type="uuid",
    params={"prefix": "PT-", "short_form": True, "uppercase": True},
)

config_builder.add_column(
    name="first_name",
    type="expression",
    expr="{{ patient_sampler.first_name}} ",
)

config_builder.add_column(
    name="last_name",
    type="expression",
    expr="{{ patient_sampler.last_name }}",
)


config_builder.add_column(
    name="dob", type="expression", expr="{{ patient_sampler.birth_date }}"
)


config_builder.add_column(
    name="patient_email",
    type="expression",
    expr="{{ patient_sampler.email_address }}",
)


config_builder.add_column(
    name="symptom_onset_date",
    type="datetime",
    params={"start": "2024-01-01", "end": "2024-12-31"},
)

config_builder.add_column(
    name="date_of_visit",
    type="timedelta",
    params={"dt_min": 1, "dt_max": 30, "reference_column_name": "symptom_onset_date"},
)

config_builder.add_column(
    name="physician",
    type="expression",
    expr="Dr. {{ doctor_sampler.last_name }}",
)

# Note we have access to the seed data fields.
config_builder.add_column(
    name="physician_notes",
    prompt="""\
You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},
who has been struggling with symptoms from {{ diagnosis }} since {{ symptom_onset_date }}.
The date of today's visit is {{ date_of_visit }}.

{{ patient_summary }}

Write careful notes about your visit with {{ first_name }},
as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.

Format the notes as a busy doctor might.
""",
    model_alias=model_alias,
)

config_builder.validate()

## 👀 Preview the dataset

- Iteration is key to generating high-quality synthetic data.

- Use the `preview` method to generate 10 records for inspection.


In [None]:
preview = ndd.preview(config_builder, verbose_logging=True)

In [None]:
# The preview dataset is available as a pandas DataFrame.
preview.dataset

In [None]:
# Run this cell multiple times to cycle through the 10 preview records.
preview.display_sample_record()

## 🧬 Generate your dataset

- Once you are happy with the preview, scale up to a larger dataset.

- The `create` method will submit your generation job to the microservice and return a results object.

- If you want to wait for the job to complete, set `wait_until_done=True`.


In [None]:
results = ndd.create(config_builder, num_records=20, wait_until_done=True)

In [None]:
# load the dataset into a pandas DataFrame
dataset = results.load_dataset()

dataset.head()