# 🎨 NeMo Data Designer 101: Seeding Synthetic Data Generation with an External Dataset

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.

<br>

In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset.

If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series.

#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.


In [None]:
from getpass import getpass

from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)

from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer (NDD) Client

- The NDD client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).
- If you have an instance of data designer running locally, you can connect to it as follows

    ```python
    data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))
    ```


In [None]:
# if using the managed service of data designer, provide the api key here
api_key = getpass("Enter data designer API key: ")

if len(api_key) > 0:
    print("✅ API key received.")
else:
    print("❌ No API key provided. Please enter your model provider API key.")

In [None]:
data_designer_client = DataDesignerClient(
    client=NeMoMicroservices(
            base_url="https://ai.api.nvidia.com/v1/nemo/dd",
            default_headers={"Authorization": f"Bearer {api_key}"} # auto-generated API KEY
    )
)

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


**Note**: 
The NeMo Data Designer Managed service has models available for you to use as well. You can use these models by referencing the appropriate model_alias for them.

Please visit https://build.nvidia.com/nemo/data-designer to see the full list of models and their model aliases.

In [None]:
model_alias = "nemotron-nano-v2"

In [None]:
config_builder = DataDesignerConfigBuilder()

## 🏥 Download a seed dataset

- For this notebook, we'll change gears and create a synthetic dataset of patient notes.

- To steer the generation process, we will use an open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).

- In this dataset, the `input_text` represents the `patient_summary` and the `output_text` represents the `diagnosis`

**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. 


In [None]:
# The repo_id and filename arguments follow the Hugging Face Hub API format.
# Passing the dataset_path argument signals that we need to upload the dataset
# to the datastore. Note we need to pass in the datastore's endpoint, which
# must match the endpoint in the docker-compose file.
config_builder.with_seed_dataset(
    repo_id="gretelai/symptom_to_diagnosis",
    filename="train.jsonl",
    sampling_strategy="shuffle",
    with_replacement=False,
    datastore={"endpoint": "https://huggingface.co"}
)

## 🎨 Designing our synthetic patient notes dataset

- We set the seed dataset using the `with_seed_dataset` method.

- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.

- We set `with_replacement=False`, which limits our max number of records to 853, which is the number of records in the seed dataset.


In [None]:
# Since we often just want a few attributes from Person objects, we can use
# Data Designer's `with_person_samplers` method to create multiple person samplers
# at once and drop the person object columns from the final dataset.

# Empty dictionaries mean use default settings for the person samplers.
config_builder.with_person_samplers({"patient_sampler": {}, "doctor_sampler": {}})

In [None]:
# Here we demonstrate how you can add a column by calling `add_column` with the
# column name, column type, and any parameters for that column type. This is in
# contrast to using the column and parameter type objects, via `C` and `P`, as we
# did in the previous notebooks. Generally, we recommend using the concrete column
# and parameter type objects, but this is a convenient shorthand when you are
# familiar with the required arguments for each type.

config_builder.add_column(
    name="patient_id",
    type="uuid",
    params={"prefix": "PT-", "short_form": True, "uppercase": True},
)

config_builder.add_column(
    name="first_name",
    type="expression",
    expr="{{ patient_sampler.first_name}} ",
)

config_builder.add_column(
    name="last_name",
    type="expression",
    expr="{{ patient_sampler.last_name }}",
)


config_builder.add_column(
    name="dob", type="expression", expr="{{ patient_sampler.birth_date }}"
)


config_builder.add_column(
    name="patient_email",
    type="expression",
    expr="{{ patient_sampler.email_address }}",
)


config_builder.add_column(
    name="symptom_onset_date",
    type="datetime",
    params={"start": "2024-01-01", "end": "2024-12-31"},
)

config_builder.add_column(
    name="date_of_visit",
    type="timedelta",
    params={"dt_min": 1, "dt_max": 30, "reference_column_name": "symptom_onset_date"},
)

config_builder.add_column(
    name="physician",
    type="expression",
    expr="Dr. {{ doctor_sampler.last_name }}",
)

# Note we have access to the seed data fields.
config_builder.add_column(
    name="physician_notes",
    prompt="""\
You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},
who has been struggling with symptoms from {{ output_text }} since {{ symptom_onset_date }}.
The date of today's visit is {{ date_of_visit }}.

{{ input_text }}

Write careful notes about your visit with {{ first_name }},
as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.

Format the notes as a busy doctor might.
""",
    model_alias=model_alias,
)

config_builder.validate()

## 👀 Preview the dataset

- Iteration is key to generating high-quality synthetic data.

- Use the `preview` method to generate 10 records for inspection.


In [None]:
preview = data_designer_client.preview(config_builder, num_records=2, verbose_logging=True)

In [None]:
# Run this cell multiple times to cycle through the 10 preview records.
preview.display_sample_record()

In [None]:
# The preview dataset is available as a pandas DataFrame.
preview.dataset

## ⏭️ Next Steps

Check out the following notebooks to learn more about:

- [Using Custom Model Configs](./4-custom-model-configs.ipynb)
