# 🧑‍⚕️ NeMo Data Designer: Realistic Patient Data & Physician Notes

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
>
> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. 
>
> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method.

This notebook demonstrates how to use NeMo Data Designer to generate realistic patient data including physician notes. We'll leverage both structured data generation and LLM capabilities to create a comprehensive medical dataset.

#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.


In [None]:
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).


In [None]:
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# We specify the endpoint of the model during deployment using the model_provider_registry.
model_id = "nvidia/nvidia-nemotron-nano-9b-v2"
model_alias = "nemotron-nano-9b-v2"

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.95,
            ),
            is_reasoner=True
        ),
    ]
)

## 📊 Loading Seed Data

We'll use the symptom-to-diagnosis dataset as our seed data. This dataset contains patient symptoms and corresponding diagnoses which will help generate realistic medical scenarios.

**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. 

In [None]:
from datasets import load_dataset

# Let's use the symptom-to-diagnosis dataset to seed our workflow.
df_seed = load_dataset("gretelai/symptom_to_diagnosis")["train"].to_pandas()
df_seed = df_seed.rename(columns={"output_text": "diagnosis", "input_text": "patient_summary"})

print(f"Number of records: {len(df_seed)}")

df_seed.head()

In [None]:
import os

os.makedirs("./data", exist_ok=True)
df_seed.to_csv("./data/symptom_to_diagnosis.csv", index=False)

In [None]:
# We use with_replacement=False, so our max num_records is 853.
config_builder.with_seed_dataset(
    repo_id="advanced/healthcare-datasets",
    filename="symptom_to_diagnosis.csv",
    dataset_path="./data/symptom_to_diagnosis.csv",
    sampling_strategy="shuffle", # "ordered"
    with_replacement=True,
    datastore={"endpoint": "http://localhost:3000/v1/hf"}
)

In [None]:
# Create a couple random person samplers.
config_builder.with_person_samplers({"patient_sampler": {}, "doctor_sampler": {}})

## 🏗️ Defining Data Structure

Now we'll define the structure of our dataset by adding columns for patient information, dates, and medical details. We'll use:

- `uuid` for patient identification
- Patient personal information (`first_name`, `last_name`, `dob`, `patient_email`)
- Medical timeline information (`symptom_onset_date`, `date_of_visit`)
- Physician information (`physician`)

In [None]:
config_builder.add_column(
    name="patient_id",
    type="uuid",
    params={"prefix": "PT-", "short_form": True, "uppercase": True},
)

config_builder.add_column(
    name="first_name",
    type="expression",
    expr="{{patient_sampler.first_name}}"
)

config_builder.add_column(
    name="last_name",
    type="expression",
    expr="{{patient_sampler.last_name}}"
)


config_builder.add_column(
    name="dob",
    type="expression",
    expr="{{patient_sampler.birth_date}}"
)


config_builder.add_column(
    name="patient_email",
    type="expression",
    expr="{{patient_sampler.email_address}}"
)


config_builder.add_column(
    name="symptom_onset_date",
    type="datetime",
    params={"start": "2024-01-01", "end": "2024-12-31"},
)

config_builder.add_column(
    name="date_of_visit",
    type="timedelta",
    params={
        "dt_min": 1,
        "dt_max": 30,
        "reference_column_name": "symptom_onset_date"
    },
)

config_builder.add_column(
    name="physician",
    type="expression",
    expr="Dr. {{doctor_sampler.first_name}} {{doctor_sampler.last_name}}",
)

### 📝 LLM-Generated Physician Notes

The final and most complex column uses an LLM to generate realistic physician notes. We provide:

- Context about the patient and their condition
- Patient summary from our seed data
- Clear formatting instructions

This will create detailed medical notes that reflect the patient's diagnosis and visit information. Note how we reference other columns in the prompt using Jinja templating syntax with double curly braces `{{column_name}}`.

In [None]:
# Note we have access to the seed data fields.
config_builder.add_column(
    name="physician_notes",
    type="llm-text",
    model_alias=model_alias,
    prompt="""\
<context>
You are a primary-care physician who just had an appointment with {{first_name}} {{last_name}},
who has been struggling with symptoms from {{diagnosis}} since {{symptom_onset_date}}.
The date of today's visit is {{date_of_visit}}.
</context>

<patient_summary_of_symptoms>
{{patient_summary}}
</patient_summary_of_symptoms>

<task>
Write careful notes about your visit with {{first_name}},
as {{physician}}.

Format the notes as a busy doctor might.
</task>
"""
 )

## 👀 Previewing the Dataset

Let's generate a preview to see how our data looks before creating the full dataset. This helps verify that our configuration is working as expected.

In [None]:
preview = data_designer_client.preview(config_builder, verbose_logging=True)

In [None]:
preview.display_sample_record()

In [None]:
preview.dataset

## 🚀 Generating the Full Dataset

Now that we've verified our configuration works correctly, let's generate a larger dataset with 100 records. We'll wait for the workflow to complete so we can access the data immediately.

In [None]:
job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)
job_results.wait_until_done()

In [None]:
dataset = job_results.load_dataset()
print("\nGenerated dataset shape:", dataset.shape)

dataset.head()

In [None]:
csv_filename = f"./data/physician_notes_with_realistic_personal_details.csv"
dataset.to_csv(csv_filename, index=False)
print(f"Dataset with {len(dataset)} records saved to {csv_filename}")