# 🏥 NeMo Data Designer: Clinical Trials Dataset Generator

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
>
> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. 
>
> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method.

This notebook creates a synthetic dataset of clinical trial records with realistic PII (Personally Identifiable Information) for testing data protection and anonymization techniques.

The dataset includes:
- Trial information and study design
- Participant demographics and health data (PII)
- Investigator and coordinator information (PII)
- Medical observations and notes with embedded PII
- Adverse event reports with varying severity

We'll use Data Designer to create this fully synthetic dataset from scratch.

#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.


In [None]:
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).


In [None]:
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# We specify the endpoint of the model during deployment using the model_provider_registry.
model_id = "nvidia/nvidia-nemotron-nano-9b-v2"
model_alias = "nemotron-nano-9b-v2"

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.95,
            ),
            is_reasoner=True
        ),
    ]
)

In [None]:
# Create person samplers for different roles, using en_GB locale
# Add person samplers for different roles in the clinical trial
config_builder.add_column(
    C.SamplerColumn(
        name="participant",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(locale="en_US"),
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="investigator",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(locale="en_US"),
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="coordinator",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(locale="en_US"),
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="sponsor",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(locale="en_US"),
    )
)

## Creating Trial Information

Next, we'll create the basic trial information:
- Study ID (unique identifier)
- Trial phase and therapeutic area
- Study design details
- Start and end dates for the trial

In [None]:
# Study identifiers
config_builder.add_column(
    C.SamplerColumn(
        name="study_id",
        type=P.SamplerType.UUID,
        params=P.UUIDSamplerParams(prefix="CT-", short_form=True, uppercase=True)
    )
)

# Trial phase
config_builder.add_column(
    C.SamplerColumn(
        name="trial_phase",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Phase I", "Phase II", "Phase III", "Phase IV"],
            weights=[0.2, 0.3, 0.4, 0.1]
        )
    )
)

# Therapeutic area
config_builder.add_column(
    C.SamplerColumn(
        name="therapeutic_area",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Oncology", "Cardiology", "Neurology", "Immunology", "Infectious Disease"],
            weights=[0.3, 0.2, 0.2, 0.15, 0.15]
        )
    )
)

# Study design
config_builder.add_column(
    name="study_design",
    type="subcategory",
    params={
        "category": "trial_phase",
        "values": {
            "Phase I": ["Single Arm", "Dose Escalation", "First-in-Human", "Safety Assessment"],
            "Phase II": ["Randomized", "Double-Blind", "Proof of Concept", "Open-Label Extension"],
            "Phase III": ["Randomized Controlled", "Double-Blind Placebo-Controlled", "Multi-Center", "Pivotal"],
            "Phase IV": ["Post-Marketing Surveillance", "Real-World Evidence", "Long-Term Safety", "Expanded Access"]
        }
    }
)

# Trial dates
config_builder.add_column(
    name="trial_start_date",
    type="datetime",
    params={"start": "2022-01-01", "end": "2023-06-30"},
    convert_to="%Y-%m-%d"
)

config_builder.add_column(
    name="trial_end_date",
    type="datetime",
    params={"start": "2023-07-01", "end": "2024-12-31"},
    convert_to="%Y-%m-%d"
)

## Participant Information

Now we'll create fields for participant demographics and enrollment details:
- Participant ID and basic information
- Demographics (age, gender, etc.)
- Enrollment status and dates
- Randomization assignment

In [None]:
# Participant identifiers and information
config_builder.add_column(
    name="participant_id",
    type="uuid",
    params={"prefix": "PT-", "short_form": True, "uppercase": True}
)

config_builder.add_column(
    name="participant_first_name",
    type="expression",
    expr="{{participant.first_name}}"
)

config_builder.add_column(
    name="participant_last_name",
    type="expression",
    expr="{{participant.last_name}}"
)

config_builder.add_column(
    name="participant_birth_date",
    type="expression",
    expr="{{participant.birth_date}}"
)

config_builder.add_column(
    name="participant_email",
    type="expression",
    expr="{{participant.email_address}}"
)

# Enrollment information
config_builder.add_column(
    name="enrollment_date",
    type="timedelta",
    params={
        "dt_min": 0,
        "dt_max": 60,
        "reference_column_name": "trial_start_date",
        "unit": "D"
    },
    convert_to="%Y-%m-%d"
)

config_builder.add_column(
    name="participant_status",
    type="category",
    params={
        "values": ["Active", "Completed", "Withdrawn", "Lost to Follow-up"],
        "weights": [0.6, 0.2, 0.15, 0.05]
    }
)

config_builder.add_column(
    name="treatment_arm",
    type="category",
    params={
        "values": ["Treatment", "Placebo", "Standard of Care"],
        "weights": [0.5, 0.3, 0.2]
    }
)

## Investigator and Staff Information

Here we'll add information about the trial staff:
- Investigator information (principal investigator)
- Study coordinator details
- Site information

In [None]:
# Investigator information
config_builder.add_column(
    name="investigator_first_name",
    type="expression",
    expr="{{investigator.first_name}}"
)

config_builder.add_column(
    name="investigator_last_name",
    type="expression",
    expr="{{investigator.last_name}}"
)

config_builder.add_column(
    name="investigator_id",
    type="uuid",
    params={"prefix": "INV-", "short_form": True, "uppercase": True}
)

# Study coordinator information
config_builder.add_column(
    name="coordinator_first_name",
    type="expression",
    expr="{{coordinator.first_name}}"
)

config_builder.add_column(
    name="coordinator_last_name",
    type="expression",
    expr="{{coordinator.last_name}}"
)

config_builder.add_column(
    name="coordinator_email",
    type="expression",
    expr="{{coordinator.email_address}}"
)

# Site information
config_builder.add_column(
    name="site_id",
    type="category",
    params={
        "values": ["SITE-001", "SITE-002", "SITE-003", "SITE-004", "SITE-005"]
    }
)

config_builder.add_column(
    name="site_location",
    type="category",
    params={
        "values": ["London", "Manchester", "Birmingham", "Edinburgh", "Cambridge"]
    }
)

# Study costs
config_builder.add_column(
    name="per_patient_cost",
    type="gaussian",
    params={"mean": 15000, "stddev": 5000, "min": 5000}
)

config_builder.add_column(
    name="participant_compensation",
    type="gaussian",
    params={"mean": 500, "stddev": 200, "min": 100}
)

## Clinical Measurements and Outcomes

These columns will track the key clinical data collected during the trial:
- Vital signs and lab values
- Efficacy measurements 
- Dosing information

In [None]:
# Basic clinical measurements
config_builder.add_column(
    name="baseline_measurement",
    type="gaussian",
    params={"mean": 100, "stddev": 15},
    convert_to="float"
)

config_builder.add_column(
    name="final_measurement",
    type="gaussian",
    params={"mean": 85, "stddev": 20},
    convert_to="float"
)

# Calculate percent change
config_builder.add_column(
    name="percent_change",
    type="expression",
    expr="{{(final_measurement - baseline_measurement) / baseline_measurement * 100}}"
)

# Dosing information
config_builder.add_column(
    name="dose_level",
    type="category",
    params={
        "values": ["Low", "Medium", "High", "Placebo"],
        "weights": [0.3, 0.3, 0.2, 0.2]
    }
)

config_builder.add_column(
    name="dose_frequency",
    type="category",
    params={
        "values": ["Once daily", "Twice daily", "Weekly", "Biweekly"],
        "weights": [0.4, 0.3, 0.2, 0.1]
    }
)

# Protocol compliance
config_builder.add_column(
    name="compliance_rate",
    type="uniform",
    params={"low": 0.7, "high": 1.0}
)

## Adverse Events Tracking

Here we'll capture adverse events that occur during the clinical trial:
- Adverse event presence and type
- Severity and relatedness to treatment
- Dates and resolution

In [None]:
# Adverse event flags and details
config_builder.add_column(
    name="has_adverse_event",
    type="bernoulli",
    params={"p": 0.3}
)

config_builder.add_column(
    name="adverse_event_type",
    type="category",
    params={
        "values": ["Headache", "Nausea", "Fatigue", "Rash", "Dizziness", "Pain at injection site", "Other"],
        "weights": [0.2, 0.15, 0.15, 0.1, 0.1, 0.2, 0.1]
    },
    conditional_params={"has_adverse_event == 0": {"values": ["None"]}}
)

config_builder.add_column(
    name="adverse_event_severity",
    type="category",
    params={"values": ["Mild", "Moderate", "Severe", "Life-threatening"]},
    conditional_params={"has_adverse_event == 0": {"values": ["NA"]}}
)

config_builder.add_column(
    name="adverse_event_relatedness",
    type="category",
    params={
        "values": ["Unrelated", "Possibly related", "Probably related", "Definitely related"],
        "weights": [0.2, 0.4, 0.3, 0.1]
    },
    conditional_params={"has_adverse_event == 0": {"values": ["NA"]}}
)

config_builder.add_column(
    name="adverse_event_resolved",
    type="category",
    params={"values": ["NA"]},
    conditional_params={"has_adverse_event == 1": {"values": ["Yes", "No"], "weights": [0.8, 0.2]}}
)

## Narrative text fields with style variations

These fields will contain natural language text that incorporates PII elements.
We'll use style seed categories to ensure diversity in the writing styles:

1. Medical observations and notes
2. Adverse event descriptions  
3. Protocol deviation explanations

**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. 

In [None]:
# Documentation style category
config_builder.add_column(
    name="documentation_style",
    type="category",
    params={
        "values": ["Formal and Technical", "Concise and Direct", "Detailed and Descriptive"],
        "weights": [0.4, 0.3, 0.3]
    }
)

# Medical observations - varies based on documentation style
config_builder.add_column(
    name="medical_observations",
    type="llm-text",
    model_alias=model_alias,
    prompt="""
    {% if documentation_style == "Formal and Technical" %}
    Write formal and technical medical observations for participant {{ participant_first_name }} {{ participant_last_name }}
    (ID: {{ participant_id }}) in the clinical trial for {{ therapeutic_area }} (Study ID: {{ study_id }}).

    Include observations related to their enrollment in the {{ dose_level }} dose group with {{ dose_frequency }} administration.
    Baseline measurement was {{ baseline_measurement }} and final measurement was {{ final_measurement }}, representing a
    change of {{ percent_change }}%.

    Use proper medical terminology, maintain a highly formal tone, and structure the notes in a technical format with appropriate
    sections and subsections. Include at least one reference to the site investigator, Dr. {{ investigator_last_name }}.
    {% elif documentation_style == "Concise and Direct" %}
    Write brief, direct medical observations for patient {{ participant_first_name }} {{ participant_last_name }}
    ({{ participant_id }}) in {{ therapeutic_area }} trial {{ study_id }}.

    Note: {{ dose_level }} dose, {{ dose_frequency }}. Baseline: {{ baseline_measurement }}. Final: {{ final_measurement }}.
    Change: {{ percent_change }}%.

    Keep notes extremely concise, using abbreviations where appropriate. Mention follow-up needs and reference
    Dr. {{ investigator_last_name }} briefly.
    {% else %}
    Write detailed and descriptive medical observations for participant {{ participant_first_name }} {{ participant_last_name }}
    enrolled in the {{ therapeutic_area }} clinical trial ({{ study_id }}).

    Provide a narrative description of their experience in the {{ dose_level }} dose group with {{ dose_frequency }} dosing.
    Describe how their measurements changed from baseline ({{ baseline_measurement }}) to final ({{ final_measurement }}),
    representing a {{ percent_change }}% change.

    Use a mix of technical terms and explanatory language. Include thorough descriptions of observed effects and subjective
    patient reports. Mention interactions with the investigator, Dr. {{ investigator_first_name }} {{ investigator_last_name }}.
    {% endif %}
    """
)

# Adverse event descriptions - conditional on having an adverse event
config_builder.add_column(
    name="adverse_event_description",
    type="llm-text",
    model_alias=model_alias,
    prompt="""
    {% if has_adverse_event == 1 %}
    [INSTRUCTIONS: Write a brief clinical description (1-2 sentences only) of the adverse event. Use formal medical language. Do not include meta-commentary or explain what you're doing.]\
    {{adverse_event_type}}, {{adverse_event_severity}}. {{adverse_event_relatedness}} to study treatment.
    {% if adverse_event_resolved == "Yes" %}Resolved.{% else %}Ongoing.{% endif %}
    {% else %}
    [INSTRUCTIONS: Output only the exact text "No adverse events reported" without any additional commentary.]\
    No adverse events reported.\
    {% endif %}
    """
)

# Protocol deviation description (if compliance is low)
config_builder.add_column(
    name="protocol_deviation",
    type="llm-text",
    model_alias=model_alias,
    prompt="""
    {% if compliance_rate < 0.85 %}
    {% if documentation_style == "Formal and Technical" %}
    [FORMAT INSTRUCTIONS: Write in a direct documentation style. Do not use phrases like "it looks like" or "you've provided". Begin with the protocol deviation details. Use formal terminology.]

    PROTOCOL DEVIATION REPORT
    Study ID: {{ study_id }}
    Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})
    Compliance Rate: {{ compliance_rate }}

    [Continue with formal description of the deviation, impact on data integrity, and corrective actions. Reference coordinator {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_last_name }}]
    {% elif documentation_style == "Concise and Direct" %}
    [FORMAT INSTRUCTIONS: Use only brief notes and bullet points. No introductions or explanations.]

    PROTOCOL DEVIATION - {{ participant_id }}
    • Compliance: {{ compliance_rate }}
    • Impact: [severity level]
    • Actions: [list actions]
    • Coordinator: {{ coordinator_first_name }} {{ coordinator_last_name }}
    • PI: Dr. {{ investigator_last_name }}
    {% else %}
    [FORMAT INSTRUCTIONS: Write a narrative description. Begin directly with the deviation details. No meta-commentary.]

    During the {{ therapeutic_area }} study at {{ site_location }}, participant {{ participant_first_name }} {{ participant_last_name }} demonstrated a compliance rate of {{ compliance_rate }}, which constitutes a protocol deviation.

    [Continue with narrative about circumstances, discovery, impact, and team response. Include references to {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_first_name }} {{ investigator_last_name }}]
    {% endif %}
    {% else %}
    [FORMAT INSTRUCTIONS: Write a simple direct statement. No meta-commentary or explanation.]

    PROTOCOL COMPLIANCE ASSESSMENT
    Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})
    Finding: No protocol deviations. Compliance rate: {{ compliance_rate }}.
    {% endif %}
    """
)

## Adding Constraints

Finally, we'll add constraints to ensure our data is logically consistent:
- Trial dates must be in proper sequence
- Adverse event dates must occur after enrollment
- Measurement changes must be realistic

In [None]:
# Ensure appropriate date sequence
config_builder.add_constraint(
    target_column="trial_end_date",
    type="column_inequality",
    params={"operator": "gt", "rhs": "trial_start_date"}
)

config_builder.add_constraint(
    target_column="enrollment_date",
    type="column_inequality",
    params={"operator": "ge", "rhs": "trial_start_date"}
)

config_builder.add_constraint(
    target_column="enrollment_date",
    type="column_inequality",
    params={"operator": "lt", "rhs": "trial_end_date"}
)

# Ensure reasonable clinical measurements
config_builder.add_constraint(
    target_column="baseline_measurement",
    type="scalar_inequality",
    params={"operator": "gt", "rhs": 0}
)

config_builder.add_constraint(
    target_column="final_measurement",
    type="scalar_inequality",
    params={"operator": "gt", "rhs": 0}
)


## Preview and Generate Dataset

First, we'll preview a small sample to verify our configuration is working correctly.
Then we'll generate the full dataset with the desired number of records.

In [None]:
# Preview a few records
preview = data_designer_client.preview(config_builder, verbose_logging=True)

In [None]:
# More previews
preview.display_sample_record()

In [None]:
# Submit batch job
job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)

job_results.wait_until_done()


In [None]:
dataset = job_results.load_dataset()
print("\nGenerated dataset shape:", dataset.shape)

dataset.head()

In [None]:
# Save the dataset
import os
os.makedirs("data", exist_ok=True)

csv_filename = f"./data/clinical-trial-data.csv"
dataset.to_csv(csv_filename, index=False)
print(f"Dataset with {len(dataset)} records saved to {csv_filename}")