# 🧾 NeMo Data Designer: Synthetic Insurance Claims Dataset Generator

#### 📚 What you'll learn

This notebook creates a synthetic dataset of insurance claims with realistic PII (Personally Identifiable Information) \
for testing data protection and anonymization techniques.

The dataset includes:

- Policy and claim details
- Policyholder and claimant information (PII)
- Claim descriptions and adjuster notes with embedded PII
- Medical information for relevant claims

<br>

> 👋 **IMPORTANT** – Environment Setup
>
> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.
>
> - You may need to restart your notebook's kernel after setting up the environment.
> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.
>
> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).


### 📦 Import the essentials

- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.

- The `essentials` module provides quick access to the most commonly used objects.


In [None]:
from nemo_microservices.data_designer.essentials import (
    BernoulliSamplerParams,
    CategorySamplerParams,
    DataDesignerConfigBuilder,
    ExpressionColumnConfig,
    GaussianSamplerParams,
    InferenceParameters,
    LLMTextColumnConfig,
    ModelConfig,
    NeMoDataDesignerClient,
    PersonSamplerParams,
    SamplerColumnConfig,
    SamplerType,
    SubcategorySamplerParams,
    UUIDSamplerParams,
    UniformSamplerParams,
)

### ⚙️ Initialize the NeMo Data Designer Client

- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.


In [None]:
NEMO_MICROSERVICES_BASE_URL = "http://localhost:8080"

data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)

### 🎛️ Define model configurations

- Each `ModelConfig` defines a model that can be used during the generation process.

- The "model alias" is used to reference the model in the Data Designer config (as we will see below).

- The "model provider" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).

- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.


In [None]:
# This name is set in the microservice deployment configuration.
MODEL_PROVIDER = "nvidiabuild"

# The model ID is from build.nvidia.com.
MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2"

# We choose this alias to be descriptive for our use case.
MODEL_ALIAS = "nemotron-nano-v2"

# This sets reasoning to False for the nemotron-nano-v2 model.
SYSTEM_PROMPT = "/no_think"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        provider=MODEL_PROVIDER,
        inference_parameters=InferenceParameters(
            temperature=0.6,
            top_p=0.95,
            max_tokens=1024,
        ),
    )
]

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- The list of model configs is provided to the builder at initialization.


In [None]:
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

## 🎲 Creating Person Samplers

We'll create person samplers to generate consistent personal information for different roles in the insurance claims process:

- Policyholders (primary insurance customers)
- Claimants (who may be different from policyholders)
- Adjusters (insurance company employees who evaluate claims)
- Physicians (for medical-related claims)


In [None]:
# Create person samplers for different roles, using en_GB locale since en_US with PGM is not supported in streaming mode
config_builder.add_column(
    SamplerColumnConfig(
        name="policyholder",
        sampler_type=SamplerType.PERSON,
        params=PersonSamplerParams(locale="en_US"),
    )
)
config_builder.add_column(
    SamplerColumnConfig(
        name="claimant",
        sampler_type=SamplerType.PERSON,
        params=PersonSamplerParams(locale="en_US"),
    )
)
config_builder.add_column(
    SamplerColumnConfig(
        name="adjuster",
        sampler_type=SamplerType.PERSON,
        params=PersonSamplerParams(locale="en_US"),
    )
)
config_builder.add_column(
    SamplerColumnConfig(
        name="physician",
        sampler_type=SamplerType.PERSON,
        params=PersonSamplerParams(locale="en_US"),
    )
)

### Creating Policy Information

Next, we'll create the basic policy information:

- Policy number (unique identifier)
- Policy type (Auto, Home, Health, etc.)
- Coverage details (based on policy type)
- Policy start and end dates


In [None]:
# Policy identifiers
config_builder.add_column(
    SamplerColumnConfig(
        name="policy_number",
        sampler_type=SamplerType.UUID,
        params=UUIDSamplerParams(prefix="POL-", short_form=True, uppercase=True),
    )
)

# Policy type
config_builder.add_column(
    SamplerColumnConfig(
        name="policy_type",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Auto", "Home", "Health", "Life", "Travel"],
            weights=[0.4, 0.3, 0.15, 0.1, 0.05],
        ),
    )
)

# Coverage types based on policy type
config_builder.add_column(
    SamplerColumnConfig(
        name="coverage_type",
        sampler_type=SamplerType.SUBCATEGORY,
        params=SubcategorySamplerParams(
            category="policy_type",
            values={
                "Auto": [
                    "Liability",
                    "Comprehensive",
                    "Collision",
                    "Uninsured Motorist",
                ],
                "Home": [
                    "Dwelling",
                    "Personal Property",
                    "Liability",
                    "Natural Disaster",
                ],
                "Health": [
                    "Emergency Care",
                    "Primary Care",
                    "Specialist",
                    "Prescription",
                ],
                "Life": ["Term", "Whole Life", "Universal Life", "Variable Life"],
                "Travel": [
                    "Trip Cancellation",
                    "Medical Emergency",
                    "Lost Baggage",
                    "Flight Accident",
                ],
            },
        ),
    )
)

# Policy dates
config_builder.add_column(
    SamplerColumnConfig(
        name="policy_start_date",
        sampler_type=SamplerType.DATETIME,
        params={"start": "2022-01-01", "end": "2023-06-30"},
        convert_to="%Y-%m-%d",
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="policy_end_date",
        sampler_type=SamplerType.DATETIME,
        params={"start": "2023-07-01", "end": "2024-12-31"},
        convert_to="%Y-%m-%d",
    )
)

### Policyholder Information (PII)

Now we'll add fields for the policyholder's personal information. This includes PII elements that would typically be \
subject to privacy regulations:

- First and last name
- Birth date
- Contact information (email)

These fields use expressions to reference the person sampler we defined earlier.


In [None]:
# Policyholder personal information
config_builder.add_column(
    ExpressionColumnConfig(
        name="policyholder_first_name", expr="{{policyholder.first_name}}"
    )
)

config_builder.add_column(
    ExpressionColumnConfig(
        name="policyholder_last_name", expr="{{policyholder.last_name}}"
    )
)

config_builder.add_column(
    ExpressionColumnConfig(
        name="policyholder_birth_date", expr="{{policyholder.birth_date}}"
    )
)

config_builder.add_column(
    ExpressionColumnConfig(
        name="policyholder_email", expr="{{policyholder.email_address}}"
    )
)

### Claim Information

Next, we'll create the core claim details:

- Claim ID (unique identifier)
- Dates (filing date, incident date)
- Claim status (in process, approved, denied, etc.)
- Financial information (amount claimed, amount approved)


In [None]:
# Claim identifier
config_builder.add_column(
    SamplerColumnConfig(
        name="claim_id",
        sampler_type=SamplerType.UUID,
        params=UUIDSamplerParams(prefix="CLM-", short_form=True, uppercase=True),
    )
)

# Claim dates
config_builder.add_column(
    SamplerColumnConfig(
        name="incident_date",
        sampler_type=SamplerType.DATETIME,
        params={"start": "2023-01-01", "end": "2023-12-31"},
        convert_to="%Y-%m-%d",
    )
)

config_builder.add_column(
    name="filing_date",
    column_type="sampler",
    sampler_type="timedelta",
    params={
        "dt_min": 1,
        "dt_max": 30,
        "reference_column_name": "incident_date",
        "unit": "D",
    },
    convert_to="%Y-%m-%d",
)

# Claim status
config_builder.add_column(
    SamplerColumnConfig(
        name="claim_status",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "Filed",
                "Under Review",
                "Additional Info Requested",
                "Approved",
                "Denied",
                "Appealed",
            ],
            weights=[0.15, 0.25, 0.15, 0.25, 0.15, 0.05],
        ),
    )
)

# Financial information
config_builder.add_column(
    SamplerColumnConfig(
        name="claim_amount",
        sampler_type=SamplerType.GAUSSIAN,
        params=GaussianSamplerParams(mean=5000, stddev=2000),
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="approved_percentage",
        sampler_type=SamplerType.UNIFORM,
        params=UniformSamplerParams(low=0.0, high=1.0),
    )
)

# Calculate approved amount based on percentage
config_builder.add_column(
    ExpressionColumnConfig(
        name="approved_amount", expr="{{claim_amount * approved_percentage}}"
    )
)

### Claimant Information

In some cases, the claimant (person filing the claim) may be different from the policyholder. \
We'll create fields to capture claimant information and their relationship to the policyholder:

- Flag indicating if claimant is the policyholder
- Claimant personal details (when different from policyholder)
- Relationship to policyholder


In [None]:
# Determine if claimant is the policyholder
config_builder.add_column(
    SamplerColumnConfig(
        name="is_claimant_policyholder",
        sampler_type=SamplerType.BERNOULLI,
        params=BernoulliSamplerParams(p=0.7),
    )
)

# Claimant personal information (when different from policyholder)
config_builder.add_column(
    ExpressionColumnConfig(name="claimant_first_name", expr="{{claimant.first_name}}")
)

config_builder.add_column(
    ExpressionColumnConfig(name="claimant_last_name", expr="{{claimant.last_name}}")
)

config_builder.add_column(
    ExpressionColumnConfig(name="claimant_birth_date", expr="{{claimant.birth_date}}")
)

# Relationship to policyholder
config_builder.add_column(
    SamplerColumnConfig(
        name="relationship_to_policyholder",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Self", "Spouse", "Child", "Parent", "Sibling", "Other"]
        ),
    )
)

### Claim Adjuster Information

Insurance claims are typically handled by claim adjusters. We'll add information about
the adjuster assigned to each claim:

- Adjuster name
- Assignment date
- Contact information


In [None]:
# Adjuster information
config_builder.add_column(
    ExpressionColumnConfig(name="adjuster_first_name", expr="{{adjuster.first_name}}")
)

config_builder.add_column(
    ExpressionColumnConfig(name="adjuster_last_name", expr="{{adjuster.last_name}}")
)

# Adjuster assignment date
config_builder.add_column(
    name="adjuster_assignment_date",
    column_type="sampler",
    sampler_type="timedelta",
    params={
        "dt_min": 0,
        "dt_max": 5,
        "reference_column_name": "filing_date",
        "unit": "D",
    },
    convert_to="%Y-%m-%d",
)

### Medical Information

For health insurance claims and injury-related claims in other policy types,
we'll include medical information:

- Flag indicating if there's a medical component to the claim
- Medical claim details (when applicable)


In [None]:
# Is there a medical component to this claim?
config_builder.add_column(
    SamplerColumnConfig(
        name="has_medical_component",
        sampler_type=SamplerType.BERNOULLI,
        params=BernoulliSamplerParams(p=0.4),
    )
)

# Physician information using conditional logic
config_builder.add_column(
    ExpressionColumnConfig(
        name="physician_first_name",
        expr="{% if has_medical_component == 1 %}{{physician.first_name}}{% else %}'NA'{% endif %}",
    )
)

config_builder.add_column(
    ExpressionColumnConfig(
        name="physician_last_name",
        expr="{% if has_medical_component == 1 %}{{physician.last_name}}{% else %}'NA'{% endif %}",
    )
)

### Free Text Fields with PII References

These fields will contain natural language text that incorporates PII elements from other fields.
This is particularly useful for testing PII detection and redaction within unstructured text:

1. Incident Description - The policyholder/claimant's account of what happened
2. Adjuster Notes - The insurance adjuster's professional documentation
3. Medical Notes - For claims with a medical component

The LLM will be prompted to include PII elements like names, dates, and contact information
within the narrative text.


In [None]:
# Incident description from policyholder/claimant
config_builder.add_column(
    LLMTextColumnConfig(
        name="incident_description",
        system_prompt=SYSTEM_PROMPT,
        model_alias=MODEL_ALIAS,
        prompt=(
            "Write a detailed description of an insurance incident for a {{policy_type}} insurance policy with "
            "{{coverage_type}} coverage.\n\n"
            "The policyholder is {{policyholder_first_name}} {{policyholder_last_name}} (email: {{policyholder_email}}).\n\n"
            "The incident occurred on {{incident_date}} and resulted in approximately ${{claim_amount}} in damages/expenses.\n\n"
            "Write this from the perspective of the person filing the claim. Include specific details that would be relevant "
            "to processing this type of claim. Make it detailed but realistic, as if written by someone describing an actual incident.\n\n"
            "Reference the policyholder's contact information at least once in the narrative.\n"
        ),
    )
)

# Adjuster notes
config_builder.add_column(
    LLMTextColumnConfig(
        name="adjuster_notes",
        system_prompt=SYSTEM_PROMPT,
        model_alias=MODEL_ALIAS,
        prompt=(
            "Write detailed insurance adjuster notes for claim {{claim_id}}.\n\n"
            "POLICY INFORMATION:\n"
            "- Policy #: {{policy_number}}\n"
            "- Type: {{policy_type}}, {{coverage_type}} coverage\n"
            "- Policyholder: {{policyholder_first_name}} {{policyholder_last_name}}\n\n"
            "CLAIM DETAILS:\n"
            "- Incident Date: {{incident_date}}\n"
            "- Filing Date: {{filing_date}}\n"
            "- Claimed Amount: ${{claim_amount}}\n\n"
            "As adjuster {{adjuster_first_name}} {{adjuster_last_name}}, write professional notes documenting:\n"
            "1. Initial contact with the policyholder\n"
            "2. Assessment of the claim based on the incident description\n"
            "3. Coverage determination under the policy\n"
            "4. Recommended next steps\n\n"
            "Include at least one mention of contacting the policyholder using their full name and email ({{policyholder_email}}).\n"
            "Use a formal, professional tone typical of insurance documentation.\n"
        ),
    )
)

# Medical notes (for claims with medical component)
config_builder.add_column(
    LLMTextColumnConfig(
        name="medical_notes",
        system_prompt=SYSTEM_PROMPT,
        model_alias=MODEL_ALIAS,
        prompt=(
            "{% if has_medical_component %}"
            "Write medical notes related to insurance claim {{ claim_id }}.\n\n"
            "Patient: {{policyholder_first_name}} {{policyholder_last_name}}, DOB: {{policyholder_birth_date}}\n\n"
            "As Dr. {{physician_first_name}} {{physician_last_name}}, document:\n\n"
            "1. Chief complaint\n"
            "2. Medical assessment\n"
            "3. Treatment recommendations\n"
            "4. Follow-up instructions\n\n"
            "Include appropriate medical terminology relevant to a {{policy_type}} insurance claim.\n"
            "If this is for a Health policy, focus on the {{coverage_type}} aspects.\n"
            "For other policy types, focus on injury assessment relevant to the incident.\n\n"
            "Use a professional medical documentation style that includes specific references "
            "to the patient by name and birth date.\n\n"
            "The language should be natural and different from one physician to the next.\n\n"
            "Vary the length of the response. Keep some notes brief and others more detailed.\n"
            "{% else -%}"
            "No medical claim"
            "{% endif -%}"
        ),
    )
)

### Adding Constraints

To ensure our data is logically consistent, we'll add some constraints:

- Incident date must be during the policy term
- Filing date must be after incident date


In [None]:
# Ensure incident date falls within policy period
config_builder.add_constraint(
    target_column="incident_date",
    constraint_type="column_inequality",
    operator="ge",
    rhs="policy_start_date",
)

config_builder.add_constraint(
    target_column="incident_date",
    constraint_type="column_inequality",
    operator="le",
    rhs="policy_end_date",
)

# Ensure filing date is after incident date
config_builder.add_constraint(
    target_column="filing_date",
    constraint_type="column_inequality",
    operator="gt",
    rhs="incident_date",
)

### 🔁 Iteration is key – preview the dataset!

1. Use the `preview` method to generate a sample of records quickly.

2. Inspect the results for quality and format issues.

3. Adjust column configurations, prompts, or parameters as needed.

4. Re-run the preview until satisfied.


In [None]:
# Preview a few records
preview = data_designer_client.preview(config_builder)

In [None]:
# More previews
preview.display_sample_record()

### 📊 Analyze the generated data

- Data Designer automatically generates a basic statistical analysis of the generated data.

- This analysis is available via the `analysis` property of generation result objects.


In [None]:
# Print the analysis as a table.
preview.analysis.to_report()

### 🆙 Scale up!

- Happy with your preview data?

- Use the `create` method to submit larger Data Designer generation jobs.


In [None]:
job_results = data_designer_client.create(config_builder, num_records=20)

# This will block until the job is complete.
job_results.wait_until_done()

In [None]:
# Load the generated dataset as a pandas DataFrame.
dataset = job_results.load_dataset()

dataset.head()

In [None]:
# Load the analysis results into memory.
analysis = job_results.load_analysis()

analysis.to_report()

In [None]:
TUTORIAL_OUTPUT_PATH = "data-designer-tutorial-output"

# Download the job artifacts and save them to disk.
job_results.download_artifacts(
    output_path=TUTORIAL_OUTPUT_PATH,
    artifacts_folder_name="artifacts-community-contributions-healthcare-datasets-insurance-claims",
);