# 🎨 NeMo Data Designer: Synthetic Insurance Claims Dataset Generator

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
>
> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. 
>
> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method.

This notebook creates a synthetic dataset of insurance claims with realistic PII (Personally Identifiable Information) for testing data protection and anonymization techniques.

The dataset includes:
- Policy and claim details
- Policyholder and claimant information (PII)
- Claim descriptions and adjuster notes with embedded PII
- Medical information for relevant claims

We'll use NeMo Data Designer to create this fully synthetic dataset from scratch.


#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.


In [None]:
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).


In [None]:
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# We specify the endpoint of the model during deployment using the model_provider_registry.
model_id = "nvidia/nvidia-nemotron-nano-9b-v2"
model_alias = "nemotron-nano-9b-v2"

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.95,
            ),
            is_reasoner=True
        ),
    ]
)

## Creating Person Samplers

We'll create person samplers to generate consistent personal information for different roles in the insurance claims process:
- Policyholders (primary insurance customers)
- Claimants (who may be different from policyholders)
- Adjusters (insurance company employees who evaluate claims)
- Physicians (for medical-related claims)

In [None]:
# Create person samplers for different roles, using en_GB locale since en_US with PGM is not supported in streaming mode
config_builder.with_person_samplers({
    "policyholder": {"locale": "en_US"},
    "claimant": {"locale": "en_US"},
    "adjuster": {"locale": "en_US"},
    "physician": {"locale": "en_US"}
})

## Creating Policy Information

Next, we'll create the basic policy information:
- Policy number (unique identifier)
- Policy type (Auto, Home, Health, etc.)
- Coverage details (based on policy type)
- Policy start and end dates

In [None]:
# Policy identifiers
config_builder.add_column(
    name="policy_number",
    type="uuid",
    params={"prefix": "POL-", "short_form": True, "uppercase": True}
)

# Policy type
config_builder.add_column(
    name="policy_type",
    type="category",
    params={
        "values": ["Auto", "Home", "Health", "Life", "Travel"],
        "weights": [0.4, 0.3, 0.15, 0.1, 0.05]
    }
)

# Coverage types based on policy type
config_builder.add_column(
    name="coverage_type",
    type="subcategory",
    params={
        "category": "policy_type",
        "values": {
            "Auto": ["Liability", "Comprehensive", "Collision", "Uninsured Motorist"],
            "Home": ["Dwelling", "Personal Property", "Liability", "Natural Disaster"],
            "Health": ["Emergency Care", "Primary Care", "Specialist", "Prescription"],
            "Life": ["Term", "Whole Life", "Universal Life", "Variable Life"],
            "Travel": ["Trip Cancellation", "Medical Emergency", "Lost Baggage", "Flight Accident"]
        }
    }
)

# Policy dates
config_builder.add_column(
    name="policy_start_date",
    type="datetime",
    params={"start": "2022-01-01", "end": "2023-06-30"},
    convert_to="%Y-%m-%d"
)

config_builder.add_column(
    name="policy_end_date",
    type="datetime",
    params={"start": "2023-07-01", "end": "2024-12-31"},
    convert_to="%Y-%m-%d"
)

## Policyholder Information (PII)

Now we'll add fields for the policyholder's personal information. This includes PII elements that would typically be subject to privacy regulations:
- First and last name
- Birth date
- Contact information (email)

These fields use expressions to reference the person sampler we defined earlier.

In [None]:
# Policyholder personal information
config_builder.add_column(
    name="policyholder_first_name",
    type="expression",
    expr="{{policyholder.first_name}}"
)

config_builder.add_column(
    name="policyholder_last_name",
    type="expression",
    expr="{{policyholder.last_name}}"
)

config_builder.add_column(
    name="policyholder_birth_date",
    type="expression",
    expr="{{policyholder.birth_date}}"
)

config_builder.add_column(
    name="policyholder_email",
    type="expression",
    expr="{{policyholder.email_address}}"
)

## Claim Information

Next, we'll create the core claim details:
- Claim ID (unique identifier)
- Dates (filing date, incident date)
- Claim status (in process, approved, denied, etc.)
- Financial information (amount claimed, amount approved)

In [None]:
# Claim identifier
config_builder.add_column(
    name="claim_id",
    type="uuid",
    params={"prefix": "CLM-", "short_form": True, "uppercase": True}
)

# Claim dates
config_builder.add_column(
    name="incident_date",
    type="datetime",
    params={"start": "2023-01-01", "end": "2023-12-31"},
    convert_to="%Y-%m-%d"
)

config_builder.add_column(
    name="filing_date",
    type="timedelta",
    params={
        "dt_min": 1,
        "dt_max": 30,
        "reference_column_name": "incident_date",
        "unit": "D"
    },
    convert_to="%Y-%m-%d"
)

# Claim status
config_builder.add_column(
    name="claim_status",
    type="category",
    params={
        "values": ["Filed", "Under Review", "Additional Info Requested", "Approved", "Denied", "Appealed"],
        "weights": [0.15, 0.25, 0.15, 0.25, 0.15, 0.05]
    }
)

# Financial information
config_builder.add_column(
    name="claim_amount",
    type="gaussian",
    params={"mean": 5000, "stddev": 2000, "min": 500}
)

config_builder.add_column(
    name="approved_percentage",
    type="uniform",
    params={"low": 0.0, "high": 1.0}
)

# Calculate approved amount based on percentage
config_builder.add_column(
    name="approved_amount",
    type="expression",
    expr="{{claim_amount * approved_percentage}}"
)

## Claimant Information

In some cases, the claimant (person filing the claim) may be different from the policyholder. 
We'll create fields to capture claimant information and their relationship to the policyholder:
- Flag indicating if claimant is the policyholder
- Claimant personal details (when different from policyholder)
- Relationship to policyholder

In [None]:
# Determine if claimant is the policyholder
config_builder.add_column(
    name="is_claimant_policyholder",
    type="bernoulli",
    params={"p": 0.7}
)

# Claimant personal information (when different from policyholder)
config_builder.add_column(
    name="claimant_first_name",
    type="expression",
    expr="{{claimant.first_name}}"
)

config_builder.add_column(
    name="claimant_last_name",
    type="expression",
    expr="{{claimant.last_name}}"
)

config_builder.add_column(
    name="claimant_birth_date",
    type="expression",
    expr="{{claimant.birth_date}}"
)

# Relationship to policyholder
config_builder.add_column(
    name="relationship_to_policyholder",
    type="category",
    params={"values": ["Self","Spouse", "Child", "Parent", "Sibling", "Other"]},
)

## Claim Adjuster Information

Insurance claims are typically handled by claim adjusters. We'll add information about 
the adjuster assigned to each claim:
- Adjuster name
- Assignment date
- Contact information

In [None]:
# Adjuster information
config_builder.add_column(
    name="adjuster_first_name",
    type="expression",
    expr="{{adjuster.first_name}}"
)

config_builder.add_column(
    name="adjuster_last_name",
    type="expression",
    expr="{{adjuster.last_name}}"
)

# Adjuster assignment date
config_builder.add_column(
    name="adjuster_assignment_date",
    type="timedelta",
    params={
        "dt_min": 0,
        "dt_max": 5,
        "reference_column_name": "filing_date",
        "unit": "D"
    },
    convert_to="%Y-%m-%d"
)

## Medical Information

For health insurance claims and injury-related claims in other policy types, 
we'll include medical information:
- Flag indicating if there's a medical component to the claim
- Medical claim details (when applicable)

In [None]:
# Is there a medical component to this claim?
config_builder.add_column(
    name="has_medical_component",
    type="bernoulli",
    params={"p": 0.4}
)

# Physician information using conditional logic
config_builder.add_column(
    name="physician_first_name",
    type="expression",
    expr="{{physician.first_name}}",
    conditional_params={"has_medical_component == 0": {"expr": "'NA'"}}
)

config_builder.add_column(
    name="physician_last_name",
    type="expression",
    expr="{{physician.last_name}}",
    conditional_params={"has_medical_component == 0": {"expr": "'NA'"}}
)

## Free Text Fields with PII References

These fields will contain natural language text that incorporates PII elements from other fields.
This is particularly useful for testing PII detection and redaction within unstructured text:

1. Incident Description - The policyholder/claimant's account of what happened
2. Adjuster Notes - The insurance adjuster's professional documentation
3. Medical Notes - For claims with a medical component

The LLM will be prompted to include PII elements like names, dates, and contact information
within the narrative text.

In [None]:
# Incident description from policyholder/claimant
config_builder.add_column(
    name="incident_description",
    type="llm-text",
    model_alias=model_alias,
    prompt="""
    Write a detailed description of an insurance incident for a {{policy_type}} insurance policy with {{coverage_type}} coverage.

    The policyholder is {{policyholder_first_name}} {{policyholder_last_name}} (email: {{policyholder_email}}).

    The incident occurred on {{incident_date}} and resulted in approximately ${{claim_amount}} in damages/expenses.

    Write this from the perspective of the person filing the claim. Include specific details that would be relevant
    to processing this type of claim. Make it detailed but realistic, as if written by someone describing an actual incident.

    Reference the policyholder's contact information at least once in the narrative.
    """
)

# Adjuster notes
config_builder.add_column(
    name="adjuster_notes",
    type="llm-text",
    model_alias=model_alias,
    prompt="""
    Write detailed insurance adjuster notes for claim {{claim_id}}.

    POLICY INFORMATION:
    - Policy #: {{policy_number}}
    - Type: {{policy_type}}, {{coverage_type}} coverage
    - Policyholder: {{policyholder_first_name}} {{policyholder_last_name}}

    CLAIM DETAILS:
    - Incident Date: {{incident_date}}
    - Filing Date: {{filing_date}}
    - Claimed Amount: ${{claim_amount}}

    As adjuster {{adjuster_first_name}} {{adjuster_last_name}}, write professional notes documenting:
    1. Initial contact with the policyholder
    2. Assessment of the claim based on the incident description
    3. Coverage determination under the policy
    4. Recommended next steps

    Include at least one mention of contacting the policyholder using their full name and email ({{policyholder_email}}).
    Use a formal, professional tone typical of insurance documentation.
    """
)

# Medical notes (for claims with medical component)
config_builder.add_column(
    name="medical_notes",
    type="llm-text",
    model_alias=model_alias,
    prompt="""
    {% if has_medical_component %}\
    Write medical notes related to insurance claim {{ claim_id }}.

    Patient: {{policyholder_first_name}} {{policyholder_last_name}}, DOB: {{policyholder_birth_date}}

    As Dr. {{physician_first_name}} {{physician_last_name}}, document:

    1. Chief complaint
    2. Medical assessment
    3. Treatment recommendations
    4. Follow-up instructions

    Include appropriate medical terminology relevant to a {{policy_type}} insurance claim.
    If this is for a Health policy, focus on the {{coverage_type}} aspects.
    For other policy types, focus on injury assessment relevant to the incident.

    Use a professional medical documentation style that includes specific references
    to the patient by name and birth date.\

    The language should be natural and different from one physician to the next.\

    Vary the length of the response. Keep some notes brief and others more detailed.\
    {% else -%}\
    Repeat the following: "No medical claim"\
    {% endif -%}\
    """
)

## Adding Constraints

To ensure our data is logically consistent, we'll add some constraints:
- Incident date must be during the policy term
- Filing date must be after incident date

In [None]:
# Ensure incident date falls within policy period
config_builder.add_constraint(
    target_column="incident_date",
    type="column_inequality",
    params={"operator": "ge", "rhs": "policy_start_date"}
)

config_builder.add_constraint(
    target_column="incident_date",
    type="column_inequality",
    params={"operator": "le", "rhs": "policy_end_date"}
)

# Ensure filing date is after incident date
config_builder.add_constraint(
    target_column="filing_date",
    type="column_inequality",
    params={"operator": "gt", "rhs": "incident_date"}
)

## Preview and Generate Dataset

First, we'll preview a small sample to verify our configuration is working correctly.
Then we'll generate the full dataset with the desired number of records.

In [None]:
# Preview a few records
preview = data_designer_client.preview(config_builder, verbose_logging=True)

In [None]:
# More previews
preview.display_sample_record()

In [None]:
# Generate the full dataset
job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)
job_results.wait_until_done()

In [None]:
# Display the first few rows of the generated dataset
dataset = job_results.load_dataset()
print("\nGenerated dataset shape:", dataset.shape)

dataset.head()

In [None]:
import os
os.makedirs("./data", exist_ok=True)

# Save the dataset to CSV
dataset.to_csv("./data/insurance_claims_with_pii.csv", index=False)
print(f"Dataset with {len(dataset)} records saved to ./data/insurance_claims_with_pii.csv")