# 🧑‍🤝‍🧑 NeMo Data Designer: Person Sampler Tutorial

### 📚 What you'll learn

In this notebook, we'll explore how you can generate realistic personal information for your synthetic datasets.

<br>

> 👋 **IMPORTANT** – Environment Setup
>
> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.
>
> - You may need to restart your notebook's kernel after setting up the environment.
> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.
>
> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).

<br>

## What is the Person Sampler?

The Person Sampler is a powerful feature in NeMo Data Designer that generates consistent, realistic person records with attributes like:

- Names (first, middle, last)
- Contact information (email, phone)
- Addresses (street, city, state, zip)
- Demographics (age, gender, ethnicity)
- IDs (SSN, UUID)
- And more!

These records are fully synthetic but maintain the statistical properties and formatting patterns of real personal data.


### 📦 Import the essentials

- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.

- The `essentials` module provides quick access to the most commonly used objects.


In [None]:
from nemo_microservices.data_designer.essentials import (
    CategorySamplerParams,
    DataDesignerConfigBuilder,
    InferenceParameters,
    LLMTextColumnConfig,
    ModelConfig,
    NeMoDataDesignerClient,
    PersonSamplerParams,
    SamplerColumnConfig,
    SamplerType,
)

### ⚙️ Initialize the NeMo Data Designer Client

- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.


In [None]:
NEMO_MICROSERVICES_BASE_URL = "http://localhost:8080"

data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)

### 🎛️ Define model configurations

- Each `ModelConfig` defines a model that can be used during the generation process.

- The "model alias" is used to reference the model in the Data Designer config (as we will see below).

- The "model provider" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).

- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.


In [None]:
# This name is set in the microservice deployment configuration.
MODEL_PROVIDER = "nvidiabuild"

# The model ID is from build.nvidia.com.
MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2"

# We choose this alias to be descriptive for our use case.
MODEL_ALIAS = "nemotron-nano-v2"

# This sets reasoning to False for the nemotron-nano-v2 model.
SYSTEM_PROMPT = "/no_think"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        provider=MODEL_PROVIDER,
        inference_parameters=InferenceParameters(
            temperature=0.6,
            top_p=0.95,
            max_tokens=1024,
        ),
    )
]

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- The list of model configs is provided to the builder at initialization.


In [None]:
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

### 1. Basic Person Sampling

Let's start with a simple example of generating person data using the default settings.


In [None]:
# Add a simple person column with default settings
config_builder.add_column(
    SamplerColumnConfig(
        name="person",  # This creates a nested object with all person attributes
        sampler_type=SamplerType.PERSON,
        params=PersonSamplerParams(locale="en_US", sex="Male"),
    )
)
# Preview what the generated data looks like
preview = data_designer_client.preview(config_builder)
preview.dataset

### 2. Accessing Individual Person Attributes

The `person` column we created above is a nested object with many attributes. Let's create some columns to access specific attributes from this person object.


In [None]:
# Add columns to extract specific attributes from the person object
config_builder.add_column(
    name="full_name",
    column_type="expression",
    expr="{{ person.first_name }} {{ person.last_name }}",
)

config_builder.add_column(
    name="email", column_type="expression", expr="{{ person.email_address }}"
)

config_builder.add_column(
    name="address",
    column_type="expression",
    expr="{{ person.street_number }} {{ person.street_name }}, {{ person.city }}, {{ person.state }} {{ person.zipcode }}",
)

config_builder.add_column(name="age", column_type="expression", expr="{{ person.age }}")

# Preview the results
preview = data_designer_client.preview(config_builder)
preview.dataset[["full_name", "email", "address", "age"]]

### 3. Customizing Person Generators

- Now let's explore customizing the Person Sampler to generate specific types of profiles.

- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\
  If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker


In [None]:
# Reset our config builder for this example
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

# Create custom person samplers for different roles/demographics
config_builder.add_column(
    name="employee",
    column_type="sampler",
    sampler_type="person",
    params={"locale": "en_US", "age_range": [22, 65], "state": "CA"},
)

config_builder.add_column(
    name="customer",
    column_type="sampler",
    sampler_type="person",
    params={"locale": "en_US", "age_range": [18, 80]},
)

# Create a UK-based person
config_builder.add_column(
    name="uk_contact",
    column_type="sampler",
    sampler_type="person",
    params={
        "locale": "en_GB",  # UK locale
        "city": "London",
    },
)

# Add columns to extract and format information
config_builder.add_column(
    name="employee_info",
    column_type="expression",
    expr="{{ employee.first_name }} {{ employee.last_name }}, {{ employee.age }} - {{ employee.city }}, {{ employee.state }}",
)

config_builder.add_column(
    name="customer_info",
    column_type="expression",
    expr="{{ customer.first_name }} {{ customer.last_name }}, {{ customer.age }} - {{ customer.city }}, {{ customer.state }}",
)

config_builder.add_column(
    name="uk_contact_info",
    column_type="expression",
    expr="{{ uk_contact.first_name }} {{ uk_contact.last_name }}, {{ uk_contact.phone_number }} - {{ uk_contact.city }}",
)

# Preview the results
preview = data_designer_client.preview(config_builder)
preview.dataset[["employee_info", "customer_info", "uk_contact_info"]]

### 4. Available Person Attributes

The Person Sampler generates a rich set of attributes that you can use. Here's a reference list of some of the key attributes available:


| Attribute           | Description                        | Example                                |
| ------------------- | ---------------------------------- | -------------------------------------- |
| `first_name`        | Person's first name                | "John"                                 |
| `middle_name`       | Person's middle name (may be None) | "Robert"                               |
| `last_name`         | Person's last name                 | "Smith"                                |
| `sex`               | Person's sex                       | "Male"                                 |
| `age`               | Person's age in years              | 42                                     |
| `birth_date`        | Date of birth                      | "1980-05-15"                           |
| `email_address`     | Email address                      | "john.smith@example.com"               |
| `phone_number`      | Phone number                       | "+1 (555) 123-4567"                    |
| `street_number`     | Street number                      | "123"                                  |
| `street_name`       | Street name                        | "Main Street"                          |
| `unit`              | Apartment/unit number              | "Apt 4B"                               |
| `city`              | City name                          | "Chicago"                              |
| `state`             | State/province (locale dependent)  | "IL"                                   |
| `county`            | County (locale dependent)          | "Cook"                                 |
| `zipcode`           | Postal/ZIP code                    | "60601"                                |
| `country`           | Country name                       | "United States"                        |
| `ssn`               | Social Security Number (US locale) | "123-45-6789"                          |
| `occupation`        | Occupation                         | "Software Engineer"                    |
| `marital_status`    | Marital status                     | "Married"                              |
| `education_level`   | Education level                    | "Bachelor's Degree"                    |
| `ethnic_background` | Ethnic background                  | "Caucasian"                            |
| `uuid`              | Unique identifier                  | "550e8400-e29b-41d4-a716-446655440000" |


### 5. Creating Multiple Person Samplers with One Method

For convenience, Data Designer provides a `with_person_samplers` method to create multiple person samplers at once.


In [None]:
# Reset our config builder for this example
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

# Create custom person samplers for different roles/demographics
config_builder.add_column(
    name="doctor",
    column_type="sampler",
    sampler_type="person",
    params={"locale": "en_US", "age_range": [30, 70]},
)

config_builder.add_column(
    name="patient",
    column_type="sampler",
    sampler_type="person",
    params={"locale": "en_US", "age_range": [18, 90]},
)

config_builder.add_column(
    name="nurse",
    column_type="sampler",
    sampler_type="person",
    params={"locale": "en_US", "age_range": [25, 65], "sex": "Female"},
)

config_builder.add_column(
    name="international_doctor",
    column_type="sampler",
    sampler_type="person",
    params={"locale": "fr_FR", "age_range": [35, 65]},
)

# Add columns to format information for each person type
config_builder.add_column(
    name="doctor_profile",
    column_type="expression",
    expr="Dr. {{ doctor.first_name }} {{ doctor.last_name }}, {{ doctor.age }}, {{ doctor.email_address }}",
)

config_builder.add_column(
    name="patient_profile",
    column_type="expression",
    expr="{{ patient.first_name }} {{ patient.last_name }}, {{ patient.age }}, {{ patient.city }}, {{ patient.state }}",
)

config_builder.add_column(
    name="nurse_profile",
    column_type="expression",
    expr="Nurse {{ nurse.first_name }} {{ nurse.last_name }}, {{ nurse.age }}",
)

config_builder.add_column(
    name="international_doctor_profile",
    column_type="expression",
    expr="Dr. {{ international_doctor.first_name }} {{ international_doctor.last_name }}, {{ international_doctor.city }}, {{ international_doctor.country }}",
)

# Preview the results
preview = data_designer_client.preview(config_builder)
preview.dataset[
    [
        "doctor_profile",
        "patient_profile",
        "nurse_profile",
        "international_doctor_profile",
    ]
]

## 6. Using Person Data with LLM Generation

One of the most powerful features of Data Designer is combining structured person data with LLM generation to create realistic, contextual content.


In [None]:
# Reset our config builder for this example
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)


# Create person samplers for patients and doctors
config_builder.add_column(
    name="patient",
    column_type="sampler",
    sampler_type="person",
    params={"locale": "en_US", "age_range": [18, 85]},
)

config_builder.add_column(
    name="doctor",
    column_type="sampler",
    sampler_type="person",
    params={"locale": "en_US", "age_range": [30, 70]},
)

# Add some medical condition sampling
config_builder.add_column(
    SamplerColumnConfig(
        name="medical_condition",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "Hypertension",
                "Type 2 Diabetes",
                "Asthma",
                "Rheumatoid Arthritis",
                "Migraine",
                "Hypothyroidism",
            ]
        ),
    )
)

# Add basic info columns
config_builder.add_column(
    name="patient_name",
    column_type="expression",
    expr="{{ patient.first_name }} {{ patient.last_name }}",
)

config_builder.add_column(
    name="doctor_name",
    column_type="expression",
    expr="Dr. {{ doctor.first_name }} {{ doctor.last_name }}",
)

# Add an LLM-generated medical note
config_builder.add_column(
    LLMTextColumnConfig(
        name="medical_notes",
        system_prompt=SYSTEM_PROMPT,
        model_alias=MODEL_ALIAS,
        prompt=(
            "Write a brief medical note from {{ doctor_name }} about patient {{ patient_name }}, "
            "a {{ patient.age }}-year-old {{ patient.sex }} with {{ medical_condition }}. \n"
            "Include relevant medical observations and recommendations. \n"
            "The patient lives in {{ patient.city }}, {{ patient.state }} and works as {{ patient.occupation }}. \n"
            "Keep the note professional, concise (3-4 sentences), and medically accurate.\n"
        ),
    )
)

# Add an LLM-generated patient message
config_builder.add_column(
    LLMTextColumnConfig(
        name="patient_message",
        system_prompt=SYSTEM_PROMPT,
        model_alias=MODEL_ALIAS,
        prompt=(
            "Write a brief message (1-2 sentences) from {{ patient_name }} to {{ doctor_name }} "
            "about their {{ medical_condition }}. The message should reflect the patient's "
            "experience and concerns. The patient is {{ patient.age }} years old."
        ),
    )
)

# Preview the results
preview = data_designer_client.preview(config_builder)
preview.dataset[
    [
        "patient_name",
        "doctor_name",
        "medical_condition",
        "medical_notes",
        "patient_message",
    ]
]

### 🆙 Scale up!

- Happy with your preview data?

- Use the `create` method to submit larger Data Designer generation jobs.


In [None]:
job_results = data_designer_client.create(config_builder, num_records=20)

# This will block until the job is complete.
job_results.wait_until_done()

In [None]:
# Load the generated dataset as a pandas DataFrame.
dataset = job_results.load_dataset()

dataset.head()

In [None]:
# Load the analysis results into memory.
analysis = job_results.load_analysis()

analysis.to_report()

In [None]:
TUTORIAL_OUTPUT_PATH = "data-designer-tutorial-output"

# Download the job artifacts and save them to disk.
job_results.download_artifacts(
    output_path=TUTORIAL_OUTPUT_PATH,
    artifacts_folder_name="artifacts-community-contributions-person-sampler-tutorial",
);