# 🧑‍🤝‍🧑 NeMo Data Designer: Person Sampler Tutorial

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
>
> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. 
>
> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method.

In this notebook, we'll explore how you can generate realistic personal information for your synthetic datasets.


## What is the Person Sampler?

The Person Sampler is a powerful feature in Data Designer that generates consistent, realistic person records with attributes like:
- Names (first, middle, last)
- Contact information (email, phone)
- Addresses (street, city, state, zip)
- Demographics (age, gender, ethnicity)
- IDs (SSN, UUID)
- And more!

These records are fully synthetic but maintain the statistical properties and formatting patterns of real personal data.

#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.

In [1]:
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).


In [5]:
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# We specify the endpoint of the model during deployment using the model_provider_registry.
model_id = "nvidia/nvidia-nemotron-nano-9b-v2"
model_alias = "nemotron-nano-9b-v2"

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.95,
            ),
            is_reasoner=True
        ),
    ]
)

## 1. Basic Person Sampling

Let's start with a simple example of generating person data using the default settings.

In [8]:
# Add a simple person column with default settings
config_builder.add_column(
    C.SamplerColumn(
        name="person",  # This creates a nested object with all person attributes
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(locale="en_US", sex="Male")
    )
)

# Preview what the generated data looks like
preview = data_designer_client.preview(config_builder)
preview.dataset

## 2. Accessing Individual Person Attributes

The `person` column we created above is a nested object with many attributes. Let's create some columns to access specific attributes from this person object.

In [10]:
# Add columns to extract specific attributes from the person object
config_builder.add_column(
    C.ExpressionColumn(
        name="full_name",
        expr="{{ person.first_name }} {{ person.last_name }}"
    )
)

config_builder.add_column(
    C.ExpressionColumn(
        name="email",
        expr="{{ person.email_address }}"
    )
)

config_builder.add_column(
    C.ExpressionColumn(
        name="address",
        expr="{{ person.street_number }} {{ person.street_name }}, {{ person.city }}, {{ person.state }} {{ person.zipcode }}"
    )
)

config_builder.add_column(
    C.ExpressionColumn(
        name="age",
        expr="{{ person.age }}"
    )
)

# Preview the results
preview = data_designer_client.preview(config_builder)
preview.dataset[['full_name', 'email', 'address', 'age']]

## 3. Customizing Person Generators

Now let's explore customizing the Person Sampler to generate specific types of profiles.

In [None]:
# Reset our config builder for this example
config_builder = DataDesignerConfigBuilder(
    model_configs = [
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.5,
                top_p=1.0,
            ),
        )
    ]
)

In [14]:
# Create custom person samplers for different roles/demographics
config_builder.add_column(
    C.SamplerColumn(
        name="employee",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_US",
            age_range=[22, 65],
            state="CA"
        )
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="customer",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_US",
            age_range=[18, 80]
        )
    )
)

# Create a UK-based person
config_builder.add_column(
    C.SamplerColumn(
        name="uk_contact",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_GB",  # UK locale
            city="London"
        )
    )
)

# Add columns to extract and format information
config_builder.add_column(
    C.ExpressionColumn(
        name="employee_info",
        expr="{{ employee.first_name }} {{ employee.last_name }}, {{ employee.age }} - {{ employee.city }}, {{ employee.state }}"
    )
)

config_builder.add_column(
    C.ExpressionColumn(
        name="customer_info",
        expr="{{ customer.first_name }} {{ customer.last_name }}, {{ customer.age }} - {{ customer.city }}, {{ customer.state }}"
    )
)

config_builder.add_column(
    C.ExpressionColumn(
        name="uk_contact_info",
        expr="{{ uk_contact.first_name }} {{ uk_contact.last_name }}, {{ uk_contact.phone_number }} - {{ uk_contact.city }}"
    )
)

# Preview the results
preview = data_designer_client.preview(config_builder)
preview.dataset[['employee_info', 'customer_info', 'uk_contact_info']]

## 4. Available Person Attributes

The Person Sampler generates a rich set of attributes that you can use. Here's a reference list of some of the key attributes available:

| Attribute | Description | Example |
|-----------|-------------|--------|
| `first_name` | Person's first name | "John" |
| `middle_name` | Person's middle name (may be None) | "Robert" |
| `last_name` | Person's last name | "Smith" |
| `sex` | Person's sex | "Male" |
| `age` | Person's age in years | 42 |
| `birth_date` | Date of birth | "1980-05-15" |
| `email_address` | Email address | "john.smith@example.com" |
| `phone_number` | Phone number | "+1 (555) 123-4567" |
| `street_number` | Street number | "123" |
| `street_name` | Street name | "Main Street" |
| `unit` | Apartment/unit number | "Apt 4B" |
| `city` | City name | "Chicago" |
| `state` | State/province (locale dependent) | "IL" |
| `county` | County (locale dependent) | "Cook" |
| `zipcode` | Postal/ZIP code | "60601" |
| `country` | Country name | "United States" |
| `ssn` | Social Security Number (US locale) | "123-45-6789" |
| `occupation` | Occupation | "Software Engineer" |
| `marital_status` | Marital status | "Married" |
| `education_level` | Education level | "Bachelor's Degree" |
| `ethnic_background` | Ethnic background | "Caucasian" |
| `uuid` | Unique identifier | "550e8400-e29b-41d4-a716-446655440000" |

## 5. Creating Multiple Person Samplers with One Method

For convenience, Data Designer provides a `with_person_samplers` method to create multiple person samplers at once.

In [None]:
# Reset our config builder for this example
config_builder = DataDesignerConfigBuilder(
    model_configs = [
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.5,
                top_p=1.0,
            ),
        )
    ]
)

In [17]:
# Create multiple person samplers at once
config_builder.with_person_samplers({
    "doctor": {"locale": "en_US", "age_range": [30, 70]},
    "patient": {"locale": "en_US", "age_range": [18, 90]},
    "nurse": {"locale": "en_US", "age_range": [25, 65], "sex": "Female"},
    "international_doctor": {"locale": "fr_FR", "age_range": [35, 65]}
})

# Add columns to format information for each person type
config_builder.add_column(
    C.ExpressionColumn(
        name="doctor_profile",
        expr="Dr. {{ doctor.first_name }} {{ doctor.last_name }}, {{ doctor.age }}, {{ doctor.email_address }}"
    )
)

config_builder.add_column(
    C.ExpressionColumn(
        name="patient_profile",
        expr="{{ patient.first_name }} {{ patient.last_name }}, {{ patient.age }}, {{ patient.city }}, {{ patient.state }}"
    )
)

config_builder.add_column(
    C.ExpressionColumn(
        name="nurse_profile",
        expr="Nurse {{ nurse.first_name }} {{ nurse.last_name }}, {{ nurse.age }}"
    )
)

config_builder.add_column(
    C.ExpressionColumn(
        name="international_doctor_profile",
        expr="Dr. {{ international_doctor.first_name }} {{ international_doctor.last_name }}, {{ international_doctor.city }}, {{ international_doctor.country }}"
    )
)

# Preview the results
preview = data_designer_client.preview(config_builder)
preview.dataset[['doctor_profile', 'patient_profile', 'nurse_profile', 'international_doctor_profile']]

## 6. Using Person Data with LLM Generation

One of the most powerful features of Data Designer is combining structured person data with LLM generation to create realistic, contextual content.

In [None]:
# Reset our config builder for this example
config_builder = DataDesignerConfigBuilder(
    model_configs = [
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.5,
                top_p=1.0,
            ),
        )
    ]
)

In [21]:
# Create person samplers for patients and doctors
config_builder.with_person_samplers({
    "patient": {"locale": "en_US", "age_range": [18, 85]},
    "doctor": {"locale": "en_US", "age_range": [30, 70]}
})

# Add some medical condition sampling
config_builder.add_column(
    C.SamplerColumn(
        name="medical_condition",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=[
                "Hypertension",
                "Type 2 Diabetes",
                "Asthma",
                "Rheumatoid Arthritis",
                "Migraine",
                "Hypothyroidism"
            ]
        )
    )
)

# Add basic info columns
config_builder.add_column(
    C.ExpressionColumn(
        name="patient_name",
        expr="{{ patient.first_name }} {{ patient.last_name }}"
    )
)

config_builder.add_column(
    C.ExpressionColumn(
        name="doctor_name",
        expr="Dr. {{ doctor.first_name }} {{ doctor.last_name }}"
    )
)

# Add an LLM-generated medical note
config_builder.add_column(
    C.LLMTextColumn(
        name="medical_notes",
        model_alias=model_alias,
        prompt=(
            "Write a brief medical note from {{ doctor_name }} about patient {{ patient_name }}, "
            "a {{ patient.age }}-year-old {{ patient.sex }} with {{ medical_condition }}. "
            "Include relevant medical observations and recommendations. "
            "The patient lives in {{ patient.city }}, {{ patient.state }} and works as {{ patient.occupation }}. "
            "Keep the note professional, concise (3-4 sentences), and medically accurate."
        )
    )
)

# Add an LLM-generated patient message
config_builder.add_column(
    C.LLMTextColumn(
        name="patient_message",
        model_alias=model_alias,
        prompt=(
            "Write a brief message (1-2 sentences) from {{ patient_name }} to {{ doctor_name }} "
            "about their {{ medical_condition }}. The message should reflect the patient's "
            "experience and concerns. The patient is {{ patient.age }} years old."
        )
    )
)

# Preview the results
preview = data_designer_client.preview(config_builder)
preview.dataset[['patient_name', 'doctor_name', 'medical_condition', 'medical_notes', 'patient_message']]

## 7. Generating and Saving the Final Dataset

Now that we've explored the Person Sampler capabilities, let's generate a complete dataset and save it.

In [None]:
job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)

job_results.wait_until_done()

In [None]:
dataset = job_results.load_dataset()
print(f"Generated dataset with {len(dataset)} records")

## Conclusion

In this tutorial, we've explored the Person Sampler functionality in Data Designer. We've learned how to:

1. Generate basic person records with realistic attributes
2. Customize person profiles by locale, age, gender, and location
3. Create multiple person samplers for different roles or demographics
4. Use person attributes in expressions and LLM prompts

The Person Sampler is an essential tool for creating realistic synthetic datasets for testing, development, and training applications that handle personal information.