# 🧾 NeMo Data Designer: W-2 Dataset Generator

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
>
> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. 
>
> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method.

In this notebook we demonstrate how you can combine numerical samplers, the person sampler and LLMs to create a synthetic dataset of W-2 forms (US Wage & Tax Statements).

### Generating realistic numerical values

We will use generate numerical fields using statistics published by the IRS for the year 2021:

- https://www.irs.gov/pub/irs-pdf/p5385.pdf

### Generating realistic taxpayers

We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, statistics for generated persons reflect real-world census data.



#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.


In [None]:
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).


In [None]:
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [47]:
# We specify the endpoint of the model during deployment using the model_provider_registry.
model_id = "nvidia/nvidia-nemotron-nano-9b-v2"
model_alias = "nemotron-nano-9b-v2"

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.9,
            ),
            is_reasoner=True
        ),
    ]
)

## Setting up taxpayer and employer sampling

In [None]:
# Create a samplers for an American taxpayer (employee), and employer.
config_builder.add_column(
    C.SamplerColumn(
        name="taxpayer",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_US",
            age_range=[18, 75]
        ),
    )
)

# While the employer isn't technically a "person", we'll use the person sampler for generating the employer address.
config_builder.add_column(
    C.SamplerColumn(
        name="employer",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_US",
        ),
    )
)

## Defining the fields

We will focus on the following:
- Box 1 (Wages, tips, and other compensation)
- Box 2 (Federal income tax withheld)
- Box 3 (Social security wages)
- Box 4 (Social security tax withheld)
- Box 5 (Medicare wages and tips)
- Box 6 (Medicare tax withheld)
- Box 7 (Social security tips)
- Box a (Employee's social security number)
- Box c (Employer's name, address and zip code)
- Box e (Employee's fist name, initial, and last name)
- Box f (Employee's address and zip code)

### Numerical fields

Here, we'll define how to generate numerical samples for the currency fields of the W-2 (Boxes 1-7). We'll use the W-2 statistics from the IRS linked above to generate realistic samples.

In [None]:
### BOX 1 (TOTAL WAGES, TIPS, AND OTHER COMPENSATION) ###

# From Page 6 of the IRS Statistics, we know that  276,388,660 / 277,981,454 W-2 forms had a non-zero value for Box 1 (99.4%).
# From Page 8 of the IRS Statistics, we know that the sum of this field across all forms was 9,920,000,000*$1000 = $9,920,000,000,000 dollars.
# Since there were 276,388,660 non-zero Box 1 values, the average value of Box 1 was $9,920,000,000,000 / 276,388,660 = $35,891.49.
# We will use a Bernoulli-Exponential mixture distribution to sample values for this field.
config_builder.add_column(
    C.SamplerColumn(
        name="box_1_wages_tips_other_compensation",
        type=P.SamplerType.BERNOULLI_MIXTURE,
        params=P.BernoulliMixtureSamplerParams(
            p=0.994,
            dist_name="expon",
            dist_params={"scale": 35891.49}
        ),
        convert_to="int",
    )
)

### BOX 2 (FEDERAL INCOME TAX WITHHELD) ###

# Note: The calculations below are a simplification based on the assumption that this is an individual's only W-2.
# In practice, the taxable income is based on all wages for individuals with multiple W-2s.

# 2022 standard deduction
config_builder.add_column(
    C.ExpressionColumn(
        name="standard_deduction",
        expr="{% if taxpayer.marital_status == 'married_present' %}25900{% else %}12950{% endif %}",
        convert_to="float",
    ),
)

config_builder.add_column(
    C.ExpressionColumn(
        name="taxable_income",
        expr="{{ [0, box_1_wages_tips_other_compensation - standard_deduction]|max }}",
        convert_to="float",
    )
)

# We'll sum over the tax incurred at each 2022 tax bracket.
# For simplicity, we'll assume that the taxpayer is single here.
BRACKETS = [
    {"name": "bracket1", "rate": 0.10, "max": 10275, "min": 0},
    {"name": "bracket2", "rate": 0.12, "max": 41775, "min": 10275},
    {"name": "bracket3", "rate": 0.22, "max": 89075, "min": 41775},
    {"name": "bracket4", "rate": 0.24, "max": 170050, "min": 89075},
    {"name": "bracket5", "rate": 0.32, "max": 215950, "min": 170050},
    {"name": "bracket6", "rate": 0.35, "max": 539900, "min": 215950},
    {"name": "bracket7", "rate": 0.37, "max": 10000000000000, "min": 539900},
]
for bracket in BRACKETS:
    expression = f"{bracket['rate']}*([[taxable_income,{bracket['max']}]|min - {bracket['min']}, 0] | max)"
    config_builder.add_column(
        C.ExpressionColumn(
            name=bracket["name"],
            expr="{{ " + expression + " }}",
            convert_to="float",
        )
    )

# Sum the tax brackets to get the total withheld, on average
config_builder.add_column(
    C.ExpressionColumn(
        name="mean_tax_liability",
        expr="{{ bracket1 + bracket2 + bracket3 + bracket4 + bracket5 + bracket6 + bracket7 }}",
        convert_to="int",
    )
)

# Add some noise to get the actual withholding
config_builder.add_column(
    C.SamplerColumn(
        name="tax_liability_noise",
        type=P.SamplerType.GAUSSIAN,
        params=P.GaussianSamplerParams(mean=1, stddev=0.1),
    )
)
config_builder.add_column(
    C.ExpressionColumn(
        name="box_2_federal_income_tax_withheld",
        expr="{{ (mean_tax_liability * tax_liability_noise) | int }}",
    )
)

### BOX 3 (SOCIAL SECURITY WAGES) ###

# From Page 8 of the IRS Statistics, we know that social security wages are, on average, 8,150,000,000/9,920,000,000 ~= 82.16% of total wages.
# We'll sample a ratio from a normal distribution with mean 0.8216 and standard deviation 0.2.
config_builder.add_column(
    C.SamplerColumn(
        name="social_security_wages_ratio",
        type=P.SamplerType.GAUSSIAN,
        params=P.GaussianSamplerParams(mean=0.8216, stddev=0.2),
        convert_to="float",
    )
)

config_builder.add_column(
    C.ExpressionColumn(
        name="box_3_social_security_wages",
        expr="{{ (box_1_wages_tips_other_compensation * social_security_wages_ratio) | int }}",
    )
)

### BOX 4 (SOCIAL SECURITY TAX WITHHELD) ###

# In 2022, social security tax was withheld at a rate of 6.2% of social security wages, up to a maximum of $147,000.
config_builder.add_column(
    C.ExpressionColumn(
        name="box_4_social_security_tax_withheld",
        expr="{{ (([box_3_social_security_wages, 147000]|min) * 0.062) | int }}",
    )
)

### BOX 5 (MEDICARE WAGES AND TIPS) ###

# From Page 8 of the IRS Statistics, we know that Medicare wages and tips are, on average, 10,300,000,000/9,920,000,000 ~= 103.8% of total wages.
config_builder.add_column(
    C.SamplerColumn(
        name="medicare_wages_and_tips_ratio",
        type=P.SamplerType.GAUSSIAN,
        params=P.GaussianSamplerParams(mean=1.038, stddev=0.2),
        convert_to="float",
    )
)

config_builder.add_column(
    C.ExpressionColumn(
        name="box_5_medicare_wages_and_tips",
        expr="{{ (box_1_wages_tips_other_compensation * medicare_wages_and_tips_ratio) | int }}",
    )
)

### BOX 6 (MEDICARE TAX WITHHELD) ###

# The standard employee Medicare tax rate in 2022 was 1.45% on all Medicare wages.
# The Additional Medicare Tax rate in 2022 was 0.9% on all Medicare wages in excess of $200,000.
config_builder.add_column(
    C.ExpressionColumn(
        name="box_6_medicare_tax_withheld",
        expr="{{ ((box_5_medicare_wages_and_tips * 0.0145) + (([box_5_medicare_wages_and_tips - 200000, 0]|max) * 0.009)) | int }}",
    )
)

### BOX 7 (SOCIAL SECURITY TIPS) ###

# From Page 6 of the IRS Statistics, we know that only 12,620,946 / 277,981,454 W-2 forms had a non-zero value for Box 7 (4.54%).
# From Page 8 of the IRS Statistics, we know that the sum of this field across all forms was 55,897,014*$1000 = $55,897,014,000.
# Since there were 12,620,946 non-zero Box 7 values, the average value of Box 7 was $55,897,014,000 / 12,620,946 = $4428.91.
# We will use a Bernoulli-Exponential mixture distribution to sample values for this field.
config_builder.add_column(
    C.SamplerColumn(
        name="box_7_social_security_tips",
        type=P.SamplerType.BERNOULLI_MIXTURE,
        params=P.BernoulliMixtureSamplerParams(
            p=0.0454,
            dist_name="expon",
            dist_params={"scale": 4428.91}
        ),
        convert_to="int",
    )
)

### Non-numerical fields

The remaining fields contain information about the employee (taxpayer) and the employer. We'll use the person sampler in combination with an LLM to generate values here.

In [None]:
### BOX A (EMPLOYEE'S SOCIAL SECURITY NUMBER) ###

# We can use the ssn field of the person sampler to generate a valid SSN for the employee.

config_builder.add_column(
    C.ExpressionColumn(
        name="box_a_employee_ssn",
        expr="{{ taxpayer.ssn }}",
    )
)

### BOX C (EMPLOYER'S NAME, ADDRESS AND ZIP CODE) ###

# We want to generate a realistic company name.
# We'll start by generating a list of industries, expanded with magic.
config_builder.add_column(
    C.LLMTextColumn(
        name="employer_business",
        model_alias=model_alias,
        system_prompt=("You are assisting a user generate synthetic W-2 forms."
                       "You must generate a realistic industry category for the employer"
                       "eg: software, health insurance, shoe store, restaurant, plumbing"),
        prompt=("Generate the industry category for the employer. Ensure it is consistent with the employer location"
                "City: {{ employer.city }}\nState: {{ employer.state }}"),
    )
)

# Next, we'll generate an actual name based on the type of business.
config_builder.add_column(
   C.LLMTextColumn(
        name="employer_name",
        model_alias=model_alias,
        prompt="Generate an original name for a {{ employer_business }} business in {{ employer.city }}.",
    )
)

# Finally, we'll combine the employer name with the address of the employer.
config_builder.add_column(
    C.ExpressionColumn(
        name="box_c_employer_name_address_zip",
        expr="{{ employer_name }}\n{{ employer.street_number }} {{ employer.street_name }}\n{{ employer.city }}, {{ employer.state }} {{ employer.postcode }}",
    )
)

### BOX E (EMPLOYEE'S FIRST NAME, INITIAL, AND LAST NAME) ###

# We can extract the first name, initial, and last name from the person sampler.

config_builder.add_column(
    C.ExpressionColumn(
        name="box_e_employee_first_name_initial_last_name",
        expr="{{ taxpayer.first_name }} {{ taxpayer.middle_name[:1] }} {{ taxpayer.last_name }}",
    )
)

### BOX F (EMPLOYEE'S ADDRESS AND ZIP CODE) ###

# Similarly, we can extract the employee's address and zip code from the person sampler.

config_builder.add_column(
    C.ExpressionColumn(
        name="box_f_employee_address_zip",
        expr="{{ taxpayer.street_number }} {{ taxpayer.street_name }}\n{{ taxpayer.city }}, {{ taxpayer.state }} {{ taxpayer.postcode }}",
    )
)

## Preview the dataset

We'll define the actual columns we want to appear in the dataset and generate a small 10-row preview.

In [None]:
# These are the columns we want in the final dataset, after dropping latent variables.
FINAL_COLUMNS = [
    "box_1_wages_tips_other_compensation",
    "box_2_federal_income_tax_withheld",
    "box_3_social_security_wages",
    "box_4_social_security_tax_withheld",
    "box_5_medicare_wages_and_tips",
    "box_6_medicare_tax_withheld",
    "box_7_social_security_tips",
    "box_a_employee_ssn",
    "box_c_employer_name_address_zip",
    "box_e_employee_first_name_initial_last_name",
    "box_f_employee_address_zip",
]

# Preview the results
preview = data_designer_client.preview(config_builder, verbose_logging=True)
preview.dataset[FINAL_COLUMNS]

## Generating and Saving the Final Dataset

Once we're happy with the preview, we can generate a larger dataset.

In [None]:
# Generate a final dataset
job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)

job_results.wait_until_done()

In [None]:
# Load the dataset into a pandas DataFrame
dataset = job_results.load_dataset()

# Show the final dataset with only the W-2 relevant columns
final_dataset = dataset[FINAL_COLUMNS]

print(f"Generated dataset with {len(final_dataset)} records")


In [None]:
# Create data directory if it doesn't exist
import os
os.makedirs("./data", exist_ok=True)

# Save the dataset to CSV
csv_filename = "./data/synthetic-w2-dataset.csv"
final_dataset.to_csv(csv_filename, index=False)
print(f"Dataset saved to {csv_filename}")

# Show a sample of the final dataset
final_dataset.head()