**LLM Enrichment using a Safe Synthetic Proxy**

This notebook demonstrates a privacy-first approach to enhancing sensitive datasets using Large Language Models (LLMs), without ever exposing the original data. We achieve this by introducing a synthetic proxy - a privacy-safe, high-fidelity synthetic version of your dataset. This proxy is shared with an LLM to generate enriched insights. These enriched insights are then learned by a generator, which is capable of transferring the enrichment logic back onto your real data without sharing it externally.

📋 Steps

1.   Create a synthetic proxy
2.   Enrich the proxy with an LLM
3.   Train a generator on the enriched proxy
4.   Apply the enrichment to the sensitive data




🔐 Key Benefits

* No data exposure: Original data stays secure.
* Enrichment at scale: LLMs enrich synthetic data; the generator brings that intelligence back.
* Reusable logic: Once trained, the generator acts as a secure enrichment adapter - no repeated LLM calls needed.



**Install Required Packages**

Install the Synthetic Data SDK and DataLLM.

In [None]:
# install sdk
%pip install -U 'mostlyai[local]' mostlyai-mock

*Remember to restart your kernel after installing new packages.*

In [None]:
import os
from concurrent.futures import ThreadPoolExecutor, as_completed

import pandas as pd

from mostlyai import mock

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE"

**Load Original Data**

Fetch a sample of the census dataset that will be used as our sensitive proprietary data that we want to enrich while keeping it private.

In [None]:
# load sample of original data
df_od = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz", nrows=2000)
print(df_od.head())

   age         workclass  fnlwgt  education  education_num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital_status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital_gain  capital_loss  hours_per_week native_country income  
0          2174             0              40  United-States  <=50K  
1             0             0             

**Initialize SDK and DataLLM**

The SDK will handle model training and synthetic data generation, while DataLLM will provide the LLM enrichment capabilities.

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI(local=True)

**Train a Generator**

Train a generator on the sensitive original data.

In [None]:
# train generator on original data
g = mostly.train(data=df_od)
# g = mostly.generators.get('GENERATOR_ID')

**Generate Synthetic Data**

Create synthetic data that will act as a proxy for our sensitive original data. This synthetic data will be shared with the LLM and enriched.


In [None]:
# generate synthetic data as proxy
df_sd = mostly.generate(g, size=len(df_od)).data()

Output()

**Base configuration of the census table for the mock library**

Add the metadata description of the existing columns of the census table to support the enrichment process.

In [None]:
tables = {
    "census": {
        "prompt": "U.S. Census data with demographic and employment-related columns",
        "columns": {
            "age": {"prompt": "age in years (17-90)", "dtype": "integer"},
            "workclass": {
                "dtype": "category",
                "values": [
                    "?",
                    "Federal-gov",
                    "Local-gov",
                    "Never-worked",
                    "Private",
                    "Self-emp-inc",
                    "Self-emp-not-inc",
                    "State-gov",
                    "Without-pay",
                ],
            },
            "education": {
                "dtype": "category",
                "values": [
                    "10th",
                    "11th",
                    "12th",
                    "1st-4th",
                    "5th-6th",
                    "7th-8th",
                    "9th",
                    "Assoc-acdm",
                    "Assoc-voc",
                    "Bachelors",
                    "Doctorate",
                    "HS-grad",
                    "Masters",
                    "Preschool",
                    "Prof-school",
                    "Some-college",
                ],
            },
            "marital-status": {
                "dtype": "category",
                "values": [
                    "Divorced",
                    "Married-AF-spouse",
                    "Married-civ-spouse",
                    "Married-spouse-absent",
                    "Never-married",
                    "Separated",
                    "Widowed",
                ],
            },
            "occupation": {
                "dtype": "category",
                "values": [
                    "?",
                    "Adm-clerical",
                    "Armed-Forces",
                    "Craft-repair",
                    "Exec-managerial",
                    "Farming-fishing",
                    "Handlers-cleaners",
                    "Machine-op-inspct",
                    "Other-service",
                    "Priv-house-serv",
                    "Prof-specialty",
                    "Protective-serv",
                    "Sales",
                    "Tech-support",
                    "Transport-moving",
                ],
            },
            "relationship": {
                "dtype": "category",
                "values": ["Husband", "Not-in-family", "Other-relative", "Own-child", "Unmarried", "Wife"],
            },
            "race": {
                "dtype": "category",
                "values": ["Amer-Indian-Eskimo", "Asian-Pac-Islander", "Black", "Other", "White"],
            },
            "sex": {"dtype": "category", "values": ["Female", "Male"]},
            "hours-per-week": {"prompt": "hours worked per week (1-99)", "dtype": "integer", "min": 1, "max": 99},
            "native-country": {
                "dtype": "category",
                "values": [
                    "?",
                    "Cambodia",
                    "Canada",
                    "China",
                    "Columbia",
                    "Cuba",
                    "Dominican-Republic",
                    "Ecuador",
                    "El-Salvador",
                    "England",
                    "France",
                    "Germany",
                    "Greece",
                    "Guatemala",
                    "Haiti",
                    "Holand-Netherlands",
                    "Honduras",
                    "Hong",
                    "Hungary",
                    "India",
                    "Iran",
                    "Ireland",
                    "Italy",
                    "Jamaica",
                    "Japan",
                    "Laos",
                    "Mexico",
                    "Nicaragua",
                    "Outlying-US(Guam-USVI-etc)",
                    "Peru",
                    "Philippines",
                    "Poland",
                    "Portugal",
                    "Puerto-Rico",
                    "Scotland",
                    "South",
                    "Taiwan",
                    "Thailand",
                    "Trinadad&Tobago",
                    "United-States",
                    "Vietnam",
                    "Yugoslavia",
                ],
            },
            "income": {"dtype": "category", "values": ["<=50K", ">50K"]},
        },
    }
}

**Enrich Synthetic Data Proxy**

Use the LLM to enrich the synthetic data with 2 new columns, namely: work category, and career stage. This is where we expose data to the LLM, but only the synthetic proxy data, not our sensitive original data.

**DESCRIPTION TO REVIEW**

In [None]:
# ──────────────────────────────────────────────────────────────
# 1.  House-keeping that only needs to run once
# ──────────────────────────────────────────────────────────────
batch_size = 100
n_rows = len(df_sd)

tables["census"]["columns"]["specific_job_title"] = {
    "prompt": (
        "Generate a realistic, specific job title for a person "
        "based on their occupation, education, and income level. "
        "The job title should be more specific than the general "
        "occupation category."
    ),
    "dtype": "string",
}

# Define the categories
categories = ["Manual Labor", "Service Work", "Professional", "Management", "Technical"]

tables["census"]["columns"]["work_category"] = {
    "prompt": """categorize the occupation into work category, considering the actual job duties and level.
                    Examples of correct categorizations:
                    - Handlers-cleaners → Manual Labor
                    - Machine-op-inspct → Manual Labor
                    - Craft-repair → Manual Labor
                    - Transport-moving → Manual Labor
                    - Farming-fishing → Manual Labor
                    - Exec-managerial → Management
                    - Prof-specialty → Professional
                    - Tech-support → Technical
                    - Sales → Service Work
                    - Other-service → Service Work

                    Categories and their meanings:
                    - Manual Labor: physical work, manufacturing, construction, cleaning, transportation, farming, machine operation, craft work, manual repairs, physical labor
                    - Service Work: customer service, retail, hospitality, food service, personal care, non-physical service roles
                    - Professional: doctors, lawyers, engineers, scientists, specialized knowledge workers
                    - Management: supervisors, executives, administrators, team leaders
                    - Technical: IT, technical support, specialized technical skills, maintenance""",
    "dtype": "category",
    "values": categories,
}


# ──────────────────────────────────────────────────────────────
# 2.  Helper that enriches **one** slice of rows
#     Executed in parallel by many threads.
# ──────────────────────────────────────────────────────────────
def enrich_slice(start: int, end: int):
    """Return the enriched batch and where it belongs in the final list."""
    batch_df = df_sd.iloc[start:end].reset_index(drop=True)

    enriched = mock.sample(
        tables=tables,
        existing_data={"census": batch_df},
        model="openai/gpt-4.1-nano",
    )
    print(f"Processed rows {start}–{end - 1}")
    return start, enriched  # keep the start index so we can restore order


# ──────────────────────────────────────────────────────────────
# 3.  Dispatch every slice to the thread pool
# ──────────────────────────────────────────────────────────────
slices = [(s, min(s + batch_size, n_rows)) for s in range(0, n_rows, batch_size)]

results_ordered = [None] * len(slices)  # preserve batch order

# Tune max_workers to match your rate-limit / CPU core budget
with ThreadPoolExecutor(max_workers=10) as pool:
    futures = {pool.submit(enrich_slice, s, e): idx for idx, (s, e) in enumerate(slices)}

    for fut in as_completed(futures):
        idx = futures[fut]
        try:
            _, enriched = fut.result()
            results_ordered[idx] = enriched
        except Exception as err:
            print(f"⚠️  Batch {idx} failed: {err}")

# ──────────────────────────────────────────────────────────────
# 4.  Combine the individual DataFrames (or whatever mock.sample returns)
# ──────────────────────────────────────────────────────────────
df_enriched = pd.concat(results_ordered, ignore_index=True)

In [None]:
df_enriched.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income,specific_job_title,work_category
0,32,Self-emp-inc,Some-college,Divorced,Prof-specialty,Not-in-family,Other,Female,50,Mexico,<=50K,Marketing Specialist,Professional
1,70,Self-emp-not-inc,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,40,United-States,>50K,Manufacturing Supervisor,Manual Labor
2,29,Private,Bachelors,Never-married,Tech-support,Not-in-family,Asian-Pac-Islander,Male,40,India,<=50K,IT Support Specialist,Technical
3,45,Local-gov,Masters,Married-civ-spouse,Prof-specialty,Husband,White,Male,55,United-States,>50K,Senior Data Scientist,Professional
4,40,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,42,Jamaica,<=50K,Janitorial Supervisor,Manual Labor


**Display Sample of Enriched Synthetic Data**

Show a random sample of the synthetic data, including the newly generated `work_category` and `specific_job_title` columns, to inspect the results of the LLM enrichment.

In [None]:
# display a random sample with the newly enriched columns
print(df_enriched.sample(n=10))

     age         workclass     education      marital-status  \
73    23           Private           9th       Never-married   
465   29       Federal-gov          12th       Never-married   
916   27      Self-emp-inc  Some-college  Married-civ-spouse   
902   40      Self-emp-inc     Doctorate  Married-civ-spouse   
920   42         State-gov          12th            Divorced   
178   40      Self-emp-inc     Doctorate  Married-civ-spouse   
349   37           Private          11th  Married-civ-spouse   
88    59         State-gov          12th             Widowed   
377   29      Self-emp-inc  Some-college           Separated   
134   41  Self-emp-not-inc       Masters  Married-civ-spouse   

            occupation    relationship                race     sex  \
73               Sales   Not-in-family  Asian-Pac-Islander  Female   
465  Handlers-cleaners       Own-child               Black    Male   
916     Prof-specialty         Husband               White    Male   
902      Other-

**Train a fresh Generator on the newly Enriched Synthetic Data**

Train a generator on the enriched synthetic data to encode the LLM intelligence into a reusable, privacy-safe enrichment model. This enables it to later apply the same intelligence to sensitive original data.

In [None]:
# train generator on enriched synthetic data
config = {
    "name": "Enriched Census",
    "tables": [
        {
            "name": "Census",
            "data": df_enriched,
            "tabularModelConfiguration": {
                "enableModelReport": False,  # failing
                "valueProtection": False,  # specific job types
                "maxTrainingTime": 2.0,
            },
        }
    ],
}

g = mostly.train(config=config)

Output()

**Apply Enrichment to Original Data**

Now we use the generator trained on enriched synthetic data to add the same new features to the original sensitive data. We do this by fixing the original data as the seed input to the generator, which then produces the enriched version with the same feature transformation. This approach ensures that the original data's structure and relationships are preserved while the new features are generated consistently with the same patterns learned from the synthetic data. Your sensitive data remains untouched - yet is now enhanced with the same intelligent enrichments, thanks to the generator’s learned transformations.

In [None]:
# generate enriched original data using original data as seed
df_od_enriched = mostly.generate(g, seed=df_od).data()

Output()

**Display Sample of Enriched Original Data**

Show a random sample of the enriched original data, including the newly added `summary` column, to see how the learned patterns were applied to your sensitive data.

In [None]:
# display a random sample of the newly enriched column
print(df_od_enriched.sample(n=10))

     age         workclass     education      marital-status  \
630   47  Self-emp-not-inc  Some-college            Divorced   
283   60           Private     Bachelors       Never-married   
908   37  Self-emp-not-inc       HS-grad  Married-civ-spouse   
94    34         Local-gov     Bachelors  Married-civ-spouse   
75    27           Private       HS-grad       Never-married   
671   17                 ?          11th       Never-married   
757   33         Local-gov     Bachelors       Never-married   
223   54           Private       HS-grad  Married-civ-spouse   
427   23           Private       HS-grad       Never-married   
378   46  Self-emp-not-inc       Masters       Never-married   

            occupation   relationship   race     sex  hours-per-week  \
630      Other-service      Unmarried  Black  Female              40   
283     Prof-specialty  Not-in-family  White  Female              45   
908  Handlers-cleaners      Own-child  White    Male              45   
94     

**Conclusion**

This tutorial demonstrated how to securely enrich sensitive proprietary data by:
1. Creating a synthetic proxy
2. Enriching the proxy with an LLM
3. Training a generator on the enriched proxy
4. Applying the enrichment to the sensitive data

The sensitive data never leaves your secure environment, maintaining privacy while enabling LLM-based enrichment.