# üéõÔ∏è NeMo Safe Synthesizer 101: The Basics

> ‚ö†Ô∏è **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.

<br> 

In this notebook, we demonstrate how to create a synthetic version of a tabular dataset using the NeMo Microservices Python SDK. The notebook should take about 20 minutes to run.

After completing this notebook, you'll be able to:
- Use the NeMo Microservices SDK to interact with Safe Synthesizer
- Create novel synthetic data that follows the statistical properties of your input dataset
- Access an evaluation report on synthetic data quality and privacy


#### üíæ Install dependencies

**IMPORTANT** üëâ Ensure you have a NeMo Microservices Platform deployment available. Follow the quickstart or Helm chart instructions in your environment's setup guide. You may need to restart your kernel after installing dependencies.


In [None]:
import pandas as pd
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder

import logging
logging.basicConfig(level=logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)

### ‚öôÔ∏è Initialize the NeMo Safe Synthesizer Client

- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.
- `http://localhost:8080` is the default url for the client's `base_url` in the quickstart.
- If using a managed or remote deployment, ensure correct base URLs and tokens.


In [None]:
client = NeMoMicroservices(
    base_url="http://localhost:8080",
)

NeMo DataStore is launched as one of the services, and we'll use it to manage our storage. so we'll set the following:

In [None]:
datastore_config = {
    "endpoint": "http://localhost:3000/v1/hf",
    "token": "placeholder",
}

## üì• Load input data

Safe Synthesizer learns the patterns and correlations in your input dataset to produce synthetic data with similar properties. For this tutorial, we will use a small public sample dataset. Replace it with your own data if desired.

The sample dataset used here is a set of women's clothing reviews, including age, product category, rating, and review text. Some of the reviews contain Personally Identifiable Information (PII), such as height, weight, age, and location.

In [None]:
# %uv pip install kagglehub, scikit-learn, tabulate

In [None]:
import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews")
raw_df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0)
raw_df.head()

We create a holdout dataset that will only be used for evaluating the end classifier

In [None]:
from sklearn.model_selection import train_test_split

df, test_df = train_test_split(raw_df, test_size=0.2, random_state=42)

print(f"Original df length: {len(raw_df)}")
print(f"Training df length: {len(df)}")
print(f"Testing df length:  {len(test_df)}")

## üèóÔ∏è Create a Safe Synthesizer job

The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.

The following code creates and submits a job:
- `SafeSynthesizerBuilder(client)`: initialize with the NeMo Microservices client.
- `.from_data_source(df)`: set the input data source.
- `.with_datastore(datastore_config)`: configure model artifact storage.
- `.with_replace_pii()`: enable automatic replacement of PII.
- `.synthesize()`: train and generate synthetic data.
- `.create_job()`: submit the job to the platform.


In [None]:
job = (
    SafeSynthesizerBuilder(client)
    .from_data_source(df)
    .with_datastore(datastore_config)
    .with_replace_pii()
    .synthesize()
    .with_generate(num_records=15000)
    .create_job()
)

print(f"job_id = {job.job_id}")
job.wait_for_completion()

print(f"Job finished with status {job.fetch_status()}")

In [None]:
# If your notebook shuts down, it's okay, your job is still running on the microservices platform.
# You can get the same job object and interact with it again by uncommenting the following code
# snippet, and modifying it with the job id from the previous cell output.

# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob
# job = SafeSynthesizerJob(job_id="<job id>", client=client)

## üëÄ View synthetic data

After the job completes, fetch the generated synthetic dataset.

In [None]:
# Fetch the synthetic data created by the job
synthetic_df = job.fetch_data()
synthetic_df


## üìä View evaluation report

An evaluation comparing the synthetic data to the input data is performed automatically. You can:

- **Inspect key scores**: overall synthetic data quality and privacy.
- **Download the full HTML report**: includes charts and detailed metrics.
- **Display the report inline**: useful when viewing in notebook environments.


In [None]:
# Print selected information from the job summary
summary = job.fetch_summary()
print(
    f"Synthetic data quality score (0-10, higher is better): {summary.synthetic_data_quality_score}"
)
print(f"Data privacy score (0-10, higher is better): {summary.data_privacy_score}")


In [None]:
# Download the full evaluation report to your local machine
job.save_report("evaluation_report.html")

In [None]:
# Fetch and display the full evaluation report inline
# job.display_report_in_notebook()

## üß™ Extrinsic Evaluation 

This section details the **extrinsic evaluation** process, where the quality of the synthetic data is assessed based on how well a model trained on it performs on a real-world task. This comparison is critical for validating the synthetic data's utility.

- **Train Benchmark Model**: A model is trained on a small, fixed subset of the **original data** to establish a performance baseline.
- **Train Synthetic Model**: A second model, using the same structure, is trained on the **entire synthetic dataset**.
- **Compare Performance**: Both models are evaluated against the same **fixed holdout test set** ($\mathbf{X_{test}, y_{test}}$).
- **Inspect Key Metrics**: The comparison focuses on key metrics like **ROC AUC** and **F1-Score** to determine if the synthetic model performs comparably to the benchmark.

In [None]:
# This script defines a scikit-learn pipeline for a classification task.
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score
from sklearn.base import clone

X_train = df.drop('Recommended IND', axis=1)
y_train = df['Recommended IND']

X_train['Review Text'] = X_train['Review Text'].fillna('')
X_train['Title'] = X_train['Title'].fillna('')

X_test = test_df.drop('Recommended IND', axis=1)
y_test = test_df['Recommended IND']

X_test['Review Text'] = X_test['Review Text'].fillna('')
X_test['Title'] = X_test['Title'].fillna('')

text_features = ['Review Text']
numerical_features = ['Age', 'Rating', 'Positive Feedback Count']
categorical_features = ['Division Name', 'Department Name', 'Class Name']

text_transformer = TfidfVectorizer(stop_words='english', max_features=5000)
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore') 

preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_transformer, text_features[0]), 
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='drop' 
)

model = LogisticRegression(solver='liblinear', random_state=42)

full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', model)
])


In [None]:
# Train and evaluate a benchmark model pipeline, storing its performance metrics.
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score

original_pipeline = full_pipeline 
print("\n--- Training Benchmark Model on Original Data (1000 rows) ---")
original_pipeline.fit(X_train, y_train)

y_pred_original = original_pipeline.predict(X_test)
y_prob_original = original_pipeline.predict_proba(X_test)[:, 1]

results = {}
results['Original'] = {
    'Accuracy': accuracy_score(y_test, y_pred_original),
    'ROC AUC': roc_auc_score(y_test, y_prob_original),
    'Classification Report': classification_report(y_test, y_pred_original, output_dict=True)
}
print("Benchmark training and evaluation complete.")


In [None]:
# Train a new model pipeline on synthetic data and evaluates it against the test set.
from sklearn.base import clone
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score

X_synthetic = synthetic_df.drop('Recommended IND', axis=1).fillna({'Review Text': '', 'Title': ''})
y_synthetic = synthetic_df['Recommended IND']

synthetic_pipeline = clone(full_pipeline) 

print("\n--- Training Model on Synthetic Data ---")
synthetic_pipeline.fit(X_synthetic, y_synthetic)

y_pred_synthetic = synthetic_pipeline.predict(X_test)
y_prob_synthetic = synthetic_pipeline.predict_proba(X_test)[:, 1]

results['Synthetic'] = {
    'Accuracy': accuracy_score(y_test, y_pred_synthetic),
    'ROC AUC': roc_auc_score(y_test, y_prob_synthetic),
    'Classification Report': classification_report(y_test, y_pred_synthetic, output_dict=True)
}
print("Synthetic training and evaluation complete.")


In [None]:
# Compare the performance of the original and synthetic models and prints a summary.
import pandas as pd

print("\n" + "="*60)
print("             SIDE-BY-SIDE MODEL COMPARISON")
print(f"             (Tested on {len(test_df)}-Row Holdout Set)")
print("="*60)

summary_data = {
    'Model': ['Original (Benchmark)', 'Synthetic'],
    'Train Size': [len(X_train), len(X_synthetic)],
    'Accuracy': [results['Original']['Accuracy'], results['Synthetic']['Accuracy']],
    'ROC AUC Score': [results['Original']['ROC AUC'], results['Synthetic']['ROC AUC']],
    'Precision (Class 1)': [results['Original']['Classification Report']['1']['precision'], results['Synthetic']['Classification Report']['1']['precision']],
    'Recall (Class 1)': [results['Original']['Classification Report']['1']['recall'], results['Synthetic']['Classification Report']['1']['recall']],
}

summary_df = pd.DataFrame(summary_data).set_index('Model').T
summary_df.columns.name = 'Metric'

print(summary_df.to_markdown(floatfmt=".4f"))

print("\n" + "="*60)

print("Key Finding:")
if results['Synthetic']['ROC AUC'] >= results['Original']['ROC AUC']:
    print("The Synthetic Model performs AS WELL OR BETTER than the Original Benchmark.")
else:
    print("The Synthetic Model's performance is slightly lower than the Original Benchmark.")


Your end result should look similar to this:

|                     |   Original (Benchmark) |   Synthetic |
|:--------------------|-----------------------:|------------:|
| Train Size          |                 18,788 |      15,000 |
| Accuracy            |                 0.9404 |      0.9278 |
| ROC AUC Score       |                 0.9782 |      0.9762 |
| Precision (Class 1) |                 0.9626 |      0.9423 |
| Recall (Class 1)    |                 0.9646 |      0.9714 |

