# üéõÔ∏è NeMo Safe Synthesizer 101: The Basics

> ‚ö†Ô∏è **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.

<br> 

In this notebook, we demonstrate how to create a synthetic version of a tabular dataset using the NeMo Microservices Python SDK. The notebook should take about 20 minutes to run.

After completing this notebook, you'll be able to:
- Use the NeMo Microservices SDK to interact with Safe Synthesizer
- Create novel synthetic data that follows the statistical properties of your input dataset
- Access an evaluation report on synthetic data quality and privacy


#### üíæ Install dependencies

**IMPORTANT** üëâ Ensure you have a NeMo Microservices Platform deployment available. Follow the quickstart or Helm chart instructions in your environment's setup guide. You may need to restart your kernel after installing dependencies.


In [1]:
import pandas as pd
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder

import logging
logging.basicConfig(level=logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)

  from .autonotebook import tqdm as notebook_tqdm


### ‚öôÔ∏è Initialize the NeMo Safe Synthesizer Client

- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.
- `http://localhost:8080` is the default url for the client's `base_url` in the quickstart.
- If using a managed or remote deployment, ensure correct base URLs and tokens.


In [2]:
client = NeMoMicroservices(
    base_url="http://localhost:8080",
)

NeMo DataStore is launched as one of the services, and we'll use it to manage our storage. so we'll set the following:

In [3]:
datastore_config = {
    "endpoint": "http://localhost:3000/v1/hf",
    "token": "",
}

## üì• Load input data

Safe Synthesizer learns the patterns and correlations in your input dataset to produce synthetic data with similar properties. For this tutorial, we will use a small public sample dataset. Replace it with your own data if desired.

The sample dataset used here is a set of women's clothing reviews, including age, product category, rating, and review text. Some of the reviews contain Personally Identifiable Information (PII), such as height, weight, age, and location.

In [4]:
%pip install kagglehub || uv pip install kagglehub

/home/shadeform/GenerativeAIExamples/nemo/NeMo-Safe-Synthesizer/.venv/bin/python: No module named pip
/bin/bash: line 1: uv: command not found
Note: you may need to restart the kernel to use updated packages.


In [5]:
import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews")
df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0)
df.head()

Downloading from https://www.kaggle.com/api/v1/datasets/download/nicapotato/womens-ecommerce-clothing-reviews?dataset_version_number=1...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2.79M/2.79M [00:00<00:00, 70.2MB/s]

Extracting files...





Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## üèóÔ∏è Create a Safe Synthesizer job

The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.

The following code creates and submits a job:
- `SafeSynthesizerBuilder(client)`: initialize with the NeMo Microservices client.
- `.from_data_source(df)`: set the input data source.
- `.with_datastore(datastore_config)`: configure model artifact storage.
- `.with_replace_pii()`: enable automatic replacement of PII.
- `.synthesize()`: train and generate synthetic data.
- `.create_job()`: submit the job to the platform.


In [None]:
job = (
    SafeSynthesizerBuilder(client)
    .from_data_source(df)
    .with_datastore(datastore_config)
    .with_replace_pii()
    .synthesize()
    .create_job()
)

print(f"job_id = {job.job_id}")
job.wait_for_completion()

print(f"Job finished with status {job.fetch_status()}")

In [None]:
# If your notebook shuts down, it's okay, your job is still running on the microservices platform.
# You can get the same job object and interact with it again by uncommenting the following code
# snippet, and modifying it with the job id from the previous cell output.

# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob
# job = SafeSynthesizerJob(job_id="<job id>", client=client)

## üëÄ View synthetic data

After the job completes, fetch the generated synthetic dataset.

In [7]:
# Fetch the synthetic data created by the job
synthetic_df = job.fetch_data()
synthetic_df


Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,670,39,Beautiful in person!,This top is just beautiful in person! the colo...,5,1,0,General,Tops,Knits
1,186,49,Pretty; not flattering,"I love the color of this shirt, and it's very ...",4,1,0,General,Tops,Blouses
2,382,46,Fabulous wool crazy sweater,This sweater is gorgeous. the fabric goes wit...,5,1,2,General,Jackets,Jackets
3,186,38,Pretty fall top,"I really liked this, all things considered. i ...",4,1,5,General,Tops,Blouses
4,619,37,,I ordered the pink color. it's a very pretty t...,5,1,1,General,Tops,Knits
...,...,...,...,...,...,...,...,...,...,...
995,228,56,White pajamas my favorite pair since when i wa...,Wish they had more colors. i ordered two pair ...,5,1,0,General Petite,Bottoms,Jackets
996,247,55,Great casual shirt for any occasion!,I love this shirt!! not only is it a great cas...,5,1,2,General,Tops,Knits
997,902,64,,,5,1,0,General Petite,Dresses,Dresses
998,6610,47,Amazing color and pattern,I love the fit of this skirt. i usually wear a...,5,1,0,General,Bottoms,Skirts


## üìä View evaluation report

An evaluation comparing the synthetic data to the input data is performed automatically. You can:

- **Inspect key scores**: overall synthetic data quality and privacy.
- **Download the full HTML report**: includes charts and detailed metrics.
- **Display the report inline**: useful when viewing in notebook environments.


In [8]:
# Print selected information from the job summary
summary = job.fetch_summary()
print(
    f"Synthetic data quality score (0-10, higher is better): {summary.synthetic_data_quality_score}"
)
print(f"Data privacy score (0-10, higher is better): {summary.data_privacy_score}")


Synthetic data quality score (0-10, higher is better): 8.9
Data privacy score (0-10, higher is better): 8.5


In [10]:
# Download the full evaluation report to your local machine
job.save_report("evaluation_report.html")

In [11]:
# Fetch and display the full evaluation report inline
# job.display_report_in_notebook()

## üß™ Extrinsic Evaluation 

This section details the **extrinsic evaluation** process, where the quality of the synthetic data is assessed based on how well a model trained on it performs on a real-world task. This comparison is critical for validating the synthetic data's utility.

- **Train Benchmark Model**: A model is trained on a small, fixed subset of the **original data** to establish a performance baseline.
- **Train Synthetic Model**: A second model, using the same structure, is trained on the **entire synthetic dataset**.
- **Compare Performance**: Both models are evaluated against the same **fixed holdout test set** ($\mathbf{X_{test}, y_{test}}$).
- **Inspect Key Metrics**: The comparison focuses on key metrics like **ROC AUC** and **F1-Score** to determine if the synthetic model performs comparably to the benchmark.

In [47]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score
from sklearn.base import clone

# --- 1. Define Features (X) and Target (y) ---

# Separate features (X) from the binary target variable (y).
X = df.drop('Recommended IND', axis=1)
y = df['Recommended IND']

# Fill any missing text values with an empty string to prevent TfidfVectorizer errors.
X['Review Text'] = X['Review Text'].fillna('')
X['Title'] = X['Title'].fillna('')

# --- 2. Fixed-Size Non-Overlapping Data Split (1000 Train, 500 Test) ---

# Sample 1000 random indices for the training set.
train_indices = X.sample(n=1000, random_state=1).index
X_train = X.loc[train_indices]
y_train = y.loc[train_indices]

# Create a temporary DataFrame containing only the remaining, unused rows.
remaining_X = X.drop(train_indices)

# Sample 500 random indices from the remaining data for the holdout test set.
test_indices = remaining_X.sample(n=500, random_state=1).index
X_test = X.loc[test_indices]
y_test = y.loc[test_indices]

# --- 3. Define Feature Types for Preprocessing ---

# List the columns for each type of transformation.
text_features = ['Review Text']
numerical_features = ['Age', 'Rating', 'Positive Feedback Count']
categorical_features = ['Division Name', 'Department Name', 'Class Name']

# --- 4. Define Preprocessing Steps ---

# Text: Convert text to numerical vectors (max 5000 terms).
text_transformer = TfidfVectorizer(stop_words='english', max_features=5000)
# Numerical: Standardize numerical features (mean=0, std=1).
numerical_transformer = StandardScaler()
# Categorical: One-hot encode categories, ignoring unseen categories in the test set.
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# --- 5. Create Column Transformer (The Preprocessor) ---

# Apply different transformations to the respective column groups.
preprocessor = ColumnTransformer(
    transformers=[
        # TfidfVectorizer only takes one column input.
        ('text', text_transformer, text_features[0]),
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    # Exclude non-feature columns (like 'Clothing ID' and 'Title') from the model input.
    remainder='drop'
)

# --- 6. Define the Classifier and Full Pipeline ---

# Choose the classification algorithm (Logistic Regression for recommendation prediction).
model = LogisticRegression(solver='liblinear', random_state=42)

# Build the final Pipeline: Preprocessing followed by Classification.
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', model)
])

In [49]:
# Assuming the full_pipeline structure (preprocessor + classifier) has been defined.

# --- 1. TRAIN BENCHMARK MODEL ON ORIGINAL DATA ---

# Assign the pipeline to a specific name for the original model.
# NOTE: In a real comparison, you should clone the pipeline here to ensure the original structure is used without modification.
original_pipeline = full_pipeline 
print("\n--- Training Benchmark Model on Original Data (1000 rows) ---")
# Train the pipeline using the fixed, 1000-row original training set.
original_pipeline.fit(X_train, y_train)

# --- 2. EVALUATE BENCHMARK MODEL ON FIXED TEST SET ---

# Predict class labels on the 500-row fixed holdout test set.
y_pred_original = original_pipeline.predict(X_test)
# Predict probabilities and extract the probability for the positive class (Recommended=1), needed for ROC AUC.
y_prob_original = original_pipeline.predict_proba(X_test)[:, 1]

# --- 3. STORE BENCHMARK RESULTS ---

# Initialize a dictionary to store all model comparison results.
results = {}
results['Original'] = {
    # Calculate and store standard metrics using the fixed test set results.
    'Accuracy': accuracy_score(y_test, y_pred_original),
    'ROC AUC': roc_auc_score(y_test, y_prob_original),
    # Store the full classification report as a dictionary for detailed metric access (Precision, Recall, F1).
    'Classification Report': classification_report(y_test, y_pred_original, output_dict=True)
}
print("Benchmark training and evaluation complete.")


--- Training Benchmark Model on Original Data (1000 rows) ---
Benchmark training and evaluation complete.


In [50]:
# Prepare the synthetic data features and target
# Extract features (X_synthetic) and fill missing text values.
X_synthetic = synthetic_df.drop('Recommended IND', axis=1).fillna({'Review Text': '', 'Title': ''})
# Extract the target variable (y_synthetic).
y_synthetic = synthetic_df['Recommended IND']

# Clone the original pipeline structure (preprocessor and classifier) to ensure a clean, unfitted start.
synthetic_pipeline = clone(full_pipeline) 

# ----------------- TRAIN ON SYNTHETIC DATA -----------------
print("\n--- Training Model on Synthetic Data ---")
# Fit the pipeline using the entire synthetic dataset.
synthetic_pipeline.fit(X_synthetic, y_synthetic)

# --- 2. EVALUATE SYNTHETIC MODEL ON FIXED TEST SET ---

# Make predictions using the synthetic-trained model against the fixed 500-row holdout set (X_test).
y_pred_synthetic = synthetic_pipeline.predict(X_test)
# Extract the probability for the positive class (Recommended=1) for ROC AUC calculation.
y_prob_synthetic = synthetic_pipeline.predict_proba(X_test)[:, 1]

# --- 3. STORE SYNTHETIC RESULTS ---

# Store the performance metrics in the results dictionary for side-by-side comparison.
results['Synthetic'] = {
    'Accuracy': accuracy_score(y_test, y_pred_synthetic),
    'ROC AUC': roc_auc_score(y_test, y_prob_synthetic),
    # Store the detailed classification report as a dictionary.
    'Classification Report': classification_report(y_test, y_pred_synthetic, output_dict=True)
}
print("Synthetic training and evaluation complete.")


--- Training Model on Synthetic Data ---
Synthetic training and evaluation complete.


In [51]:
print("\n" + "="*60)
print("             SIDE-BY-SIDE MODEL COMPARISON")
print("             (Tested on 500-Row Holdout Set)")
print("="*60)

# --- 1. PREPARE COMPARISON DATA ---

# Compile key performance metrics from both the Original and Synthetic models into a single dictionary.
summary_data = {
    'Model': ['Original (Benchmark)', 'Synthetic'],
    # Record the size of the training data used for each model.
    'Train Size': [len(X_train), len(X_synthetic)],
    # Gather Accuracy and ROC AUC score for overall performance.
    'Accuracy': [results['Original']['Accuracy'], results['Synthetic']['Accuracy']],
    'ROC AUC Score': [results['Original']['ROC AUC'], results['Synthetic']['ROC AUC']],
    # Include key classification metrics (Precision and Recall) for the positive class (Recommended=1).
    'Precision (Class 1)': [results['Original']['Classification Report']['1']['precision'], results['Synthetic']['Classification Report']['1']['precision']],
    'Recall (Class 1)': [results['Original']['Classification Report']['1']['recall'], results['Synthetic']['Classification Report']['1']['recall']],
}

# --- 2. DISPLAY SUMMARY TABLE ---

# Convert the summary dictionary to a pandas DataFrame for structured display.
summary_df = pd.DataFrame(summary_data).set_index('Model').T
# Label the row index of the transposed DataFrame as 'Metric'.
summary_df.columns.name = 'Metric'

# Print the comparison table in Markdown format for clean terminal output, formatting metrics to 4 decimal places.
print(summary_df.to_markdown(floatfmt=".4f"))

print("\n" + "="*60)

# --- 3. INTERPRETATION ---

# Provide a simple, text-based conclusion based on the most robust metric (ROC AUC).
print("Key Finding:")
# Compare ROC AUC scores to determine if the synthetic data generalized as well as the original data.
if results['Synthetic']['ROC AUC'] >= results['Original']['ROC AUC']:
    print("The Synthetic Model performs AS WELL OR BETTER than the Original Benchmark.")
else:
    print("The Synthetic Model's performance is slightly lower than the Original Benchmark.")


             SIDE-BY-SIDE MODEL COMPARISON
             (Tested on 500-Row Holdout Set)
|                     |   Original (Benchmark) |   Synthetic |
|:--------------------|-----------------------:|------------:|
| Train Size          |              1000.0000 |   1000.0000 |
| Accuracy            |                 0.9480 |      0.9380 |
| ROC AUC Score       |                 0.9776 |      0.9821 |
| Precision (Class 1) |                 0.9640 |      0.9505 |
| Recall (Class 1)    |                 0.9734 |      0.9758 |

Key Finding:
The Synthetic Model performs AS WELL OR BETTER than the Original Benchmark.
