From 188b812d8a7e037e23e23608ee3abcea077ae716 Mon Sep 17 00:00:00 2001 From: Matt Kornfield Date: Tue, 30 Sep 2025 09:52:13 -0700 Subject: [PATCH 1/8] Add NeMo Safe-Synthesizer Notebooks --- nemo/NeMo-Safe-Synthesizer/README.md | 32 ++ .../advanced/advanced_privacy.ipynb | 304 ++++++++++++++++++ .../advanced/replace_pii_only.ipynb | 250 ++++++++++++++ .../intro/safe_synthesizer_101.ipynb | 281 ++++++++++++++++ 4 files changed, 867 insertions(+) create mode 100644 nemo/NeMo-Safe-Synthesizer/README.md create mode 100644 nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb create mode 100644 nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb create mode 100644 nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb diff --git a/nemo/NeMo-Safe-Synthesizer/README.md b/nemo/NeMo-Safe-Synthesizer/README.md new file mode 100644 index 00000000..805d45b2 --- /dev/null +++ b/nemo/NeMo-Safe-Synthesizer/README.md @@ -0,0 +1,32 @@ +# NeMo Safe Synthesizer Example Notebooks + + +This directory contains the tutorial notebooks for getting started with NeMo Safe Synthesizer. + +## 📦 Set Up the Environment + +We will use the `uv` python management tool to set up our environment and install the necessary dependencies. If you don't have `uv` installed, you can follow the installation instructions from the [uv documentation](https://docs.astral.sh/uv/getting-started/installation/). + +Install the sdk as follows: + +```bash +uv venv +source .venv/bin/activate +uv pip install nemo-microservices[safe-synthesizer] +``` + + +Be sure to select this virtual environment as your kernel when running the notebooks. + +## 🚀 Deploying the NeMo Safe Synthesizer Microservice + +To run these notebooks, you'll need access to a deployment of the NeMo Safe Synthesizer microservice. You have two deployment options: + + +### 🐳 Deploy the NeMo Safe Synthesizer Microservice Locally + +Follow our quickstart guide to deploy the NeMo safe synthesizer microservice locally via Docker Compose. + +### 🚀 Deploy NeMo Microservices Platform with Helm + +Follow the helm installation guide to deploy the microservices platform. diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb new file mode 100644 index 00000000..e91dd517 --- /dev/null +++ b/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb @@ -0,0 +1,304 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "630e3e17", + "metadata": {}, + "source": [ + "# 🔐 NeMo Safe Synthesizer: Advanced Privacy (Differential Privacy)\n", + "\n", + "> ⚠️ **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.\n", + "\n", + "
\n", + "\n", + "In this notebook, we create synthetic tabular data using the NeMo Microservices Python SDK with differential privacy enabled.\n", + "\n", + "After completing this notebook, you'll be able to:\n", + "- **Use the NeMo Microservices SDK** to interact with Safe Synthesizer\n", + "- **Enable differential privacy** to provide additional privacy protection\n", + "- **Access an evaluation report** on the quality and privacy of the synthetic data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a538526a", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "8be84f5d", + "metadata": {}, + "source": [ + "#### 💾 Install dependencies\n", + "\n", + "Ensure you have a NeMo Microservices Platform deployment available. If you're using a managed or remote deployment, have the correct base URLs and tokens ready." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9f5d6f5a", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from nemo_microservices import NeMoMicroservices\n", + "from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder\n", + "\n", + "import logging\n", + "\n", + "logging.basicConfig(level=logging.WARNING)\n", + "logging.getLogger(\"httpx\").setLevel(logging.WARNING)" + ] + }, + { + "cell_type": "markdown", + "id": "7395f0c8", + "metadata": {}, + "source": [ + "### ⚙️ Initialize the NeMo Safe Synthesizer Client\n", + "\n", + "- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.\n", + "- `http://localhost:8080` is the default URL for `base_url` in quickstart.\n", + "- If using a managed or remote deployment, ensure you use the correct base URLs and tokens." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c15ab93", + "metadata": {}, + "outputs": [], + "source": [ + "client = NeMoMicroservices(\n", + " base_url=\"http://localhost:8080\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "8f1cfb12", + "metadata": {}, + "source": [ + "NeMo DataStore is launched as one of the services. We'll use it to manage storage, so set the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "426186a3", + "metadata": {}, + "outputs": [], + "source": [ + "datastore_config = {\n", + " \"endpoint\": \"http://localhost:3000/v1/hf\",\n", + " \"token\": \"\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "2d66c819", + "metadata": {}, + "source": [ + "## 📥 Load input data\n", + "\n", + "Safe synthesizer learns the patterns and correlations of an input data set in order to produce synthetic data with similar properties. Use the sample dataset provided or change the following cell to try with your own data.\n", + "\n", + "The sample dataset is of a set of customer default payments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c989a42", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install ucimlrepo || uv pip install ucimlrepo" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7204f213", + "metadata": {}, + "outputs": [], + "source": [ + "from ucimlrepo import fetch_ucirepo \n", + " \n", + "# fetch dataset \n", + "default_of_credit_card_clients = fetch_ucirepo(id=350) \n", + "df = default_of_credit_card_clients.data.original\n", + " \n", + "\n", + "# Display the first few rows of the combined DataFrame\n", + "print(df.head()) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8ca3a11", + "metadata": {}, + "outputs": [], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "87d72c68", + "metadata": {}, + "source": [ + "## 🏗️ Create a Safe Synthesizer job\n", + "\n", + "The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.\n", + "\n", + "This job will:\n", + "- Initialize the builder with the NeMo Microservices client.\n", + "- Use the loaded DataFrame as the input data source.\n", + "- Configure the job to use the specified datastore for model storage.\n", + "- Enable automatic replacement of personally identifiable information (PII).\n", + "- Enable differential privacy (DP) with a configurable epsilon.\n", + "- Use structured generation to enforce the schema during data generation.\n", + "- Submit the job to the microservices platform." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85d9de56", + "metadata": {}, + "outputs": [], + "source": [ + "job = (\n", + " SafeSynthesizerBuilder(client)\n", + " .from_data_source(df)\n", + " .with_datastore(datastore_config)\n", + " .with_replace_pii()\n", + " .with_differential_privacy(dp_enabled=True, epsilon=8.0)\n", + " .with_generate(use_structured_generation=True)\n", + " .create_job()\n", + ")\n", + "\n", + "print(f\"job_id = {job.job_id}\")\n", + "job.wait_for_completion()\n", + "\n", + "print(f\"Job finished with status {job.fetch_status()}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa2eacb2", + "metadata": {}, + "outputs": [], + "source": [ + "# If your notebook shuts down, it's okay, your job is still running on the microservices platform.\n", + "# You can get the same job object and interact with it again by uncommenting the following code\n", + "# snippet, and modifying it with the job id from the previous cell output.\n", + "\n", + "# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob\n", + "# job = SafeSynthesizerJob(job_id=\"\", client=client)" + ] + }, + { + "cell_type": "markdown", + "id": "285d4a9d", + "metadata": {}, + "source": [ + "## 👀 View synthetic data\n", + "\n", + "After the job completes, fetch the generated synthetic dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f25574a", + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch the synthetic data created by the job\n", + "synthetic_df = job.fetch_data()\n", + "synthetic_df\n" + ] + }, + { + "cell_type": "markdown", + "id": "472b4f38", + "metadata": {}, + "source": [ + "## 📊 View evaluation report\n", + "\n", + "An evaluation comparing the synthetic data to the input data is performed automatically.\n", + "\n", + "- Programmatically access key scores (quality and privacy).\n", + "- Download the full HTML report with charts and detailed metrics.\n", + "- Display the report inline below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b691127", + "metadata": {}, + "outputs": [], + "source": [ + "# Print selected information from the job summary\n", + "summary = job.fetch_summary()\n", + "print(\n", + " f\"Synthetic data quality score (0-10, higher is better): {summary.synthetic_data_quality_score}\"\n", + ")\n", + "print(f\"Data privacy score (0-10, higher is better): {summary.data_privacy_score}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5b1030a", + "metadata": {}, + "outputs": [], + "source": [ + "# Download the full evaluation report to your local machine\n", + "job.save_report(\"evaluation_report.html\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45f7e22b", + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch and display the full evaluation report inline\n", + "job.display_report_in_notebook()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "kendrickb-notebooks", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb new file mode 100644 index 00000000..2a294fd3 --- /dev/null +++ b/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb @@ -0,0 +1,250 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "630e3e17", + "metadata": {}, + "source": [ + "# 🔒 NeMo Safe Synthesizer: PII Replacement Only\n", + "\n", + "> ⚠️ **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.\n", + "\n", + "
\n", + "\n", + "In this notebook, we demonstrate how to use the NeMo Microservices Python SDK to replace PII in a tabular dataset.\n", + "\n", + "After completing this notebook, you'll be able to:\n", + "- **Use the NeMo Microservices SDK** to interact with Safe Synthesizer\n", + "- **Run a job to perform PII replacement only** (no novel data generation)\n" + ] + }, + { + "cell_type": "markdown", + "id": "8be84f5d", + "metadata": {}, + "source": [ + "#### 💾 Install dependencies\n", + "\n", + "Ensure you have a NeMo Microservices Platform deployment available. If you're using a managed or remote deployment, have the correct base URLs and tokens ready." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9f5d6f5a", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_microservices import NeMoMicroservices\n", + "from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder\n", + "\n", + "import logging\n", + "logging.basicConfig(level=logging.WARNING)\n", + "logging.getLogger(\"httpx\").setLevel(logging.WARNING)" + ] + }, + { + "cell_type": "markdown", + "id": "53bb2807", + "metadata": {}, + "source": [ + "### ⚙️ Initialize the NeMo Safe Synthesizer Client\n", + "\n", + "- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.\n", + "- `http://localhost:8080` is the default URL for `base_url` in quickstart.\n", + "- If using a managed or remote deployment, ensure you use the correct base URLs and tokens." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c15ab93", + "metadata": {}, + "outputs": [], + "source": [ + "client = NeMoMicroservices(\n", + " base_url=\"http://localhost:8080\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "3e1c5697", + "metadata": {}, + "source": [ + "NeMo DataStore is launched as one of the services. We'll use it to manage storage, so set the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "016213ab", + "metadata": {}, + "outputs": [], + "source": [ + "datastore_config = {\n", + " \"endpoint\": \"http://localhost:3000/v1/hf\",\n", + " \"token\": \"\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "2d66c819", + "metadata": {}, + "source": [ + "## 📥 Load input data\n", + "\n", + "Safe Synthesizer processes your input dataset and returns the same rows with PII replaced. For this tutorial we load a small public sample dataset. Replace it with your own data if desired.\n", + "\n", + "The dolly dataset is an open source dataset of instruction-following records." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7204f213", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "df = pd.read_json(\n", + " \"hf://datasets/databricks/databricks-dolly-15k/databricks-dolly-15k.jsonl\",\n", + " lines=True,\n", + ")\n", + "print(df.head())" + ] + }, + { + "cell_type": "markdown", + "id": "87d72c68", + "metadata": {}, + "source": [ + "## 🏗️ Create a Safe Synthesizer job\n", + "\n", + "The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.\n", + "\n", + "This job will:\n", + "- Initialize the builder with the NeMo Microservices client.\n", + "- Use the loaded DataFrame as the input data source.\n", + "- Configure the job to use the specified datastore for model storage.\n", + "- Enable automatic replacement of personally identifiable information (PII).\n", + "- Submit the job to the microservices platform." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85d9de56", + "metadata": {}, + "outputs": [], + "source": [ + "job = (\n", + " SafeSynthesizerBuilder(client)\n", + " .from_data_source(df)\n", + " .with_datastore(datastore_config)\n", + " .with_replace_pii()\n", + " .create_job()\n", + ")\n", + "\n", + "print(f\"job_id = {job.job_id}\")\n", + "job.wait_for_completion()\n", + "\n", + "print(f\"Job finished with status {job.fetch_status()}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa2eacb2", + "metadata": {}, + "outputs": [], + "source": [ + "# If your notebook shuts down, it's okay, your job is still running on the microservices platform.\n", + "# You can get the same job object and interact with it again by uncommenting the following code\n", + "# snippet, and modifying it with the job id from the previous cell output.\n", + "\n", + "# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob\n", + "# job = SafeSynthesizerJob(job_id=\"\", client=client)" + ] + }, + { + "cell_type": "markdown", + "id": "285d4a9d", + "metadata": {}, + "source": [ + "## 👀 View output data\n", + "\n", + "After the job completes, fetch the output with PII replaced." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f25574a", + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch the job output data with PII replaced\n", + "output_df = job.fetch_data()\n", + "output_df" + ] + }, + { + "cell_type": "markdown", + "id": "571efc39", + "metadata": {}, + "source": [ + "## 📊 View PII report\n", + "\n", + "A report summarizing the PII replacement is created automatically for every job.\n", + "\n", + "You can download the full HTML report or display it inline below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bba96175", + "metadata": {}, + "outputs": [], + "source": [ + "# Download the full evaluation report to your local machine\n", + "job.save_report(\"evaluation_report.html\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45f7e22b", + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch and display the full evaluation report inline\n", + "job.display_report_in_notebook()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "kendrickb-notebooks", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb new file mode 100644 index 00000000..173c95aa --- /dev/null +++ b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb @@ -0,0 +1,281 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "630e3e17", + "metadata": {}, + "source": [ + "# 🎛️ NeMo Safe Synthesizer 101: The Basics\n", + "\n", + "> ⚠️ **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.\n", + "\n", + "
\n", + "\n", + "In this notebook, we demonstrate how to create a synthetic version of a tabular dataset using the NeMo Microservices Python SDK.\n", + "\n", + "After completing this notebook, you'll be able to:\n", + "- Use the NeMo Microservices SDK to interact with Safe Synthesizer\n", + "- Create novel synthetic data that follows the statistical properties of your input dataset\n", + "- Access an evaluation report on synthetic data quality and privacy\n" + ] + }, + { + "cell_type": "markdown", + "id": "8be84f5d", + "metadata": {}, + "source": [ + "#### 💾 Install dependencies\n", + "\n", + "**IMPORTANT** 👉 Ensure you have a NeMo Microservices Platform deployment available. Follow the quickstart or Helm chart instructions in your environment's setup guide. You may need to restart your kernel after installing dependencies.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9f5d6f5a", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from nemo_microservices import NeMoMicroservices\n", + "from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder\n", + "\n", + "import logging\n", + "logging.basicConfig(level=logging.WARNING)\n", + "logging.getLogger(\"httpx\").setLevel(logging.WARNING)" + ] + }, + { + "cell_type": "markdown", + "id": "53bb2807", + "metadata": {}, + "source": [ + "### ⚙️ Initialize the NeMo Safe Synthesizer Client\n", + "\n", + "- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.\n", + "- `http://localhost:8080` is the default url for the client's `base_url` in the quickstart.\n", + "- If using a managed or remote deployment, ensure correct base URLs and tokens.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c15ab93", + "metadata": {}, + "outputs": [], + "source": [ + "client = NeMoMicroservices(\n", + " base_url=\"http://localhost:8080\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "74d72ef7", + "metadata": {}, + "source": [ + "NeMo DataStore is launched as one of the services, and we'll use it to manage our storage. so we'll set the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab037a3a", + "metadata": {}, + "outputs": [], + "source": [ + "datastore_config = {\n", + " \"endpoint\": \"http://localhost:3000/v1/hf\",\n", + " \"token\": \"\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "2d66c819", + "metadata": {}, + "source": [ + "## 📥 Load input data\n", + "\n", + "Safe Synthesizer learns the patterns and correlations in your input dataset to produce synthetic data with similar properties. For this tutorial, we will use a small public sample dataset. Replace it with your own data if desired.\n", + "\n", + "The sample dataset used here is a set of women's clothing reviews." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "daa955b6", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install kagglehub || uv pip install kagglehub" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7204f213", + "metadata": {}, + "outputs": [], + "source": [ + "import kagglehub\n", + "import pandas as pd\n", + "\n", + "# Download latest version\n", + "path = kagglehub.dataset_download(\"nicapotato/womens-ecommerce-clothing-reviews\")\n", + "df = pd.read_csv(f\"{path}/Womens Clothing E-Commerce Reviews.csv\", index_col=0)\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "87d72c68", + "metadata": {}, + "source": [ + "## 🏗️ Create a Safe Synthesizer job\n", + "\n", + "The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.\n", + "\n", + "The following code creates and submits a job:\n", + "- `SafeSynthesizerBuilder(client)`: initialize with the NeMo Microservices client.\n", + "- `.from_data_source(df)`: set the input data source.\n", + "- `.with_datastore(datastore_config)`: configure model artifact storage.\n", + "- `.with_replace_pii()`: enable automatic replacement of PII.\n", + "- `.synthesize()`: train and generate synthetic data.\n", + "- `.create_job()`: submit the job to the platform.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85d9de56", + "metadata": {}, + "outputs": [], + "source": [ + "job = (\n", + " SafeSynthesizerBuilder(client)\n", + " .from_data_source(df)\n", + " .with_datastore(datastore_config)\n", + " .with_replace_pii()\n", + " .synthesize()\n", + " .create_job()\n", + ")\n", + "\n", + "print(f\"job_id = {job.job_id}\")\n", + "job.wait_for_completion()\n", + "\n", + "print(f\"Job finished with status {job.fetch_status()}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa2eacb2", + "metadata": {}, + "outputs": [], + "source": [ + "# If your notebook shuts down, it's okay, your job is still running on the microservices platform.\n", + "# You can get the same job object and interact with it again by uncommenting the following code\n", + "# snippet, and modifying it with the job id from the previous cell output.\n", + "\n", + "# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob\n", + "# job = SafeSynthesizerJob(job_id=\"\", client=client)" + ] + }, + { + "cell_type": "markdown", + "id": "285d4a9d", + "metadata": {}, + "source": [ + "## 👀 View synthetic data\n", + "\n", + "After the job completes, fetch the generated synthetic dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f25574a", + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch the synthetic data created by the job\n", + "synthetic_df = job.fetch_data()\n", + "synthetic_df\n" + ] + }, + { + "cell_type": "markdown", + "id": "2b25f152", + "metadata": {}, + "source": [ + "## 📊 View evaluation report\n", + "\n", + "An evaluation comparing the synthetic data to the input data is performed automatically. You can:\n", + "\n", + "- **Inspect key scores**: overall synthetic data quality and privacy.\n", + "- **Download the full HTML report**: includes charts and detailed metrics.\n", + "- **Display the report inline**: useful when viewing in notebook environments.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b691127", + "metadata": {}, + "outputs": [], + "source": [ + "# Print selected information from the job summary\n", + "summary = job.fetch_summary()\n", + "print(\n", + " f\"Synthetic data quality score (0-10, higher is better): {summary.synthetic_data_quality_score}\"\n", + ")\n", + "print(f\"Data privacy score (0-10, higher is better): {summary.data_privacy_score}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "39e62ea9", + "metadata": {}, + "outputs": [], + "source": [ + "# Download the full evaluation report to your local machine\n", + "job.save_report(\"evaluation_report.html\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45f7e22b", + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch and display the full evaluation report inline\n", + "job.display_report_in_notebook()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "kendrickb-notebooks", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 9af208c520b5beafa989ac0cfd7eee0c88dc051c Mon Sep 17 00:00:00 2001 From: alexahaushalter Date: Mon, 6 Oct 2025 16:20:36 -0500 Subject: [PATCH 2/8] Update safe_synthesizer_101.ipynb --- nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb index 173c95aa..2b953a90 100644 --- a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb +++ b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb @@ -99,7 +99,7 @@ "\n", "Safe Synthesizer learns the patterns and correlations in your input dataset to produce synthetic data with similar properties. For this tutorial, we will use a small public sample dataset. Replace it with your own data if desired.\n", "\n", - "The sample dataset used here is a set of women's clothing reviews." + "The sample dataset used here is a set of women's clothing reviews, including age, product category, rating, and review text. Some of the reviews contain Personally Identifiable Information (PII), such as height, weight, age, and location." ] }, { From acf40c4843653a5c1076881aaeee3a4150df2d94 Mon Sep 17 00:00:00 2001 From: alexahaushalter Date: Mon, 6 Oct 2025 16:28:24 -0500 Subject: [PATCH 3/8] Update replace_pii_only.ipynb --- nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb index 2a294fd3..30b3fe72 100644 --- a/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb +++ b/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb @@ -97,7 +97,7 @@ "\n", "Safe Synthesizer processes your input dataset and returns the same rows with PII replaced. For this tutorial we load a small public sample dataset. Replace it with your own data if desired.\n", "\n", - "The dolly dataset is an open source dataset of instruction-following records." + "The dolly dataset is an open source dataset of instruction-following records. Each record contains (1) a free text prompt that could be sent to an LLM, (2) a context descriptions to help the LLM determine the answer, (3) a response that could come from the LLM, and (4) the instruction category such as classification, open QA, closed QA, information extraction, and brainstroming." ] }, { From a11b8ac80d0d4ccb062f49b8d8e27e4be6e42d5d Mon Sep 17 00:00:00 2001 From: alexahaushalter Date: Mon, 6 Oct 2025 16:32:46 -0500 Subject: [PATCH 4/8] Update advanced_privacy.ipynb --- nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb index e91dd517..e702f49c 100644 --- a/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb +++ b/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb @@ -108,7 +108,7 @@ "\n", "Safe synthesizer learns the patterns and correlations of an input data set in order to produce synthetic data with similar properties. Use the sample dataset provided or change the following cell to try with your own data.\n", "\n", - "The sample dataset is of a set of customer default payments." + "The sample dataset is of a set of customer default payments. It includes columns of Personally Identifiable Information (PII) such as sex, education level, marriage status, and age. In addition, it contains several billing and payments accounts and a binary indicator of whether the next month's payment would default." ] }, { From ee5938a80ff8e51fe0bc61c0d08c22207d7f5018 Mon Sep 17 00:00:00 2001 From: alexahaushalter Date: Mon, 6 Oct 2025 16:35:30 -0500 Subject: [PATCH 5/8] Update replace_pii_only.ipynb --- nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb index 30b3fe72..0f2d2e30 100644 --- a/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb +++ b/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb @@ -97,7 +97,7 @@ "\n", "Safe Synthesizer processes your input dataset and returns the same rows with PII replaced. For this tutorial we load a small public sample dataset. Replace it with your own data if desired.\n", "\n", - "The dolly dataset is an open source dataset of instruction-following records. Each record contains (1) a free text prompt that could be sent to an LLM, (2) a context descriptions to help the LLM determine the answer, (3) a response that could come from the LLM, and (4) the instruction category such as classification, open QA, closed QA, information extraction, and brainstroming." + "The dolly dataset is an open source dataset of instruction-following records. Each record contains (1) a free text prompt that could be sent to an LLM, (2) a context descriptions to help the LLM determine the answer, (3) a response that could come from the LLM, and (4) the instruction category such as classification, open QA, closed QA, information extraction, and brainstorming. The text in each of the first three fields sometimes contains Personally Identifiable Information, such as names, birth dates, and locations." ] }, { From 53b1728f39f738622dbd9ce50726f97a5f894384 Mon Sep 17 00:00:00 2001 From: alexahaushalter Date: Tue, 7 Oct 2025 13:44:02 -0500 Subject: [PATCH 6/8] Add duration to safe_synthesizer_101.ipynb --- nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb index 2b953a90..e52612d1 100644 --- a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb +++ b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb @@ -11,7 +11,7 @@ "\n", "
\n", "\n", - "In this notebook, we demonstrate how to create a synthetic version of a tabular dataset using the NeMo Microservices Python SDK.\n", + "In this notebook, we demonstrate how to create a synthetic version of a tabular dataset using the NeMo Microservices Python SDK. The notebook should take about 20 minutes to run.\n", "\n", "After completing this notebook, you'll be able to:\n", "- Use the NeMo Microservices SDK to interact with Safe Synthesizer\n", From 5651ad36bdd272f57b6fb894f64f6d3e9a7708aa Mon Sep 17 00:00:00 2001 From: alexahaushalter Date: Tue, 7 Oct 2025 13:44:33 -0500 Subject: [PATCH 7/8] Add duration to replace_pii_only.ipynb --- nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb index 0f2d2e30..6025a901 100644 --- a/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb +++ b/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb @@ -11,7 +11,7 @@ "\n", "
\n", "\n", - "In this notebook, we demonstrate how to use the NeMo Microservices Python SDK to replace PII in a tabular dataset.\n", + "In this notebook, we demonstrate how to use the NeMo Microservices Python SDK to replace PII in a tabular dataset. The notebook should take about 15 minutes to run.\n", "\n", "After completing this notebook, you'll be able to:\n", "- **Use the NeMo Microservices SDK** to interact with Safe Synthesizer\n", From 7675a9f33e760fba5328a2e9d900b85c8e2759c5 Mon Sep 17 00:00:00 2001 From: alexahaushalter Date: Tue, 7 Oct 2025 13:45:03 -0500 Subject: [PATCH 8/8] Add duration to advanced_privacy.ipynb --- nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb index e702f49c..f8560a21 100644 --- a/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb +++ b/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb @@ -11,7 +11,7 @@ "\n", "
\n", "\n", - "In this notebook, we create synthetic tabular data using the NeMo Microservices Python SDK with differential privacy enabled.\n", + "In this notebook, we create synthetic tabular data using the NeMo Microservices Python SDK with differential privacy enabled. The notebook should take about 1.5 hours to run.\n", "\n", "After completing this notebook, you'll be able to:\n", "- **Use the NeMo Microservices SDK** to interact with Safe Synthesizer\n",