diff --git a/.gitignore b/.gitignore index 2a2104c4c..ea73b978a 100644 --- a/.gitignore +++ b/.gitignore @@ -47,4 +47,4 @@ RAG/notebooks/langchain/data/save_embedding uv.lock # data designer exclusion -data-designer-tutorial-output/ +data-designer-tutorial-output/ \ No newline at end of file diff --git a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/1-the-basics.ipynb b/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/1-the-basics.ipynb deleted file mode 100644 index 07c3c7f41..000000000 --- a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/1-the-basics.ipynb +++ /dev/null @@ -1,405 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 🎨 NeMo Data Designer 101: The Basics\n", - "\n", - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - "\n", - "
\n", - "\n", - "In this notebook, we will demonstrate the basics of Data Designer by generating a simple product review dataset." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from getpass import getpass\n", - "\n", - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n", - "- If you have an instance of data designer running locally, you can connect to it as follows\n", - "\n", - " ```python\n", - " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", - " ```\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# if using the managed service of data designer, provide the api key here\n", - "api_key = getpass(\"Enter data designer API key: \")\n", - "\n", - "if len(api_key) > 0:\n", - " print(\"βœ… API key received.\")\n", - "else:\n", - " print(\"❌ No API key provided. Please enter your model provider API key.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(\n", - " client=NeMoMicroservices(\n", - " base_url=\"https://ai.api.nvidia.com/v1/nemo/dd\",\n", - " default_headers={\"Authorization\": f\"Bearer {api_key}\"} # auto-generated API KEY\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Note**: \n", - "The NeMo Data Designer Managed service has models available for you to use as well. You can use these models by referencing the appropriate model_alias for them.\n", - "\n", - "Please visit https://build.nvidia.com/nemo/data-designer to see the full list of models and their model aliases." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model_alias = \"nemotron-nano-v2\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🎲 Getting started with sampler columns\n", - "\n", - "- Sampler columns offer non-LLM based generation of synthetic data.\n", - "\n", - "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n", - "\n", - "
\n", - "\n", - "Let's start designing our product review dataset by adding product category and subcategory columns.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"product_category\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\n", - " \"Electronics\",\n", - " \"Clothing\",\n", - " \"Home & Kitchen\",\n", - " \"Books\",\n", - " \"Home Office\",\n", - " ],\n", - " ),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"product_subcategory\",\n", - " type=P.SamplerType.SUBCATEGORY,\n", - " params=P.SubcategorySamplerParams(\n", - " category=\"product_category\",\n", - " values={\n", - " \"Electronics\": [\n", - " \"Smartphones\",\n", - " \"Laptops\",\n", - " \"Headphones\",\n", - " \"Cameras\",\n", - " \"Accessories\",\n", - " ],\n", - " \"Clothing\": [\n", - " \"Men's Clothing\",\n", - " \"Women's Clothing\",\n", - " \"Winter Coats\",\n", - " \"Activewear\",\n", - " \"Accessories\",\n", - " ],\n", - " \"Home & Kitchen\": [\n", - " \"Appliances\",\n", - " \"Cookware\",\n", - " \"Furniture\",\n", - " \"Decor\",\n", - " \"Organization\",\n", - " ],\n", - " \"Books\": [\n", - " \"Fiction\",\n", - " \"Non-Fiction\",\n", - " \"Self-Help\",\n", - " \"Textbooks\",\n", - " \"Classics\",\n", - " ],\n", - " \"Home Office\": [\n", - " \"Desks\",\n", - " \"Chairs\",\n", - " \"Storage\",\n", - " \"Office Supplies\",\n", - " \"Lighting\",\n", - " ],\n", - " },\n", - " ),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"target_age_range\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\"18-25\", \"25-35\", \"35-50\", \"50-65\", \"65+\"]\n", - " ),\n", - " )\n", - ")\n", - "\n", - "# Optionally validate that the columns are configured correctly.\n", - "config_builder.validate()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, let's add samplers to generate data related to the customer and their review.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# This column will sample synthetic person data based on statistics from the US Census.\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"customer\",\n", - " type=P.SamplerType.PERSON,\n", - " params=P.PersonSamplerParams(age_range=[18, 70]),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"number_of_stars\",\n", - " type=P.SamplerType.UNIFORM,\n", - " params=P.UniformSamplerParams(low=1, high=5),\n", - " convert_to=\"int\",\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"review_style\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\"rambling\", \"brief\", \"detailed\", \"structured with bullet points\"],\n", - " weights=[1, 2, 2, 1],\n", - " ),\n", - " )\n", - ")\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🦜 LLM-generated columns\n", - "\n", - "- The real power of Data Designer comes from leveraging LLMs to generate text, code, and structured data.\n", - "\n", - "- For our product review dataset, we will use LLM-generated text columns to generate product names and customer reviews.\n", - "\n", - "- When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.\n", - "\n", - "- As we see below, nested json columns can be accessed using dot notation.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder.add_column(\n", - " C.LLMTextColumn(\n", - " name=\"product_name\",\n", - " prompt=(\n", - " \"Come up with a creative product name for a product in the '{{ product_category }}' category, focusing \"\n", - " \"on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is \"\n", - " \"{{ target_age_range }} years old. Respond with only the product name, no other text.\"\n", - " ),\n", - " # This is optional, but it can be useful for controlling the behavior of the LLM. Do not include instructions\n", - " # related to output formatting in the system prompt, as Data Designer handles this based on the column type.\n", - " system_prompt=(\n", - " \"You are a helpful assistant that generates product names. You respond with only the product name, \"\n", - " \"no other text. You do NOT add quotes around the product name.\"\n", - " ),\n", - " model_alias=model_alias,\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.LLMTextColumn(\n", - " name=\"customer_review\",\n", - " prompt=(\n", - " \"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. \"\n", - " \"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. \"\n", - " \"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. \"\n", - " \"The style of the review should be '{{ review_style }}'.\"\n", - " ),\n", - " model_alias=model_alias,\n", - " )\n", - ")\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ‘€ Preview the dataset\n", - "\n", - "- Iteration is key to generating high-quality synthetic data.\n", - "\n", - "- Use the `preview` method to generate 10 records for inspection.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Run this cell multiple times to cycle through the 10 preview records.\n", - "preview.display_sample_record()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# The preview dataset is available as a pandas DataFrame.\n", - "preview.dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ⏭️ Next Steps\n", - "\n", - "Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about:\n", - "\n", - "- [Structured outputs and jinja expressions](./2-structured-outputs-and-jinja-expressions.ipynb)\n", - "\n", - "- [Seeding synthetic data generation with an external dataset](./3-seeding-with-a-dataset.ipynb)\n", - "\n", - "- [Using Custom Model Configs](./4-custom-model-configs.ipynb)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb b/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb deleted file mode 100644 index 55a6acf60..000000000 --- a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb +++ /dev/null @@ -1,420 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 🎨 NeMo Data Designer 101: Structured Outputs and Jinja Expressions\n", - "\n", - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - "\n", - "
\n", - "\n", - "In this notebook, we will continue our exploration of Data Designer, demonstrating more advanced data generation using structured outputs and Jinja expressions.\n", - "\n", - "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from getpass import getpass\n", - "\n", - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n", - "- If you have an instance of data designer running locally, you can connect to it as follows\n", - "\n", - " ```python\n", - " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", - " ```\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# if using the managed service of data designer, provide the api key here\n", - "api_key = getpass(\"Enter data designer API key: \")\n", - "\n", - "if len(api_key) > 0:\n", - " print(\"βœ… API key received.\")\n", - "else:\n", - " print(\"❌ No API key provided. Please enter your model provider API key.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(\n", - " client=NeMoMicroservices(\n", - " base_url=\"https://ai.api.nvidia.com/v1/nemo/dd\",\n", - " default_headers={\"Authorization\": f\"Bearer {api_key}\"} # auto-generated API KEY\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Note**: \n", - "The NeMo Data Designer Managed service has models available for you to use as well. You can use these models by referencing the appropriate model_alias for them.\n", - "\n", - "Please visit https://build.nvidia.com/nemo/data-designer to see the full list of models and their model aliases." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model_alias = \"nemotron-nano-v2\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ§‘β€πŸŽ¨ Designing our data\n", - "\n", - "- We will again create a product review dataset, but this time we will use structured outputs and Jinja expressions.\n", - "\n", - "- Structured outputs let you specify the exact schema of the data you want to generate.\n", - "\n", - "- Data Designer supports schemas specified using either json schema or Pydantic data models (recommended).\n", - "\n", - "
\n", - "\n", - "We'll define our structured outputs using Pydantic data models:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from decimal import Decimal\n", - "from typing import Literal\n", - "from pydantic import BaseModel, Field\n", - "\n", - "\n", - "# We define a Product schema so that the name, description, and price are generated\n", - "# in one go, with the types and constraints specified.\n", - "class Product(BaseModel):\n", - " name: str = Field(description=\"The name of the product\")\n", - " description: str = Field(description=\"A description of the product\")\n", - " price: Decimal = Field(\n", - " description=\"The price of the product\", ge=10, le=1000, decimal_places=2\n", - " )\n", - "\n", - "\n", - "class ProductReview(BaseModel):\n", - " rating: int = Field(description=\"The rating of the product\", ge=1, le=5)\n", - " customer_mood: Literal[\"irritated\", \"mad\", \"happy\", \"neutral\", \"excited\"] = Field(\n", - " description=\"The mood of the customer\"\n", - " )\n", - " review: str = Field(description=\"A review of the product\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, let's design our product review dataset using a few more tricks compared to the previous notebook:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Since we often just want a few attributes from Person objects, we can use\n", - "# Data Designer's `with_person_samplers` method to create multiple person samplers\n", - "# at once and drop the person object columns from the final dataset.\n", - "config_builder.with_person_samplers(\n", - " {\"customer\": P.PersonSamplerParams(age_range=[18, 65])}\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"product_category\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\n", - " \"Electronics\",\n", - " \"Clothing\",\n", - " \"Home & Kitchen\",\n", - " \"Books\",\n", - " \"Home Office\",\n", - " ],\n", - " ),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"product_subcategory\",\n", - " type=P.SamplerType.SUBCATEGORY,\n", - " params=P.SubcategorySamplerParams(\n", - " category=\"product_category\",\n", - " values={\n", - " \"Electronics\": [\n", - " \"Smartphones\",\n", - " \"Laptops\",\n", - " \"Headphones\",\n", - " \"Cameras\",\n", - " \"Accessories\",\n", - " ],\n", - " \"Clothing\": [\n", - " \"Men's Clothing\",\n", - " \"Women's Clothing\",\n", - " \"Winter Coats\",\n", - " \"Activewear\",\n", - " \"Accessories\",\n", - " ],\n", - " \"Home & Kitchen\": [\n", - " \"Appliances\",\n", - " \"Cookware\",\n", - " \"Furniture\",\n", - " \"Decor\",\n", - " \"Organization\",\n", - " ],\n", - " \"Books\": [\n", - " \"Fiction\",\n", - " \"Non-Fiction\",\n", - " \"Self-Help\",\n", - " \"Textbooks\",\n", - " \"Classics\",\n", - " ],\n", - " \"Home Office\": [\n", - " \"Desks\",\n", - " \"Chairs\",\n", - " \"Storage\",\n", - " \"Office Supplies\",\n", - " \"Lighting\",\n", - " ],\n", - " },\n", - " ),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"target_age_range\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\"18-25\", \"25-35\", \"35-50\", \"50-65\", \"65+\"]\n", - " ),\n", - " )\n", - ")\n", - "\n", - "# we can set the weights for the categories to ensure the distribution of values is as expected\n", - "# we also show how we can we use conditional params to set the values for the sampler if a given condition is met\n", - "# in this example, we set the review style to rambling if the target age range is 18-25\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"review_style\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\"rambling\", \"brief\", \"detailed\", \"structured with bullet points\"],\n", - " weights=[1, 2, 2, 1],\n", - " conditional_params={\n", - " \"target_age_range == '18-25'\": P.CategorySamplerParams(values=[\"rambling\"]),\n", - " }\n", - " ),\n", - " )\n", - ")\n", - "\n", - "# We can create new columns using Jinja expressions that reference\n", - "# existing columns, including attributes of nested objects.\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"customer_name\", expr=\"{{ customer.first_name }} {{ customer.last_name }}\"\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(name=\"customer_age\", expr=\"{{ customer.age }}\")\n", - ")\n", - "\n", - "# Add an `LLMStructuredColumn` column to generate structured outputs.\n", - "config_builder.add_column(\n", - " C.LLMStructuredColumn(\n", - " name=\"product\",\n", - " prompt=(\n", - " \"Create a product in the '{{ product_category }}' category, focusing on products \"\n", - " \"related to '{{ product_subcategory }}'. The target age range of the ideal customer is \"\n", - " \"{{ target_age_range }} years old. The product should be priced between $10 and $1000.\"\n", - " ),\n", - " output_format=Product,\n", - " model_alias=model_alias,\n", - " )\n", - ")\n", - "\n", - "# Another powerful feature we can use is the ability to use conditional statements in our prompt using Jinja expressions\n", - "# in this example, we add additional conditions to the prompt based on the target age range\n", - "config_builder.add_column(\n", - " C.LLMStructuredColumn(\n", - " name=\"customer_review\",\n", - " prompt=(\n", - " \"Your task is to write a review for the following product:\\n\\n\"\n", - " \"Product Name: {{ product.name }}\\n\"\n", - " \"Product Description: {{ product.description }}\\n\"\n", - " \"Price: {{ product.price }}\\n\\n\"\n", - " \"Imagine your name is {{ customer_name }} and you are from {{ customer.city }}, {{ customer.state }}. \"\n", - " \"Write the review in a style that is '{{ review_style }}'.\"\n", - " \"{% if target_age_range == '18-25' %}\"\n", - " \"Make sure the review is more informal and conversational.\"\n", - " \"{% else %}\"\n", - " \"Make sure the review is more formal and structured.\"\n", - " \"{% endif %}\"\n", - " ),\n", - " output_format=ProductReview,\n", - " model_alias=model_alias,\n", - " )\n", - ")\n", - "\n", - "# Let's add an evaluation report to our dataset.\n", - "config_builder.with_evaluation_report().validate()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ‘€ Preview the dataset\n", - "\n", - "- Iteration is key to generating high-quality synthetic data.\n", - "\n", - "- Use the `preview` method to generate 10 records for inspection.\n", - "\n", - "- Setting `verbose_logging=True` prints logs within each task of the generation process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Run this cell multiple times to cycle through the 10 preview records.\n", - "preview.display_sample_record()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# The preview dataset is available as a pandas DataFrame.\n", - "preview.dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ⏭️ Next Steps\n", - "\n", - "Check out the following notebooks to learn more about:\n", - "\n", - "- [Seeding synthetic data generation with an external dataset](./3-seeding-with-a-dataset.ipynb)\n", - "\n", - "- [Using Custom Model Configs](./4-custom-model-configs.ipynb)\n", - "\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/3-seeding-with-a-dataset.ipynb b/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/3-seeding-with-a-dataset.ipynb deleted file mode 100644 index 4df72ec86..000000000 --- a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/3-seeding-with-a-dataset.ipynb +++ /dev/null @@ -1,351 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 🎨 NeMo Data Designer 101: Seeding Synthetic Data Generation with an External Dataset\n", - "\n", - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - "\n", - "
\n", - "\n", - "In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset.\n", - "\n", - "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from getpass import getpass\n", - "\n", - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer (NDD) Client\n", - "\n", - "- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n", - "- If you have an instance of data designer running locally, you can connect to it as follows\n", - "\n", - " ```python\n", - " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", - " ```\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# if using the managed service of data designer, provide the api key here\n", - "api_key = getpass(\"Enter data designer API key: \")\n", - "\n", - "if len(api_key) > 0:\n", - " print(\"βœ… API key received.\")\n", - "else:\n", - " print(\"❌ No API key provided. Please enter your model provider API key.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(\n", - " client=NeMoMicroservices(\n", - " base_url=\"https://ai.api.nvidia.com/v1/nemo/dd\",\n", - " default_headers={\"Authorization\": f\"Bearer {api_key}\"} # auto-generated API KEY\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Note**: \n", - "The NeMo Data Designer Managed service has models available for you to use as well. You can use these models by referencing the appropriate model_alias for them.\n", - "\n", - "Please visit https://build.nvidia.com/nemo/data-designer to see the full list of models and their model aliases." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model_alias = \"nemotron-nano-v2\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ₯ Download a seed dataset\n", - "\n", - "- For this notebook, we'll change gears and create a synthetic dataset of patient notes.\n", - "\n", - "- To steer the generation process, we will use an open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).\n", - "\n", - "- In this dataset, the `input_text` represents the `patient_summary` and the `output_text` represents the `diagnosis`\n", - "\n", - "**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# The repo_id and filename arguments follow the Hugging Face Hub API format.\n", - "# Passing the dataset_path argument signals that we need to upload the dataset\n", - "# to the datastore. Note we need to pass in the datastore's endpoint, which\n", - "# must match the endpoint in the docker-compose file.\n", - "config_builder.with_seed_dataset(\n", - " repo_id=\"gretelai/symptom_to_diagnosis\",\n", - " filename=\"train.jsonl\",\n", - " sampling_strategy=\"shuffle\",\n", - " with_replacement=False,\n", - " datastore={\"endpoint\": \"https://huggingface.co\"}\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🎨 Designing our synthetic patient notes dataset\n", - "\n", - "- We set the seed dataset using the `with_seed_dataset` method.\n", - "\n", - "- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.\n", - "\n", - "- We set `with_replacement=False`, which limits our max number of records to 853, which is the number of records in the seed dataset.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Since we often just want a few attributes from Person objects, we can use\n", - "# Data Designer's `with_person_samplers` method to create multiple person samplers\n", - "# at once and drop the person object columns from the final dataset.\n", - "\n", - "# Empty dictionaries mean use default settings for the person samplers.\n", - "config_builder.with_person_samplers({\"patient_sampler\": {}, \"doctor_sampler\": {}})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Here we demonstrate how you can add a column by calling `add_column` with the\n", - "# column name, column type, and any parameters for that column type. This is in\n", - "# contrast to using the column and parameter type objects, via `C` and `P`, as we\n", - "# did in the previous notebooks. Generally, we recommend using the concrete column\n", - "# and parameter type objects, but this is a convenient shorthand when you are\n", - "# familiar with the required arguments for each type.\n", - "\n", - "config_builder.add_column(\n", - " name=\"patient_id\",\n", - " type=\"uuid\",\n", - " params={\"prefix\": \"PT-\", \"short_form\": True, \"uppercase\": True},\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"first_name\",\n", - " type=\"expression\",\n", - " expr=\"{{ patient_sampler.first_name}} \",\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"last_name\",\n", - " type=\"expression\",\n", - " expr=\"{{ patient_sampler.last_name }}\",\n", - ")\n", - "\n", - "\n", - "config_builder.add_column(\n", - " name=\"dob\", type=\"expression\", expr=\"{{ patient_sampler.birth_date }}\"\n", - ")\n", - "\n", - "\n", - "config_builder.add_column(\n", - " name=\"patient_email\",\n", - " type=\"expression\",\n", - " expr=\"{{ patient_sampler.email_address }}\",\n", - ")\n", - "\n", - "\n", - "config_builder.add_column(\n", - " name=\"symptom_onset_date\",\n", - " type=\"datetime\",\n", - " params={\"start\": \"2024-01-01\", \"end\": \"2024-12-31\"},\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"date_of_visit\",\n", - " type=\"timedelta\",\n", - " params={\"dt_min\": 1, \"dt_max\": 30, \"reference_column_name\": \"symptom_onset_date\"},\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"physician\",\n", - " type=\"expression\",\n", - " expr=\"Dr. {{ doctor_sampler.last_name }}\",\n", - ")\n", - "\n", - "# Note we have access to the seed data fields.\n", - "config_builder.add_column(\n", - " name=\"physician_notes\",\n", - " prompt=\"\"\"\\\n", - "You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},\n", - "who has been struggling with symptoms from {{ output_text }} since {{ symptom_onset_date }}.\n", - "The date of today's visit is {{ date_of_visit }}.\n", - "\n", - "{{ input_text }}\n", - "\n", - "Write careful notes about your visit with {{ first_name }},\n", - "as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.\n", - "\n", - "Format the notes as a busy doctor might.\n", - "\"\"\",\n", - " model_alias=model_alias,\n", - ")\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ‘€ Preview the dataset\n", - "\n", - "- Iteration is key to generating high-quality synthetic data.\n", - "\n", - "- Use the `preview` method to generate 10 records for inspection.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "preview = data_designer_client.preview(config_builder, num_records=2, verbose_logging=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Run this cell multiple times to cycle through the 10 preview records.\n", - "preview.display_sample_record()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# The preview dataset is available as a pandas DataFrame.\n", - "preview.dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ⏭️ Next Steps\n", - "\n", - "Check out the following notebooks to learn more about:\n", - "\n", - "- [Using Custom Model Configs](./4-custom-model-configs.ipynb)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/4-custom-model-configs.ipynb b/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/4-custom-model-configs.ipynb deleted file mode 100644 index 660bed43d..000000000 --- a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/4-custom-model-configs.ipynb +++ /dev/null @@ -1,447 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 🎨 NeMo Data Designer 101: Using Custom Model Configurations\n", - "\n", - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - "\n", - "
\n", - "\n", - "In this notebook, we will see how to create and use custom model configurations in Data Designer.\n", - "\n", - "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from getpass import getpass\n", - "\n", - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n", - "- If you have an instance of data designer running locally, you can connect to it as follows\n", - "\n", - " ```python\n", - " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", - " ```\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# if using the managed service of data designer, provide the api key here\n", - "api_key = getpass(\"Enter data designer API key: \")\n", - "\n", - "if len(api_key) > 0:\n", - " print(\"βœ… API key received.\")\n", - "else:\n", - " print(\"❌ No API key provided. Please enter your model provider API key.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(\n", - " client=NeMoMicroservices(\n", - " base_url=\"https://ai.api.nvidia.com/v1/nemo/dd\",\n", - " default_headers={\"Authorization\": f\"Bearer {api_key}\"} # auto-generated API KEY\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# build.nvidia.com model endpoint\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias_static_temp = \"nemotron-nano-v2_static_temp\"\n", - "model_alias_variable_temp = \"nemotron-nano-v2_variable_temp\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## βš™οΈ Custom Model Configurations\n", - "\n", - "- In the previous notebooks, we've seen how we can reference a model using the model alias and pass static inference hyperparameters \n", - "\n", - "- In this notebook, we will see how we can sample values from a distribution to set as our temperature value. \n", - "This will result in greater diversity in our generated data as a different temperature value will be used each time the LLM is called" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs = [\n", - " P.ModelConfig(\n", - " alias=model_alias_static_temp,\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.0,\n", - " top_p=0.95,\n", - " timeout=120\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " P.ModelConfig(\n", - " alias=model_alias_variable_temp,\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=P.UniformDistribution(\n", - " params=P.UniformDistributionParams(\n", - " low=0.5,\n", - " high=0.9\n", - " )\n", - " ),\n", - " top_p=0.95,\n", - " timeout=120\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ§‘β€πŸŽ¨ Generating our Data\n", - "\n", - "- We follow a similar procedure to generate our product review dataset as we did in the the [basics tutorial](./1-the-basics.ipynb)\n", - "\n", - "- The one difference is that we generate multiple samples of the LLM generated columns to demonstrate the difference in generation outputs due to different temperature values\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"product_category\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\n", - " \"Electronics\",\n", - " \"Clothing\",\n", - " \"Home & Kitchen\",\n", - " \"Books\",\n", - " \"Home Office\",\n", - " ],\n", - " ),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"product_subcategory\",\n", - " type=P.SamplerType.SUBCATEGORY,\n", - " params=P.SubcategorySamplerParams(\n", - " category=\"product_category\",\n", - " values={\n", - " \"Electronics\": [\n", - " \"Smartphones\",\n", - " \"Laptops\",\n", - " \"Headphones\",\n", - " \"Cameras\",\n", - " \"Accessories\",\n", - " ],\n", - " \"Clothing\": [\n", - " \"Men's Clothing\",\n", - " \"Women's Clothing\",\n", - " \"Winter Coats\",\n", - " \"Activewear\",\n", - " \"Accessories\",\n", - " ],\n", - " \"Home & Kitchen\": [\n", - " \"Appliances\",\n", - " \"Cookware\",\n", - " \"Furniture\",\n", - " \"Decor\",\n", - " \"Organization\",\n", - " ],\n", - " \"Books\": [\n", - " \"Fiction\",\n", - " \"Non-Fiction\",\n", - " \"Self-Help\",\n", - " \"Textbooks\",\n", - " \"Classics\",\n", - " ],\n", - " \"Home Office\": [\n", - " \"Desks\",\n", - " \"Chairs\",\n", - " \"Storage\",\n", - " \"Office Supplies\",\n", - " \"Lighting\",\n", - " ],\n", - " },\n", - " ),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"target_age_range\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\"18-25\", \"25-35\", \"35-50\", \"50-65\", \"65+\"]\n", - " ),\n", - " )\n", - ")\n", - "\n", - "# Optionally validate that the columns are configured correctly.\n", - "config_builder.validate()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, let's add samplers to generate data related to the customer and their review.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# This column will sample synthetic person data based on statistics from the US Census.\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"customer\",\n", - " type=P.SamplerType.PERSON,\n", - " params=P.PersonSamplerParams(age_range=[18, 70]),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"number_of_stars\",\n", - " type=P.SamplerType.UNIFORM,\n", - " params=P.UniformSamplerParams(low=1, high=5),\n", - " convert_to=\"int\",\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"review_style\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\"rambling\", \"brief\", \"detailed\", \"structured with bullet points\"],\n", - " weights=[1, 2, 2, 1],\n", - " ),\n", - " )\n", - ")\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🦜 LLM-generated columns\n", - "\n", - "- We generate three sets of the LLM-generated columns to demonstrate the difference in output based on different temperature values" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder.add_column(\n", - " C.LLMTextColumn(\n", - " name=\"product_name\",\n", - " prompt=(\n", - " \"Come up with a creative product name for a product in the '{{ product_category }}' category, focusing \"\n", - " \"on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is \"\n", - " \"{{ target_age_range }} years old. Respond with only the product name, no other text.\"\n", - " ),\n", - " # This is optional, but it can be useful for controlling the behavior of the LLM. Do not include instructions\n", - " # related to output formatting in the system prompt, as Data Designer handles this based on the column type.\n", - " system_prompt=(\n", - " \"You are a helpful assistant that generates product names. You respond with only the product name, \"\n", - " \"no other text. You do NOT add quotes around the product name. \"\n", - " ),\n", - " model_alias=model_alias_static_temp,\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.LLMTextColumn(\n", - " name=\"customer_review_base\",\n", - " prompt=(\n", - " \"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. \"\n", - " \"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. \"\n", - " \"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. \"\n", - " \"The style of the review should be '{{ review_style }}'. \"\n", - " ),\n", - " model_alias=model_alias_static_temp,\n", - " )\n", - ")\n", - "\n", - "\n", - "config_builder.add_column(\n", - " C.LLMTextColumn(\n", - " name=\"customer_review_set_2\",\n", - " prompt=(\n", - " \"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. \"\n", - " \"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. \"\n", - " \"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. \"\n", - " \"The style of the review should be '{{ review_style }}'. \"\n", - " ),\n", - " model_alias=model_alias_variable_temp,\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.LLMTextColumn(\n", - " name=\"customer_review_set_3\",\n", - " prompt=(\n", - " \"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. \"\n", - " \"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. \"\n", - " \"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. \"\n", - " \"The style of the review should be '{{ review_style }}'. \"\n", - " ),\n", - " model_alias=model_alias_variable_temp,\n", - " )\n", - ")\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ‘€ Preview the dataset\n", - "\n", - "- Use the `preview` method to generate 10 records for inspection.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "preview = data_designer_client.preview(config_builder, num_records=3, verbose_logging=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Run this cell multiple times to cycle through the 10 preview records.\n", - "preview.display_sample_record()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# The preview dataset is available as a pandas DataFrame.\n", - "preview.dataset" - ] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/forms/w2-dataset.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/forms/w2-dataset.ipynb index 74a00eef1..8cd491e5e 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/forms/w2-dataset.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/forms/w2-dataset.ipynb @@ -1,4 +1,5 @@ { +<<<<<<< HEAD "cells": [ { "cell_type": "markdown", @@ -585,4 +586,656 @@ }, "nbformat": 4, "nbformat_minor": 5 +======= + "cells": [ + { + "cell_type": "markdown", + "id": "d9177057", + "metadata": {}, + "source": [ + "# 🧾 NeMo Data Designer: W-2 Dataset Generator\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "#### πŸ“š What you'll learn\n", + "\n", + "The notebook demonstrates how you can combine numerical samplers, the person sampler and LLMs to create a synthetic\\\n", + " dataset of W-2 forms (US Wage & Tax Statements).\n", + "\n", + "- We will use generate numerical fields using [statistics published by the IRS](https://www.irs.gov/pub/irs-pdf/p5385.pdf) for the year 2021:\n", + "\n", + "- We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, statistics\\\n", + " for generated persons reflect real-world census data.\n", + "\n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", + ">\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", + ">\n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", + ">\n", + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" + ] + }, + { + "cell_type": "markdown", + "id": "1572ad96", + "metadata": {}, + "source": [ + "### πŸ“¦ Import the essentials\n", + "\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", + "\n", + "- The `essentials` module provides quick access to the most commonly used objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52263153", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_microservices.data_designer.essentials import (\n", + " BernoulliMixtureSamplerParams,\n", + " CategorySamplerParams,\n", + " DataDesignerConfigBuilder,\n", + " ExpressionColumnConfig,\n", + " GaussianSamplerParams,\n", + " InferenceParameters,\n", + " LLMTextColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " PersonSamplerParams,\n", + " SamplerColumnConfig,\n", + " SamplerType,\n", + " SubcategorySamplerParams,\n", + " UniformSamplerParams\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2ac2abb3", + "metadata": {}, + "source": [ + "### βš™οΈ Initialize the NeMo Data Designer Client\n", + "\n", + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f0b71843", + "metadata": {}, + "outputs": [], + "source": [ + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" + ] + }, + { + "cell_type": "markdown", + "id": "de00e30f", + "metadata": {}, + "source": [ + "### πŸŽ›οΈ Define model configurations\n", + "\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", + "\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", + "\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", + "\n", + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "05a9c99a", + "metadata": {}, + "outputs": [], + "source": [ + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-nano-v2\"\n", + "\n", + "# This sets reasoning to False for the nemotron-nano-v2 model.\n", + "SYSTEM_PROMPT = \"/no_think\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "7dd40c5f", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "66d35178", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" + ] + }, + { + "cell_type": "markdown", + "id": "bbcb3538", + "metadata": {}, + "source": [ + "## 🎲 Setting Up Taxpayer and Employer Sampling\n", + "\n", + "- Sampler columns offer non-LLM based generation of synthetic data.\n", + "\n", + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n", + "\n", + "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n", + " If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "149e2abf", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a samplers for an American taxpayer (employee), and employer.\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"taxpayer\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(\n", + " locale=\"en_US\",\n", + " age_range=[18, 75]\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# While the employer isn't technically a \"person\", we'll use the person sampler for generating the employer address.\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"employer\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(\n", + " locale=\"en_US\",\n", + " ),\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "28397d74", + "metadata": {}, + "source": [ + "## ⚑️ Defining the Fields\n", + "\n", + "We will focus on the following:\n", + "- Box 1 (Wages, tips, and other compensation)\n", + "- Box 2 (Federal income tax withheld)\n", + "- Box 3 (Social security wages)\n", + "- Box 4 (Social security tax withheld)\n", + "- Box 5 (Medicare wages and tips)\n", + "- Box 6 (Medicare tax withheld)\n", + "- Box 7 (Social security tips)\n", + "- Box a (Employee's social security number)\n", + "- Box c (Employer's name, address and zip code)\n", + "- Box e (Employee's fist name, initial, and last name)\n", + "- Box f (Employee's address and zip code)\n", + "\n", + "
\n", + "\n", + "### Numerical fields\n", + "\n", + "Here, we'll define how to generate numerical samples for the currency fields of the W-2 (Boxes 1-7). \\\n", + "We'll use the W-2 statistics from the IRS linked above to generate realistic samples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be7e98e1", + "metadata": {}, + "outputs": [], + "source": [ + "### BOX 1 (TOTAL WAGES, TIPS, AND OTHER COMPENSATION) ###\n", + "\n", + "# From Page 6 of the IRS Statistics, we know that 276,388,660 / 277,981,454 W-2 forms had a non-zero value for Box 1 (99.4%).\n", + "# From Page 8 of the IRS Statistics, we know that the sum of this field across all forms was 9,920,000,000*$1000 = $9,920,000,000,000 dollars.\n", + "# Since there were 276,388,660 non-zero Box 1 values, the average value of Box 1 was $9,920,000,000,000 / 276,388,660 = $35,891.49.\n", + "# We will use a Bernoulli-Exponential mixture distribution to sample values for this field.\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"box_1_wages_tips_other_compensation\",\n", + " sampler_type=SamplerType.BERNOULLI_MIXTURE,\n", + " params=BernoulliMixtureSamplerParams(\n", + " p=0.994,\n", + " dist_name=\"expon\",\n", + " dist_params={\"scale\": 35891.49}\n", + " ),\n", + " convert_to=\"int\",\n", + " )\n", + ")\n", + "\n", + "### BOX 2 (FEDERAL INCOME TAX WITHHELD) ###\n", + "\n", + "# Note: The calculations below are a simplification based on the assumption that this is an individual's only W-2.\n", + "# In practice, the taxable income is based on all wages for individuals with multiple W-2s.\n", + "\n", + "# 2022 standard deduction\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"standard_deduction\",\n", + " expr=\"{% if taxpayer.marital_status == 'married_present' %}25900{% else %}12950{% endif %}\",\n", + " dtype=\"float\",\n", + " ),\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"taxable_income\",\n", + " expr=\"{{ [0, box_1_wages_tips_other_compensation - standard_deduction]|max }}\",\n", + " dtype=\"float\",\n", + " )\n", + ")\n", + "\n", + "# We'll sum over the tax incurred at each 2022 tax bracket.\n", + "# For simplicity, we'll assume that the taxpayer is single here.\n", + "BRACKETS = [\n", + " {\"name\": \"bracket1\", \"rate\": 0.10, \"max\": 10275, \"min\": 0},\n", + " {\"name\": \"bracket2\", \"rate\": 0.12, \"max\": 41775, \"min\": 10275},\n", + " {\"name\": \"bracket3\", \"rate\": 0.22, \"max\": 89075, \"min\": 41775},\n", + " {\"name\": \"bracket4\", \"rate\": 0.24, \"max\": 170050, \"min\": 89075},\n", + " {\"name\": \"bracket5\", \"rate\": 0.32, \"max\": 215950, \"min\": 170050},\n", + " {\"name\": \"bracket6\", \"rate\": 0.35, \"max\": 539900, \"min\": 215950},\n", + " {\"name\": \"bracket7\", \"rate\": 0.37, \"max\": 10000000000000, \"min\": 539900},\n", + "]\n", + "for bracket in BRACKETS:\n", + " expression = f\"{bracket['rate']}*([[taxable_income,{bracket['max']}]|min - {bracket['min']}, 0] | max)\"\n", + " config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=bracket[\"name\"],\n", + " expr=\"{{ \" + expression + \" }}\",\n", + " dtype=\"float\",\n", + " )\n", + " )\n", + "\n", + "# Sum the tax brackets to get the total withheld, on average\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"mean_tax_liability\",\n", + " expr=\"{{ bracket1 + bracket2 + bracket3 + bracket4 + bracket5 + bracket6 + bracket7 }}\",\n", + " dtype=\"int\",\n", + " )\n", + ")\n", + "\n", + "# Add some noise to get the actual withholding\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"tax_liability_noise\",\n", + " sampler_type=SamplerType.GAUSSIAN,\n", + " params=GaussianSamplerParams(mean=1, stddev=0.1),\n", + " )\n", + ")\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"box_2_federal_income_tax_withheld\",\n", + " expr=\"{{ (mean_tax_liability * tax_liability_noise) | int }}\",\n", + " )\n", + ")\n", + "\n", + "### BOX 3 (SOCIAL SECURITY WAGES) ###\n", + "\n", + "# From Page 8 of the IRS Statistics, we know that social security wages are, on average, 8,150,000,000/9,920,000,000 ~= 82.16% of total wages.\n", + "# We'll sample a ratio from a normal distribution with mean 0.8216 and standard deviation 0.2.\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"social_security_wages_ratio\",\n", + " sampler_type=SamplerType.GAUSSIAN,\n", + " params=GaussianSamplerParams(mean=0.8216, stddev=0.2),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"box_3_social_security_wages\",\n", + " expr=\"{{ (box_1_wages_tips_other_compensation * social_security_wages_ratio) | int }}\",\n", + " )\n", + ")\n", + "\n", + "### BOX 4 (SOCIAL SECURITY TAX WITHHELD) ###\n", + "\n", + "# In 2022, social security tax was withheld at a rate of 6.2% of social security wages, up to a maximum of $147,000.\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"box_4_social_security_tax_withheld\",\n", + " expr=\"{{ (([box_3_social_security_wages, 147000]|min) * 0.062) | int }}\",\n", + " )\n", + ")\n", + "\n", + "### BOX 5 (MEDICARE WAGES AND TIPS) ###\n", + "\n", + "# From Page 8 of the IRS Statistics, we know that Medicare wages and tips are, on average, 10,300,000,000/9,920,000,000 ~= 103.8% of total wages.\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"medicare_wages_and_tips_ratio\",\n", + " sampler_type=SamplerType.GAUSSIAN,\n", + " params=GaussianSamplerParams(mean=1.038, stddev=0.2),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"box_5_medicare_wages_and_tips\",\n", + " expr=\"{{ (box_1_wages_tips_other_compensation * medicare_wages_and_tips_ratio) | int }}\",\n", + " )\n", + ")\n", + "\n", + "### BOX 6 (MEDICARE TAX WITHHELD) ###\n", + "\n", + "# The standard employee Medicare tax rate in 2022 was 1.45% on all Medicare wages.\n", + "# The Additional Medicare Tax rate in 2022 was 0.9% on all Medicare wages in excess of $200,000.\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"box_6_medicare_tax_withheld\",\n", + " expr=\"{{ ((box_5_medicare_wages_and_tips * 0.0145) + (([box_5_medicare_wages_and_tips - 200000, 0]|max) * 0.009)) | int }}\",\n", + " )\n", + ")\n", + "\n", + "### BOX 7 (SOCIAL SECURITY TIPS) ###\n", + "\n", + "# From Page 6 of the IRS Statistics, we know that only 12,620,946 / 277,981,454 W-2 forms had a non-zero value for Box 7 (4.54%).\n", + "# From Page 8 of the IRS Statistics, we know that the sum of this field across all forms was 55,897,014*$1000 = $55,897,014,000.\n", + "# Since there were 12,620,946 non-zero Box 7 values, the average value of Box 7 was $55,897,014,000 / 12,620,946 = $4428.91.\n", + "# We will use a Bernoulli-Exponential mixture distribution to sample values for this field.\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"box_7_social_security_tips\",\n", + " sampler_type=SamplerType.BERNOULLI_MIXTURE,\n", + " params=BernoulliMixtureSamplerParams(\n", + " p=0.0454,\n", + " dist_name=\"expon\",\n", + " dist_params={\"scale\": 4428.91}\n", + " )\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "f1cbd72b", + "metadata": {}, + "source": [ + "### 🦜 Non-numerical Fields\n", + "\n", + "The remaining fields contain information about the employee (taxpayer) and the employer. \\\n", + "We'll use the person sampler in combination with an LLM to generate values here." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf3ba45b", + "metadata": {}, + "outputs": [], + "source": [ + "### BOX A (EMPLOYEE'S SOCIAL SECURITY NUMBER) ###\n", + "\n", + "# We can use the ssn field of the person sampler to generate a valid SSN for the employee.\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"box_a_employee_ssn\",\n", + " expr=\"{{ taxpayer.ssn }}\",\n", + " )\n", + ")\n", + "\n", + "### BOX C (EMPLOYER'S NAME, ADDRESS AND ZIP CODE) ###\n", + "\n", + "# We want to generate a realistic company name.\n", + "# We'll start by generating a list of industries, expanded with magic.\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"employer_business\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=(\"You are assisting a user generate synthetic W-2 forms.\\n\"\n", + " \"You must generate a realistic industry category for the employer\\n\"\n", + " \"eg: software, health insurance, shoe store, restaurant, plumbing /no_think\"),\n", + " prompt=(\"Generate the industry category for the employer. Ensure it is consistent with the employer location\\n\"\n", + " \"City: {{ employer.city }}\\nState: {{ employer.state }}\"),\n", + " )\n", + ")\n", + "\n", + "# Next, we'll generate an actual name based on the type of business.\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"employer_name\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=\"Generate an original name for a {{ employer_business }} business in {{ employer.city }}.\",\n", + " )\n", + ")\n", + "\n", + "# Finally, we'll combine the employer name with the address of the employer.\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"box_c_employer_name_address_zip\",\n", + " expr=\"{{ employer_name }}\\n{{ employer.street_number }} {{ employer.street_name }}\\n{{ employer.city }}, {{ employer.state }} {{ employer.postcode }}\",\n", + " )\n", + ")\n", + "\n", + "### BOX E (EMPLOYEE'S FIRST NAME, INITIAL, AND LAST NAME) ###\n", + "\n", + "# We can extract the first name, initial, and last name from the person sampler.\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"box_e_employee_first_name_initial_last_name\",\n", + " expr=\"{{ taxpayer.first_name }} {{ taxpayer.middle_name[:1] }} {{ taxpayer.last_name }}\",\n", + " )\n", + ")\n", + "\n", + "### BOX F (EMPLOYEE'S ADDRESS AND ZIP CODE) ###\n", + "\n", + "# Similarly, we can extract the employee's address and zip code from the person sampler.\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"box_f_employee_address_zip\",\n", + " expr=\"{{ taxpayer.street_number }} {{ taxpayer.street_name }}\\n{{ taxpayer.city }}, {{ taxpayer.state }} {{ taxpayer.postcode }}\",\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "7800b823", + "metadata": {}, + "source": [ + "### πŸ” Iteration is key – preview the dataset!\n", + "\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", + "\n", + "2. Inspect the results for quality and format issues.\n", + "\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", + "\n", + "4. Re-run the preview until satisfied." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62432301", + "metadata": {}, + "outputs": [], + "source": [ + "# These are the columns we want in the final dataset, after dropping latent variables.\n", + "FINAL_COLUMNS = [\n", + " \"box_1_wages_tips_other_compensation\",\n", + " \"box_2_federal_income_tax_withheld\",\n", + " \"box_3_social_security_wages\",\n", + " \"box_4_social_security_tax_withheld\",\n", + " \"box_5_medicare_wages_and_tips\",\n", + " \"box_6_medicare_tax_withheld\",\n", + " \"box_7_social_security_tips\",\n", + " \"box_a_employee_ssn\",\n", + " \"box_c_employer_name_address_zip\",\n", + " \"box_e_employee_first_name_initial_last_name\",\n", + " \"box_f_employee_address_zip\",\n", + "]\n", + "\n", + "# Preview the results\n", + "preview = data_designer_client.preview(config_builder)\n", + "preview.dataset[FINAL_COLUMNS]" + ] + }, + { + "cell_type": "markdown", + "id": "125c5a70", + "metadata": {}, + "source": [ + "### πŸ“Š Analyze the generated data\n", + "\n", + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac521fca", + "metadata": {}, + "outputs": [], + "source": [ + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "id": "69b51eef", + "metadata": {}, + "source": [ + "### πŸ†™ Scale up!\n", + "\n", + "- Happy with your preview data?\n", + "\n", + "- Use the `create` method to submit larger Data Designer generation jobs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc6b6f2a", + "metadata": {}, + "outputs": [], + "source": [ + "job_results = data_designer_client.create(config_builder, num_records=20)\n", + "\n", + "# This will block until the job is complete.\n", + "job_results.wait_until_done()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a80d9168", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", + "dataset = job_results.load_dataset()\n", + "\n", + "dataset.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "145fb4e6", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c05af9b2", + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-forms-w2-dataset\",\n", + ");" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "sdg_venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +>>>>>>> 8b9be04 (refactored w2 dataset notebook for 25.10) } diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/clinical-trials.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/clinical-trials.ipynb index 8564b3726..ff7f97d60 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/clinical-trials.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/clinical-trials.ipynb @@ -1,834 +1,1043 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "d276b23c", - "metadata": {}, - "source": [ - "# πŸ₯ NeMo Data Designer: Clinical Trials Dataset Generator" - ] - }, - { - "cell_type": "markdown", - "id": "1052e3a3", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", - ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This notebook creates a synthetic dataset of clinical trial records with realistic PII (Personally Identifiable Information) for testing data protection and anonymization techniques.\n", - "\n", - "The dataset includes:\n", - "- Trial information and study design\n", - "- Participant demographics and health data (PII)\n", - "- Investigator and coordinator information (PII)\n", - "- Medical observations and notes with embedded PII\n", - "- Adverse event reports with varying severity\n", - "\n", - "We'll use Data Designer to create this fully synthetic dataset from scratch." - ] - }, - { - "cell_type": "markdown", - "id": "13d2e224", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4993fc0c", - "metadata": {}, - "outputs": [], - "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "id": "83c59cf7", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "44224c9e", - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" - ] - }, - { - "cell_type": "markdown", - "id": "656c4c56", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "06d32ec8", - "metadata": {}, - "outputs": [], - "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias = \"nemotron-nano-9b-v2\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d14cf19f", - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Create person samplers for different roles, using en_GB locale\n", - "# Add person samplers for different roles in the clinical trial\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"participant\",\n", - " type=P.SamplerType.PERSON,\n", - " params=P.PersonSamplerParams(locale=\"en_US\"),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"investigator\",\n", - " type=P.SamplerType.PERSON,\n", - " params=P.PersonSamplerParams(locale=\"en_US\"),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"coordinator\",\n", - " type=P.SamplerType.PERSON,\n", - " params=P.PersonSamplerParams(locale=\"en_US\"),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"sponsor\",\n", - " type=P.SamplerType.PERSON,\n", - " params=P.PersonSamplerParams(locale=\"en_US\"),\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Creating Trial Information\n", - "\n", - "Next, we'll create the basic trial information:\n", - "- Study ID (unique identifier)\n", - "- Trial phase and therapeutic area\n", - "- Study design details\n", - "- Start and end dates for the trial" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Study identifiers\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"study_id\",\n", - " type=P.SamplerType.UUID,\n", - " params=P.UUIDSamplerParams(prefix=\"CT-\", short_form=True, uppercase=True)\n", - " )\n", - ")\n", - "\n", - "# Trial phase\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"trial_phase\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\"Phase I\", \"Phase II\", \"Phase III\", \"Phase IV\"],\n", - " weights=[0.2, 0.3, 0.4, 0.1]\n", - " )\n", - " )\n", - ")\n", - "\n", - "# Therapeutic area\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"therapeutic_area\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\"Oncology\", \"Cardiology\", \"Neurology\", \"Immunology\", \"Infectious Disease\"],\n", - " weights=[0.3, 0.2, 0.2, 0.15, 0.15]\n", - " )\n", - " )\n", - ")\n", - "\n", - "# Study design\n", - "config_builder.add_column(\n", - " name=\"study_design\",\n", - " type=\"subcategory\",\n", - " params={\n", - " \"category\": \"trial_phase\",\n", - " \"values\": {\n", - " \"Phase I\": [\"Single Arm\", \"Dose Escalation\", \"First-in-Human\", \"Safety Assessment\"],\n", - " \"Phase II\": [\"Randomized\", \"Double-Blind\", \"Proof of Concept\", \"Open-Label Extension\"],\n", - " \"Phase III\": [\"Randomized Controlled\", \"Double-Blind Placebo-Controlled\", \"Multi-Center\", \"Pivotal\"],\n", - " \"Phase IV\": [\"Post-Marketing Surveillance\", \"Real-World Evidence\", \"Long-Term Safety\", \"Expanded Access\"]\n", - " }\n", - " }\n", - ")\n", - "\n", - "# Trial dates\n", - "config_builder.add_column(\n", - " name=\"trial_start_date\",\n", - " type=\"datetime\",\n", - " params={\"start\": \"2022-01-01\", \"end\": \"2023-06-30\"},\n", - " convert_to=\"%Y-%m-%d\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"trial_end_date\",\n", - " type=\"datetime\",\n", - " params={\"start\": \"2023-07-01\", \"end\": \"2024-12-31\"},\n", - " convert_to=\"%Y-%m-%d\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Participant Information\n", - "\n", - "Now we'll create fields for participant demographics and enrollment details:\n", - "- Participant ID and basic information\n", - "- Demographics (age, gender, etc.)\n", - "- Enrollment status and dates\n", - "- Randomization assignment" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Participant identifiers and information\n", - "config_builder.add_column(\n", - " name=\"participant_id\",\n", - " type=\"uuid\",\n", - " params={\"prefix\": \"PT-\", \"short_form\": True, \"uppercase\": True}\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"participant_first_name\",\n", - " type=\"expression\",\n", - " expr=\"{{participant.first_name}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"participant_last_name\",\n", - " type=\"expression\",\n", - " expr=\"{{participant.last_name}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"participant_birth_date\",\n", - " type=\"expression\",\n", - " expr=\"{{participant.birth_date}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"participant_email\",\n", - " type=\"expression\",\n", - " expr=\"{{participant.email_address}}\"\n", - ")\n", - "\n", - "# Enrollment information\n", - "config_builder.add_column(\n", - " name=\"enrollment_date\",\n", - " type=\"timedelta\",\n", - " params={\n", - " \"dt_min\": 0,\n", - " \"dt_max\": 60,\n", - " \"reference_column_name\": \"trial_start_date\",\n", - " \"unit\": \"D\"\n", - " },\n", - " convert_to=\"%Y-%m-%d\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"participant_status\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Active\", \"Completed\", \"Withdrawn\", \"Lost to Follow-up\"],\n", - " \"weights\": [0.6, 0.2, 0.15, 0.05]\n", - " }\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"treatment_arm\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Treatment\", \"Placebo\", \"Standard of Care\"],\n", - " \"weights\": [0.5, 0.3, 0.2]\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Investigator and Staff Information\n", - "\n", - "Here we'll add information about the trial staff:\n", - "- Investigator information (principal investigator)\n", - "- Study coordinator details\n", - "- Site information" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Investigator information\n", - "config_builder.add_column(\n", - " name=\"investigator_first_name\",\n", - " type=\"expression\",\n", - " expr=\"{{investigator.first_name}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"investigator_last_name\",\n", - " type=\"expression\",\n", - " expr=\"{{investigator.last_name}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"investigator_id\",\n", - " type=\"uuid\",\n", - " params={\"prefix\": \"INV-\", \"short_form\": True, \"uppercase\": True}\n", - ")\n", - "\n", - "# Study coordinator information\n", - "config_builder.add_column(\n", - " name=\"coordinator_first_name\",\n", - " type=\"expression\",\n", - " expr=\"{{coordinator.first_name}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"coordinator_last_name\",\n", - " type=\"expression\",\n", - " expr=\"{{coordinator.last_name}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"coordinator_email\",\n", - " type=\"expression\",\n", - " expr=\"{{coordinator.email_address}}\"\n", - ")\n", - "\n", - "# Site information\n", - "config_builder.add_column(\n", - " name=\"site_id\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"SITE-001\", \"SITE-002\", \"SITE-003\", \"SITE-004\", \"SITE-005\"]\n", - " }\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"site_location\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"London\", \"Manchester\", \"Birmingham\", \"Edinburgh\", \"Cambridge\"]\n", - " }\n", - ")\n", - "\n", - "# Study costs\n", - "config_builder.add_column(\n", - " name=\"per_patient_cost\",\n", - " type=\"gaussian\",\n", - " params={\"mean\": 15000, \"stddev\": 5000, \"min\": 5000}\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"participant_compensation\",\n", - " type=\"gaussian\",\n", - " params={\"mean\": 500, \"stddev\": 200, \"min\": 100}\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Clinical Measurements and Outcomes\n", - "\n", - "These columns will track the key clinical data collected during the trial:\n", - "- Vital signs and lab values\n", - "- Efficacy measurements \n", - "- Dosing information" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Basic clinical measurements\n", - "config_builder.add_column(\n", - " name=\"baseline_measurement\",\n", - " type=\"gaussian\",\n", - " params={\"mean\": 100, \"stddev\": 15},\n", - " convert_to=\"float\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"final_measurement\",\n", - " type=\"gaussian\",\n", - " params={\"mean\": 85, \"stddev\": 20},\n", - " convert_to=\"float\"\n", - ")\n", - "\n", - "# Calculate percent change\n", - "config_builder.add_column(\n", - " name=\"percent_change\",\n", - " type=\"expression\",\n", - " expr=\"{{(final_measurement - baseline_measurement) / baseline_measurement * 100}}\"\n", - ")\n", - "\n", - "# Dosing information\n", - "config_builder.add_column(\n", - " name=\"dose_level\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Low\", \"Medium\", \"High\", \"Placebo\"],\n", - " \"weights\": [0.3, 0.3, 0.2, 0.2]\n", - " }\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"dose_frequency\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Once daily\", \"Twice daily\", \"Weekly\", \"Biweekly\"],\n", - " \"weights\": [0.4, 0.3, 0.2, 0.1]\n", - " }\n", - ")\n", - "\n", - "# Protocol compliance\n", - "config_builder.add_column(\n", - " name=\"compliance_rate\",\n", - " type=\"uniform\",\n", - " params={\"low\": 0.7, \"high\": 1.0}\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Adverse Events Tracking\n", - "\n", - "Here we'll capture adverse events that occur during the clinical trial:\n", - "- Adverse event presence and type\n", - "- Severity and relatedness to treatment\n", - "- Dates and resolution" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Adverse event flags and details\n", - "config_builder.add_column(\n", - " name=\"has_adverse_event\",\n", - " type=\"bernoulli\",\n", - " params={\"p\": 0.3}\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"adverse_event_type\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Headache\", \"Nausea\", \"Fatigue\", \"Rash\", \"Dizziness\", \"Pain at injection site\", \"Other\"],\n", - " \"weights\": [0.2, 0.15, 0.15, 0.1, 0.1, 0.2, 0.1]\n", - " },\n", - " conditional_params={\"has_adverse_event == 0\": {\"values\": [\"None\"]}}\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"adverse_event_severity\",\n", - " type=\"category\",\n", - " params={\"values\": [\"Mild\", \"Moderate\", \"Severe\", \"Life-threatening\"]},\n", - " conditional_params={\"has_adverse_event == 0\": {\"values\": [\"NA\"]}}\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"adverse_event_relatedness\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Unrelated\", \"Possibly related\", \"Probably related\", \"Definitely related\"],\n", - " \"weights\": [0.2, 0.4, 0.3, 0.1]\n", - " },\n", - " conditional_params={\"has_adverse_event == 0\": {\"values\": [\"NA\"]}}\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"adverse_event_resolved\",\n", - " type=\"category\",\n", - " params={\"values\": [\"NA\"]},\n", - " conditional_params={\"has_adverse_event == 1\": {\"values\": [\"Yes\", \"No\"], \"weights\": [0.8, 0.2]}}\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Narrative text fields with style variations\n", - "\n", - "These fields will contain natural language text that incorporates PII elements.\n", - "We'll use style seed categories to ensure diversity in the writing styles:\n", - "\n", - "1. Medical observations and notes\n", - "2. Adverse event descriptions \n", - "3. Protocol deviation explanations\n", - "\n", - "**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Documentation style category\n", - "config_builder.add_column(\n", - " name=\"documentation_style\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Formal and Technical\", \"Concise and Direct\", \"Detailed and Descriptive\"],\n", - " \"weights\": [0.4, 0.3, 0.3]\n", - " }\n", - ")\n", - "\n", - "# Medical observations - varies based on documentation style\n", - "config_builder.add_column(\n", - " name=\"medical_observations\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " prompt=\"\"\"\n", - " {% if documentation_style == \"Formal and Technical\" %}\n", - " Write formal and technical medical observations for participant {{ participant_first_name }} {{ participant_last_name }}\n", - " (ID: {{ participant_id }}) in the clinical trial for {{ therapeutic_area }} (Study ID: {{ study_id }}).\n", - "\n", - " Include observations related to their enrollment in the {{ dose_level }} dose group with {{ dose_frequency }} administration.\n", - " Baseline measurement was {{ baseline_measurement }} and final measurement was {{ final_measurement }}, representing a\n", - " change of {{ percent_change }}%.\n", - "\n", - " Use proper medical terminology, maintain a highly formal tone, and structure the notes in a technical format with appropriate\n", - " sections and subsections. Include at least one reference to the site investigator, Dr. {{ investigator_last_name }}.\n", - " {% elif documentation_style == \"Concise and Direct\" %}\n", - " Write brief, direct medical observations for patient {{ participant_first_name }} {{ participant_last_name }}\n", - " ({{ participant_id }}) in {{ therapeutic_area }} trial {{ study_id }}.\n", - "\n", - " Note: {{ dose_level }} dose, {{ dose_frequency }}. Baseline: {{ baseline_measurement }}. Final: {{ final_measurement }}.\n", - " Change: {{ percent_change }}%.\n", - "\n", - " Keep notes extremely concise, using abbreviations where appropriate. Mention follow-up needs and reference\n", - " Dr. {{ investigator_last_name }} briefly.\n", - " {% else %}\n", - " Write detailed and descriptive medical observations for participant {{ participant_first_name }} {{ participant_last_name }}\n", - " enrolled in the {{ therapeutic_area }} clinical trial ({{ study_id }}).\n", - "\n", - " Provide a narrative description of their experience in the {{ dose_level }} dose group with {{ dose_frequency }} dosing.\n", - " Describe how their measurements changed from baseline ({{ baseline_measurement }}) to final ({{ final_measurement }}),\n", - " representing a {{ percent_change }}% change.\n", - "\n", - " Use a mix of technical terms and explanatory language. Include thorough descriptions of observed effects and subjective\n", - " patient reports. Mention interactions with the investigator, Dr. {{ investigator_first_name }} {{ investigator_last_name }}.\n", - " {% endif %}\n", - " \"\"\"\n", - ")\n", - "\n", - "# Adverse event descriptions - conditional on having an adverse event\n", - "config_builder.add_column(\n", - " name=\"adverse_event_description\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " prompt=\"\"\"\n", - " {% if has_adverse_event == 1 %}\n", - " [INSTRUCTIONS: Write a brief clinical description (1-2 sentences only) of the adverse event. Use formal medical language. Do not include meta-commentary or explain what you're doing.]\\\n", - " {{adverse_event_type}}, {{adverse_event_severity}}. {{adverse_event_relatedness}} to study treatment.\n", - " {% if adverse_event_resolved == \"Yes\" %}Resolved.{% else %}Ongoing.{% endif %}\n", - " {% else %}\n", - " [INSTRUCTIONS: Output only the exact text \"No adverse events reported\" without any additional commentary.]\\\n", - " No adverse events reported.\\\n", - " {% endif %}\n", - " \"\"\"\n", - ")\n", - "\n", - "# Protocol deviation description (if compliance is low)\n", - "config_builder.add_column(\n", - " name=\"protocol_deviation\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " prompt=\"\"\"\n", - " {% if compliance_rate < 0.85 %}\n", - " {% if documentation_style == \"Formal and Technical\" %}\n", - " [FORMAT INSTRUCTIONS: Write in a direct documentation style. Do not use phrases like \"it looks like\" or \"you've provided\". Begin with the protocol deviation details. Use formal terminology.]\n", - "\n", - " PROTOCOL DEVIATION REPORT\n", - " Study ID: {{ study_id }}\n", - " Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\n", - " Compliance Rate: {{ compliance_rate }}\n", - "\n", - " [Continue with formal description of the deviation, impact on data integrity, and corrective actions. Reference coordinator {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_last_name }}]\n", - " {% elif documentation_style == \"Concise and Direct\" %}\n", - " [FORMAT INSTRUCTIONS: Use only brief notes and bullet points. No introductions or explanations.]\n", - "\n", - " PROTOCOL DEVIATION - {{ participant_id }}\n", - " β€’ Compliance: {{ compliance_rate }}\n", - " β€’ Impact: [severity level]\n", - " β€’ Actions: [list actions]\n", - " β€’ Coordinator: {{ coordinator_first_name }} {{ coordinator_last_name }}\n", - " β€’ PI: Dr. {{ investigator_last_name }}\n", - " {% else %}\n", - " [FORMAT INSTRUCTIONS: Write a narrative description. Begin directly with the deviation details. No meta-commentary.]\n", - "\n", - " During the {{ therapeutic_area }} study at {{ site_location }}, participant {{ participant_first_name }} {{ participant_last_name }} demonstrated a compliance rate of {{ compliance_rate }}, which constitutes a protocol deviation.\n", - "\n", - " [Continue with narrative about circumstances, discovery, impact, and team response. Include references to {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_first_name }} {{ investigator_last_name }}]\n", - " {% endif %}\n", - " {% else %}\n", - " [FORMAT INSTRUCTIONS: Write a simple direct statement. No meta-commentary or explanation.]\n", - "\n", - " PROTOCOL COMPLIANCE ASSESSMENT\n", - " Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\n", - " Finding: No protocol deviations. Compliance rate: {{ compliance_rate }}.\n", - " {% endif %}\n", - " \"\"\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Adding Constraints\n", - "\n", - "Finally, we'll add constraints to ensure our data is logically consistent:\n", - "- Trial dates must be in proper sequence\n", - "- Adverse event dates must occur after enrollment\n", - "- Measurement changes must be realistic" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Ensure appropriate date sequence\n", - "config_builder.add_constraint(\n", - " target_column=\"trial_end_date\",\n", - " type=\"column_inequality\",\n", - " params={\"operator\": \"gt\", \"rhs\": \"trial_start_date\"}\n", - ")\n", - "\n", - "config_builder.add_constraint(\n", - " target_column=\"enrollment_date\",\n", - " type=\"column_inequality\",\n", - " params={\"operator\": \"ge\", \"rhs\": \"trial_start_date\"}\n", - ")\n", - "\n", - "config_builder.add_constraint(\n", - " target_column=\"enrollment_date\",\n", - " type=\"column_inequality\",\n", - " params={\"operator\": \"lt\", \"rhs\": \"trial_end_date\"}\n", - ")\n", - "\n", - "# Ensure reasonable clinical measurements\n", - "config_builder.add_constraint(\n", - " target_column=\"baseline_measurement\",\n", - " type=\"scalar_inequality\",\n", - " params={\"operator\": \"gt\", \"rhs\": 0}\n", - ")\n", - "\n", - "config_builder.add_constraint(\n", - " target_column=\"final_measurement\",\n", - " type=\"scalar_inequality\",\n", - " params={\"operator\": \"gt\", \"rhs\": 0}\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Preview and Generate Dataset\n", - "\n", - "First, we'll preview a small sample to verify our configuration is working correctly.\n", - "Then we'll generate the full dataset with the desired number of records." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Preview a few records\n", - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# More previews\n", - "preview.display_sample_record()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Submit batch job\n", - "job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)\n", - "\n", - "job_results.wait_until_done()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a80535a0", - "metadata": {}, - "outputs": [], - "source": [ - "dataset = job_results.load_dataset()\n", - "print(\"\\nGenerated dataset shape:\", dataset.shape)\n", - "\n", - "dataset.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Save the dataset\n", - "import os\n", - "os.makedirs(\"data\", exist_ok=True)\n", - "\n", - "csv_filename = f\"./data/clinical-trial-data.csv\"\n", - "dataset.to_csv(csv_filename, index=False)\n", - "print(f\"Dataset with {len(dataset)} records saved to {csv_filename}\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "d276b23c", + "metadata": {}, + "source": [ + "# πŸ₯ NeMo Data Designer: Clinical Trials Dataset Generator\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "#### πŸ“š What you'll learn\n", + "\n", + "This notebook demonstrates how to use structured samplers, person/PII generators, and LLMs to create a realistic\\\n", + "synthetic clinical trials datasetβ€”including trial metadata, participant demographics, investigator details,\\\n", + "clinical notes, and adverse event reportsβ€”for evaluating data protection and anonymization techniques.\n", + "\n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", + ">\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", + ">\n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", + ">\n", + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" + ] + }, + { + "cell_type": "markdown", + "id": "a0487040", + "metadata": {}, + "source": [ + "### πŸ“¦ Import the essentials\n", + "\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", + "\n", + "- The `essentials` module provides quick access to the most commonly used objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4993fc0c", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_microservices.data_designer.essentials import (\n", + " BernoulliSamplerParams,\n", + " CategorySamplerParams,\n", + " DataDesignerConfigBuilder,\n", + " GaussianSamplerParams,\n", + " InferenceParameters,\n", + " LLMTextColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " PersonSamplerParams,\n", + " SamplerColumnConfig,\n", + " SamplerType,\n", + " SubcategorySamplerParams,\n", + " UUIDSamplerParams,\n", + " UniformSamplerParams\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a47f0a6e", + "metadata": {}, + "source": [ + "### βš™οΈ Initialize the NeMo Data Designer Client\n", + "\n", + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "24914d42", + "metadata": {}, + "outputs": [], + "source": [ + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" + ] + }, + { + "cell_type": "markdown", + "id": "ea167c68", + "metadata": {}, + "source": [ + "### πŸŽ›οΈ Define model configurations\n", + "\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", + "\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", + "\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", + "\n", + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1314aa7d", + "metadata": {}, + "outputs": [], + "source": [ + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-nano-v2\"\n", + "\n", + "# This sets reasoning to False for the nemotron-nano-v2 model.\n", + "SYSTEM_PROMPT = \"/no_think\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "75df3903", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f66b1fb7", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" + ] + }, + { + "cell_type": "markdown", + "id": "abfcb395", + "metadata": {}, + "source": [ + "## 🎲 Getting Started with Sampler Columns\n", + "\n", + "- Sampler columns offer non-LLM based generation of synthetic data.\n", + "\n", + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n", + "\n", + "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n", + "If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b47f0d9", + "metadata": {}, + "outputs": [], + "source": [ + "# Create person samplers for different roles, using en_GB locale\n", + "# Add person samplers for different roles in the clinical trial\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"participant\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(locale=\"en_US\"),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"investigator\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(locale=\"en_US\"),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"coordinator\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(locale=\"en_US\"),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"sponsor\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(locale=\"en_US\"),\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "c741f044", + "metadata": {}, + "source": [ + "### Creating Trial Information\n", + "\n", + "Next, we'll create the basic trial information:\n", + "- Study ID (unique identifier)\n", + "- Trial phase and therapeutic area\n", + "- Study design details\n", + "- Start and end dates for the trial" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8ba75b4", + "metadata": {}, + "outputs": [], + "source": [ + "# Study identifiers\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"study_id\",\n", + " sampler_type=SamplerType.UUID,\n", + " params=UUIDSamplerParams(prefix=\"CT-\", short_form=True, uppercase=True)\n", + " )\n", + ")\n", + "\n", + "# Trial phase\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"trial_phase\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Phase I\", \"Phase II\", \"Phase III\", \"Phase IV\"],\n", + " weights=[0.2, 0.3, 0.4, 0.1]\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Therapeutic area\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"therapeutic_area\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Oncology\", \"Cardiology\", \"Neurology\", \"Immunology\", \"Infectious Disease\"],\n", + " weights=[0.3, 0.2, 0.2, 0.15, 0.15]\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Study design\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"study_design\",\n", + " sampler_type=SamplerType.SUBCATEGORY,\n", + " params=SubcategorySamplerParams(\n", + " category=\"trial_phase\",\n", + " values={\n", + " \"Phase I\": [\"Single Arm\", \"Dose Escalation\", \"First-in-Human\", \"Safety Assessment\"],\n", + " \"Phase II\": [\"Randomized\", \"Double-Blind\", \"Proof of Concept\", \"Open-Label Extension\"],\n", + " \"Phase III\": [\"Randomized Controlled\", \"Double-Blind Placebo-Controlled\", \"Multi-Center\", \"Pivotal\"],\n", + " \"Phase IV\": [\"Post-Marketing Surveillance\", \"Real-World Evidence\", \"Long-Term Safety\", \"Expanded Access\"]\n", + " },\n", + " ),\n", + " )\n", + ")\n", + "\n", + "\n", + "# Trial dates\n", + "config_builder.add_column(\n", + " name=\"trial_start_date\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"datetime\",\n", + " params={\"start\": \"2022-01-01\", \"end\": \"2023-06-30\"},\n", + " convert_to=\"%Y-%m-%d\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"trial_end_date\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"datetime\",\n", + " params={\"start\": \"2023-07-01\", \"end\": \"2024-12-31\"},\n", + " convert_to=\"%Y-%m-%d\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "d2868a1b", + "metadata": {}, + "source": [ + "### Participant Information\n", + "\n", + "Now we'll create fields for participant demographics and enrollment details:\n", + "- Participant ID and basic information\n", + "- Demographics (age, gender, etc.)\n", + "- Enrollment status and dates\n", + "- Randomization assignment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2e0618e", + "metadata": {}, + "outputs": [], + "source": [ + "# Participant identifiers and information\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"participant_id\",\n", + " sampler_type=SamplerType.UUID,\n", + " params={\"prefix\": \"PT-\", \"short_form\": True, \"uppercase\": True}\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"participant_first_name\",\n", + " column_type=\"expression\",\n", + " expr=\"{{participant.first_name}}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"participant_last_name\",\n", + " column_type=\"expression\",\n", + " expr=\"{{participant.last_name}}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"participant_birth_date\",\n", + " column_type=\"expression\",\n", + " expr=\"{{participant.birth_date}}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"participant_email\",\n", + " column_type=\"expression\",\n", + " expr=\"{{participant.email_address}}\"\n", + ")\n", + "\n", + "# Enrollment information\n", + "config_builder.add_column(\n", + " name=\"enrollment_date\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"timedelta\",\n", + " params={\n", + " \"dt_min\": 0,\n", + " \"dt_max\": 60,\n", + " \"reference_column_name\": \"trial_start_date\",\n", + " \"unit\": \"D\"\n", + " },\n", + " convert_to=\"%Y-%m-%d\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"participant_status\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values = [\"Active\", \"Completed\", \"Withdrawn\", \"Lost to Follow-up\"],\n", + " weights = [0.6, 0.2, 0.15, 0.05]\n", + " )\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"treatment_arm\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values = [\"Treatment\", \"Placebo\", \"Standard of Care\"],\n", + " weights = [0.5, 0.3, 0.2]\n", + " )\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "3d764b3f", + "metadata": {}, + "source": [ + "### Investigator and Staff Information\n", + "\n", + "Here we'll add information about the trial staff:\n", + "- Investigator information (principal investigator)\n", + "- Study coordinator details\n", + "- Site information" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3491885f", + "metadata": {}, + "outputs": [], + "source": [ + "# Investigator information\n", + "config_builder.add_column(\n", + " name=\"investigator_first_name\",\n", + " column_type=\"expression\",\n", + " expr=\"{{investigator.first_name}}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"investigator_last_name\",\n", + " column_type=\"expression\",\n", + " expr=\"{{investigator.last_name}}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"investigator_id\",\n", + " sampler_type=SamplerType.UUID,\n", + " params={\"prefix\": \"INV-\", \"short_form\": True, \"uppercase\": True}\n", + " )\n", + ")\n", + "\n", + "# Study coordinator information\n", + "config_builder.add_column(\n", + " name=\"coordinator_first_name\",\n", + " column_type=\"expression\",\n", + " expr=\"{{coordinator.first_name}}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"coordinator_last_name\",\n", + " column_type=\"expression\",\n", + " expr=\"{{coordinator.last_name}}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"coordinator_email\",\n", + " column_type=\"expression\",\n", + " expr=\"{{coordinator.email_address}}\"\n", + ")\n", + "\n", + "# Site information\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"site_id\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values = [\"SITE-001\", \"SITE-002\", \"SITE-003\", \"SITE-004\", \"SITE-005\"]\n", + " )\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"site_location\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values = [\"London\", \"Manchester\", \"Birmingham\", \"Edinburgh\", \"Cambridge\"]\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Study costs\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"per_patient_cost\",\n", + " sampler_type=SamplerType.GAUSSIAN,\n", + " params=GaussianSamplerParams(mean=15000, stddev=5000),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"participant_compensation\",\n", + " sampler_type=SamplerType.GAUSSIAN,\n", + " params=GaussianSamplerParams(mean=500, stddev=200),\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "bf640a42", + "metadata": {}, + "source": [ + "### Clinical Measurements and Outcomes\n", + "\n", + "These columns will track the key clinical data collected during the trial:\n", + "- Vital signs and lab values\n", + "- Efficacy measurements \n", + "- Dosing information" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e342316c-58c7-4da3-bc85-d40bd2663c6c", + "metadata": {}, + "outputs": [], + "source": [ + "# Basic clinical measurements\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"baseline_measurement\",\n", + " sampler_type=SamplerType.GAUSSIAN,\n", + " params=GaussianSamplerParams(mean=100, stddev=15),\n", + " convert_to=\"float\",\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"final_measurement\",\n", + " sampler_type=SamplerType.GAUSSIAN,\n", + " params=GaussianSamplerParams(mean=85, stddev=20),\n", + " convert_to=\"float\",\n", + " )\n", + ")\n", + "\n", + "# Calculate percent change\n", + "config_builder.add_column(\n", + " name=\"percent_change\",\n", + " column_type=\"expression\",\n", + " expr=\"{{(final_measurement - baseline_measurement) / baseline_measurement * 100}}\",\n", + ")\n", + "\n", + "# Dosing information\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"dose_level\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Low\", \"Medium\", \"High\", \"Placebo\"],\n", + " weights=[0.3, 0.3, 0.2, 0.2]\n", + " ),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"dose_frequency\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Once daily\", \"Twice daily\", \"Weekly\", \"Biweekly\"],\n", + " weights=[0.4, 0.3, 0.2, 0.1]\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Protocol compliance\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"compliance_rate\",\n", + " sampler_type=SamplerType.UNIFORM,\n", + " params=UniformSamplerParams(low=0.7, high=1.0),\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "cb97a115", + "metadata": {}, + "source": [ + "### Adverse Events Tracking\n", + "\n", + "Here we'll capture adverse events that occur during the clinical trial:\n", + "- Adverse event presence and type\n", + "- Severity and relatedness to treatment\n", + "- Dates and resolution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "07331df8-eb13-47f6-ad16-5fe748ecfec2", + "metadata": {}, + "outputs": [], + "source": [ + "# Adverse event flags and details\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"has_adverse_event\",\n", + " sampler_type=SamplerType.BERNOULLI,\n", + " params=BernoulliSamplerParams(\n", + " p=0.3\n", + " ),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"adverse_event_type\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Headache\", \"Nausea\", \"Fatigue\", \"Rash\", \"Dizziness\", \"Pain at injection site\", \"Other\"],\n", + " weights=[0.2, 0.15, 0.15, 0.1, 0.1, 0.2, 0.1]\n", + " ),\n", + " conditional_params={\"has_adverse_event == 0\": CategorySamplerParams(values=[\"None\"])},\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"adverse_event_severity\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(values=[\"Mild\", \"Moderate\", \"Severe\", \"Life-threatening\"]),\n", + " conditional_params={\"has_adverse_event == 0\": CategorySamplerParams(values=[\"NA\"])},\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"adverse_event_relatedness\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Unrelated\", \"Possibly related\", \"Probably related\", \"Definitely related\"],\n", + " weights=[0.2, 0.4, 0.3, 0.1]\n", + " ),\n", + " conditional_params={\"has_adverse_event == 0\": CategorySamplerParams(values=[\"NA\"])},\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"adverse_event_resolved\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(values=[\"NA\"]),\n", + " conditional_params={\"has_adverse_event == 1\": CategorySamplerParams(values=[\"Yes\", \"No\"], weights=[0.8, 0.2])},\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "03d2fd1b", + "metadata": {}, + "source": [ + "### Narrative text fields with style variations\n", + "\n", + "These fields will contain natural language text that incorporates PII elements.\n", + "We'll use style seed categories to ensure diversity in the writing styles:\n", + "\n", + "1. Medical observations and notes\n", + "2. Adverse event descriptions \n", + "3. Protocol deviation explanations\n", + "\n", + "**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7564093-046d-48b9-a265-f9eaf2ce1a4f", + "metadata": {}, + "outputs": [], + "source": [ + "# Documentation style category\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"documentation_style\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Formal and Technical\", \"Concise and Direct\", \"Detailed and Descriptive\"],\n", + " weights=[0.4, 0.3, 0.3]\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Medical observations - varies based on documentation style\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"medical_observations\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=(\n", + " \"{% if documentation_style == 'Formal and Technical' %}\\n\"\n", + " \"Write formal and technical medical observations for participant {{ participant_first_name }} {{ participant_last_name }}\\n\"\n", + " \"(ID: {{ participant_id }}) in the clinical trial for {{ therapeutic_area }} (Study ID: {{ study_id }}).\\n\"\n", + "\n", + " \"Include observations related to their enrollment in the {{ dose_level }} dose group with {{ dose_frequency }} administration.\\n\"\n", + " \"Baseline measurement was {{ baseline_measurement }} and final measurement was {{ final_measurement }}, representing a\"\n", + " \"change of {{ percent_change }}%.\\n\"\n", + "\n", + " \"Use proper medical terminology, maintain a highly formal tone, and structure the notes in a technical format with appropriate\"\n", + " \"sections and subsections. Include at least one reference to the site investigator, Dr. {{ investigator_last_name }}.\\n\"\n", + " \"{% elif documentation_style == 'Concise and Direct' %}\"\n", + " \"Write brief, direct medical observations for patient {{ participant_first_name }} {{ participant_last_name }}\\n\"\n", + " \"({{ participant_id }}) in {{ therapeutic_area }} trial {{ study_id }}.\\n\"\n", + "\n", + " \"Note: {{ dose_level }} dose, {{ dose_frequency }}. Baseline: {{ baseline_measurement }}. Final: {{ final_measurement }}.\\n\"\n", + " \"Change: {{ percent_change }}%.\\n\"\n", + "\n", + " \"Keep notes extremely concise, using abbreviations where appropriate. Mention follow-up needs and reference\\n\"\n", + " \"Dr. {{ investigator_last_name }} briefly.\\n\"\n", + " \"{% else %}\\n\"\n", + " \"Write detailed and descriptive medical observations for participant {{ participant_first_name }} {{ participant_last_name }}\\n\"\n", + " \"enrolled in the {{ therapeutic_area }} clinical trial ({{ study_id }}).\\n\"\n", + "\n", + " \"Provide a narrative description of their experience in the {{ dose_level }} dose group with {{ dose_frequency }} dosing.\\n\"\n", + " \"Describe how their measurements changed from baseline ({{ baseline_measurement }}) to final ({{ final_measurement }}),\\n\"\n", + " \"representing a {{ percent_change }}% change.\\n\"\n", + "\n", + " \"Use a mix of technical terms and explanatory language. Include thorough descriptions of observed effects and subjective \"\n", + " \"patient reports. Mention interactions with the investigator, Dr. {{ investigator_first_name }} {{ investigator_last_name }}.\\n\"\n", + " \"{% endif %}\"\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Adverse event descriptions - conditional on having an adverse event\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"adverse_event_description\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=(\n", + " \"{% if has_adverse_event == 1 %}\"\n", + " \"[INSTRUCTIONS: Write a brief clinical description (1-2 sentences only) of the adverse event. \"\n", + " \"Use formal medical language. Do not include meta-commentary or explain what you're doing.] \"\n", + " \"{{adverse_event_type}}, {{adverse_event_severity}}. {{adverse_event_relatedness}} to study treatment.\\n\"\n", + " \"{% if adverse_event_resolved == 'Yes' %}Resolved.{% else %}Ongoing.{% endif %}\\n\"\n", + " \"{% else %}\\n\"\n", + " \"[INSTRUCTIONS: Output only the exact text 'No adverse events reported' without any additional commentary.] \"\n", + " \"No adverse events reported.\\n\"\n", + " \"{% endif %}\"\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Protocol deviation description (if compliance is low)\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"protocol_deviation\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=(\n", + " \"{% if compliance_rate < 0.85 %}\"\n", + " \"{% if documentation_style == 'Formal and Technical' %}\"\n", + " \"[FORMAT INSTRUCTIONS: Write in a direct documentation style. Do not use phrases like 'it looks like' or \"\n", + " \"'you've provided'. Begin with the protocol deviation details. Use formal terminology.]\\n\"\n", + "\n", + " \"PROTOCOL DEVIATION REPORT\\n\"\n", + " \"Study ID: {{ study_id }}\\n\"\n", + " \"Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\\n\"\n", + " \"Compliance Rate: {{ compliance_rate }}\\n\"\n", + "\n", + " \"[Continue with formal description of the deviation, impact on data integrity, and corrective actions. \"\n", + " \"Reference coordinator {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_last_name }}]\\n\"\n", + " \"{% elif documentation_style == 'Concise and Direct' %}\\n\"\n", + " \"[FORMAT INSTRUCTIONS: Use only brief notes and bullet points. No introductions or explanations.]\\n\"\n", + "\n", + " \"PROTOCOL DEVIATION - {{ participant_id }}\\n\"\n", + " \"β€’ Compliance: {{ compliance_rate }}\\n\"\n", + " \"β€’ Impact: [severity level]\\n\"\n", + " \"β€’ Actions: [list actions]\\n\"\n", + " \"β€’ Coordinator: {{ coordinator_first_name }} {{ coordinator_last_name }}\\n\"\n", + " \"β€’ PI: Dr. {{ investigator_last_name }}\\n\"\n", + " \"{% else %}\\n\"\n", + " \"[FORMAT INSTRUCTIONS: Write a narrative description. Begin directly with the deviation details. No meta-commentary.]\\n\"\n", + "\n", + " \"During the {{ therapeutic_area }} study at {{ site_location }}, participant {{ participant_first_name }} \"\n", + " \"{{ participant_last_name }} demonstrated a compliance rate of {{ compliance_rate }}, which constitutes a protocol deviation.\\n\"\n", + "\n", + " \"[Continue with narrative about circumstances, discovery, impact, and team response. Include references to \"\n", + " \"{{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_first_name }} {{ investigator_last_name }}]\\n\"\n", + " \"{% endif %}]\\n\"\n", + " \"{% else %}\\n\"\n", + " \"[FORMAT INSTRUCTIONS: Write a simple direct statement. No meta-commentary or explanation.]\\n\"\n", + "\n", + " \"PROTOCOL COMPLIANCE ASSESSMENT\\n\"\n", + " \"Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\\n\"\n", + " \"Finding: No protocol deviations. Compliance rate: {{ compliance_rate }}.\\n\"\n", + " \"{% endif %}\"\n", + " )\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "de5f3951", + "metadata": {}, + "source": [ + "### Adding Constraints\n", + "\n", + "Finally, we'll add constraints to ensure our data is logically consistent:\n", + "- Trial dates must be in proper sequence\n", + "- Adverse event dates must occur after enrollment\n", + "- Measurement changes must be realistic" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "49c04764-2ac4-4fac-ad39-d536f5c0c80c", + "metadata": {}, + "outputs": [], + "source": [ + "# Ensure appropriate date sequence\n", + "config_builder.add_constraint(\n", + " target_column=\"trial_end_date\",\n", + " constraint_type=\"column_inequality\",\n", + " operator=\"gt\",\n", + " rhs=\"trial_start_date\"\n", + ")\n", + "\n", + "config_builder.add_constraint(\n", + " target_column=\"enrollment_date\",\n", + " constraint_type=\"column_inequality\",\n", + " operator=\"ge\",\n", + " rhs=\"trial_start_date\"\n", + ")\n", + "\n", + "config_builder.add_constraint(\n", + " target_column=\"enrollment_date\",\n", + " constraint_type=\"column_inequality\",\n", + " operator=\"lt\",\n", + " rhs=\"trial_end_date\"\n", + ")\n", + "\n", + "# Ensure reasonable clinical measurements\n", + "config_builder.add_constraint(\n", + " target_column=\"baseline_measurement\",\n", + " constraint_type=\"scalar_inequality\",\n", + " operator=\"gt\",\n", + " rhs=0\n", + ")\n", + "\n", + "config_builder.add_constraint(\n", + " target_column=\"final_measurement\",\n", + " constraint_type=\"scalar_inequality\",\n", + " operator=\"gt\",\n", + " rhs=0\n", + ")\n", + "\n", + "config_builder.add_constraint(\n", + " target_column=\"trial_end_date\",\n", + " constraint_type=\"column_inequality\",\n", + " operator=\"gt\",\n", + " rhs=\"trial_start_date\"\n", + ")\n", + "\n", + "config_builder.add_constraint(\n", + " target_column=\"enrollment_date\",\n", + " constraint_type=\"column_inequality\",\n", + " operator=\"ge\",\n", + " rhs=\"trial_start_date\"\n", + ")\n", + "\n", + "config_builder.add_constraint(\n", + " target_column=\"enrollment_date\",\n", + " constraint_type=\"column_inequality\",\n", + " operator=\"lt\",\n", + " rhs=\"trial_end_date\"\n", + ")\n", + "\n", + "# Ensure reasonable clinical measurements\n", + "config_builder.add_constraint(\n", + " target_column=\"baseline_measurement\",\n", + " constraint_type=\"scalar_inequality\",\n", + " operator=\"gt\",\n", + " rhs=0\n", + ")\n", + "\n", + "config_builder.add_constraint(\n", + " target_column=\"final_measurement\",\n", + " constraint_type=\"scalar_inequality\",\n", + " operator=\"gt\",\n", + " rhs=0\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "820346d5", + "metadata": {}, + "source": [ + "### πŸ” Iteration is key – preview the dataset!\n", + "\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", + "\n", + "2. Inspect the results for quality and format issues.\n", + "\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", + "\n", + "4. Re-run the preview until satisfied." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1ff0208c", + "metadata": {}, + "outputs": [], + "source": [ + "# Preview a few records\n", + "preview = data_designer_client.preview(config_builder)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a5263ce2", + "metadata": {}, + "outputs": [], + "source": [ + "# More previews\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "id": "6ceedcd7", + "metadata": {}, + "source": [ + "### πŸ“Š Analyze the generated data\n", + "\n", + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "82530531", + "metadata": {}, + "outputs": [], + "source": [ + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "id": "dc1c716b", + "metadata": {}, + "source": [ + "### πŸ†™ Scale up!\n", + "\n", + "- Happy with your preview data?\n", + "\n", + "- Use the `create` method to submit larger Data Designer generation jobs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0207d661", + "metadata": {}, + "outputs": [], + "source": [ + "job_results = data_designer_client.create(config_builder, num_records=20)\n", + "\n", + "# This will block until the job is complete.\n", + "job_results.wait_until_done()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "858f7f75", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", + "dataset = job_results.load_dataset()\n", + "\n", + "dataset.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "860e4e8c", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ecff7b04", + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-healthcare-datasets-clinical-trials\",\n", + ");" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "sdg_venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/insurance-claims.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/insurance-claims.ipynb index e2591a787..99751bb2c 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/insurance-claims.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/insurance-claims.ipynb @@ -1,733 +1,896 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "114035a1", - "metadata": {}, - "source": [ - "# 🎨 NeMo Data Designer: Synthetic Insurance Claims Dataset Generator" - ] - }, - { - "cell_type": "markdown", - "id": "a38fd337", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", - ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This notebook creates a synthetic dataset of insurance claims with realistic PII (Personally Identifiable Information) for testing data protection and anonymization techniques.\n", - "\n", - "The dataset includes:\n", - "- Policy and claim details\n", - "- Policyholder and claimant information (PII)\n", - "- Claim descriptions and adjuster notes with embedded PII\n", - "- Medical information for relevant claims\n", - "\n", - "We'll use NeMo Data Designer to create this fully synthetic dataset from scratch." - ] - }, - { - "cell_type": "markdown", - "id": "403557a7", - "metadata": {}, - "source": [ - "\n", - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "78d08106", - "metadata": {}, - "outputs": [], - "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "id": "1b9390a6", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "471686a2", - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" - ] - }, - { - "cell_type": "markdown", - "id": "a387a821", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6ab59df6", - "metadata": {}, - "outputs": [], - "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias = \"nemotron-nano-9b-v2\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "07a0a8e2", - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Creating Person Samplers\n", - "\n", - "We'll create person samplers to generate consistent personal information for different roles in the insurance claims process:\n", - "- Policyholders (primary insurance customers)\n", - "- Claimants (who may be different from policyholders)\n", - "- Adjusters (insurance company employees who evaluate claims)\n", - "- Physicians (for medical-related claims)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Create person samplers for different roles, using en_GB locale since en_US with PGM is not supported in streaming mode\n", - "config_builder.with_person_samplers({\n", - " \"policyholder\": {\"locale\": \"en_US\"},\n", - " \"claimant\": {\"locale\": \"en_US\"},\n", - " \"adjuster\": {\"locale\": \"en_US\"},\n", - " \"physician\": {\"locale\": \"en_US\"}\n", - "})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Creating Policy Information\n", - "\n", - "Next, we'll create the basic policy information:\n", - "- Policy number (unique identifier)\n", - "- Policy type (Auto, Home, Health, etc.)\n", - "- Coverage details (based on policy type)\n", - "- Policy start and end dates" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Policy identifiers\n", - "config_builder.add_column(\n", - " name=\"policy_number\",\n", - " type=\"uuid\",\n", - " params={\"prefix\": \"POL-\", \"short_form\": True, \"uppercase\": True}\n", - ")\n", - "\n", - "# Policy type\n", - "config_builder.add_column(\n", - " name=\"policy_type\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Auto\", \"Home\", \"Health\", \"Life\", \"Travel\"],\n", - " \"weights\": [0.4, 0.3, 0.15, 0.1, 0.05]\n", - " }\n", - ")\n", - "\n", - "# Coverage types based on policy type\n", - "config_builder.add_column(\n", - " name=\"coverage_type\",\n", - " type=\"subcategory\",\n", - " params={\n", - " \"category\": \"policy_type\",\n", - " \"values\": {\n", - " \"Auto\": [\"Liability\", \"Comprehensive\", \"Collision\", \"Uninsured Motorist\"],\n", - " \"Home\": [\"Dwelling\", \"Personal Property\", \"Liability\", \"Natural Disaster\"],\n", - " \"Health\": [\"Emergency Care\", \"Primary Care\", \"Specialist\", \"Prescription\"],\n", - " \"Life\": [\"Term\", \"Whole Life\", \"Universal Life\", \"Variable Life\"],\n", - " \"Travel\": [\"Trip Cancellation\", \"Medical Emergency\", \"Lost Baggage\", \"Flight Accident\"]\n", - " }\n", - " }\n", - ")\n", - "\n", - "# Policy dates\n", - "config_builder.add_column(\n", - " name=\"policy_start_date\",\n", - " type=\"datetime\",\n", - " params={\"start\": \"2022-01-01\", \"end\": \"2023-06-30\"},\n", - " convert_to=\"%Y-%m-%d\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"policy_end_date\",\n", - " type=\"datetime\",\n", - " params={\"start\": \"2023-07-01\", \"end\": \"2024-12-31\"},\n", - " convert_to=\"%Y-%m-%d\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Policyholder Information (PII)\n", - "\n", - "Now we'll add fields for the policyholder's personal information. This includes PII elements that would typically be subject to privacy regulations:\n", - "- First and last name\n", - "- Birth date\n", - "- Contact information (email)\n", - "\n", - "These fields use expressions to reference the person sampler we defined earlier." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Policyholder personal information\n", - "config_builder.add_column(\n", - " name=\"policyholder_first_name\",\n", - " type=\"expression\",\n", - " expr=\"{{policyholder.first_name}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"policyholder_last_name\",\n", - " type=\"expression\",\n", - " expr=\"{{policyholder.last_name}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"policyholder_birth_date\",\n", - " type=\"expression\",\n", - " expr=\"{{policyholder.birth_date}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"policyholder_email\",\n", - " type=\"expression\",\n", - " expr=\"{{policyholder.email_address}}\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Claim Information\n", - "\n", - "Next, we'll create the core claim details:\n", - "- Claim ID (unique identifier)\n", - "- Dates (filing date, incident date)\n", - "- Claim status (in process, approved, denied, etc.)\n", - "- Financial information (amount claimed, amount approved)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Claim identifier\n", - "config_builder.add_column(\n", - " name=\"claim_id\",\n", - " type=\"uuid\",\n", - " params={\"prefix\": \"CLM-\", \"short_form\": True, \"uppercase\": True}\n", - ")\n", - "\n", - "# Claim dates\n", - "config_builder.add_column(\n", - " name=\"incident_date\",\n", - " type=\"datetime\",\n", - " params={\"start\": \"2023-01-01\", \"end\": \"2023-12-31\"},\n", - " convert_to=\"%Y-%m-%d\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"filing_date\",\n", - " type=\"timedelta\",\n", - " params={\n", - " \"dt_min\": 1,\n", - " \"dt_max\": 30,\n", - " \"reference_column_name\": \"incident_date\",\n", - " \"unit\": \"D\"\n", - " },\n", - " convert_to=\"%Y-%m-%d\"\n", - ")\n", - "\n", - "# Claim status\n", - "config_builder.add_column(\n", - " name=\"claim_status\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Filed\", \"Under Review\", \"Additional Info Requested\", \"Approved\", \"Denied\", \"Appealed\"],\n", - " \"weights\": [0.15, 0.25, 0.15, 0.25, 0.15, 0.05]\n", - " }\n", - ")\n", - "\n", - "# Financial information\n", - "config_builder.add_column(\n", - " name=\"claim_amount\",\n", - " type=\"gaussian\",\n", - " params={\"mean\": 5000, \"stddev\": 2000, \"min\": 500}\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"approved_percentage\",\n", - " type=\"uniform\",\n", - " params={\"low\": 0.0, \"high\": 1.0}\n", - ")\n", - "\n", - "# Calculate approved amount based on percentage\n", - "config_builder.add_column(\n", - " name=\"approved_amount\",\n", - " type=\"expression\",\n", - " expr=\"{{claim_amount * approved_percentage}}\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Claimant Information\n", - "\n", - "In some cases, the claimant (person filing the claim) may be different from the policyholder. \n", - "We'll create fields to capture claimant information and their relationship to the policyholder:\n", - "- Flag indicating if claimant is the policyholder\n", - "- Claimant personal details (when different from policyholder)\n", - "- Relationship to policyholder" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Determine if claimant is the policyholder\n", - "config_builder.add_column(\n", - " name=\"is_claimant_policyholder\",\n", - " type=\"bernoulli\",\n", - " params={\"p\": 0.7}\n", - ")\n", - "\n", - "# Claimant personal information (when different from policyholder)\n", - "config_builder.add_column(\n", - " name=\"claimant_first_name\",\n", - " type=\"expression\",\n", - " expr=\"{{claimant.first_name}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"claimant_last_name\",\n", - " type=\"expression\",\n", - " expr=\"{{claimant.last_name}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"claimant_birth_date\",\n", - " type=\"expression\",\n", - " expr=\"{{claimant.birth_date}}\"\n", - ")\n", - "\n", - "# Relationship to policyholder\n", - "config_builder.add_column(\n", - " name=\"relationship_to_policyholder\",\n", - " type=\"category\",\n", - " params={\"values\": [\"Self\",\"Spouse\", \"Child\", \"Parent\", \"Sibling\", \"Other\"]},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Claim Adjuster Information\n", - "\n", - "Insurance claims are typically handled by claim adjusters. We'll add information about \n", - "the adjuster assigned to each claim:\n", - "- Adjuster name\n", - "- Assignment date\n", - "- Contact information" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Adjuster information\n", - "config_builder.add_column(\n", - " name=\"adjuster_first_name\",\n", - " type=\"expression\",\n", - " expr=\"{{adjuster.first_name}}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"adjuster_last_name\",\n", - " type=\"expression\",\n", - " expr=\"{{adjuster.last_name}}\"\n", - ")\n", - "\n", - "# Adjuster assignment date\n", - "config_builder.add_column(\n", - " name=\"adjuster_assignment_date\",\n", - " type=\"timedelta\",\n", - " params={\n", - " \"dt_min\": 0,\n", - " \"dt_max\": 5,\n", - " \"reference_column_name\": \"filing_date\",\n", - " \"unit\": \"D\"\n", - " },\n", - " convert_to=\"%Y-%m-%d\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Medical Information\n", - "\n", - "For health insurance claims and injury-related claims in other policy types, \n", - "we'll include medical information:\n", - "- Flag indicating if there's a medical component to the claim\n", - "- Medical claim details (when applicable)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Is there a medical component to this claim?\n", - "config_builder.add_column(\n", - " name=\"has_medical_component\",\n", - " type=\"bernoulli\",\n", - " params={\"p\": 0.4}\n", - ")\n", - "\n", - "# Physician information using conditional logic\n", - "config_builder.add_column(\n", - " name=\"physician_first_name\",\n", - " type=\"expression\",\n", - " expr=\"{{physician.first_name}}\",\n", - " conditional_params={\"has_medical_component == 0\": {\"expr\": \"'NA'\"}}\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"physician_last_name\",\n", - " type=\"expression\",\n", - " expr=\"{{physician.last_name}}\",\n", - " conditional_params={\"has_medical_component == 0\": {\"expr\": \"'NA'\"}}\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Free Text Fields with PII References\n", - "\n", - "These fields will contain natural language text that incorporates PII elements from other fields.\n", - "This is particularly useful for testing PII detection and redaction within unstructured text:\n", - "\n", - "1. Incident Description - The policyholder/claimant's account of what happened\n", - "2. Adjuster Notes - The insurance adjuster's professional documentation\n", - "3. Medical Notes - For claims with a medical component\n", - "\n", - "The LLM will be prompted to include PII elements like names, dates, and contact information\n", - "within the narrative text." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Incident description from policyholder/claimant\n", - "config_builder.add_column(\n", - " name=\"incident_description\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " prompt=\"\"\"\n", - " Write a detailed description of an insurance incident for a {{policy_type}} insurance policy with {{coverage_type}} coverage.\n", - "\n", - " The policyholder is {{policyholder_first_name}} {{policyholder_last_name}} (email: {{policyholder_email}}).\n", - "\n", - " The incident occurred on {{incident_date}} and resulted in approximately ${{claim_amount}} in damages/expenses.\n", - "\n", - " Write this from the perspective of the person filing the claim. Include specific details that would be relevant\n", - " to processing this type of claim. Make it detailed but realistic, as if written by someone describing an actual incident.\n", - "\n", - " Reference the policyholder's contact information at least once in the narrative.\n", - " \"\"\"\n", - ")\n", - "\n", - "# Adjuster notes\n", - "config_builder.add_column(\n", - " name=\"adjuster_notes\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " prompt=\"\"\"\n", - " Write detailed insurance adjuster notes for claim {{claim_id}}.\n", - "\n", - " POLICY INFORMATION:\n", - " - Policy #: {{policy_number}}\n", - " - Type: {{policy_type}}, {{coverage_type}} coverage\n", - " - Policyholder: {{policyholder_first_name}} {{policyholder_last_name}}\n", - "\n", - " CLAIM DETAILS:\n", - " - Incident Date: {{incident_date}}\n", - " - Filing Date: {{filing_date}}\n", - " - Claimed Amount: ${{claim_amount}}\n", - "\n", - " As adjuster {{adjuster_first_name}} {{adjuster_last_name}}, write professional notes documenting:\n", - " 1. Initial contact with the policyholder\n", - " 2. Assessment of the claim based on the incident description\n", - " 3. Coverage determination under the policy\n", - " 4. Recommended next steps\n", - "\n", - " Include at least one mention of contacting the policyholder using their full name and email ({{policyholder_email}}).\n", - " Use a formal, professional tone typical of insurance documentation.\n", - " \"\"\"\n", - ")\n", - "\n", - "# Medical notes (for claims with medical component)\n", - "config_builder.add_column(\n", - " name=\"medical_notes\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " prompt=\"\"\"\n", - " {% if has_medical_component %}\\\n", - " Write medical notes related to insurance claim {{ claim_id }}.\n", - "\n", - " Patient: {{policyholder_first_name}} {{policyholder_last_name}}, DOB: {{policyholder_birth_date}}\n", - "\n", - " As Dr. {{physician_first_name}} {{physician_last_name}}, document:\n", - "\n", - " 1. Chief complaint\n", - " 2. Medical assessment\n", - " 3. Treatment recommendations\n", - " 4. Follow-up instructions\n", - "\n", - " Include appropriate medical terminology relevant to a {{policy_type}} insurance claim.\n", - " If this is for a Health policy, focus on the {{coverage_type}} aspects.\n", - " For other policy types, focus on injury assessment relevant to the incident.\n", - "\n", - " Use a professional medical documentation style that includes specific references\n", - " to the patient by name and birth date.\\\n", - "\n", - " The language should be natural and different from one physician to the next.\\\n", - "\n", - " Vary the length of the response. Keep some notes brief and others more detailed.\\\n", - " {% else -%}\\\n", - " Repeat the following: \"No medical claim\"\\\n", - " {% endif -%}\\\n", - " \"\"\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Adding Constraints\n", - "\n", - "To ensure our data is logically consistent, we'll add some constraints:\n", - "- Incident date must be during the policy term\n", - "- Filing date must be after incident date" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Ensure incident date falls within policy period\n", - "config_builder.add_constraint(\n", - " target_column=\"incident_date\",\n", - " type=\"column_inequality\",\n", - " params={\"operator\": \"ge\", \"rhs\": \"policy_start_date\"}\n", - ")\n", - "\n", - "config_builder.add_constraint(\n", - " target_column=\"incident_date\",\n", - " type=\"column_inequality\",\n", - " params={\"operator\": \"le\", \"rhs\": \"policy_end_date\"}\n", - ")\n", - "\n", - "# Ensure filing date is after incident date\n", - "config_builder.add_constraint(\n", - " target_column=\"filing_date\",\n", - " type=\"column_inequality\",\n", - " params={\"operator\": \"gt\", \"rhs\": \"incident_date\"}\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Preview and Generate Dataset\n", - "\n", - "First, we'll preview a small sample to verify our configuration is working correctly.\n", - "Then we'll generate the full dataset with the desired number of records." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Preview a few records\n", - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# More previews\n", - "preview.display_sample_record()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Generate the full dataset\n", - "job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)\n", - "job_results.wait_until_done()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Display the first few rows of the generated dataset\n", - "dataset = job_results.load_dataset()\n", - "print(\"\\nGenerated dataset shape:\", dataset.shape)\n", - "\n", - "dataset.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "os.makedirs(\"./data\", exist_ok=True)\n", - "\n", - "# Save the dataset to CSV\n", - "dataset.to_csv(\"./data/insurance_claims_with_pii.csv\", index=False)\n", - "print(f\"Dataset with {len(dataset)} records saved to ./data/insurance_claims_with_pii.csv\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "114035a1", + "metadata": {}, + "source": [ + "# 🧾 NeMo Data Designer: Synthetic Insurance Claims Dataset Generator\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "#### πŸ“š What you'll learn\n", + "\n", + "This notebook creates a synthetic dataset of insurance claims with realistic PII (Personally Identifiable Information) \\\n", + "for testing data protection and anonymization techniques.\n", + "\n", + "The dataset includes:\n", + "- Policy and claim details\n", + "- Policyholder and claimant information (PII)\n", + "- Claim descriptions and adjuster notes with embedded PII\n", + "- Medical information for relevant claims\n", + "\n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", + ">\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", + ">\n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", + ">\n", + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" + ] + }, + { + "cell_type": "markdown", + "id": "948abb33", + "metadata": {}, + "source": [ + "### πŸ“¦ Import the essentials\n", + "\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", + "\n", + "- The `essentials` module provides quick access to the most commonly used objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78d08106", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_microservices.data_designer.essentials import (\n", + " BernoulliSamplerParams,\n", + " CategorySamplerParams,\n", + " DataDesignerConfigBuilder,\n", + " ExpressionColumnConfig,\n", + " GaussianSamplerParams,\n", + " InferenceParameters,\n", + " LLMTextColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " PersonSamplerParams,\n", + " SamplerColumnConfig,\n", + " SamplerType,\n", + " SubcategorySamplerParams,\n", + " UUIDSamplerParams,\n", + " UniformSamplerParams\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "37fb2747", + "metadata": {}, + "source": [ + "### βš™οΈ Initialize the NeMo Data Designer Client\n", + "\n", + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "471394ac", + "metadata": {}, + "outputs": [], + "source": [ + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" + ] + }, + { + "cell_type": "markdown", + "id": "d802e2c7", + "metadata": {}, + "source": [ + "### πŸŽ›οΈ Define model configurations\n", + "\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", + "\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", + "\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", + "\n", + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f4b671a3", + "metadata": {}, + "outputs": [], + "source": [ + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-nano-v2\"\n", + "\n", + "# This sets reasoning to False for the nemotron-nano-v2 model.\n", + "SYSTEM_PROMPT = \"/no_think\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "bdf25fb5", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "77335c04", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" + ] + }, + { + "cell_type": "markdown", + "id": "0d21b17a", + "metadata": {}, + "source": [ + "## 🎲 Creating Person Samplers\n", + "\n", + "We'll create person samplers to generate consistent personal information for different roles in the insurance claims process:\n", + "- Policyholders (primary insurance customers)\n", + "- Claimants (who may be different from policyholders)\n", + "- Adjusters (insurance company employees who evaluate claims)\n", + "- Physicians (for medical-related claims)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f6843c16-5430-4228-a545-94cb0a22565c", + "metadata": {}, + "outputs": [], + "source": [ + "# Create person samplers for different roles, using en_GB locale since en_US with PGM is not supported in streaming mode\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"policyholder\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(locale=\"en_US\"),\n", + " )\n", + ")\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"claimant\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(locale=\"en_US\"),\n", + " )\n", + ")\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"adjuster\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(locale=\"en_US\"),\n", + " )\n", + ")\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"physician\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(locale=\"en_US\"),\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "63af6848", + "metadata": {}, + "source": [ + "### Creating Policy Information\n", + "\n", + "Next, we'll create the basic policy information:\n", + "- Policy number (unique identifier)\n", + "- Policy type (Auto, Home, Health, etc.)\n", + "- Coverage details (based on policy type)\n", + "- Policy start and end dates" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "56a84bf3-609b-4611-bee4-fa8414ee7519", + "metadata": {}, + "outputs": [], + "source": [ + "# Policy identifiers\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"policy_number\",\n", + " sampler_type=SamplerType.UUID,\n", + " params=UUIDSamplerParams(prefix=\"POL-\", short_form=True, uppercase=True)\n", + " )\n", + ")\n", + "\n", + "# Policy type\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"policy_type\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Auto\", \"Home\", \"Health\", \"Life\", \"Travel\"],\n", + " weights=[0.4, 0.3, 0.15, 0.1, 0.05]\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Coverage types based on policy type\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"coverage_type\",\n", + " sampler_type=SamplerType.SUBCATEGORY,\n", + " params=SubcategorySamplerParams(\n", + " category=\"policy_type\",\n", + " values={\n", + " \"Auto\": [\"Liability\", \"Comprehensive\", \"Collision\", \"Uninsured Motorist\"],\n", + " \"Home\": [\"Dwelling\", \"Personal Property\", \"Liability\", \"Natural Disaster\"],\n", + " \"Health\": [\"Emergency Care\", \"Primary Care\", \"Specialist\", \"Prescription\"],\n", + " \"Life\": [\"Term\", \"Whole Life\", \"Universal Life\", \"Variable Life\"],\n", + " \"Travel\": [\"Trip Cancellation\", \"Medical Emergency\", \"Lost Baggage\", \"Flight Accident\"]\n", + " }\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Policy dates\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"policy_start_date\",\n", + " sampler_type=SamplerType.DATETIME,\n", + " params={\"start\": \"2022-01-01\", \"end\": \"2023-06-30\"},\n", + " convert_to=\"%Y-%m-%d\"\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"policy_end_date\",\n", + " sampler_type=SamplerType.DATETIME,\n", + " params={\"start\": \"2023-07-01\", \"end\": \"2024-12-31\"},\n", + " convert_to=\"%Y-%m-%d\"\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "41425390", + "metadata": {}, + "source": [ + "### Policyholder Information (PII)\n", + "\n", + "Now we'll add fields for the policyholder's personal information. This includes PII elements that would typically be \\\n", + "subject to privacy regulations:\n", + "- First and last name\n", + "- Birth date\n", + "- Contact information (email)\n", + "\n", + "These fields use expressions to reference the person sampler we defined earlier." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d706b819-e081-4626-ab49-b04332fbb3a3", + "metadata": {}, + "outputs": [], + "source": [ + "# Policyholder personal information\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"policyholder_first_name\",\n", + " expr=\"{{policyholder.first_name}}\"\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"policyholder_last_name\",\n", + " expr=\"{{policyholder.last_name}}\"\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"policyholder_birth_date\",\n", + " expr=\"{{policyholder.birth_date}}\"\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"policyholder_email\",\n", + " expr=\"{{policyholder.email_address}}\"\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "f47b0e0f", + "metadata": {}, + "source": [ + "### Claim Information\n", + "\n", + "Next, we'll create the core claim details:\n", + "- Claim ID (unique identifier)\n", + "- Dates (filing date, incident date)\n", + "- Claim status (in process, approved, denied, etc.)\n", + "- Financial information (amount claimed, amount approved)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b1f6c37-6b05-4eca-8ef6-5c0d76d79ff1", + "metadata": {}, + "outputs": [], + "source": [ + "# Claim identifier\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"claim_id\",\n", + " sampler_type=SamplerType.UUID,\n", + " params=UUIDSamplerParams(prefix=\"CLM-\", short_form=True, uppercase=True)\n", + " )\n", + ")\n", + "\n", + "# Claim dates\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"incident_date\",\n", + " sampler_type=SamplerType.DATETIME,\n", + " params={\"start\": \"2023-01-01\", \"end\": \"2023-12-31\"},\n", + " convert_to=\"%Y-%m-%d\"\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"filing_date\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"timedelta\",\n", + " params={\n", + " \"dt_min\": 1,\n", + " \"dt_max\": 30,\n", + " \"reference_column_name\": \"incident_date\",\n", + " \"unit\": \"D\"\n", + " },\n", + " convert_to=\"%Y-%m-%d\"\n", + ")\n", + "\n", + "# Claim status\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"claim_status\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Filed\", \"Under Review\", \"Additional Info Requested\", \"Approved\", \"Denied\", \"Appealed\"],\n", + " weights=[0.15, 0.25, 0.15, 0.25, 0.15, 0.05]\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Financial information\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"claim_amount\",\n", + " sampler_type=SamplerType.GAUSSIAN,\n", + " params=GaussianSamplerParams(mean=5000, stddev=2000)\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"approved_percentage\",\n", + " sampler_type=SamplerType.UNIFORM,\n", + " params=UniformSamplerParams(low=0.0, high=1.0)\n", + " )\n", + ")\n", + "\n", + "# Calculate approved amount based on percentage\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"approved_amount\",\n", + " expr=\"{{claim_amount * approved_percentage}}\"\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2b44e2ea", + "metadata": {}, + "source": [ + "### Claimant Information\n", + "\n", + "In some cases, the claimant (person filing the claim) may be different from the policyholder. \\\n", + "We'll create fields to capture claimant information and their relationship to the policyholder:\n", + "- Flag indicating if claimant is the policyholder\n", + "- Claimant personal details (when different from policyholder)\n", + "- Relationship to policyholder" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1dcb5e28-76a0-44cd-a1a8-0ecd807c093a", + "metadata": {}, + "outputs": [], + "source": [ + "# Determine if claimant is the policyholder\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"is_claimant_policyholder\",\n", + " sampler_type=SamplerType.BERNOULLI,\n", + " params=BernoulliSamplerParams(p=0.7)\n", + " )\n", + ")\n", + "\n", + "# Claimant personal information (when different from policyholder)\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"claimant_first_name\",\n", + " expr=\"{{claimant.first_name}}\"\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"claimant_last_name\",\n", + " expr=\"{{claimant.last_name}}\"\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"claimant_birth_date\",\n", + " expr=\"{{claimant.birth_date}}\"\n", + " )\n", + ")\n", + "\n", + "# Relationship to policyholder\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"relationship_to_policyholder\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(values=[\"Self\", \"Spouse\", \"Child\", \"Parent\", \"Sibling\", \"Other\"]),\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "b923e797", + "metadata": {}, + "source": [ + "### Claim Adjuster Information\n", + "\n", + "Insurance claims are typically handled by claim adjusters. We'll add information about \n", + "the adjuster assigned to each claim:\n", + "- Adjuster name\n", + "- Assignment date\n", + "- Contact information" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d598037-765f-4824-a93d-87735c23b1f1", + "metadata": {}, + "outputs": [], + "source": [ + "# Adjuster information\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"adjuster_first_name\",\n", + " expr=\"{{adjuster.first_name}}\"\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"adjuster_last_name\",\n", + " expr=\"{{adjuster.last_name}}\"\n", + " )\n", + ")\n", + "\n", + "# Adjuster assignment date\n", + "config_builder.add_column(\n", + " name=\"adjuster_assignment_date\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"timedelta\",\n", + " params={\n", + " \"dt_min\": 0,\n", + " \"dt_max\": 5,\n", + " \"reference_column_name\": \"filing_date\",\n", + " \"unit\": \"D\"\n", + " },\n", + " convert_to=\"%Y-%m-%d\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "cd06da0e", + "metadata": {}, + "source": [ + "### Medical Information\n", + "\n", + "For health insurance claims and injury-related claims in other policy types, \n", + "we'll include medical information:\n", + "- Flag indicating if there's a medical component to the claim\n", + "- Medical claim details (when applicable)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4b840142-d8d8-4365-a8dc-6d10cd2d68d7", + "metadata": {}, + "outputs": [], + "source": [ + "# Is there a medical component to this claim?\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"has_medical_component\",\n", + " sampler_type=SamplerType.BERNOULLI,\n", + " params=BernoulliSamplerParams(p=0.4)\n", + " )\n", + ")\n", + "\n", + "# Physician information using conditional logic\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"physician_first_name\",\n", + " expr=\"{% if has_medical_component == 1 %}{{physician.first_name}}{% else %}'NA'{% endif %}\"\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"physician_last_name\",\n", + " expr=\"{% if has_medical_component == 1 %}{{physician.last_name}}{% else %}'NA'{% endif %}\"\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "3b8ade9d", + "metadata": {}, + "source": [ + "### Free Text Fields with PII References\n", + "\n", + "These fields will contain natural language text that incorporates PII elements from other fields.\n", + "This is particularly useful for testing PII detection and redaction within unstructured text:\n", + "\n", + "1. Incident Description - The policyholder/claimant's account of what happened\n", + "2. Adjuster Notes - The insurance adjuster's professional documentation\n", + "3. Medical Notes - For claims with a medical component\n", + "\n", + "The LLM will be prompted to include PII elements like names, dates, and contact information\n", + "within the narrative text." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3276fe6e-8eae-4137-be14-589035f4e43a", + "metadata": {}, + "outputs": [], + "source": [ + "# Incident description from policyholder/claimant\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"incident_description\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=(\n", + " \"Write a detailed description of an insurance incident for a {{policy_type}} insurance policy with \"\n", + " \"{{coverage_type}} coverage.\\n\\n\"\n", + " \"The policyholder is {{policyholder_first_name}} {{policyholder_last_name}} (email: {{policyholder_email}}).\\n\\n\"\n", + " \"The incident occurred on {{incident_date}} and resulted in approximately ${{claim_amount}} in damages/expenses.\\n\\n\"\n", + " \"Write this from the perspective of the person filing the claim. Include specific details that would be relevant \"\n", + " \"to processing this type of claim. Make it detailed but realistic, as if written by someone describing an actual incident.\\n\\n\"\n", + " \"Reference the policyholder's contact information at least once in the narrative.\\n\"\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Adjuster notes\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"adjuster_notes\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=(\n", + " \"Write detailed insurance adjuster notes for claim {{claim_id}}.\\n\\n\"\n", + " \"POLICY INFORMATION:\\n\"\n", + " \"- Policy #: {{policy_number}}\\n\"\n", + " \"- Type: {{policy_type}}, {{coverage_type}} coverage\\n\"\n", + " \"- Policyholder: {{policyholder_first_name}} {{policyholder_last_name}}\\n\\n\"\n", + " \"CLAIM DETAILS:\\n\"\n", + " \"- Incident Date: {{incident_date}}\\n\"\n", + " \"- Filing Date: {{filing_date}}\\n\"\n", + " \"- Claimed Amount: ${{claim_amount}}\\n\\n\"\n", + " \"As adjuster {{adjuster_first_name}} {{adjuster_last_name}}, write professional notes documenting:\\n\"\n", + " \"1. Initial contact with the policyholder\\n\"\n", + " \"2. Assessment of the claim based on the incident description\\n\"\n", + " \"3. Coverage determination under the policy\\n\"\n", + " \"4. Recommended next steps\\n\\n\"\n", + " \"Include at least one mention of contacting the policyholder using their full name and email ({{policyholder_email}}).\\n\"\n", + " \"Use a formal, professional tone typical of insurance documentation.\\n\"\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Medical notes (for claims with medical component)\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"medical_notes\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=(\n", + " \"{% if has_medical_component %}\"\n", + " \"Write medical notes related to insurance claim {{ claim_id }}.\\n\\n\"\n", + " \"Patient: {{policyholder_first_name}} {{policyholder_last_name}}, DOB: {{policyholder_birth_date}}\\n\\n\"\n", + " \"As Dr. {{physician_first_name}} {{physician_last_name}}, document:\\n\\n\"\n", + " \"1. Chief complaint\\n\"\n", + " \"2. Medical assessment\\n\"\n", + " \"3. Treatment recommendations\\n\"\n", + " \"4. Follow-up instructions\\n\\n\"\n", + " \"Include appropriate medical terminology relevant to a {{policy_type}} insurance claim.\\n\"\n", + " \"If this is for a Health policy, focus on the {{coverage_type}} aspects.\\n\"\n", + " \"For other policy types, focus on injury assessment relevant to the incident.\\n\\n\"\n", + " \"Use a professional medical documentation style that includes specific references \"\n", + " \"to the patient by name and birth date.\\n\\n\"\n", + " \"The language should be natural and different from one physician to the next.\\n\\n\"\n", + " \"Vary the length of the response. Keep some notes brief and others more detailed.\\n\"\n", + " \"{% else -%}\"\n", + " \"No medical claim\"\n", + " \"{% endif -%}\"\n", + " ),\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "9140e015", + "metadata": {}, + "source": [ + "### Adding Constraints\n", + "\n", + "To ensure our data is logically consistent, we'll add some constraints:\n", + "- Incident date must be during the policy term\n", + "- Filing date must be after incident date" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9d6b187-de07-493e-b3ba-ed0052ab216d", + "metadata": {}, + "outputs": [], + "source": [ + "# Ensure incident date falls within policy period\n", + "config_builder.add_constraint(\n", + " target_column=\"incident_date\",\n", + " constraint_type=\"column_inequality\",\n", + " operator=\"ge\",\n", + " rhs=\"policy_start_date\"\n", + ")\n", + "\n", + "config_builder.add_constraint(\n", + " target_column=\"incident_date\",\n", + " constraint_type=\"column_inequality\",\n", + " operator=\"le\",\n", + " rhs=\"policy_end_date\"\n", + ")\n", + "\n", + "# Ensure filing date is after incident date\n", + "config_builder.add_constraint(\n", + " target_column=\"filing_date\",\n", + " constraint_type=\"column_inequality\",\n", + " operator=\"gt\",\n", + " rhs=\"incident_date\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "330e47c0", + "metadata": {}, + "source": [ + "### πŸ” Iteration is key – preview the dataset!\n", + "\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", + "\n", + "2. Inspect the results for quality and format issues.\n", + "\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", + "\n", + "4. Re-run the preview until satisfied." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7ad52fb", + "metadata": {}, + "outputs": [], + "source": [ + "# Preview a few records\n", + "preview = data_designer_client.preview(config_builder)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e248b8a", + "metadata": {}, + "outputs": [], + "source": [ + "# More previews\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "id": "4f890e19", + "metadata": {}, + "source": [ + "### πŸ“Š Analyze the generated data\n", + "\n", + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eb938dca", + "metadata": {}, + "outputs": [], + "source": [ + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "id": "212c1110", + "metadata": {}, + "source": [ + "### πŸ†™ Scale up!\n", + "\n", + "- Happy with your preview data?\n", + "\n", + "- Use the `create` method to submit larger Data Designer generation jobs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8725bc15", + "metadata": {}, + "outputs": [], + "source": [ + "job_results = data_designer_client.create(config_builder, num_records=20)\n", + "\n", + "# This will block until the job is complete.\n", + "job_results.wait_until_done()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "af926788", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", + "dataset = job_results.load_dataset()\n", + "\n", + "dataset.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c18e0ab2", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f40f4fad", + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-healthcare-datasets-insurance-claims\",\n", + ");" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "sdg_venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/physician-notes-with-realistic-personal-details.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/physician-notes-with-realistic-personal-details.ipynb index 020d0f28e..a84738ba7 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/physician-notes-with-realistic-personal-details.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/physician-notes-with-realistic-personal-details.ipynb @@ -2,39 +2,45 @@ "cells": [ { "cell_type": "markdown", - "id": "a9883b84", "metadata": {}, "source": [ - "# πŸ§‘β€βš•οΈ NeMo Data Designer: Realistic Patient Data & Physician Notes" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", + "# πŸ§‘β€βš•οΈ NeMo Data Designer: Realistic Patient Data & Physician Notes\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "#### πŸ“š What you'll learn\n", + "\n", + "This notebook demonstrates how to use NeMo Data Designer to generate realistic patient data including physician notes.\\\n", + " We'll leverage both structured data generation and LLM capabilities to create a comprehensive medical dataset.\n", + "\n", + "The dataset includes:\n", + "- Policy and claim details\n", + "- Policyholder and claimant information (PII)\n", + "- Claim descriptions and adjuster notes with embedded PII\n", + "- Medical information for relevant claims\n", + "\n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This notebook demonstrates how to use NeMo Data Designer to generate realistic patient data including physician notes. We'll leverage both structured data generation and LLM capabilities to create a comprehensive medical dataset." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", + "### πŸ“¦ Import the essentials\n", + "\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" + "- The `essentials` module provides quick access to the most commonly used objects.\n" ] }, { @@ -43,13 +49,19 @@ "metadata": {}, "outputs": [], "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", + "from nemo_microservices.data_designer.essentials import (\n", + " CategorySamplerParams,\n", " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" + " InferenceParameters,\n", + " LLMTextColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " PersonSamplerParams,\n", + " SamplerColumnConfig,\n", + " SamplerType,\n", + " SeedDatasetReference,\n", + " UUIDSamplerParams,\n", + ")" ] }, { @@ -58,8 +70,7 @@ "source": [ "### βš™οΈ Initialize the NeMo Data Designer Client\n", "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" ] }, { @@ -68,22 +79,24 @@ "metadata": {}, "outputs": [], "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "### πŸŽ›οΈ Define model configurations\n", "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", "\n", - "- You must provide a list of model configs to the builder at initialization.\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" ] }, { @@ -92,9 +105,43 @@ "metadata": {}, "outputs": [], "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias = \"nemotron-nano-9b-v2\"" + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-nano-v2\"\n", + "\n", + "# This sets reasoning to False for the nemotron-nano-v2 model.\n", + "SYSTEM_PROMPT = \"/no_think\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" ] }, { @@ -103,32 +150,42 @@ "metadata": {}, "outputs": [], "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## πŸ“Š Loading Seed Data\n", + "## 🌱 Loading Seed Data\n", + "\n", + "- We'll use the symptom-to-diagnosis dataset as our seed data. \n", + "\n", + "- This dataset contains patient symptoms and corresponding diagnoses which will help generate realistic medical scenarios.\n", + "\n", + "
\n", + "\n", + "> 🌱 **Why use a seed dataset?**\n", + ">\n", + "> - Seed datasets let you steer the generation process by providing context that is specific to your use case.\n", + ">\n", + "> - Seed datasets are also an excellent way to inject real-world diversity into your synthetic data.\n", + ">\n", + "> - During generation, prompt templates can reference any of the seed dataset fields.\n", + "\n", + "
\n", + "\n", + "> πŸ’‘ **About datastores**\n", + ">\n", + "> - You can use seed datasets from _either_ the Hugging Face Hub or a locally deployed datastore.\n", + ">\n", + "> - By default, we use the local datastore deployed with the Data Designer microservice.\n", + ">\n", + "> - The datastore endpoint is specified in the deployment configuration.\n", "\n", - "We'll use the symptom-to-diagnosis dataset as our seed data. This dataset contains patient symptoms and corresponding diagnoses which will help generate realistic medical scenarios.\n", "\n", - "**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. " + "πŸ‘‹ **Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as \\\n", + "seeds, it is recommended you consolidated these into a single file. " ] }, { @@ -139,7 +196,7 @@ "source": [ "from datasets import load_dataset\n", "\n", - "# Let's use the symptom-to-diagnosis dataset to seed our workflow.\n", + "# Let's use the symptom-to-diagnosis dataset to seed our workflow\n", "df_seed = load_dataset(\"gretelai/symptom_to_diagnosis\")[\"train\"].to_pandas()\n", "df_seed = df_seed.rename(columns={\"output_text\": \"diagnosis\", \"input_text\": \"patient_summary\"})\n", "\n", @@ -154,27 +211,53 @@ "metadata": {}, "outputs": [], "source": [ - "import os\n", + "# Upload the dataset to the local datastore if not already present.\n", + "# NDD suppoorts uploading pandas DataFrame or a CSV, Parquet, or JSON file to the datastore\n", + "seed_dataset_reference = data_designer_client.upload_seed_dataset(\n", + " dataset=df_seed,\n", + " repo_id=\"data-designer/demo\",\n", + " datastore_settings={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n", + ")\n", "\n", - "os.makedirs(\"./data\", exist_ok=True)\n", - "df_seed.to_csv(\"./data/symptom_to_diagnosis.csv\", index=False)" + "# Pass the reference to the config builder for use during generation.\n", + "config_builder.with_seed_dataset(seed_dataset_reference)" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "# We use with_replacement=False, so our max num_records is 853.\n", - "config_builder.with_seed_dataset(\n", - " repo_id=\"advanced/healthcare-datasets\",\n", - " filename=\"symptom_to_diagnosis.csv\",\n", - " dataset_path=\"./data/symptom_to_diagnosis.csv\",\n", - " sampling_strategy=\"shuffle\", # \"ordered\"\n", - " with_replacement=True,\n", - " datastore={\"endpoint\": \"http://localhost:3000/v1/hf\"}\n", - ")" + "> πŸ’‘ **Tip**\n", + ">\n", + "> - If the dataset already exists in the datastore, you can create the seed dataset reference directly.\n", + ">\n", + "> - See the [seed dataset docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/seed-datasets.html) for more info.\n", + ">\n", + ">
\n", + ">\n", + "> For example:\n", + ">\n", + "> ```python\n", + "> from nemo_microservices.data_designer.essentials import SeedDatasetReference\n", + ">\n", + "> # Create reference to existing dataset in the datastore.\n", + "> seed_dataset_reference = SeedDatasetReference(\n", + "> dataset=\"data-designer/demo/gretelai_symptom_to_diagnosis.csv\",\n", + "> datastore_settings={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n", + "> )\n", + "> ```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🎲 Creating Person Samplers\n", + "\n", + "- We create persona samplers to simulate details about the patient and the doctor \n", + "\n", + "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n", + "If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker\n" ] }, { @@ -184,7 +267,21 @@ "outputs": [], "source": [ "# Create a couple random person samplers.\n", - "config_builder.with_person_samplers({\"patient_sampler\": {}, \"doctor_sampler\": {}})" + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"patient_sampler\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"doctor_sampler\",\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(),\n", + " )\n", + ")" ] }, { @@ -208,47 +305,48 @@ "outputs": [], "source": [ "config_builder.add_column(\n", - " name=\"patient_id\",\n", - " type=\"uuid\",\n", - " params={\"prefix\": \"PT-\", \"short_form\": True, \"uppercase\": True},\n", + " SamplerColumnConfig(\n", + " name=\"patient_id\",\n", + " sampler_type=SamplerType.UUID,\n", + " params=UUIDSamplerParams(prefix=\"PT-\", short_form=True, uppercase=True),\n", + " )\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"first_name\",\n", - " type=\"expression\",\n", + " column_type=\"expression\",\n", " expr=\"{{patient_sampler.first_name}}\"\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"last_name\",\n", - " type=\"expression\",\n", + " column_type=\"expression\",\n", " expr=\"{{patient_sampler.last_name}}\"\n", ")\n", "\n", - "\n", "config_builder.add_column(\n", " name=\"dob\",\n", - " type=\"expression\",\n", + " column_type=\"expression\",\n", " expr=\"{{patient_sampler.birth_date}}\"\n", ")\n", "\n", - "\n", "config_builder.add_column(\n", " name=\"patient_email\",\n", - " type=\"expression\",\n", + " column_type=\"expression\",\n", " expr=\"{{patient_sampler.email_address}}\"\n", ")\n", "\n", - "\n", "config_builder.add_column(\n", " name=\"symptom_onset_date\",\n", - " type=\"datetime\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"datetime\",\n", " params={\"start\": \"2024-01-01\", \"end\": \"2024-12-31\"},\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"date_of_visit\",\n", - " type=\"timedelta\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"timedelta\",\n", " params={\n", " \"dt_min\": 1,\n", " \"dt_max\": 30,\n", @@ -258,7 +356,7 @@ "\n", "config_builder.add_column(\n", " name=\"physician\",\n", - " type=\"expression\",\n", + " column_type=\"expression\",\n", " expr=\"Dr. {{doctor_sampler.first_name}} {{doctor_sampler.last_name}}\",\n", ")" ] @@ -267,7 +365,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### πŸ“ LLM-Generated Physician Notes\n", + "## 🦜 LLM-Generated Physician Notes\n", "\n", "The final and most complex column uses an LLM to generate realistic physician notes. We provide:\n", "\n", @@ -275,7 +373,7 @@ "- Patient summary from our seed data\n", "- Clear formatting instructions\n", "\n", - "This will create detailed medical notes that reflect the patient's diagnosis and visit information. Note how we reference other columns in the prompt using Jinja templating syntax with double curly braces `{{column_name}}`." + "This will create detailed medical notes that reflect the patient's diagnosis and visit information. " ] }, { @@ -284,39 +382,45 @@ "metadata": {}, "outputs": [], "source": [ - "# Note we have access to the seed data fields.\n", "config_builder.add_column(\n", - " name=\"physician_notes\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " prompt=\"\"\"\\\n", - "\n", - "You are a primary-care physician who just had an appointment with {{first_name}} {{last_name}},\n", - "who has been struggling with symptoms from {{diagnosis}} since {{symptom_onset_date}}.\n", - "The date of today's visit is {{date_of_visit}}.\n", - "\n", - "\n", - "\n", - "{{patient_summary}}\n", - "\n", - "\n", - "\n", - "Write careful notes about your visit with {{first_name}},\n", - "as {{physician}}.\n", - "\n", - "Format the notes as a busy doctor might.\n", - "\n", - "\"\"\"\n", - " )" + " LLMTextColumnConfig(\n", + " name=\"physician_notes\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=(\n", + " \"\"\n", + " \"You are a primary-care physician who just had an appointment with {{first_name}} {{last_name}}, \"\n", + " \"who has been struggling with symptoms from {{diagnosis}} since {{symptom_onset_date}}.\\n\"\n", + " \"The date of today's visit is {{date_of_visit}}.\\n\"\n", + " \"\\n\"\n", + "\n", + " \"\\n\"\n", + " \"{{patient_summary}}\\n\"\n", + " \"\\n\"\n", + "\n", + " \"\\n\"\n", + " \"Write careful notes about your visit with {{first_name}}, as {{physician}}.\\n\"\n", + "\n", + " \"Format the notes as a busy doctor might.\\n\"\n", + " \"\"\n", + " )\n", + " )\n", + ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## πŸ‘€ Previewing the Dataset\n", + "### πŸ” Iteration is key – preview the dataset!\n", + "\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", "\n", - "Let's generate a preview to see how our data looks before creating the full dataset. This helps verify that our configuration is working as expected." + "2. Inspect the results for quality and format issues.\n", + "\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", + "\n", + "4. Re-run the preview until satisfied." ] }, { @@ -325,7 +429,8 @@ "metadata": {}, "outputs": [], "source": [ - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" + "# Preview a few records\n", + "preview = data_designer_client.preview(config_builder)" ] }, { @@ -334,25 +439,40 @@ "metadata": {}, "outputs": [], "source": [ + "# More previews\n", "preview.display_sample_record()" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“Š Analyze the generated data\n", + "\n", + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "preview.dataset" + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## πŸš€ Generating the Full Dataset\n", + "### πŸ†™ Scale up!\n", + "\n", + "- Happy with your preview data?\n", "\n", - "Now that we've verified our configuration works correctly, let's generate a larger dataset with 100 records. We'll wait for the workflow to complete so we can access the data immediately." + "- Use the `create` method to submit larger Data Designer generation jobs.\n" ] }, { @@ -361,7 +481,9 @@ "metadata": {}, "outputs": [], "source": [ - "job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)\n", + "job_results = data_designer_client.create(config_builder, num_records=20)\n", + "\n", + "# This will block until the job is complete.\n", "job_results.wait_until_done()" ] }, @@ -371,8 +493,8 @@ "metadata": {}, "outputs": [], "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", "dataset = job_results.load_dataset()\n", - "print(\"\\nGenerated dataset shape:\", dataset.shape)\n", "\n", "dataset.head()" ] @@ -383,9 +505,25 @@ "metadata": {}, "outputs": [], "source": [ - "csv_filename = f\"./data/physician_notes_with_realistic_personal_details.csv\"\n", - "dataset.to_csv(csv_filename, index=False)\n", - "print(f\"Dataset with {len(dataset)} records saved to {csv_filename}\")" + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-healthcare-datasets-physician-notes\",\n", + ");" ] } ], @@ -409,5 +547,5 @@ } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multi-turn-chat/multi-turn-conversation.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multi-turn-chat/multi-turn-conversation.ipynb index 5e580e66a..1ded5f81d 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multi-turn-chat/multi-turn-conversation.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multi-turn-chat/multi-turn-conversation.ipynb @@ -2,135 +2,168 @@ "cells": [ { "cell_type": "markdown", + "id": "09c22178", "metadata": {}, "source": [ - "# 🎨 NeMo Data Designer: Synthetic Conversational Data with Person Details" - ] - }, - { - "cell_type": "markdown", - "id": "575c7340", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", + "# 🎨 NeMo Data Designer: Synthetic Conversational Data with Person Details\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "### πŸ“š What you'll learn\n", + "\n", + "- This notebook demonstrates how to use the NeMo Data Designer to build a synthetic data generation pipeline step-by-step.\n", + "\n", + "- We will create multi-turn user-assistant dialogues tailored for fine-tuning language models, enhanced with realistic person details. \n", + "\n", + "- These datasets could be used for developing and enhancing conversational AI applications, including customer \\\n", + "support chatbots, virtual assistants, and interactive learning systems.\n", + "\n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" ] }, { "cell_type": "markdown", + "id": "c45075a8", "metadata": {}, "source": [ - "This notebook demonstrates how to use the NeMo Data Designer to build a synthetic data generation pipeline step-by-step. We will create multi-turn user-assistant dialogues tailored for fine-tuning language models, enhanced with realistic person details. These synthetic dialogues can then be used as domain-specific training data to improve model performance in targeted scenarios.\n", + "### πŸ“¦ Import the essentials\n", "\n", - "These datasets could be used for developing and enhancing conversational AI applications, including customer support chatbots, virtual assistants, and interactive learning systems." - ] - }, - { - "cell_type": "markdown", - "id": "6343c223", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment." + "- The `essentials` module provides quick access to the most commonly used objects.\n" ] }, { "cell_type": "code", "execution_count": null, + "id": "5cd210d4", "metadata": {}, "outputs": [], "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", + "from nemo_microservices.data_designer.essentials import (\n", + " CategorySamplerParams,\n", " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" + " InferenceParameters,\n", + " LLMJudgeColumnConfig,\n", + " LLMStructuredColumnConfig,\n", + " LLMTextColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " SamplerColumnConfig,\n", + " SamplerType,\n", + " Score,\n", + " SubcategorySamplerParams\n", + ")" ] }, { "cell_type": "markdown", - "id": "7f7690c3", + "id": "4e67dd05", "metadata": {}, "source": [ "### βš™οΈ Initialize the NeMo Data Designer Client\n", "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "a1ea201f", + "id": "99d18ebe", "metadata": {}, "outputs": [], "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" ] }, { "cell_type": "markdown", - "id": "21f99c1b", + "id": "2830c677", "metadata": {}, "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "### πŸŽ›οΈ Define model configurations\n", "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", "\n", - "- You must provide a list of model configs to the builder at initialization.\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "97b8e7b8", + "id": "d6f97be3", "metadata": {}, "outputs": [], "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias = \"nemotron-nano-9b-v2\"" + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-nano-v2\"\n", + "\n", + "# This sets reasoning to False for the nemotron-nano-v2 model.\n", + "SYSTEM_PROMPT = \"/no_think\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "5fbd0e22", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "43b6d7c2", + "id": "28c51801", "metadata": {}, "outputs": [], "source": [ - "\n", - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" ] }, { "cell_type": "markdown", + "id": "cd89a931", "metadata": {}, "source": [ "### Define Pydantic Models for Structured Outputs\n", @@ -141,6 +174,7 @@ { "cell_type": "code", "execution_count": null, + "id": "af0d9313", "metadata": {}, "outputs": [], "source": [ @@ -180,92 +214,109 @@ }, { "cell_type": "markdown", + "id": "37ff2717", "metadata": {}, "source": [ - "### 🌱 Adding Categorical Seed Columns\n", + "## 🎲 Adding Sampler Columns\n", + "\n", + "- Sampler columns offer non-LLM based generation of synthetic data.\n", "\n", - "Define categorical seed columns that set the context for the generated dialogues. Domain, topic, complexity, conversation length, and user mood will influence the generated conversations." + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." ] }, { "cell_type": "code", "execution_count": null, + "id": "addb7ef4-ef77-4428-8d3d-9b0ceef1ff34", "metadata": {}, "outputs": [], "source": [ "# Add domain column with subcategories for topics\n", "config_builder.add_column(\n", - " name=\"domain\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Tech Support\", \"Personal Finances\", \"Educational Guidance\"],\n", - " \"num_new_values_to_generate\": 5\n", - " }\n", + " SamplerColumnConfig(\n", + " name=\"domain\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Tech Support\", \"Personal Finances\", \"Educational Guidance\"]\n", + " )\n", + " )\n", ")\n", "\n", "# Add topic subcategory\n", "config_builder.add_column(\n", - " name=\"topic\",\n", - " type=\"subcategory\",\n", - " params={\n", - " \"category\": \"domain\",\n", - " \"values\": {\n", - " \"Tech Support\": [\n", - " \"Troubleshooting a Laptop\",\n", - " \"Setting Up a Home Wi-Fi Network\",\n", - " \"Installing Software Updates\"\n", - " ],\n", - " \"Personal Finances\": [\n", - " \"Budgeting Advice\",\n", - " \"Understanding Taxes\",\n", - " \"Investment Strategies\"\n", - " ],\n", - " \"Educational Guidance\": [\n", - " \"Choosing a College Major\",\n", - " \"Effective Studying Techniques\",\n", - " \"Learning a New Language\"\n", - " ]\n", - " },\n", - " \"num_new_values_to_generate\": 2\n", - " }\n", + " SamplerColumnConfig(\n", + " name=\"topic\",\n", + " sampler_type=SamplerType.SUBCATEGORY,\n", + " params=SubcategorySamplerParams(\n", + " category=\"domain\",\n", + " values={\n", + " \"Tech Support\": [\n", + " \"Troubleshooting a Laptop\",\n", + " \"Setting Up a Home Wi-Fi Network\",\n", + " \"Installing Software Updates\",\n", + " ],\n", + " \"Personal Finances\": [\n", + " \"Budgeting Advice\",\n", + " \"Understanding Taxes\",\n", + " \"Investment Strategies\",\n", + " ],\n", + " \"Educational Guidance\": [\n", + " \"Choosing a College Major\",\n", + " \"Effective Studying Techniques\",\n", + " \"Learning a New Language\",\n", + " ],\n", + " },\n", + " )\n", + " )\n", ")\n", "\n", "# Add complexity column\n", "config_builder.add_column(\n", - " name=\"complexity\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Basic\", \"Intermediate\", \"Advanced\"]\n", - " }\n", + " SamplerColumnConfig(\n", + " name=\"complexity\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Basic\", \"Intermediate\", \"Advanced\"]\n", + " )\n", + " )\n", ")\n", "\n", "# Add conversation length column\n", "config_builder.add_column(\n", - " name=\"conversation_length\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [2, 4, 6, 8]\n", - " }\n", + " SamplerColumnConfig(\n", + " name=\"conversation_length\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[2, 4, 6, 8]\n", + " )\n", + " )\n", ")\n", "\n", "# Add user mood column\n", "config_builder.add_column(\n", - " name=\"user_mood\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"happy\", \"silly\", \"sarcastic\", \"combative\", \"disappointed\", \"toxic\"]\n", - " }\n", + " SamplerColumnConfig(\n", + " name=\"user_mood\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"happy\", \"silly\", \"sarcastic\", \"combative\", \"disappointed\", \"toxic\"]\n", + " )\n", + " )\n", ")" ] }, { "cell_type": "markdown", + "id": "2a34ba38", "metadata": {}, "source": [ - "### ✨ Adding Generated Data Columns\n", - "Now define the columns that the model will generate. These prompts instruct the LLM to produce the actual conversation: a system prompt to guide how the AI assistant engages in the conversation with the user, the conversation, and finally, we generate a toxicity_label to assess user toxicity over the entire conversation.\n", + "## 🦜 Adding LLM Generated columns\n", + "Now define the columns that the model will generate. These prompts instruct the LLM to produce the actual conversation: \n", + "- a system prompt to guide how the AI assistant engages in the conversation with the user, \n", + "- the conversation, and \n", + "- finally, we generate a toxicity_label to assess user toxicity over the entire conversation.\n", + "
\n", "\n", - "#### πŸ’¬πŸ€– AI Assistant system prompt and conversation\n", + "### πŸ’¬πŸ€– AI Assistant system prompt and conversation\n", "\n", "We generate a system prompt to base the AI assistant and then generate the entire conversation." ] @@ -273,171 +324,243 @@ { "cell_type": "code", "execution_count": null, + "id": "06515490-4422-4d6f-bc1d-68b304c44518", "metadata": {}, "outputs": [], "source": [ "# Generate assistant system prompt\n", "config_builder.add_column(\n", - " name=\"assistant_system_prompt\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " system_prompt=\"Keep this to a maximum of two sentences.\",\n", - " prompt=\"Write a reasonable system prompt for a helpful AI assistant with expertise in {{domain}} and {{topic}}. The AI assistant must not engage in harmful behaviors.\"\n", + " LLMTextColumnConfig(\n", + " name=\"assistant_system_prompt\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=(\"Write a reasonable system prompt for a helpful AI assistant with expertise in \"\n", + " \"{{domain}} and {{topic}}. The AI assistant must not engage in harmful behaviors.\"),\n", + " model_alias=MODEL_ALIAS,\n", + " )\n", ")\n", "\n", "# Generate the user's task\n", "config_builder.add_column(\n", - " name=\"user_task\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " system_prompt=\"The task should be clear, focused on a single goal, and at most two sentences. Focus only on the task and don't provide only the task information.\",\n", - " prompt=\"Define a simple task related to {{topic}} of {{complexity}} complexity for the user.\"\n", + " LLMTextColumnConfig(\n", + " name=\"user_task\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=\"Define a simple task related to {{topic}} of {{complexity}} complexity for the user.\",\n", + " model_alias=MODEL_ALIAS,\n", + " )\n", ")\n", "\n", - "\n", "# Generate the conversation\n", "config_builder.add_column(\n", - " name=\"conversation\",\n", - " type=\"llm-structured\",\n", - " model_alias=model_alias,\n", - " prompt=(\n", - " \"\\n{{user_task}}\\n\\n\\n\"\n", - "\n", - " \"{{assistant_system_prompt}}\\n\\n\"\n", - "\n", - " \"Generate a conversation between a user and an AI assistant with about .\\n\"\n", - " \"User is asking the assistant for advice and is in a {{user_mood}} mood.\\n\"\n", - " \"The conversation must be {{conversation_length}} messages in length.\\n\"\n", - " \"The conversation must come to a natural end in {{conversation_length}} messages and if the assistant is unable \"\n", - " \"to solve the user's needs by then, they should offer to continue the conversation later or redirect them to additional resources.\\n\"\n", - " \"The conversation must be realistic and natural, incorporating details from both the user's and assistant's profiles.\\n\"\n", - " \"The AI assistant role has the system prompt defined in . All of its responses must conform to :\\n\"\n", - " ),\n", - " output_format=ChatConversation\n", - ")" + " LLMStructuredColumnConfig(\n", + " name=\"conversation\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=(\n", + " \"\\n{{user_task}}\\n\\n\\n\"\n", + " \"{{assistant_system_prompt}}\\n\\n\"\n", + " \"Generate a conversation between a user and an AI assistant with about .\\n\"\n", + " \"User is asking the assistant for advice and is in a {{user_mood}} mood.\\n\"\n", + " \"The conversation must be {{conversation_length}} messages in length.\\n\"\n", + " \"The conversation must come to a natural end in {{conversation_length}} messages and if the assistant is unable \"\n", + " \"to solve the user's needs by then, they should offer to continue the conversation later or redirect them to additional resources.\\n\"\n", + " \"The conversation must be realistic and natural, incorporating details from both the user's and assistant's profiles.\\n\"\n", + " \"The AI assistant role has the system prompt defined in . All of its responses must conform to :\\n\"\n", + " ),\n", + " output_format=ChatConversation,\n", + " model_alias=MODEL_ALIAS,\n", + " )\n", + ")\n" ] }, { "cell_type": "markdown", + "id": "e9ed0ed9", "metadata": {}, "source": [ - "#### πŸ”βš–οΈ Assessment Column: Toxicity Label\n", - "For the final step, we add a toxicity assessment column to classify the overall toxicity level of the user's messages across the entire conversation." + "## πŸ” LLM-as-a-Judge: Toxicity Assessment\n", + "\n", + "When generating our synthetic dataset, we need to determine the quality of the generated dialogs. \\\n", + "We use the LLM-as-a-Judge strategy to do this. \n", + "\n", + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", + "that provides relavant instructions. " ] }, { "cell_type": "code", "execution_count": null, + "id": "fb8d945b-a7cb-4add-abc0-dac10a335076", "metadata": {}, "outputs": [], "source": [ - "# # Toxicity Label for the entire conversation\n", + "# Toxicity Label for the entire conversation\n", "# Define Toxicity Rubric\n", - "toxicity_rubric = P.Rubric(\n", + "toxicity_rubric = Score(\n", " name=\"Toxicity\",\n", " description=\"Evaluation of harmful, harassing, or hateful language in conversations\",\n", - " scoring={\n", - " \"4\": \"No toxicity detected; conversation is respectful and appropriate.\",\n", - " \"3\": \"Mild toxicity; slightly rude or sarcastic but not hateful or harmful.\",\n", - " \"2\": \"Moderate toxicity; contains disrespectful or harassing language.\",\n", - " \"1\": \"Severe toxicity; contains overt hate, harassment, or harmful content.\",\n", - " \"0\": \"Extreme toxicity; contains dangerous, threatening, or abusive content.\"\n", - " }\n", + " options=\n", + " {\n", + " \"4\": \"No toxicity detected; conversation is respectful and appropriate.\",\n", + " \"3\": \"Mild toxicity; slightly rude or sarcastic but not hateful or harmful.\",\n", + " \"2\": \"Moderate toxicity; contains disrespectful or harassing language.\",\n", + " \"1\": \"Severe toxicity; contains overt hate, harassment, or harmful content.\",\n", + " \"0\": \"Extreme toxicity; contains dangerous, threatening, or abusive content.\"\n", + " }\n", + "\n", ")\n", "\n", - "TOXICITY_PROMPT_TEMPLATE = \"\"\"\\\n", - "You are an expert in content moderation and digital communication ethics. You are evaluating the toxicity level in user messages.\n", - "Take a deep breath and carefully analyze the conversation provided. Use the rubric to score the level of toxicity in the USER'S messages only.\n", + "TOXICITY_PROMPT_TEMPLATE = (\n", + " \"You are an expert in content moderation and digital communication ethics. \"\n", + " \"You are evaluating the toxicity level in user messages.\\n\"\n", + " \"Take a deep breath and carefully analyze the conversation provided. \\n\"\n", + " \"Use the rubric to score the level of toxicity in the USER'S messages only.\\n\"\n", "\n", - "Consider:\n", - "- Focus primarily on harassment, hate, or harmful language\n", - "- Evaluate the full context of the conversation\n", - "- Be objective and consistent in your assessment\n", - "- If any single message exhibits toxicity, it should influence the final score\n", + " \"Consider:\\n\"\n", + " \"- Focus primarily on harassment, hate, or harmful language\\n\"\n", + " \"- Evaluate the full context of the conversation\\n\"\n", + " \"- Be objective and consistent in your assessment\\n\"\n", + " \"- If any single message exhibits toxicity, it should influence the final score\\n\"\n", "\n", - "## CONVERSATION\n", - "{{ conversation }}\n", - "\"\"\"\n", + " \"## CONVERSATION\\n\"\n", + " \"{{ conversation }}\"\n", + ")\n", "\n", "config_builder.add_column(\n", - " name=\"toxicity_evaluation\",\n", - " type='llm-judge',\n", - " model_alias=model_alias,\n", - " prompt=TOXICITY_PROMPT_TEMPLATE,\n", - " rubrics=[toxicity_rubric]\n", + " LLMJudgeColumnConfig(\n", + " name=\"toxicity_evaluation\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=TOXICITY_PROMPT_TEMPLATE,\n", + " scores=[toxicity_rubric],\n", + " model_alias=MODEL_ALIAS\n", + " )\n", ")" ] }, { "cell_type": "markdown", + "id": "90fc232e", "metadata": {}, "source": [ - "## πŸ‘€ Generating a dataset preview\n", + "### πŸ” Iteration is key – preview the dataset!\n", "\n", - "- Preview mode allows you to quickly iterate on your data design.\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", "\n", - "- Each preview generation call creates 10 records for inspection, helping you verify prompts and instructions before running a larger batch job." + "2. Inspect the results for quality and format issues.\n", + "\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", + "\n", + "4. Re-run the preview until satisfied." ] }, { "cell_type": "code", "execution_count": null, + "id": "23a44878", "metadata": {}, "outputs": [], "source": [ - "# Generate a preview\n", - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" + "# Preview a few records\n", + "preview = data_designer_client.preview(config_builder)" ] }, { "cell_type": "code", "execution_count": null, + "id": "89c2eccf", "metadata": {}, "outputs": [], "source": [ + "# More previews\n", "preview.display_sample_record()" ] }, { "cell_type": "markdown", + "id": "68f65bf7", + "metadata": {}, + "source": [ + "### πŸ“Š Analyze the generated data\n", + "\n", + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec5eed01", + "metadata": {}, + "outputs": [], + "source": [ + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "id": "7fc7075e", "metadata": {}, "source": [ - "## πŸ€” Like what you see?\n", + "### πŸ†™ Scale up!\n", "\n", - "Submit a batch workflow!" + "- Happy with your preview data?\n", + "\n", + "- Use the `create` method to submit larger Data Designer generation jobs.\n" ] }, { "cell_type": "code", "execution_count": null, + "id": "8dfd7737", "metadata": {}, "outputs": [], "source": [ - "# # Submit batch job\n", - "job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)\n", + "job_results = data_designer_client.create(config_builder, num_records=20)\n", "\n", + "# This will block until the job is complete.\n", "job_results.wait_until_done()" ] }, { "cell_type": "code", "execution_count": null, - "id": "392c55c5", + "id": "ae7cbea2", "metadata": {}, "outputs": [], "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", "dataset = job_results.load_dataset()\n", - "print(\"\\nGenerated dataset shape:\", dataset.shape)" + "\n", + "dataset.head()" ] }, { "cell_type": "code", "execution_count": null, + "id": "b688677a", "metadata": {}, "outputs": [], "source": [ - "# Inspect first 10 records of the generated dataset\n", - "dataset.head(10)" + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0de7903a", + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-multi-turn-chat\",\n", + ");" ] } ], diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multimodal/visual-question-answering-using-vlm.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multimodal/visual-question-answering-using-vlm.ipynb index 6761ce5d0..9d5b8d521 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multimodal/visual-question-answering-using-vlm.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multimodal/visual-question-answering-using-vlm.ipynb @@ -4,44 +4,37 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# 🎨 NeMo Data Designer: Visual Question Answering Dataset Generation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", + "# 🎨 NeMo Data Designer: Visual Question Answering Dataset Generation\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "### πŸ“š What you'll learn\n", + "\n", + "This notebook demonstrates how to use NeMo Data Designer to generate high-quality synthetic question-answer datasets from visual documents. \n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This notebook demonstrates how to use NeMo Data Designer to generate high-quality synthetic Question-Answer datasets from visual documents. \n", - "\n", - "### Key Features Demonstrated\n", + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n", "\n", - "- ✨ **Visual Document Processing**: Converting images to chat-ready format\n", - "- πŸ—οΈ **Structured Output Generation**: Using Pydantic models for consistent data schemas\n", - "- 🎯 **Multi-step Generation Pipeline**: Summary β†’ Question β†’ Answer generation workflow\n", - "- πŸ”„ **Iterative Development**: Preview functionality for rapid iteration\n" + "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ + "### πŸ“¦ Import the essentials\n", "\n", - "#### πŸ’Ύ Install dependencies\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" + "- The `essentials` module provides quick access to the most commonly used objects.\n" ] }, { @@ -53,9 +46,9 @@ "# Standard library imports\n", "import io\n", "import os\n", - "import json\n", "import base64\n", "import uuid\n", + "import json\n", "\n", "# Third-party imports\n", "import pandas as pd\n", @@ -67,13 +60,19 @@ "from rich.markdown import Markdown\n", "\n", "# NeMo Data Designer imports\n", - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", + "from nemo_microservices.data_designer.essentials import (\n", + " CategorySamplerParams,\n", " DataDesignerConfigBuilder,\n", - " DataDesignerClient\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" + " ImageContext,\n", + " ImageFormat,\n", + " InferenceParameters,\n", + " LLMStructuredColumnConfig,\n", + " ModelConfig,\n", + " ModalityDataType,\n", + " NeMoDataDesignerClient,\n", + " SamplerColumnConfig,\n", + " SamplerType\n", + ")" ] }, { @@ -82,8 +81,7 @@ "source": [ "### βš™οΈ Initialize the NeMo Data Designer Client\n", "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" ] }, { @@ -92,22 +90,24 @@ "metadata": {}, "outputs": [], "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "### πŸŽ›οΈ Define model configurations\n", "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", "\n", - "- You must provide a list of model configs to the builder at initialization.\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" ] }, { @@ -116,9 +116,40 @@ "metadata": {}, "outputs": [], "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"meta/llama-4-maverick-17b-128e-instruct\"\n", - "model_alias = \"llama-4-maverick-17b-128e-instruct\"" + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"meta/llama-4-maverick-17b-128e-instruct\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"llama-4-maverick-17b-128e-instruct\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" ] }, { @@ -127,28 +158,14 @@ "metadata": {}, "outputs": [], "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=False\n", - " ),\n", - " ]\n", - ")" + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### 🌱 Seed Dataset Creation\n", + "## 🌱 Loading Seed Data\n", "\n", "In this section, we'll prepare our visual documents as a seed dataset. The seed dataset provides the foundation for synthetic data generation by:\n", "\n", @@ -159,7 +176,29 @@ "\n", "The seed dataset can be referenced in generation prompts using Jinja templating.\n", "\n", - "**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. " + "
\n", + "\n", + "> 🌱 **Why use a seed dataset?**\n", + ">\n", + "> - Seed datasets let you steer the generation process by providing context that is specific to your use case.\n", + ">\n", + "> - Seed datasets are also an excellent way to inject real-world diversity into your synthetic data.\n", + ">\n", + "> - During generation, prompt templates can reference any of the seed dataset fields.\n", + "\n", + "
\n", + "\n", + "> πŸ’‘ **About datastores**\n", + ">\n", + "> - You can use seed datasets from _either_ the Hugging Face Hub or a locally deployed datastore.\n", + ">\n", + "> - By default, we use the local datastore deployed with the Data Designer microservice.\n", + ">\n", + "> - The datastore endpoint is specified in the deployment configuration.\n", + "\n", + "\n", + "πŸ‘‹ **Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as \\\n", + "seeds, it is recommended you consolidated these into a single file. " ] }, { @@ -180,6 +219,13 @@ "}" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define helper functions to preprocess the dataset" + ] + }, { "cell_type": "code", "execution_count": null, @@ -244,16 +290,7 @@ ")\n", "img_dataset = pd.DataFrame([next(img_dataset_iter) for _ in range(IMG_COUNT)])\n", "\n", - "print(f\"βœ… Loaded {len(img_dataset)} images with columns: {list(img_dataset.columns)}\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "img_dataset.head()" + "print(f\"βœ… Loaded {len(img_dataset)} images with columns: {list(img_dataset.columns)}\")\n" ] }, { @@ -262,10 +299,13 @@ "metadata": {}, "outputs": [], "source": [ + "# save the seed dataset to a csv file locally\n", "os.makedirs(\"./data/\", exist_ok=True)\n", "\n", "df_seed = pd.DataFrame(img_dataset)[[\"uuid\", \"image_filename\", \"base64_image\", \"page\", \"options\", \"source\"]]\n", - "df_seed.to_csv(\"./data/colpali_train_set.csv\", index=False)\n" + "df_seed.to_csv(\"./data/colpali_train_set.csv\", index=False)\n", + "\n", + "df_seed.head()" ] }, { @@ -274,17 +314,48 @@ "metadata": {}, "outputs": [], "source": [ - "# Add the seed dataset containing our processed images\n", + "# Upload the seed dataset containing our processed images\n", + "dataset_reference = data_designer_client.upload_seed_dataset(\n", + " repo_id=\"data-designer-advanced/visual-qna\",\n", + " dataset=\"./data/colpali_train_set.csv\",\n", + " datastore_settings={\"endpoint\": \"http://localhost:3000/v1/hf\"}\n", + ")\n", + "\n", "config_builder.with_seed_dataset(\n", - " repo_id=\"advanced/visual-qna\",\n", - " filename=\"colpali_train_set.csv\",\n", - " dataset_path=\"./data/colpali_train_set.csv\",\n", + " dataset_reference=dataset_reference,\n", " sampling_strategy=\"ordered\",\n", - " with_replacement=True,\n", - " datastore={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n", ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🦜 Generating Summary of Image Contents\n", + "\n", + "- We instruct the model to β€œlook” at each image and write a short, Markdown\n", + "summary. \n", + "\n", + "- We ask it to read the page from top ➑️ bottom, then include a quick wrap-up\n", + "at the end. \n", + "\n", + "- That summary becomes helpful context we’ll reuse to generate focused\n", + "questions and answers about the document later.\n", + "\n", + "\n", + "### πŸ–ΌοΈ How the image is provided\n", + "\n", + "We pass the image via `multi_modal_context` using `ImageContext`:\n", + "\n", + "- **Column**: `base64_image` (your image bytes encoded as Base64)\n", + "- **Modality**: `ModalityDataType.BASE64`\n", + "- **Format**: `ImageFormat.PNG`\n", + "\n", + "In other words, `ImageContext` tells the model β€œthis is an image, encoded as Base64,\n", + "and it’s a PNG,” so it knows exactly how to \\\n", + "use it during summarization." + ] + }, { "cell_type": "code", "execution_count": null, @@ -294,17 +365,16 @@ "# Add a column to generate detailed document summaries\n", "config_builder.add_column(\n", " name=\"summary\",\n", - " type=\"llm-code\",\n", - " model_alias=model_alias,\n", + " column_type=\"llm-text\",\n", + " model_alias=MODEL_ALIAS,\n", " prompt=(\"Provide a detailed summary of the content in this image in Markdown format.\"\n", " \"Start from the top of the image and then describe it from top to bottom.\"\n", " \"Place a summary at the bottom.\"),\n", - " output_format=\"markdown\",\n", " multi_modal_context=[\n", - " P.ImageContext(\n", + " ImageContext(\n", " column_name=\"base64_image\",\n", - " data_type=P.ModalityDataType.BASE64,\n", - " image_format=P.ImageFormat.PNG,\n", + " data_type=ModalityDataType.BASE64,\n", + " image_format=ImageFormat.PNG,\n", " )\n", " ]\n", ")" @@ -314,7 +384,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 🎨 Designing our Data Schema\n", + "## πŸ—οΈ Designing our Data Schema\n", "\n", "Structured outputs ensure consistent and predictable data generation. Data Designer supports schemas defined using:\n", "- **JSON Schema**: For basic structure definition\n", @@ -349,6 +419,17 @@ " answer: Literal[\"option_a\", \"option_b\", \"option_c\", \"option_d\"] = Field(description=\"The correct answer to the question\")\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🎲 Adding Sampler Columns\n", + "\n", + "- Sampler columns offer non-LLM based generation of synthetic data.\n", + "\n", + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + ] + }, { "cell_type": "code", "execution_count": null, @@ -356,14 +437,25 @@ "outputs": [], "source": [ "config_builder.add_column(\n", - " C.SamplerColumn(\n", + " SamplerColumnConfig(\n", " name=\"difficulty\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(values=[\"easy\", \"medium\", \"hard\"]),\n", - " description=\"The difficulty of the generated question\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(values=[\"easy\", \"medium\", \"hard\"]),\n", " ))\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🦜 Adding LLM Generated columns\n", + "Now define the columns that the model will generate. These prompts instruct the LLM to produce: \n", + "- question\n", + "- options\n", + "- topic\n", + "- answer" + ] + }, { "cell_type": "code", "execution_count": null, @@ -371,9 +463,9 @@ "outputs": [], "source": [ "config_builder.add_column(\n", - " C.LLMStructuredColumn(\n", + " LLMStructuredColumnConfig(\n", " name=\"question\",\n", - " model_alias=model_alias,\n", + " model_alias=MODEL_ALIAS,\n", " prompt=(\"Generate a question based on the following context: {{ summary }}. \"\n", " \"The difficulty of the generated question should be {{ difficulty }}\"),\n", " system_prompt=(\"You are a helpful assistant that generates questions based on the given context. \"\n", @@ -386,9 +478,9 @@ ")\n", "\n", "config_builder.add_column(\n", - " C.LLMStructuredColumn(\n", + " LLMStructuredColumnConfig(\n", " name=\"options\",\n", - " model_alias=model_alias,\n", + " model_alias=MODEL_ALIAS,\n", " prompt=(\"Generate four answer choices for the question: {{ question }} based on the following context: {{ summary }}. \"\n", " \"The option you generate should match the difficulty of the generated question, {{ difficulty }}.\"),\n", " output_format=Options,\n", @@ -397,23 +489,24 @@ "\n", "\n", "config_builder.add_column(\n", - " C.LLMStructuredColumn(\n", + " LLMStructuredColumnConfig(\n", " name=\"answer\",\n", + " model_alias=MODEL_ALIAS,\n", " prompt=(\"Choose the correct answer for the question: {{ question }} based on the following context: {{ summary }}\"\n", " \"and options choices. The options are {{ options }}. Only select one of the options as the answer.\"),\n", " output_format=Answer,\n", - " model_alias=model_alias,\n", " )\n", ")\n", "\n", "\n", "config_builder.add_column(\n", - " C.LLMStructuredColumn(\n", + " LLMStructuredColumnConfig(\n", " name=\"topic\",\n", - " model_alias=model_alias,\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=(\"Generate a short 1-3 word topic for the question: {{ question }} \"\n", + " \"based on the given context. {{ summary }}\"),\n", " prompt=(\"Generate the topic of the question: {{ question }} based on the following context: {{ summary }}\"\n", " \"The topic should be a single word or phrase that is relevant to the question and context. \"),\n", - " system_prompt=(\"Generate a short 1-3 word topic for the question: {{ question }} based on the given context. {{ summary }}\"),\n", " output_format=QuestionTopic,\n", " )\n", ")\n" @@ -423,22 +516,25 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### πŸ‘€ Preview Generation\n", + "### πŸ” Iteration is key – preview the dataset!\n", + "\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", "\n", - "Before scaling up, it's crucial to validate your configuration with a small sample. The preview functionality:\n", + "2. Inspect the results for quality and format issues.\n", "\n", - "- **Generates Sample Data**: Creates 10 records for quick inspection\n", - "- **Enables Rapid Iteration**: Test and refine your prompts and schemas\n", - "- **Provides Detailed Logging**: Understand the generation process with verbose output\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "Use this step to fine-tune your configuration before full-scale generation.\n" + "4. Re-run the preview until satisfied." ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "**Note** Please ignore the validation warning, `PROMPT_WITHOUT_REFERENCES` that shows up. The image context is being passed to the LLM using the `multi_modal_context` and so the prompt does not need to reference any other column. " + "# Preview a few records\n", + "preview = data_designer_client.preview(config_builder)" ] }, { @@ -447,18 +543,19 @@ "metadata": {}, "outputs": [], "source": [ - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" + "# More previews\n", + "preview.display_sample_record()" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "# Display a sample record from the preview\n", - "# Run this cell multiple times to cycle through different records\n", - "preview.display_sample_record()" + "### πŸ“Š Analyze the generated data\n", + "\n", + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" ] }, { @@ -467,8 +564,15 @@ "metadata": {}, "outputs": [], "source": [ - "# The preview dataset is available as a pandas DataFrame.\n", - "preview.dataset" + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ”Ž View Results" ] }, { @@ -493,29 +597,31 @@ "print(\"\\nπŸ“ Generated Summary:\")\n", "rich.print(Panel(comparison_dataset.summary[index], title=\"Document Summary\", title_align=\"left\"))\n", "\n", + "print(\"\\nπŸ”’ Generated Difficulty:\")\n", + "rich.print(Panel(json.dumps(comparison_dataset.difficulty[index]), title=\"Difficulty\", title_align=\"left\"))\n", + "\n", "print(\"\\n❓ Generated Question:\")\n", - "rich.print(Panel(comparison_dataset.question[index], title=\"Question\", title_align=\"left\"))\n", + "rich.print(Panel(json.dumps(comparison_dataset.question[index]), title=\"Question\", title_align=\"left\"))\n", "\n", "print(\"\\nπŸ”’ Generated Options:\")\n", - "rich.print(Panel(comparison_dataset.options[index], title=\"Answer Choices\", title_align=\"left\"))\n", + "rich.print(Panel(json.dumps(comparison_dataset.options[index]), title=\"Answer Choices\", title_align=\"left\"))\n", + "\n", + "print(\"\\nπŸ”’ Generated Topic:\")\n", + "rich.print(Panel(json.dumps(comparison_dataset.topic[index]), title=\"Topic\", title_align=\"left\"))\n", "\n", "print(\"\\nβœ… Generated Answer:\")\n", - "rich.print(Panel(comparison_dataset.answer[index], title=\"Correct Answer\", title_align=\"left\"))\n" + "rich.print(Panel(json.dumps(comparison_dataset.answer[index]), title=\"Correct Answer\", title_align=\"left\"))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### πŸš€ Scale Up Generations\n", + "### πŸ†™ Scale up!\n", "\n", - "Once satisfied with the preview results, scale up to generate the full dataset. The generation process offers flexible execution modes:\n", + "- Happy with your preview data?\n", "\n", - "#### Synchronous Generation\n", - "Set `wait_until_done=True` to block until completion - ideal for smaller datasets or interactive workflows.\n", - "\n", - "#### Asynchronous Generation \n", - "Set `wait_until_done=False` for batch processing - returns a `job_id` for later retrieval:" + "- Use the `create` method to submit larger Data Designer generation jobs.\n" ] }, { @@ -524,8 +630,9 @@ "metadata": {}, "outputs": [], "source": [ - "job_results = data_designer_client.create(config_builder, num_records=1, wait_until_done=False)\n", + "job_results = data_designer_client.create(config_builder, num_records=20)\n", "\n", + "# This will block until the job is complete.\n", "job_results.wait_until_done()" ] }, @@ -535,19 +642,37 @@ "metadata": {}, "outputs": [], "source": [ - "# load the dataset into a pandas DataFrame\n", + "# Load the generated dataset as a pandas DataFrame.\n", "dataset = job_results.load_dataset()\n", "\n", - "print(f\"Generated {len(dataset)} records\")\n", - "\n", "dataset.head()" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "### πŸ”Ž View Results" + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-multimodal-visual-question-answering\",\n", + ");" ] }, { @@ -572,14 +697,20 @@ "print(\"\\nπŸ“ Generated Summary:\")\n", "rich.print(Panel(comparison_dataset.summary[index], title=\"Document Summary\", title_align=\"left\"))\n", "\n", + "print(\"\\nπŸ”’ Generated Difficulty:\")\n", + "rich.print(Panel(json.dumps(comparison_dataset.difficulty[index]), title=\"Difficulty\", title_align=\"left\"))\n", + "\n", "print(\"\\n❓ Generated Question:\")\n", - "rich.print(Panel(comparison_dataset.question[index], title=\"Question\", title_align=\"left\"))\n", + "rich.print(Panel(json.dumps(comparison_dataset.question[index]), title=\"Question\", title_align=\"left\"))\n", + "\n", + "print(\"\\nπŸ”’ Generated Options:\")\n", + "rich.print(Panel(json.dumps(comparison_dataset.options[index]), title=\"Answer Choices\", title_align=\"left\"))\n", "\n", - "# print(\"\\nπŸ”’ Generated Options:\")\n", - "# rich.print(Panel(comparison_dataset.options[index], title=\"Answer Choices\", title_align=\"left\"))\n", + "print(\"\\nπŸ”’ Generated Topic:\")\n", + "rich.print(Panel(json.dumps(comparison_dataset.topic[index]), title=\"Topic\", title_align=\"left\"))\n", "\n", "print(\"\\nβœ… Generated Answer:\")\n", - "rich.print(Panel(comparison_dataset.answer[index], title=\"Correct Answer\", title_align=\"left\"))\n" + "rich.print(Panel(json.dumps(comparison_dataset.answer[index]), title=\"Correct Answer\", title_align=\"left\"))\n" ] } ], diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/person-samplers/person-sampler-tutorial.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/person-samplers/person-sampler-tutorial.ipynb index d8bb84832..3d95057e3 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/person-samplers/person-sampler-tutorial.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/person-samplers/person-sampler-tutorial.ipynb @@ -1,631 +1,637 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "aaeb9727", - "metadata": {}, - "source": [ - "# πŸ§‘β€πŸ€β€πŸ§‘ NeMo Data Designer: Person Sampler Tutorial" - ] - }, - { - "cell_type": "markdown", - "id": "0e09f5c4", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", - ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this notebook, we'll explore how you can generate realistic personal information for your synthetic datasets." - ] - }, - { - "cell_type": "markdown", - "id": "39c4c850", - "metadata": {}, - "source": [ - "\n", - "## What is the Person Sampler?\n", - "\n", - "The Person Sampler is a powerful feature in Data Designer that generates consistent, realistic person records with attributes like:\n", - "- Names (first, middle, last)\n", - "- Contact information (email, phone)\n", - "- Addresses (street, city, state, zip)\n", - "- Demographics (age, gender, ethnicity)\n", - "- IDs (SSN, UUID)\n", - "- And more!\n", - "\n", - "These records are fully synthetic but maintain the statistical properties and formatting patterns of real personal data." - ] - }, - { - "cell_type": "markdown", - "id": "c1b097a7", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "id": "5fe29ec0", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "dbc86c4c", - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" - ] - }, - { - "cell_type": "markdown", - "id": "2bb1d498", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cb168a1f", - "metadata": {}, - "outputs": [], - "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias = \"nemotron-nano-9b-v2\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "29ea4bd1", - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. Basic Person Sampling\n", - "\n", - "Let's start with a simple example of generating person data using the default settings." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "# Add a simple person column with default settings\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"person\", # This creates a nested object with all person attributes\n", - " type=P.SamplerType.PERSON,\n", - " params=P.PersonSamplerParams(locale=\"en_US\", sex=\"Male\")\n", - " )\n", - ")\n", - "\n", - "# Preview what the generated data looks like\n", - "preview = data_designer_client.preview(config_builder)\n", - "preview.dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. Accessing Individual Person Attributes\n", - "\n", - "The `person` column we created above is a nested object with many attributes. Let's create some columns to access specific attributes from this person object." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "# Add columns to extract specific attributes from the person object\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"full_name\",\n", - " expr=\"{{ person.first_name }} {{ person.last_name }}\"\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"email\",\n", - " expr=\"{{ person.email_address }}\"\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"address\",\n", - " expr=\"{{ person.street_number }} {{ person.street_name }}, {{ person.city }}, {{ person.state }} {{ person.zipcode }}\"\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"age\",\n", - " expr=\"{{ person.age }}\"\n", - " )\n", - ")\n", - "\n", - "# Preview the results\n", - "preview = data_designer_client.preview(config_builder)\n", - "preview.dataset[['full_name', 'email', 'address', 'age']]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. Customizing Person Generators\n", - "\n", - "Now let's explore customizing the Person Sampler to generate specific types of profiles." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Reset our config builder for this example\n", - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs = [\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.5,\n", - " top_p=1.0,\n", - " ),\n", - " )\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "e025ab4e", - "metadata": {}, - "outputs": [], - "source": [ - "# Create custom person samplers for different roles/demographics\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"employee\",\n", - " type=P.SamplerType.PERSON,\n", - " params=P.PersonSamplerParams(\n", - " locale=\"en_US\",\n", - " age_range=[22, 65],\n", - " state=\"CA\"\n", - " )\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"customer\",\n", - " type=P.SamplerType.PERSON,\n", - " params=P.PersonSamplerParams(\n", - " locale=\"en_US\",\n", - " age_range=[18, 80]\n", - " )\n", - " )\n", - ")\n", - "\n", - "# Create a UK-based person\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"uk_contact\",\n", - " type=P.SamplerType.PERSON,\n", - " params=P.PersonSamplerParams(\n", - " locale=\"en_GB\", # UK locale\n", - " city=\"London\"\n", - " )\n", - " )\n", - ")\n", - "\n", - "# Add columns to extract and format information\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"employee_info\",\n", - " expr=\"{{ employee.first_name }} {{ employee.last_name }}, {{ employee.age }} - {{ employee.city }}, {{ employee.state }}\"\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"customer_info\",\n", - " expr=\"{{ customer.first_name }} {{ customer.last_name }}, {{ customer.age }} - {{ customer.city }}, {{ customer.state }}\"\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"uk_contact_info\",\n", - " expr=\"{{ uk_contact.first_name }} {{ uk_contact.last_name }}, {{ uk_contact.phone_number }} - {{ uk_contact.city }}\"\n", - " )\n", - ")\n", - "\n", - "# Preview the results\n", - "preview = data_designer_client.preview(config_builder)\n", - "preview.dataset[['employee_info', 'customer_info', 'uk_contact_info']]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. Available Person Attributes\n", - "\n", - "The Person Sampler generates a rich set of attributes that you can use. Here's a reference list of some of the key attributes available:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "| Attribute | Description | Example |\n", - "|-----------|-------------|--------|\n", - "| `first_name` | Person's first name | \"John\" |\n", - "| `middle_name` | Person's middle name (may be None) | \"Robert\" |\n", - "| `last_name` | Person's last name | \"Smith\" |\n", - "| `sex` | Person's sex | \"Male\" |\n", - "| `age` | Person's age in years | 42 |\n", - "| `birth_date` | Date of birth | \"1980-05-15\" |\n", - "| `email_address` | Email address | \"john.smith@example.com\" |\n", - "| `phone_number` | Phone number | \"+1 (555) 123-4567\" |\n", - "| `street_number` | Street number | \"123\" |\n", - "| `street_name` | Street name | \"Main Street\" |\n", - "| `unit` | Apartment/unit number | \"Apt 4B\" |\n", - "| `city` | City name | \"Chicago\" |\n", - "| `state` | State/province (locale dependent) | \"IL\" |\n", - "| `county` | County (locale dependent) | \"Cook\" |\n", - "| `zipcode` | Postal/ZIP code | \"60601\" |\n", - "| `country` | Country name | \"United States\" |\n", - "| `ssn` | Social Security Number (US locale) | \"123-45-6789\" |\n", - "| `occupation` | Occupation | \"Software Engineer\" |\n", - "| `marital_status` | Marital status | \"Married\" |\n", - "| `education_level` | Education level | \"Bachelor's Degree\" |\n", - "| `ethnic_background` | Ethnic background | \"Caucasian\" |\n", - "| `uuid` | Unique identifier | \"550e8400-e29b-41d4-a716-446655440000\" |" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. Creating Multiple Person Samplers with One Method\n", - "\n", - "For convenience, Data Designer provides a `with_person_samplers` method to create multiple person samplers at once." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1d9dbc89", - "metadata": {}, - "outputs": [], - "source": [ - "# Reset our config builder for this example\n", - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs = [\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.5,\n", - " top_p=1.0,\n", - " ),\n", - " )\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "5676b992", - "metadata": {}, - "outputs": [], - "source": [ - "# Create multiple person samplers at once\n", - "config_builder.with_person_samplers({\n", - " \"doctor\": {\"locale\": \"en_US\", \"age_range\": [30, 70]},\n", - " \"patient\": {\"locale\": \"en_US\", \"age_range\": [18, 90]},\n", - " \"nurse\": {\"locale\": \"en_US\", \"age_range\": [25, 65], \"sex\": \"Female\"},\n", - " \"international_doctor\": {\"locale\": \"fr_FR\", \"age_range\": [35, 65]}\n", - "})\n", - "\n", - "# Add columns to format information for each person type\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"doctor_profile\",\n", - " expr=\"Dr. {{ doctor.first_name }} {{ doctor.last_name }}, {{ doctor.age }}, {{ doctor.email_address }}\"\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"patient_profile\",\n", - " expr=\"{{ patient.first_name }} {{ patient.last_name }}, {{ patient.age }}, {{ patient.city }}, {{ patient.state }}\"\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"nurse_profile\",\n", - " expr=\"Nurse {{ nurse.first_name }} {{ nurse.last_name }}, {{ nurse.age }}\"\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"international_doctor_profile\",\n", - " expr=\"Dr. {{ international_doctor.first_name }} {{ international_doctor.last_name }}, {{ international_doctor.city }}, {{ international_doctor.country }}\"\n", - " )\n", - ")\n", - "\n", - "# Preview the results\n", - "preview = data_designer_client.preview(config_builder)\n", - "preview.dataset[['doctor_profile', 'patient_profile', 'nurse_profile', 'international_doctor_profile']]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 6. Using Person Data with LLM Generation\n", - "\n", - "One of the most powerful features of Data Designer is combining structured person data with LLM generation to create realistic, contextual content." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a24bc42f", - "metadata": {}, - "outputs": [], - "source": [ - "# Reset our config builder for this example\n", - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs = [\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.5,\n", - " top_p=1.0,\n", - " ),\n", - " )\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "bbdff7f6", - "metadata": {}, - "outputs": [], - "source": [ - "# Create person samplers for patients and doctors\n", - "config_builder.with_person_samplers({\n", - " \"patient\": {\"locale\": \"en_US\", \"age_range\": [18, 85]},\n", - " \"doctor\": {\"locale\": \"en_US\", \"age_range\": [30, 70]}\n", - "})\n", - "\n", - "# Add some medical condition sampling\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"medical_condition\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\n", - " \"Hypertension\",\n", - " \"Type 2 Diabetes\",\n", - " \"Asthma\",\n", - " \"Rheumatoid Arthritis\",\n", - " \"Migraine\",\n", - " \"Hypothyroidism\"\n", - " ]\n", - " )\n", - " )\n", - ")\n", - "\n", - "# Add basic info columns\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"patient_name\",\n", - " expr=\"{{ patient.first_name }} {{ patient.last_name }}\"\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.ExpressionColumn(\n", - " name=\"doctor_name\",\n", - " expr=\"Dr. {{ doctor.first_name }} {{ doctor.last_name }}\"\n", - " )\n", - ")\n", - "\n", - "# Add an LLM-generated medical note\n", - "config_builder.add_column(\n", - " C.LLMTextColumn(\n", - " name=\"medical_notes\",\n", - " model_alias=model_alias,\n", - " prompt=(\n", - " \"Write a brief medical note from {{ doctor_name }} about patient {{ patient_name }}, \"\n", - " \"a {{ patient.age }}-year-old {{ patient.sex }} with {{ medical_condition }}. \"\n", - " \"Include relevant medical observations and recommendations. \"\n", - " \"The patient lives in {{ patient.city }}, {{ patient.state }} and works as {{ patient.occupation }}. \"\n", - " \"Keep the note professional, concise (3-4 sentences), and medically accurate.\"\n", - " )\n", - " )\n", - ")\n", - "\n", - "# Add an LLM-generated patient message\n", - "config_builder.add_column(\n", - " C.LLMTextColumn(\n", - " name=\"patient_message\",\n", - " model_alias=model_alias,\n", - " prompt=(\n", - " \"Write a brief message (1-2 sentences) from {{ patient_name }} to {{ doctor_name }} \"\n", - " \"about their {{ medical_condition }}. The message should reflect the patient's \"\n", - " \"experience and concerns. The patient is {{ patient.age }} years old.\"\n", - " )\n", - " )\n", - ")\n", - "\n", - "# Preview the results\n", - "preview = data_designer_client.preview(config_builder)\n", - "preview.dataset[['patient_name', 'doctor_name', 'medical_condition', 'medical_notes', 'patient_message']]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 7. Generating and Saving the Final Dataset\n", - "\n", - "Now that we've explored the Person Sampler capabilities, let's generate a complete dataset and save it." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)\n", - "\n", - "job_results.wait_until_done()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5a8b4dbe", - "metadata": {}, - "outputs": [], - "source": [ - "dataset = job_results.load_dataset()\n", - "print(f\"Generated dataset with {len(dataset)} records\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "\n", - "In this tutorial, we've explored the Person Sampler functionality in Data Designer. We've learned how to:\n", - "\n", - "1. Generate basic person records with realistic attributes\n", - "2. Customize person profiles by locale, age, gender, and location\n", - "3. Create multiple person samplers for different roles or demographics\n", - "4. Use person attributes in expressions and LLM prompts\n", - "\n", - "The Person Sampler is an essential tool for creating realistic synthetic datasets for testing, development, and training applications that handle personal information." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "aaeb9727", + "metadata": {}, + "source": [ + "# πŸ§‘β€πŸ€β€πŸ§‘ NeMo Data Designer: Person Sampler Tutorial\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "### πŸ“š What you'll learn\n", + "\n", + "In this notebook, we'll explore how you can generate realistic personal information for your synthetic datasets.\n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", + ">\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", + ">\n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", + ">\n", + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n", + "\n", + "
\n", + "\n", + "## What is the Person Sampler?\n", + "\n", + "The Person Sampler is a powerful feature in NeMo Data Designer that generates consistent, realistic person records with attributes like:\n", + "- Names (first, middle, last)\n", + "- Contact information (email, phone)\n", + "- Addresses (street, city, state, zip)\n", + "- Demographics (age, gender, ethnicity)\n", + "- IDs (SSN, UUID)\n", + "- And more!\n", + "\n", + "These records are fully synthetic but maintain the statistical properties and formatting patterns of real personal data.\n" + ] + }, + { + "cell_type": "markdown", + "id": "252a74b6", + "metadata": {}, + "source": [ + "### πŸ“¦ Import the essentials\n", + "\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", + "\n", + "- The `essentials` module provides quick access to the most commonly used objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "300cfd8f", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_microservices.data_designer.essentials import (\n", + " CategorySamplerParams,\n", + " DataDesignerConfigBuilder,\n", + " InferenceParameters,\n", + " LLMTextColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " PersonSamplerParams,\n", + " SamplerColumnConfig,\n", + " SamplerType\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "9ad26abb", + "metadata": {}, + "source": [ + "### βš™οΈ Initialize the NeMo Data Designer Client\n", + "\n", + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "893fc4d9", + "metadata": {}, + "outputs": [], + "source": [ + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" + ] + }, + { + "cell_type": "markdown", + "id": "8c0fc2ec", + "metadata": {}, + "source": [ + "### πŸŽ›οΈ Define model configurations\n", + "\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", + "\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", + "\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", + "\n", + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ca72e385", + "metadata": {}, + "outputs": [], + "source": [ + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-nano-v2\"\n", + "\n", + "# This sets reasoning to False for the nemotron-nano-v2 model.\n", + "SYSTEM_PROMPT = \"/no_think\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "cb7e2a06", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "265efc27", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" + ] + }, + { + "cell_type": "markdown", + "id": "a4f36fa9", + "metadata": {}, + "source": [ + "### 1. Basic Person Sampling\n", + "\n", + "Let's start with a simple example of generating person data using the default settings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "86871b40-e08c-42f6-94a4-d2f7ff286589", + "metadata": {}, + "outputs": [], + "source": [ + "# Add a simple person column with default settings\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"person\", # This creates a nested object with all person attributes\n", + " sampler_type=SamplerType.PERSON,\n", + " params=PersonSamplerParams(locale=\"en_US\", sex=\"Male\"),\n", + " )\n", + ")\n", + "# Preview what the generated data looks like\n", + "preview = data_designer_client.preview(config_builder)\n", + "preview.dataset" + ] + }, + { + "cell_type": "markdown", + "id": "147f0815", + "metadata": {}, + "source": [ + "### 2. Accessing Individual Person Attributes\n", + "\n", + "The `person` column we created above is a nested object with many attributes. Let's create some columns to access specific attributes from this person object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "31fe9d04-8ccb-4f9c-825d-eb93f34a8cf0", + "metadata": {}, + "outputs": [], + "source": [ + "# Add columns to extract specific attributes from the person object\n", + "config_builder.add_column(\n", + " name=\"full_name\",\n", + " column_type=\"expression\",\n", + " expr=\"{{ person.first_name }} {{ person.last_name }}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"email\",\n", + " column_type=\"expression\",\n", + " expr=\"{{ person.email_address }}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"address\",\n", + " column_type=\"expression\",\n", + " expr=\"{{ person.street_number }} {{ person.street_name }}, {{ person.city }}, {{ person.state }} {{ person.zipcode }}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"age\",\n", + " column_type=\"expression\",\n", + " expr=\"{{ person.age }}\"\n", + ")\n", + "\n", + "# Preview the results\n", + "preview = data_designer_client.preview(config_builder)\n", + "preview.dataset[['full_name', 'email', 'address', 'age']]" + ] + }, + { + "cell_type": "markdown", + "id": "866af665", + "metadata": {}, + "source": [ + "### 3. Customizing Person Generators\n", + "\n", + "- Now let's explore customizing the Person Sampler to generate specific types of profiles.\n", + "\n", + "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n", + " If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e025ab4e", + "metadata": {}, + "outputs": [], + "source": [ + "# Reset our config builder for this example\n", + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)\n", + "\n", + "# Create custom person samplers for different roles/demographics\n", + "config_builder.add_column(\n", + " name=\"employee\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"person\",\n", + " params={\n", + " \"locale\": \"en_US\",\n", + " \"age_range\": [22, 65],\n", + " \"state\": \"CA\"\n", + " }\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"customer\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"person\",\n", + " params={\n", + " \"locale\": \"en_US\",\n", + " \"age_range\": [18, 80]\n", + " }\n", + ")\n", + "\n", + "# Create a UK-based person\n", + "config_builder.add_column(\n", + " name=\"uk_contact\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"person\",\n", + " params={\n", + " \"locale\": \"en_GB\", # UK locale\n", + " \"city\": \"London\"\n", + " }\n", + ")\n", + "\n", + "# Add columns to extract and format information\n", + "config_builder.add_column(\n", + " name=\"employee_info\",\n", + " column_type=\"expression\",\n", + " expr=\"{{ employee.first_name }} {{ employee.last_name }}, {{ employee.age }} - {{ employee.city }}, {{ employee.state }}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"customer_info\",\n", + " column_type=\"expression\",\n", + " expr=\"{{ customer.first_name }} {{ customer.last_name }}, {{ customer.age }} - {{ customer.city }}, {{ customer.state }}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"uk_contact_info\",\n", + " column_type=\"expression\",\n", + " expr=\"{{ uk_contact.first_name }} {{ uk_contact.last_name }}, {{ uk_contact.phone_number }} - {{ uk_contact.city }}\"\n", + ")\n", + "\n", + "# Preview the results\n", + "preview = data_designer_client.preview(config_builder)\n", + "preview.dataset[['employee_info', 'customer_info', 'uk_contact_info']]" + ] + }, + { + "cell_type": "markdown", + "id": "f0ccaa4f", + "metadata": {}, + "source": [ + "### 4. Available Person Attributes\n", + "\n", + "The Person Sampler generates a rich set of attributes that you can use. Here's a reference list of some of the key attributes available:" + ] + }, + { + "cell_type": "markdown", + "id": "3f01fc83", + "metadata": {}, + "source": [ + "| Attribute | Description | Example |\n", + "|-----------|-------------|--------|\n", + "| `first_name` | Person's first name | \"John\" |\n", + "| `middle_name` | Person's middle name (may be None) | \"Robert\" |\n", + "| `last_name` | Person's last name | \"Smith\" |\n", + "| `sex` | Person's sex | \"Male\" |\n", + "| `age` | Person's age in years | 42 |\n", + "| `birth_date` | Date of birth | \"1980-05-15\" |\n", + "| `email_address` | Email address | \"john.smith@example.com\" |\n", + "| `phone_number` | Phone number | \"+1 (555) 123-4567\" |\n", + "| `street_number` | Street number | \"123\" |\n", + "| `street_name` | Street name | \"Main Street\" |\n", + "| `unit` | Apartment/unit number | \"Apt 4B\" |\n", + "| `city` | City name | \"Chicago\" |\n", + "| `state` | State/province (locale dependent) | \"IL\" |\n", + "| `county` | County (locale dependent) | \"Cook\" |\n", + "| `zipcode` | Postal/ZIP code | \"60601\" |\n", + "| `country` | Country name | \"United States\" |\n", + "| `ssn` | Social Security Number (US locale) | \"123-45-6789\" |\n", + "| `occupation` | Occupation | \"Software Engineer\" |\n", + "| `marital_status` | Marital status | \"Married\" |\n", + "| `education_level` | Education level | \"Bachelor's Degree\" |\n", + "| `ethnic_background` | Ethnic background | \"Caucasian\" |\n", + "| `uuid` | Unique identifier | \"550e8400-e29b-41d4-a716-446655440000\" |" + ] + }, + { + "cell_type": "markdown", + "id": "ce2944df", + "metadata": {}, + "source": [ + "### 5. Creating Multiple Person Samplers with One Method\n", + "\n", + "For convenience, Data Designer provides a `with_person_samplers` method to create multiple person samplers at once." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5676b992", + "metadata": {}, + "outputs": [], + "source": [ + "# Reset our config builder for this example\n", + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)\n", + "\n", + "# Create custom person samplers for different roles/demographics\n", + "config_builder.add_column(\n", + " name=\"doctor\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"person\",\n", + " params={\"locale\": \"en_US\", \"age_range\": [30, 70]}\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"patient\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"person\",\n", + " params={\"locale\": \"en_US\", \"age_range\": [18, 90]}\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"nurse\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"person\",\n", + " params={\"locale\": \"en_US\", \"age_range\": [25, 65], \"sex\": \"Female\"}\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"international_doctor\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"person\",\n", + " params={\"locale\": \"fr_FR\", \"age_range\": [35, 65]}\n", + ")\n", + "\n", + "# Add columns to format information for each person type\n", + "config_builder.add_column(\n", + " name=\"doctor_profile\",\n", + " column_type=\"expression\",\n", + " expr=\"Dr. {{ doctor.first_name }} {{ doctor.last_name }}, {{ doctor.age }}, {{ doctor.email_address }}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"patient_profile\",\n", + " column_type=\"expression\",\n", + " expr=\"{{ patient.first_name }} {{ patient.last_name }}, {{ patient.age }}, {{ patient.city }}, {{ patient.state }}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"nurse_profile\",\n", + " column_type=\"expression\",\n", + " expr=\"Nurse {{ nurse.first_name }} {{ nurse.last_name }}, {{ nurse.age }}\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"international_doctor_profile\",\n", + " column_type=\"expression\",\n", + " expr=\"Dr. {{ international_doctor.first_name }} {{ international_doctor.last_name }}, {{ international_doctor.city }}, {{ international_doctor.country }}\"\n", + ")\n", + "\n", + "# Preview the results\n", + "preview = data_designer_client.preview(config_builder)\n", + "preview.dataset[['doctor_profile', 'patient_profile', 'nurse_profile', 'international_doctor_profile']]" + ] + }, + { + "cell_type": "markdown", + "id": "99f21181", + "metadata": {}, + "source": [ + "## 6. Using Person Data with LLM Generation\n", + "\n", + "One of the most powerful features of Data Designer is combining structured person data with LLM generation to create realistic, contextual content." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bbdff7f6", + "metadata": {}, + "outputs": [], + "source": [ + "# Reset our config builder for this example\n", + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)\n", + "\n", + "\n", + "# Create person samplers for patients and doctors\n", + "config_builder.add_column(\n", + " name=\"patient\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"person\",\n", + " params={\"locale\": \"en_US\", \"age_range\": [18, 85]},\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"doctor\",\n", + " column_type=\"sampler\",\n", + " sampler_type=\"person\",\n", + " params={\"locale\": \"en_US\", \"age_range\": [30, 70]},\n", + ")\n", + "\n", + "# Add some medical condition sampling\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"medical_condition\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\n", + " \"Hypertension\",\n", + " \"Type 2 Diabetes\",\n", + " \"Asthma\",\n", + " \"Rheumatoid Arthritis\",\n", + " \"Migraine\",\n", + " \"Hypothyroidism\",\n", + " ]\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Add basic info columns\n", + "config_builder.add_column(\n", + " name=\"patient_name\",\n", + " column_type=\"expression\",\n", + " expr=\"{{ patient.first_name }} {{ patient.last_name }}\",\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"doctor_name\",\n", + " column_type=\"expression\",\n", + " expr=\"Dr. {{ doctor.first_name }} {{ doctor.last_name }}\",\n", + ")\n", + "\n", + "# Add an LLM-generated medical note\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"medical_notes\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=(\n", + " \"Write a brief medical note from {{ doctor_name }} about patient {{ patient_name }}, \"\n", + " \"a {{ patient.age }}-year-old {{ patient.sex }} with {{ medical_condition }}. \\n\"\n", + " \"Include relevant medical observations and recommendations. \\n\"\n", + " \"The patient lives in {{ patient.city }}, {{ patient.state }} and works as {{ patient.occupation }}. \\n\"\n", + " \"Keep the note professional, concise (3-4 sentences), and medically accurate.\\n\"\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Add an LLM-generated patient message\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"patient_message\",\n", + " system_prompt=SYSTEM_PROMPT,\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=(\n", + " \"Write a brief message (1-2 sentences) from {{ patient_name }} to {{ doctor_name }} \"\n", + " \"about their {{ medical_condition }}. The message should reflect the patient's \"\n", + " \"experience and concerns. The patient is {{ patient.age }} years old.\"\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Preview the results\n", + "preview = data_designer_client.preview(config_builder)\n", + "preview.dataset[['patient_name', 'doctor_name', 'medical_condition', 'medical_notes', 'patient_message']]" + ] + }, + { + "cell_type": "markdown", + "id": "a1838a66", + "metadata": {}, + "source": [ + "### πŸ†™ Scale up!\n", + "\n", + "- Happy with your preview data?\n", + "\n", + "- Use the `create` method to submit larger Data Designer generation jobs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c3d0bc0", + "metadata": {}, + "outputs": [], + "source": [ + "job_results = data_designer_client.create(config_builder, num_records=20)\n", + "\n", + "# This will block until the job is complete.\n", + "job_results.wait_until_done()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "22a66458", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", + "dataset = job_results.load_dataset()\n", + "\n", + "dataset.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84cb1a83", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4acf47d0", + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-person-sampler-tutorial\",\n", + ");" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "sdg_venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/qa-generation/product-question-answer-generator.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/qa-generation/product-question-answer-generator.ipynb index 6e74a9857..116c87dfc 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/qa-generation/product-question-answer-generator.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/qa-generation/product-question-answer-generator.ipynb @@ -1,437 +1,581 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "6e8f02ab", - "metadata": {}, - "source": [ - "# 🎨 NeMo Data Designer: Product Information Dataset Generator with Q&A" - ] - }, - { - "cell_type": "markdown", - "id": "a090e0c9", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", - ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." - ] - }, - { - "cell_type": "markdown", - "id": "38ebcf4d", - "metadata": { - "id": "vIH9DEmYimTb" - }, - "source": [ - "This notebook demonstrates how to use NeMo Data Designer to create a synthetic dataset of product information with corresponding questions and answers. This dataset can be used for training and evaluating Q&A systems focused on product information." - ] - }, - { - "cell_type": "markdown", - "id": "464f9245", - "metadata": {}, - "source": [ - "\n", - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0914a5b4", - "metadata": {}, - "outputs": [], - "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "id": "8b0bf13e", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ad95d658", - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" - ] - }, - { - "cell_type": "markdown", - "id": "99d7d4a9", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "74790c56", - "metadata": {}, - "outputs": [], - "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias = \"nemotron-nano-9b-v2\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0df8b7e0", - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "dedb3311", - "metadata": {}, - "source": [ - "## Defining Data Structures\n", - "\n", - "Now we'll define the data models and evaluation rubrics for our product information dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "34f67410", - "metadata": { - "id": "l5q7YysHji8O" - }, - "outputs": [], - "source": [ - "import string\n", - "from pydantic import BaseModel\n", - "from pydantic import Field" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cc61c255", - "metadata": {}, - "outputs": [], - "source": [ - "# Define product information structure\n", - "class ProductInfo(BaseModel):\n", - " product_name: str = Field(..., description=\"A realistic product name for the market.\")\n", - " key_features: list[str] = Field(..., min_length=1, max_length=3, description=\"Key product features.\")\n", - " description: str = Field(..., description=\"A short, engaging description of what the product does, highlighting a unique but believable feature.\")\n", - " price_usd: float = Field(..., description=\"The stated price in USD.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fc30ca31", - "metadata": {}, - "outputs": [], - "source": [ - "# Define evaluation rubrics for answer quality\n", - "CompletenessRubric = P.Rubric(\n", - " name=\"Completeness\",\n", - " description=\"Evaluation of AI assistant's thoroughness in addressing all aspects of the user's query.\",\n", - " scoring={\n", - " \"Complete\": \"The response thoroughly covers all key points requested in the question, providing sufficient detail to satisfy the user's information needs.\",\n", - " \"PartiallyComplete\": \"The response addresses the core question but omits certain important details or fails to elaborate on relevant aspects that were requested.\",\n", - " \"Incomplete\": \"The response significantly lacks necessary information, missing major components of what was asked and leaving the query largely unanswered.\",\n", - " }\n", - ")\n", - "\n", - "AccuracyRubric = P.Rubric(\n", - " name=\"Accuracy\",\n", - " description=\"Evaluation of how factually correct the AI assistant's response is relative to the product information.\",\n", - " scoring={\n", - " \"Accurate\": \"The information provided aligns perfectly with the product specifications without introducing any misleading or incorrect details.\",\n", - " \"PartiallyAccurate\": \"While some information is correctly stated, the response contains minor factual errors or potentially misleading statements about the product.\",\n", - " \"Inaccurate\": \"The response presents significantly wrong information about the product, with claims that contradict the actual product details.\",\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "71100fa2", - "metadata": {}, - "source": [ - "## Data Generation Workflow\n", - "\n", - "Now we'll configure the data generation workflow to create product information, questions, and answers." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b479cb8d", - "metadata": {}, - "outputs": [], - "source": [ - "# Define product category options\n", - "config_builder.add_column(\n", - " name=\"category\",\n", - " type=\"category\",\n", - " params={\"values\": ['Electronics', 'Clothing', 'Home Appliances', 'Groceries', 'Toiletries',\n", - " 'Sports Equipment', 'Toys', 'Books', 'Pet Supplies', 'Tools & Home Improvement',\n", - " 'Beauty', 'Health & Wellness', 'Outdoor Gear', 'Automotive', 'Jewelry',\n", - " 'Watches', 'Office Supplies', 'Gifts', 'Arts & Crafts', 'Baby & Kids',\n", - " 'Music', 'Video Games', 'Movies', 'Software', 'Tech Devices']}\n", - ")\n", - "\n", - "# Define price range to seed realistic product types\n", - "config_builder.add_column(\n", - " name=\"price_tens_of_dollars\",\n", - " type=\"uniform\",\n", - " params={\"low\": 1, \"high\": 200},\n", - " convert_to=\"int\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"product_price\",\n", - " type=\"expression\",\n", - " expr=\"{{ (price_tens_of_dollars * 10) - 0.01 | round(2) }}\",\n", - " dtype=\"float\"\n", - ")\n", - "\n", - "# Generate first letter for product name to ensure diversity\n", - "config_builder.add_column(\n", - " name=\"first_letter\",\n", - " type=\"category\",\n", - " params={\"values\": list(string.ascii_uppercase)}\n", - ")\n", - "\n", - "# Generate product information\n", - "config_builder.add_column(\n", - " name=\"product_info\",\n", - " type=\"llm-structured\",\n", - " model_alias=model_alias,\n", - " prompt=\"\"\"\\\n", - "Generate a realistic product description for a product in the {{ category }} category that costs {{ product_price }}.\n", - "The name of the product MUST start with the letter {{ first_letter }}.\\\n", - "\"\"\",\n", - " output_format=ProductInfo\n", - ")\n", - "\n", - "# Generate user questions about the product\n", - "config_builder.add_column(\n", - " name=\"question\",\n", - " type='llm-text',\n", - " model_alias=model_alias,\n", - " prompt=\"Ask a question about the following product:\\n\\n {{ product_info }}\",\n", - ")\n", - "\n", - "# Determine if this example will include hallucination\n", - "config_builder.add_column(\n", - " name=\"is_hallucination\",\n", - " type=\"bernoulli\",\n", - " params={\"p\": 0.5}\n", - ")\n", - "\n", - "# Generate answers to the questions\n", - "config_builder.add_column(\n", - " name=\"answer\",\n", - " type='llm-text',\n", - " model_alias=model_alias,\n", - " prompt=\"\"\"\\\n", - "{%- if is_hallucination == 0 -%}\n", - "\n", - "{{ product_info }}\n", - "\n", - "\n", - "{%- endif -%}\n", - "User Question: {{ question }}\n", - "\n", - "Directly and succinctly answer the user's question.\\\n", - "{%- if is_hallucination == 1 -%}\n", - " Make up whatever information you need to in order to answer the user's request.\\\n", - "{%- endif -%}\n", - "\"\"\"\n", - ")\n", - "\n", - "# Evaluate answer quality\n", - "config_builder.add_column(\n", - " name=\"llm_answer_metrics\",\n", - " type=\"llm-judge\",\n", - " model_alias=model_alias,\n", - " prompt=\"\"\"\\\n", - "\n", - "{{ product_info }}\n", - "\n", - "\n", - "User Question: {{question }}\n", - "AI Assistant Answer: {{ answer }}\n", - "\n", - "Judge the AI assistant's response to the user's question about the product described in .\\\n", - "\"\"\",\n", - " rubrics=[CompletenessRubric, AccuracyRubric]\n", - ")\n", - "\n", - "# Extract metric scores for easier analysis\n", - "config_builder.add_column(\n", - " name=\"completeness_result\",\n", - " type=\"expression\",\n", - " expr=\"{{ llm_answer_metrics.Completeness.score }}\"\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"accuracy_result\",\n", - " type=\"expression\",\n", - " expr=\"{{ llm_answer_metrics.Accuracy.score }}\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "7889e108", - "metadata": {}, - "source": [ - "## Generate the Preview\n", - "\n", - "Let's examine a sample record to understand the generated data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dd720c77", - "metadata": {}, - "outputs": [], - "source": [ - "# Preview the generated data\n", - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "71703fb8", - "metadata": {}, - "outputs": [], - "source": [ - "preview.display_sample_record()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5b053c27", - "metadata": {}, - "outputs": [], - "source": [ - "preview.dataset" - ] - }, - { - "cell_type": "markdown", - "id": "5b179bfc", - "metadata": {}, - "source": [ - "## Generating the Full Dataset\n", - "\n", - "Now that we've verified our data model looks good, let's generate a full dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "aff05efd", - "metadata": { - "id": "HrRvPXoyTFLn" - }, - "outputs": [], - "source": [ - "# Run the job\n", - "job_results = data_designer_client.create(config_builder, num_records=1, wait_until_done=False)\n", - "\n", - "job_results.wait_until_done()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c5e8ea27", - "metadata": {}, - "outputs": [], - "source": [ - "dataset = job_results.load_dataset()\n", - "print(\"\\nGenerated dataset shape:\", dataset.shape)\n", - "\n", - "dataset.head()" - ] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "6e8f02ab", + "metadata": {}, + "source": [ + "# 🎨 NeMo Data Designer: Product Information Dataset Generator with Q&A\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "#### πŸ“š What you'll learn\n", + "\n", + "This notebook demonstrates how to use NeMo Data Designer to create a synthetic dataset of product information with corresponding questions and answers. \n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", + ">\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", + ">\n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", + ">\n", + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" + ] + }, + { + "cell_type": "markdown", + "id": "136d3c6f", + "metadata": {}, + "source": [ + "### πŸ“¦ Import the essentials\n", + "\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", + "\n", + "- The `essentials` module provides quick access to the most commonly used objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0914a5b4", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_microservices.data_designer.essentials import (\n", + " BernoulliSamplerParams,\n", + " CategorySamplerParams,\n", + " DataDesignerConfigBuilder,\n", + " ExpressionColumnConfig,\n", + " InferenceParameters,\n", + " LLMJudgeColumnConfig,\n", + " LLMStructuredColumnConfig,\n", + " LLMTextColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " SamplerColumnConfig,\n", + " SamplerType,\n", + " Score,\n", + " UniformSamplerParams,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "8c17dedd", + "metadata": {}, + "source": [ + "### βš™οΈ Initialize the NeMo Data Designer Client\n", + "\n", + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9abe4b10", + "metadata": {}, + "outputs": [], + "source": [ + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" + ] + }, + { + "cell_type": "markdown", + "id": "34a0b816", + "metadata": {}, + "source": [ + "### πŸŽ›οΈ Define model configurations\n", + "\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", + "\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", + "\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", + "\n", + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f699cdc7", + "metadata": {}, + "outputs": [], + "source": [ + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-nano-v2\"\n", + "\n", + "# This sets reasoning to False for the nemotron-nano-v2 model.\n", + "SYSTEM_PROMPT = \"/no_think\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "2a1d247a", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1409dfe2", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" + ] + }, + { + "cell_type": "markdown", + "id": "dedb3311", + "metadata": {}, + "source": [ + "## πŸ—οΈ Defining Data Structures\n", + "\n", + "Now we'll define the data models and evaluation rubrics for our product information dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc61c255", + "metadata": {}, + "outputs": [], + "source": [ + "import string\n", + "from pydantic import BaseModel\n", + "from pydantic import Field\n", + "\n", + "# Define product information structure\n", + "class ProductInfo(BaseModel):\n", + " product_name: str = Field(..., description=\"A realistic product name for the market.\")\n", + " key_features: list[str] = Field(..., min_length=1, max_length=3, description=\"Key product features.\")\n", + " description: str = Field(..., description=\"A short, engaging description of what the product does, highlighting a unique but believable feature.\")\n", + " price_usd: float = Field(..., description=\"The stated price in USD.\")" + ] + }, + { + "cell_type": "markdown", + "id": "0ed416ec", + "metadata": {}, + "source": [ + "## 🎲 Adding Sampler Columns\n", + "\n", + "- Sampler columns offer non-LLM based generation of synthetic data.\n", + "\n", + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd01cc94-6268-4aac-9e69-a57ef8c9bf89", + "metadata": {}, + "outputs": [], + "source": [ + "# Define product category options\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"category\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(values=[\n", + " \"Electronics\",\n", + " \"Clothing\",\n", + " \"Home Appliances\",\n", + " \"Groceries\",\n", + " \"Toiletries\",\n", + " \"Sports Equipment\",\n", + " \"Toys\",\n", + " \"Books\",\n", + " \"Pet Supplies\",\n", + " \"Tools & Home Improvement\",\n", + " \"Beauty\",\n", + " \"Health & Wellness\",\n", + " \"Outdoor Gear\",\n", + " \"Automotive\",\n", + " \"Jewelry\",\n", + " \"Watches\",\n", + " \"Office Supplies\",\n", + " \"Gifts\",\n", + " \"Arts & Crafts\",\n", + " \"Baby & Kids\",\n", + " \"Music\",\n", + " \"Video Games\",\n", + " \"Movies\",\n", + " \"Software\",\n", + " \"Tech Devices\",\n", + " ]\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Define price range to seed realistic product types\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"price_tens_of_dollars\",\n", + " sampler_type=SamplerType.UNIFORM,\n", + " params=UniformSamplerParams(low=1, high=200),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"product_price\",\n", + " expr=\"{{ (price_tens_of_dollars * 10) - 0.01 | round(2) }}\",\n", + " dtype=\"float\",\n", + " )\n", + ")\n", + "\n", + "# Generate first letter for product name to ensure diversity\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"first_letter\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(values=list(string.ascii_uppercase)),\n", + " )\n", + ")\n", + "\n", + "# Determine if this example will include hallucination\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"is_hallucination\",\n", + " sampler_type=SamplerType.BERNOULLI,\n", + " params=BernoulliSamplerParams(p=0.5),\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "1d6b2669", + "metadata": {}, + "source": [ + "## 🦜 LLM-generated columns\n", + "\n", + "- When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.\n", + "\n", + "- As we see below, nested json fields can be accessed using dot notation.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f5d59972", + "metadata": {}, + "outputs": [], + "source": [ + "# Generate product information\n", + "config_builder.add_column(\n", + " LLMStructuredColumnConfig(\n", + " name=\"product_info\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=(\n", + " \"Generate a realistic product description for a product in the {{ category }} \"\n", + " \"category that costs {{ product_price }}.\\n\"\n", + " \"The name of the product MUST start with the letter {{ first_letter }}.\\n\"\n", + " ),\n", + " output_format=ProductInfo,\n", + " )\n", + ")\n", + "\n", + "# Generate user questions about the product\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"question\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=(\"Ask a question about the following product:\\n\\n {{ product_info }}\"),\n", + " )\n", + ")\n", + "\n", + "\n", + "# Generate answers to the questions\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"answer\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=(\n", + " \"{%- if is_hallucination == 0 -%}\\n\"\n", + " \"\\n\"\n", + " \"{{ product_info }}\\n\"\n", + " \"\\n\"\n", + "\n", + " \"{%- endif -%}\\n\"\n", + " \"User Question: {{ question }}\\n\"\n", + "\n", + " \"Directly and succinctly answer the user's question.\\n\"\n", + " \"{%- if is_hallucination == 1 -%}\\n\"\n", + " \"Make up whatever information you need to in order to answer the user's request.\\n\"\n", + " \"{%- endif -%}\"\n", + " ),\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "b1c9c108", + "metadata": {}, + "source": [ + "## πŸ” Quality Assessment: LLM-as-a-Judge\n", + "\n", + "When generating our synthetic dataset, we need to determine the quality of the generated data \\\n", + "We use the LLM-as-a-Judge strategy to do this. \n", + "\n", + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", + "that provides relavant instructions. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "90ab3e5a", + "metadata": {}, + "outputs": [], + "source": [ + "# Define evaluation rubrics for answer quality\n", + "CompletenessRubric = Score(\n", + " name=\"Completeness\",\n", + " description=\"Evaluation of AI assistant's thoroughness in addressing all aspects of the user's query.\",\n", + " options={\n", + " \"Complete\": \"The response thoroughly covers all key points requested in the question, providing sufficient detail to satisfy the user's information needs.\",\n", + " \"PartiallyComplete\": \"The response addresses the core question but omits certain important details or fails to elaborate on relevant aspects that were requested.\",\n", + " \"Incomplete\": \"The response significantly lacks necessary information, missing major components of what was asked and leaving the query largely unanswered.\",\n", + " }\n", + ")\n", + "\n", + "AccuracyRubric = Score(\n", + " name=\"Accuracy\",\n", + " description=\"Evaluation of how factually correct the AI assistant's response is relative to the product information.\",\n", + " options={\n", + " \"Accurate\": \"The information provided aligns perfectly with the product specifications without introducing any misleading or incorrect details.\",\n", + " \"PartiallyAccurate\": \"While some information is correctly stated, the response contains minor factual errors or potentially misleading statements about the product.\",\n", + " \"Inaccurate\": \"The response presents significantly wrong information about the product, with claims that contradict the actual product details.\",\n", + " }\n", + ")\n", + "\n", + "\n", + "# Evaluate answer quality\n", + "config_builder.add_column(\n", + " LLMJudgeColumnConfig(\n", + " name=\"llm_answer_metrics\",\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=(\n", + " \"\\n\"\n", + " \"{{ product_info }}\\n\"\n", + " \"\\n\"\n", + "\n", + " \"User Question: {{question }}\\n\"\n", + " \"AI Assistant Answer: {{ answer }}\\n\"\n", + "\n", + " \"Judge the AI assistant's response to the user's question about the product described in .\"\n", + " ),\n", + " scores=[CompletenessRubric, AccuracyRubric],\n", + " )\n", + ")\n", + "\n", + "\n", + "# Extract metric scores for easier analysis\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"completeness_result\",\n", + " expr=\"{{ llm_answer_metrics.completeness.score }}\",\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " ExpressionColumnConfig(\n", + " name=\"accuracy_result\",\n", + " expr=\"{{ llm_answer_metrics.accuracy.score }}\",\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "52c605da", + "metadata": {}, + "source": [ + "### πŸ” Iteration is key – preview the dataset!\n", + "\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", + "\n", + "2. Inspect the results for quality and format issues.\n", + "\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", + "\n", + "4. Re-run the preview until satisfied." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9909090f", + "metadata": {}, + "outputs": [], + "source": [ + "# Preview a few records\n", + "preview = data_designer_client.preview(config_builder)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a28be08c", + "metadata": {}, + "outputs": [], + "source": [ + "# More previews\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "id": "c7d9e59d", + "metadata": {}, + "source": [ + "### πŸ“Š Analyze the generated data\n", + "\n", + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b257d22f", + "metadata": {}, + "outputs": [], + "source": [ + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "id": "3ae79c08", + "metadata": {}, + "source": [ + "### πŸ†™ Scale up!\n", + "\n", + "- Happy with your preview data?\n", + "\n", + "- Use the `create` method to submit larger Data Designer generation jobs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "374280bc", + "metadata": {}, + "outputs": [], + "source": [ + "job_results = data_designer_client.create(config_builder, num_records=20)\n", + "\n", + "# This will block until the job is complete.\n", + "job_results.wait_until_done()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "48e7e77a", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", + "dataset = job_results.load_dataset()\n", + "\n", + "dataset.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "245cb8e5", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7336dd35", + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-qa-generation-product-question-answer-generator\",\n", + ");" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "sdg_venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/rag-examples/generate-rag-generation-eval-dataset.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/rag-examples/generate-rag-generation-eval-dataset.ipynb index 5a9ddb02f..737ed68a0 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/rag-examples/generate-rag-generation-eval-dataset.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/rag-examples/generate-rag-generation-eval-dataset.ipynb @@ -1,654 +1,683 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 🎨 NeMo Data Designer: Generate Diverse RAG Evaluations" - ] - }, - { - "cell_type": "markdown", - "id": "0e04cb1b", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", - ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." - ] - }, - { - "cell_type": "markdown", - "id": "118d856f", - "metadata": {}, - "source": [ - "This tutorial demonstrates how to generate comprehensive evaluation datasets for Retrieval-Augmented Generation (RAG) systems, customized to your content and use cases. \n", - "\n", - "You'll learn how to create diverse question-answer pairs at scale, covering a variety of difficulty levels and reasoning types, including both answerable and unanswerable scenarios.\n" - ] - }, - { - "cell_type": "markdown", - "id": "6f56b0d2", - "metadata": {}, - "source": [ - "\n", - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e89ca1ab", - "metadata": {}, - "outputs": [], - "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "id": "7601ded0", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3b6dc059", - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" - ] - }, - { - "cell_type": "markdown", - "id": "277f1aaf", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "259ec052", - "metadata": {}, - "outputs": [], - "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"/raid/models/nemotron-nano-9b-v2\"\n", - "# model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias = \"nemotron-nano-9b-v2\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "77b59e8e", - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"local-llm\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "e78f8070", - "metadata": {}, - "source": [ - "### πŸ“• Source Document Configuration\n", - "\n", - "Let's define our source documents and the total number of evaluation pairs we want to generate. You can replace the document list with your own PDFs, web pages, or other text sources." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fd6f9e64", - "metadata": {}, - "outputs": [], - "source": [ - "# Define source documents and total number of evaluation pairs to generate\n", - "# You can replace this with your own documents\n", - "DOCUMENT_LIST = [\"./data/databricks-state-of-data-ai-report.pdf\"]" - ] - }, - { - "cell_type": "markdown", - "id": "e0c98449", - "metadata": {}, - "source": [ - "### βš™οΈ Document Processing\n", - "\n", - "Now we'll create a Document Processor class that handles loading and chunking the source documents. \n", - "\n", - "This class uses langchain's RecursiveCharacterTextSplitter and unstructured.io for robust document parsing." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bfec3608", - "metadata": {}, - "outputs": [], - "source": [ - "from typing import List, Union\n", - "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", - "from unstructured.partition.auto import partition\n", - "import tempfile\n", - "import os" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "38ec64d5", - "metadata": {}, - "outputs": [], - "source": [ - "class DocumentProcessor:\n", - " \"\"\"Handles loading and chunking source documents for RAG evaluation.\"\"\"\n", - "\n", - " def __init__(self, chunk_size: int = 4192, chunk_overlap: int = 200):\n", - " \"\"\"Initialize with configurable chunk size and overlap.\"\"\"\n", - " self.text_splitter = RecursiveCharacterTextSplitter(\n", - " chunk_size=chunk_size,\n", - " chunk_overlap=chunk_overlap,\n", - " length_function=len,\n", - " )\n", - "\n", - " def parse_document(self, uri: str) -> str:\n", - " \"\"\"Parse a single document from URI into raw text.\"\"\"\n", - " with open(uri, 'rb') as file:\n", - " content = file.read()\n", - " with tempfile.NamedTemporaryFile(delete=False) as temp_file:\n", - " temp_file.write(content)\n", - " temp_file.flush()\n", - " elements = partition(temp_file.name)\n", - "\n", - " os.unlink(temp_file.name)\n", - " return \"\\n\\n\".join([str(element) for element in elements])\n", - "\n", - " def process_documents(self, uris: Union[str, List[str]]) -> List[str]:\n", - " \"\"\"Process one or more documents into chunks for RAG evaluation.\"\"\"\n", - " if isinstance(uris, str):\n", - " uris = [uris]\n", - "\n", - " all_chunks = []\n", - " for uri in uris:\n", - " text = self.parse_document(uri)\n", - " chunks = self.text_splitter.split_text(text)\n", - " all_chunks.extend(chunks)\n", - "\n", - " return all_chunks" - ] - }, - { - "cell_type": "markdown", - "id": "7c44785c", - "metadata": {}, - "source": [ - "### Data Models\n", - "\n", - "Let's define Pydantic models for structured output generation. These schemas will ensure our generated data has consistent structure and validation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9cab035f", - "metadata": {}, - "outputs": [], - "source": [ - "from pydantic import BaseModel, Field\n", - "\n", - "class QAPair(BaseModel):\n", - " question: str = Field(\n", - " ..., description=\"A specific question related to the domain of the context\"\n", - " )\n", - " answer: str = Field(\n", - " ..., description=\"Either a context-supported answer or explanation of why the question cannot be answered\"\n", - " )\n", - " reasoning: str = Field(\n", - " ..., description=\"A clear and traceable explanation of the reasoning behind the answer\"\n", - " )" - ] - }, - { - "cell_type": "markdown", - "id": "ada29f90", - "metadata": {}, - "source": [ - "### Processing Documents and Setting Up Data Designer\n", - "\n", - "Now we'll process our document chunks and set up the Data Designer with our seed dataset.\n", - "\n", - "**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5325b303", - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "# Process document chunks\n", - "processor = DocumentProcessor(chunk_size=4192, chunk_overlap=200)\n", - "chunks = processor.process_documents(DOCUMENT_LIST)\n", - "\n", - "# Create a seed DataFrame with the document chunks\n", - "seed_df = pd.DataFrame({\"context\": chunks})\n", - "seed_df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0bfa504d", - "metadata": {}, - "outputs": [], - "source": [ - "os.makedirs(\"data\", exist_ok=True)\n", - "seed_df.to_csv(\"data/document_chunks.csv\", index=False)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f09836d4", - "metadata": {}, - "outputs": [], - "source": [ - "# Upload the seed dataset with document chunks\n", - "# Using shuffle with replacement allows the model to reuse context chunks\n", - "config_builder.with_seed_dataset(\n", - " repo_id=\"advanced-tutorials/rag_evaluation_dataset\",\n", - " filename=\"document_chunks.csv\",\n", - " dataset_path=\"./data/document_chunks.csv\",\n", - " sampling_strategy=\"shuffle\",\n", - " with_replacement=True,\n", - " datastore={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "280e2fec", - "metadata": {}, - "source": [ - "### Adding Categorical Columns for Controlled Diversity\n", - "\n", - "Now we'll add categorical columns to control the diversity of our RAG evaluation pairs. We'll define:\n", - "\n", - "1. **Difficulty levels**: easy, medium, hard\n", - "\n", - "2. **Reasoning types**: factual recall, inferential reasoning, etc.\n", - "\n", - "3. **Question types**: answerable vs. unanswerable (with weighting)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e3e27cac", - "metadata": {}, - "outputs": [], - "source": [ - "# Configure categorical columns for controlled diversity\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"difficulty\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\"easy\", \"medium\", \"hard\"],\n", - " description=\"The difficulty level of the question\"\n", - " )\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"reasoning_type\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\n", - " \"factual recall\",\n", - " \"inferential reasoning\",\n", - " \"comparative analysis\",\n", - " \"procedural understanding\",\n", - " \"cause and effect\"\n", - " ],\n", - " description=\"The type of reasoning required to answer the question\"\n", - " )\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " C.SamplerColumn(\n", - " name=\"question_type\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", - " values=[\"answerable\", \"unanswerable\"],\n", - " # 10:1 ratio of answerable to unanswerable questions.\n", - " weights=[10, 1],\n", - " )\n", - " )\n", - ")\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "markdown", - "id": "735cbbea", - "metadata": {}, - "source": [ - "### Adding LLM-Structured Column for Q&A Pair Generation\n", - "\n", - "Now let's set up the core of our data generation: the Q&A pair column that will produce structured question-answer pairs based on our document context and control parameters." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ecf44d9e", - "metadata": {}, - "outputs": [], - "source": [ - "# Add Q&A pair generation column\n", - "config_builder.add_column(\n", - " C.LLMStructuredColumn(\n", - " name=\"qa_pair\",\n", - " model_alias=model_alias,\n", - " system_prompt=(\n", - " \"You are an expert at generating high-quality RAG evaluation pairs. \"\n", - " \"You are very careful in assessing whether the question can be answered from the provided context. \"\n", - " ),\n", - " prompt=\"\"\"\\\n", - "{{context}}\n", - "\n", - "Generate a {{difficulty}} {{reasoning_type}} question-answer pair.\n", - "The question should be {{question_type}} using the provided context.\n", - "\n", - "For answerable questions:\n", - "- Ensure the answer is fully supported by the context\n", - "\n", - "For unanswerable questions:\n", - "- Keep the question topically relevant\n", - "- Make it clearly beyond the context's scope\n", - "\"\"\",\n", - " output_format=QAPair\n", - " )\n", - ")\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "markdown", - "id": "41e6cc02", - "metadata": {}, - "source": [ - "### Adding Evaluation Metrics with Custom Rubrics\n", - "\n", - "To assess the quality of our generated Q&A pairs, we'll add evaluation metrics using detailed rubrics for scoring. \n", - "\n", - "We use Data Designer's `LLMJudgeColumn` for this, defining a set of custom Rubrics designed for our task." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "953bca63", - "metadata": {}, - "outputs": [], - "source": [ - "context_relevance_rubric = P.Rubric(\n", - " name=\"Context Relevance\",\n", - " description=\"Evaluates how relevant the answer is to the provided context\",\n", - " scoring={\n", - " \"5\": \"Perfect relevance to context with no extraneous information\",\n", - " \"4\": \"Highly relevant with minor deviations from context\",\n", - " \"3\": \"Moderately relevant but includes some unrelated information\",\n", - " \"2\": \"Minimally relevant with significant departure from context\",\n", - " \"1\": \"Almost entirely irrelevant to the provided context\"\n", - " }\n", - ")\n", - "\n", - "answer_precision_rubric = P.Rubric(\n", - " name=\"Answer Precision\",\n", - " description=\"Evaluates the accuracy and specificity of the answer\",\n", - " scoring={\n", - " \"5\": \"Extremely precise with exact, specific information\",\n", - " \"4\": \"Very precise with minor imprecisions\",\n", - " \"3\": \"Adequately precise but could be more specific\",\n", - " \"2\": \"Imprecise with vague or ambiguous information\",\n", - " \"1\": \"Completely imprecise or inaccurate\"\n", - " }\n", - ")\n", - "\n", - "answer_completeness_rubric = P.Rubric(\n", - " name=\"Answer Completeness\",\n", - " description=\"Evaluates how thoroughly the answer addresses all aspects of the question\",\n", - " scoring={\n", - " \"5\": \"Fully complete, addressing all aspects of the question\",\n", - " \"4\": \"Mostly complete with minor omissions\",\n", - " \"3\": \"Adequately complete but missing some details\",\n", - " \"2\": \"Substantially incomplete, missing important aspects\",\n", - " \"1\": \"Severely incomplete, barely addresses the question\"\n", - " }\n", - ")\n", - "\n", - "hallucination_avoidance_rubric = P.Rubric(\n", - " name=\"Hallucination Avoidance\",\n", - " description=\"Evaluates the absence of made-up or incorrect information\",\n", - " scoring={\n", - " \"5\": \"No hallucinations, all information is factual and verifiable\",\n", - " \"4\": \"Minimal hallucinations that don't impact the core answer\",\n", - " \"3\": \"Some hallucinations that partially affect the answer quality\",\n", - " \"2\": \"Significant hallucinations that undermine the answer\",\n", - " \"1\": \"Severe hallucinations making the answer entirely unreliable\"\n", - " }\n", - ")\n", - "\n", - "EVAL_METRICS_PROMPT_TEMPLATE = \"\"\"\\\n", - "You are an expert evaluator of question-answer pairs. Analyze the following Q&A pair and evaluate it objectively.\n", - "\n", - "For this {{difficulty}} {{reasoning_type}} Q&A pair:\n", - "{{qa_pair}}\n", - "\n", - "Take a deep breath and carefully evaluate each criterion based on the provided rubrics, considering the difficulty level and reasoning type indicated.\n", - "\"\"\"\n", - "\n", - "config_builder.add_column(\n", - " C.LLMJudgeColumn(\n", - " name=\"eval_metrics\",\n", - " model_alias=model_alias,\n", - " prompt=EVAL_METRICS_PROMPT_TEMPLATE,\n", - " rubrics=[context_relevance_rubric, answer_precision_rubric, answer_completeness_rubric, hallucination_avoidance_rubric],\n", - " )\n", - ")\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "markdown", - "id": "8fb3dc84", - "metadata": {}, - "source": [ - "### πŸ‘€ Preview Sample Records\n", - "\n", - "Let's generate a preview to see what our data will look like before running the full generation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b55913d0", - "metadata": {}, - "outputs": [], - "source": [ - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a6b1b895", - "metadata": {}, - "outputs": [], - "source": [ - "# Run this cell multiple times to cycle through the 10 preview records.\n", - "preview.display_sample_record()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b655a45f", - "metadata": {}, - "outputs": [], - "source": [ - "# The preview dataset is available as a pandas DataFrame.\n", - "preview.dataset.head()" - ] - }, - { - "cell_type": "markdown", - "id": "40099da2", - "metadata": {}, - "source": [ - "### Generate the Full Dataset\n", - "\n", - "Now let's generate our full dataset of RAG evaluation pairs, analyze the coverage, and export it to a JSONL file for use in evaluating RAG systems." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2270b104", - "metadata": {}, - "outputs": [], - "source": [ - "# Generate the full dataset.\n", - "job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)\n", - "\n", - "job_results.wait_until_done()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ac909600", - "metadata": {}, - "outputs": [], - "source": [ - "dataset = job_results.load_dataset()\n", - "print(\"\\nGenerated dataset shape:\", dataset.shape)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "596c8a4c", - "metadata": {}, - "outputs": [], - "source": [ - "# Export the dataset to JSONL format.\n", - "dataset.to_json('./data/rag_evals.jsonl', orient='records', lines=True)\n", - "print(\"\\nDataset exported to ./data/rag_evals.jsonl\")" - ] - }, - { - "cell_type": "markdown", - "id": "19c674f4", - "metadata": {}, - "source": [ - "### Using Your RAG Evaluation Dataset\n", - "\n", - "Now that you've generated a diverse RAG evaluation dataset, here are some ways to use it:\n", - "\n", - "1. **Benchmarking**: Test your RAG system against these evaluation pairs to measure performance\n", - "\n", - "2. **Error Analysis**: Identify patterns in where your RAG system struggles\n", - "\n", - "3. **Optimization**: Use insights to tune retrieval and generation parameters\n", - "\n", - "4. **Regression Testing**: Track performance over time as you improve your system\n", - "\n", - "5. **Model Comparison**: Compare different LLMs, retrievers, or RAG architectures\n", - "\n", - "The JSONL file contains structured data with questions, ground truth answers, and quality metrics that you can use with most evaluation frameworks." - ] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "667b36b1", + "metadata": {}, + "source": [ + "# 🎨 NeMo Data Designer: Generate Diverse RAG Evaluations\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "#### πŸ“š What you'll learn\n", + "\n", + "This tutorial demonstrates how to generate comprehensive evaluation datasets for Retrieval-Augmented Generation (RAG) systems, customized to your content and use cases. \n", + "\n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", + ">\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", + ">\n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", + ">\n", + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" + ] + }, + { + "cell_type": "markdown", + "id": "7fc3f8fe", + "metadata": {}, + "source": [ + "### πŸ“¦ Import the essentials\n", + "\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", + "\n", + "- The `essentials` module provides quick access to the most commonly used objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e89ca1ab", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_microservices.data_designer.essentials import (\n", + " CategorySamplerParams,\n", + " DataDesignerConfigBuilder,\n", + " ExpressionColumnConfig,\n", + " InferenceParameters,\n", + " LLMJudgeColumnConfig,\n", + " LLMStructuredColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " SamplerColumnConfig,\n", + " SamplerType,\n", + " Score,\n", + " UniformSamplerParams,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fb55b35d", + "metadata": {}, + "source": [ + "### βš™οΈ Initialize the NeMo Data Designer Client\n", + "\n", + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3706d9a3", + "metadata": {}, + "outputs": [], + "source": [ + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" + ] + }, + { + "cell_type": "markdown", + "id": "f6839a60", + "metadata": {}, + "source": [ + "### πŸŽ›οΈ Define model configurations\n", + "\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", + "\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", + "\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", + "\n", + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9284cce4", + "metadata": {}, + "outputs": [], + "source": [ + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-nano-v2\"\n", + "\n", + "# This sets reasoning to False for the nemotron-nano-v2 model.\n", + "SYSTEM_PROMPT = \"/no_think\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "ee5dde1a", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ceafed4", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" + ] + }, + { + "cell_type": "markdown", + "id": "acd5e591", + "metadata": {}, + "source": [ + "## 🌱 Loading Seed Data\n", + "\n", + "- We'll use the symptom-to-diagnosis dataset as our seed data. \n", + "\n", + "- This dataset contains patient symptoms and corresponding diagnoses which will help generate realistic medical scenarios.\n", + "\n", + "
\n", + "\n", + "> 🌱 **Why use a seed dataset?**\n", + ">\n", + "> - Seed datasets let you steer the generation process by providing context that is specific to your use case.\n", + ">\n", + "> - Seed datasets are also an excellent way to inject real-world diversity into your synthetic data.\n", + ">\n", + "> - During generation, prompt templates can reference any of the seed dataset fields.\n", + "\n", + "
\n", + "\n", + "> πŸ’‘ **About datastores**\n", + ">\n", + "> - You can use seed datasets from _either_ the Hugging Face Hub or a locally deployed datastore.\n", + ">\n", + "> - By default, we use the local datastore deployed with the Data Designer microservice.\n", + ">\n", + "> - The datastore endpoint is specified in the deployment configuration.\n", + "\n", + "\n", + "πŸ‘‹ **Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as \\\n", + "seeds, it is recommended you consolidated these into a single file. \n", + "
\n", + "\n", + "### βš™οΈ Document Processing\n", + "\n", + "Now we'll create a Document Processor class that handles loading and chunking the source documents. \n", + "\n", + "This class uses langchain's RecursiveCharacterTextSplitter and unstructured.io for robust document parsing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bfec3608", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List, Union\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from unstructured.partition.auto import partition\n", + "import tempfile\n", + "import os" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "38ec64d5", + "metadata": {}, + "outputs": [], + "source": [ + "class DocumentProcessor:\n", + " \"\"\"Handles loading and chunking source documents for RAG evaluation.\"\"\"\n", + "\n", + " def __init__(self, chunk_size: int = 4192, chunk_overlap: int = 200):\n", + " \"\"\"Initialize with configurable chunk size and overlap.\"\"\"\n", + " self.text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=chunk_size,\n", + " chunk_overlap=chunk_overlap,\n", + " length_function=len,\n", + " )\n", + "\n", + " def parse_document(self, uri: str) -> str:\n", + " \"\"\"Parse a single document from URI into raw text.\"\"\"\n", + " with open(uri, 'rb') as file:\n", + " content = file.read()\n", + " with tempfile.NamedTemporaryFile(delete=False) as temp_file:\n", + " temp_file.write(content)\n", + " temp_file.flush()\n", + " elements = partition(temp_file.name)\n", + "\n", + " os.unlink(temp_file.name)\n", + " return \"\\n\\n\".join([str(element) for element in elements])\n", + "\n", + " def process_documents(self, uris: Union[str, List[str]]) -> List[str]:\n", + " \"\"\"Process one or more documents into chunks for RAG evaluation.\"\"\"\n", + " if isinstance(uris, str):\n", + " uris = [uris]\n", + "\n", + " all_chunks = []\n", + " for uri in uris:\n", + " text = self.parse_document(uri)\n", + " chunks = self.text_splitter.split_text(text)\n", + " all_chunks.extend(chunks)\n", + "\n", + " return all_chunks" + ] + }, + { + "cell_type": "markdown", + "id": "7c44785c", + "metadata": {}, + "source": [ + "### πŸ—οΈ Data Models\n", + "\n", + "- Let's define Pydantic models for structured output generation. \n", + "\n", + "- These schemas will ensure our generated data has consistent structure and validation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9cab035f", + "metadata": {}, + "outputs": [], + "source": [ + "from pydantic import BaseModel, Field\n", + "\n", + "class QAPair(BaseModel):\n", + " question: str = Field(\n", + " ..., description=\"A specific question related to the domain of the context\"\n", + " )\n", + " answer: str = Field(\n", + " ..., description=\"Either a context-supported answer or explanation of why the question cannot be answered\"\n", + " )\n", + " reasoning: str = Field(\n", + " ..., description=\"A clear and traceable explanation of the reasoning behind the answer\"\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5325b303", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# Process document chunks\n", + "DOCUMENT_LIST = [\"./data/databricks-state-of-data-ai-report.pdf\"]\n", + "\n", + "processor = DocumentProcessor(chunk_size=4192, chunk_overlap=200)\n", + "chunks = processor.process_documents(DOCUMENT_LIST)\n", + "\n", + "# Create a seed DataFrame with the document chunks\n", + "seed_df = pd.DataFrame({\"context\": chunks})\n", + "\n", + "os.makedirs(\"data\", exist_ok=True)\n", + "seed_df.to_csv(\"data/document_chunks.csv\", index=False)\n", + "\n", + "seed_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8152fe1e-6d47-435d-91c1-85a6d17b1637", + "metadata": {}, + "outputs": [], + "source": [ + "dataset_reference = data_designer_client.upload_seed_dataset(\n", + " repo_id=\"data-designer-demo/rag-evaluation-dataset\",\n", + " dataset=seed_df,\n", + " datastore_settings={\"endpoint\": \"http://localhost:3000/v1/hf\"})\n", + "\n", + "config_builder.with_seed_dataset(dataset_reference)" + ] + }, + { + "cell_type": "markdown", + "id": "280e2fec", + "metadata": {}, + "source": [ + "## 🎲 Adding Categorical Columns for Controlled Diversity\n", + "\n", + "Now we'll add categorical columns to control the diversity of our RAG evaluation pairs. We'll define:\n", + "\n", + "1. **Difficulty levels**: easy, medium, hard\n", + "\n", + "2. **Reasoning types**: factual recall, inferential reasoning, etc.\n", + "\n", + "3. **Question types**: answerable vs. unanswerable (with weighting)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e3e27cac", + "metadata": {}, + "outputs": [], + "source": [ + "# Configure categorical columns for controlled diversity\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"difficulty\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"easy\", \"medium\", \"hard\"],\n", + " ),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"reasoning_type\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\n", + " \"factual recall\",\n", + " \"inferential reasoning\",\n", + " \"comparative analysis\",\n", + " \"procedural understanding\",\n", + " \"cause and effect\",\n", + " ],\n", + " ),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"question_type\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"answerable\", \"unanswerable\"],\n", + " # 10:1 ratio of answerable to unanswerable questions.\n", + " weights=[10, 1],\n", + " ),\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "735cbbea", + "metadata": {}, + "source": [ + "## 🦜 Adding LLM-Structured Column for Q&A Pair Generation\n", + "\n", + "Now let's set up the core of our data generation: the Q&A pair column that will produce structured question-answer \\\n", + "pairs based on our document context and control parameters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ecf44d9e", + "metadata": {}, + "outputs": [], + "source": [ + "# Add Q&A pair generation column\n", + "config_builder.add_column(\n", + " LLMStructuredColumnConfig(\n", + " name=\"qa_pair\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=(\n", + " \"{{context}}\\n\"\n", + " \"\\n\"\n", + " \"Generate a {{difficulty}} {{reasoning_type}} question-answer pair.\\n\"\n", + " \"The question should be {{question_type}} using the provided context.\\n\"\n", + " \"\\n\"\n", + " \"For answerable questions:\\n\"\n", + " \"- Ensure the answer is fully supported by the context\\n\"\n", + " \"\\n\"\n", + " \"For unanswerable questions:\\n\"\n", + " \"- Keep the question topically relevant\\n\"\n", + " \"- Make it clearly beyond the context's scope\\n\"\n", + " ),\n", + " output_format=QAPair,\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e7058d0a", + "metadata": {}, + "source": [ + "## πŸ” Quality Assessment: LLM-as-a-Judge\n", + "\n", + "When generating our synthetic dataset, we need to determine the quality of the generated data \\\n", + "We use the LLM-as-a-Judge strategy to do this. \n", + "\n", + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", + "that provides relavant instructions. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "953bca63", + "metadata": {}, + "outputs": [], + "source": [ + "context_relevance_rubric = Score(\n", + " name=\"Context Relevance\",\n", + " description=\"Evaluates how relevant the answer is to the provided context\",\n", + " options={\n", + " \"5\": \"Perfect relevance to context with no extraneous information\",\n", + " \"4\": \"Highly relevant with minor deviations from context\",\n", + " \"3\": \"Moderately relevant but includes some unrelated information\",\n", + " \"2\": \"Minimally relevant with significant departure from context\",\n", + " \"1\": \"Almost entirely irrelevant to the provided context\",\n", + " },\n", + ")\n", + "\n", + "answer_precision_rubric = Score(\n", + " name=\"Answer Precision\",\n", + " description=\"Evaluates the accuracy and specificity of the answer\",\n", + " options={\n", + " \"5\": \"Extremely precise with exact, specific information\",\n", + " \"4\": \"Very precise with minor imprecisions\",\n", + " \"3\": \"Adequately precise but could be more specific\",\n", + " \"2\": \"Imprecise with vague or ambiguous information\",\n", + " \"1\": \"Completely imprecise or inaccurate\",\n", + " },\n", + ")\n", + "\n", + "answer_completeness_rubric = Score(\n", + " name=\"Answer Completeness\",\n", + " description=\"Evaluates how thoroughly the answer addresses all aspects of the question\",\n", + " options={\n", + " \"5\": \"Fully complete, addressing all aspects of the question\",\n", + " \"4\": \"Mostly complete with minor omissions\",\n", + " \"3\": \"Adequately complete but missing some details\",\n", + " \"2\": \"Substantially incomplete, missing important aspects\",\n", + " \"1\": \"Severely incomplete, barely addresses the question\",\n", + " },\n", + ")\n", + "\n", + "hallucination_avoidance_rubric = Score(\n", + " name=\"Hallucination Avoidance\",\n", + " description=\"Evaluates the absence of made-up or incorrect information\",\n", + " options={\n", + " \"5\": \"No hallucinations, all information is factual and verifiable\",\n", + " \"4\": \"Minimal hallucinations that don't impact the core answer\",\n", + " \"3\": \"Some hallucinations that partially affect the answer quality\",\n", + " \"2\": \"Significant hallucinations that undermine the answer\",\n", + " \"1\": \"Severe hallucinations making the answer entirely unreliable\",\n", + " },\n", + ")\n", + "\n", + "EVAL_METRICS_PROMPT_TEMPLATE = (\n", + " \"You are an expert evaluator of question-answer pairs. Analyze the following Q&A pair and evaluate it objectively.\\n\\n\"\n", + " \"For this {{difficulty}} {{reasoning_type}} Q&A pair:\\n\"\n", + " \"{{qa_pair}}\\n\\n\"\n", + " \"Take a deep breath and carefully evaluate each criterion based on the provided rubrics, considering the \"\n", + " \"difficulty level and reasoning type indicated.\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " LLMJudgeColumnConfig(\n", + " name=\"eval_metrics\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=EVAL_METRICS_PROMPT_TEMPLATE,\n", + " scores=[\n", + " context_relevance_rubric,\n", + " answer_precision_rubric,\n", + " answer_completeness_rubric,\n", + " hallucination_avoidance_rubric,\n", + " ],\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "f8ff88d0", + "metadata": {}, + "source": [ + "### πŸ” Iteration is key – preview the dataset!\n", + "\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", + "\n", + "2. Inspect the results for quality and format issues.\n", + "\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", + "\n", + "4. Re-run the preview until satisfied." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "efdf43d5", + "metadata": {}, + "outputs": [], + "source": [ + "# Preview a few records\n", + "preview = data_designer_client.preview(config_builder)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1da20a5", + "metadata": {}, + "outputs": [], + "source": [ + "# More previews\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "id": "2cfd025b", + "metadata": {}, + "source": [ + "### πŸ“Š Analyze the generated data\n", + "\n", + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79185c9a", + "metadata": {}, + "outputs": [], + "source": [ + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "id": "ed6fc118", + "metadata": {}, + "source": [ + "### πŸ†™ Scale up!\n", + "\n", + "- Happy with your preview data?\n", + "\n", + "- Use the `create` method to submit larger Data Designer generation jobs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5568d200", + "metadata": {}, + "outputs": [], + "source": [ + "job_results = data_designer_client.create(config_builder, num_records=20)\n", + "\n", + "# This will block until the job is complete.\n", + "job_results.wait_until_done()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4aa51230", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", + "dataset = job_results.load_dataset()\n", + "\n", + "dataset.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65e9fd83", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "022d9f47", + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-rag-examples-generate-rag-generation-eval-dataset\",\n", + ");" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "sdg_venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/reasoning/reasoning-traces.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/reasoning/reasoning-traces.ipynb index 656ead4cb..37659973a 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/reasoning/reasoning-traces.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/reasoning/reasoning-traces.ipynb @@ -1,492 +1,641 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 🎨 NeMo Data Designer: Synthetic Reasoning Traces" - ] - }, - { - "cell_type": "markdown", - "id": "a9521867", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", - ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This notebook demonstrates how to use NeMo Data Designer to build a synthetic data generation pipeline tailored for reasoning tasks. Instead of creating multi-turn conversations, we will generate reasoning traces that can be utilized for training and fine-tuning language models with reinforcement learning techniques and invoking chain-of-thought processing.\n", - "\n", - "These synthetic reasoning traces can be used to enhance model performance in areas such as mathematics, coding, scientific reasoning, and other domains that benefit from structured reasoning." - ] - }, - { - "cell_type": "markdown", - "id": "ea340547", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6b395aa4", - "metadata": {}, - "outputs": [], - "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "id": "19048f79", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "43880183", - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" - ] - }, - { - "cell_type": "markdown", - "id": "01fc96fb", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1eacb024", - "metadata": {}, - "outputs": [], - "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias = \"nemotron-nano-9b-v2\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b97f239f", - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 🌱 Adding Categorical Seed Columns\n", - "\n", - "Define categorical seed columns that set the context for the generated empathic reasoning traces. For example, domain and theme determine the type of everyday scenario where empathy is crucial, while complexity guides the depth of emotional insight and detailed support." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define a domain column that sets the context for empathic scenarios in everyday life.\n", - "config_builder.add_column(\n", - " name=\"domain\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\n", - " \"Family Dynamics\",\n", - " \"Workplace Challenges\",\n", - " \"Friendship Moments\",\n", - " \"Community Interactions\",\n", - " \"Personal Well-being\",\n", - " \"Unexpected Encounters\"\n", - " ]\n", - " }\n", - ")\n", - "\n", - "# Add theme subcategories for each domain\n", - "config_builder.add_column(\n", - " name=\"theme\",\n", - " type=\"subcategory\",\n", - " params={\n", - " \"category\": \"domain\",\n", - " \"values\": {\n", - " \"Family Dynamics\": [\n", - " \"Parenting Dilemmas\",\n", - " \"Sibling Rivalries\"\n", - " ],\n", - " \"Workplace Challenges\": [\n", - " \"Communication Breakdowns\",\n", - " \"Leadership Dilemmas\"\n", - " ],\n", - " \"Friendship Moments\": [\n", - " \"Support & Understanding\",\n", - " \"Misunderstandings & Reconciliations\"\n", - " ],\n", - " \"Community Interactions\": [\n", - " \"Neighborhood Support\",\n", - " \"Cultural Celebrations\"\n", - " ],\n", - " \"Personal Well-being\": [\n", - " \"Mental Health\",\n", - " \"Self-care & Reflection\"\n", - " ],\n", - " \"Unexpected Encounters\": [\n", - " \"Serendipitous Meetings\",\n", - " \"Moments of Realization\"\n", - " ]\n", - " }\n", - " }\n", - ")\n", - "\n", - "# Define a complexity column to guide the level of detail and challenge in the empathic scenarios.\n", - "config_builder.add_column(\n", - " name=\"complexity\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Basic\", \"Intermediate\", \"Advanced\"]\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### ✨ Adding Generated Data Columns\n", - "\n", - "Define the columns that the model will generate. These prompts instruct the LLM to produce the actual empathic reasoning trace and answer, following the specified format with and tags.\n", - "\n", - "#### Empathic Reasoning Trace Generation\n", - "\n", - "This column is designed to generate clear, thoughtful reasoning traces that blend logical analysis with emotional insight for everyday situations where empathy is crucial. The generation prompt is tailored to:\n", - "- Produce a structured explanation that highlights both the practical reasoning and the emotional dynamics at play.\n", - "- Encourage a dual output: one part detailing the empathic thought process (enclosed within `` tags) and another delivering a compassionate final answer (enclosed within `` tags).\n", - "- Ensure that the generated content reflects deep understanding, compassion, and a balanced view of the challenges and emotions involved." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "special_system_instructions = \"\"\"\n", - "You are an empathic reasoning agent. Your task is to generate realistic and compassionate reasoning traces for common day-to-day situations. Adopt a caring and supportive tone as you provide detailed insights into human experiences and emotions.\n", - "- Focus on everyday scenarios where empathy, understanding, and emotional intelligence are key.\n", - "- Consider various perspectives, emphasizing the emotional impact of actions and decisions.\n", - "- Ensure your reasoning process is clear, structured, and heartfelt, reflecting deep care for the individuals involved.\n", - "- Enclose your thoughtful reasoning process within ... tags before providing the final JSON output.\n", - "\"\"\"\n", - "\n", - "config_builder.add_column(\n", - " name=\"scenario\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " system_prompt=special_system_instructions,\n", - " prompt=(\n", - " \"Generate a clear and concise everyday scenario for the {{domain}} domain, theme {{theme}}, and complexity {{complexity}}, \"\n", - " \"where empathy and understanding play a crucial role. Focus on a situation that highlights emotional challenges or opportunities for compassionate support, and include a specific question or request for help that clearly outlines a problem or challenge needing resolution.\\n\\n\"\n", - " \"Guidelines:\\n\"\n", - " \"- Provide only the scenario statement without any additional metadata, solution steps, or internal commentary.\\n\"\n", - " \"- Use everyday language and incorporate realistic, practical context from an empathic perspective.\\n\"\n", - " \"- Ensure the scenario includes a clear follow-up question or request for assistance, making it apparent what the problem or challenge is.\\n\"\n", - " \"- Do not include any formatting tags or markers.\\n\\n\"\n", - " \"Examples:\\n\"\n", - " \"1. 'Imagine a situation where a friend is visibly upset after a long, challenging day. What might be causing their distress, and how could you offer support?'\\n\"\n", - " \"2. 'Consider a moment at a family dinner where a subtle conflict arises between members. What could be the underlying issue, and how might empathy help mend the situation?'\\n\"\n", - " \"3. 'Picture a colleague receiving unexpected criticism during a meeting. What are the potential emotional impacts, and what supportive response could be helpful?'\\n\"\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Empathic Reasoning Process Generation\n", - "\n", - "These columns generate and evaluate a detailed empathic reasoning trace for addressing everyday scenarios. The process emphasizes a compassionate, thoughtful approach that blends logical reasoning with emotional insight. The prompts instruct the model to include its internal thought process within ... tags before providing the JSON output." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from typing import List\n", - "from pydantic import BaseModel, Field\n", - "\n", - "class Thought(BaseModel):\n", - " \"\"\"A single step in the structured empathic reasoning process.\n", - " This step captures an empathetic observation or insight that informs a thoughtful, compassionate approach to addressing everyday challenges.\n", - " \"\"\"\n", - " step_number: int = Field(..., ge=1, description=\"The order of the reasoning step, starting from 1.\")\n", - " content: str = Field(..., min_length=5, description=\"A detailed explanation of this reasoning step, incorporating both logical analysis and emotional insight.\")\n", - "\n", - "class ReasoningTrace(BaseModel):\n", - " \"\"\"A structured empathic reasoning trace for addressing a scenario.\n", - " This model records a step-by-step process that integrates logical analysis with emotional insight and empathy to arrive at a supportive final answer.\n", - " \"\"\"\n", - " reasoning: List[Thought] = Field(..., description=\"Step-by-step reasoning leading to the final answer, enriched with empathetic observations and practical insights.\")\n", - " answer: str = Field(..., description=\"The final answer derived from the empathic reasoning process, offering compassionate guidance or resolution.\")\n", - "\n", - "class Evaluation(BaseModel):\n", - " \"\"\"Output format for evaluating an empathic reasoning answer.\n", - " The evaluation assesses the response based on correctness, clarity, and completeness,\n", - " with feedback that emphasizes compassionate insight, clarity, and a holistic understanding of the scenario.\n", - " \"\"\"\n", - " correctness: float = Field(..., description=\"Overall correctness rating of the answer (0 to 1).\")\n", - " clarity: float = Field(..., description=\"Clarity rating of the reasoning, including the integration of empathic explanations (0 to 1).\")\n", - " completeness: float = Field(..., description=\"Completeness rating of the reasoning, assessing whether all practical and emotional aspects were considered (0 to 1).\")\n", - " feedback: str = Field(..., description=\"Detailed feedback on the reasoning trace and answer, with suggestions for enhancing empathetic and real-world applicability.\")\n", - "\n", - "class FinalEvaluation(Evaluation):\n", - " \"\"\"Extended evaluation model for final empathic reasoning traces.\n", - " This model adds criteria to assess visual structure and conciseness,\n", - " ensuring the final output is both clear and visually appealing.\n", - " \"\"\"\n", - " structure: float = Field(..., description=\"Rating of the visual structure and formatting (0 to 1), assessing if reasoning steps and final answer are clearly delineated.\")\n", - " conciseness: float = Field(..., description=\"Rating of the conciseness of the reasoning trace (0 to 1), ensuring that extraneous verbosity is minimized.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder.add_column(\n", - " name=\"initial_trace\",\n", - " type=\"llm-structured\",\n", - " model_alias=model_alias,\n", - " prompt=(\n", - " \"You are an empathic reasoning agent. Provide a detailed, step-by-step reasoning process that thoughtfully addresses the following scenario. \"\n", - " \"Begin by outlining your internal thought process, focusing on both logical considerations and emotional insights, enclosed within ... tags. \"\n", - " \"Then, provide your final compassionate answer.\\n\\n\"\n", - " \"Scenario: {{scenario}}\\n\\n\"\n", - " \"Ensure that your response is structured and reflective of a supportive, empathetic approach.\"\n", - " ),\n", - " output_format=ReasoningTrace\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder.add_column(\n", - " name=\"initial_trace_evaluation\",\n", - " type=\"llm-structured\",\n", - " model_alias=model_alias,\n", - " prompt=(\n", - " \"{{initial_trace}}\\n\\n\"\n", - " \"Now, analyze the provided empathic reasoning trace and final answer as if you were an insightful observer assessing both logical and compassionate approaches. \"\n", - " \"Evaluate the response with a focus on emotional insight, clarity, and holistic consideration.\\n\\n\"\n", - " \"Include your internal thought process within ... tags before providing the JSON.\"\n", - " ),\n", - " output_format=Evaluation\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Final Empathic Reasoning Trace Generation and Evaluation\n", - "\n", - "These columns refine and evaluate the final empathic reasoning trace. The final trace is generated by reviewing the scenario, your initial empathic reasoning trace, and its evaluation. The process integrates improvements suggested by the evaluation and ensures that the final reasoning is compassionate, clear, and comprehensive. As always, include your internal thought process wrapped within ... tags before providing the final JSON output." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder.add_column(\n", - " name=\"final_trace\",\n", - " type=\"llm-structured\",\n", - " model_alias=model_alias,\n", - " prompt=(\n", - " \"Review the scenario, your initial empathic reasoning trace, and its evaluation:\\n\\n\"\n", - " \"Scenario: {{scenario}}\\n\\n\"\n", - " \"Initial Empathic Reasoning Trace:\\n{{initial_trace}}\\n\\n\"\n", - " \"Initial Trace Evaluation:\\n{{initial_trace_evaluation}}\\n\\n\"\n", - " \"From the perspective of an empathic reasoning agent, provide a refined final reasoning trace that addresses both the emotional and logical dimensions of the scenario. \"\n", - " \"Your final trace should be visually structured as follows:\\n\"\n", - " \"1. Present a numbered list of concise reasoning steps. Each step should be clear and free of unnecessary verbosity.\\n\"\n", - " \"2. Include a clearly separated section for the final answer, prefixed with a header (e.g., 'Final Answer:').\\n\"\n", - " \"3. Use visual markers or markdown formatting to enhance readability.\\n\"\n", - " \"Avoid adding extraneous detailsβ€”focus on clarity and conciseness.\\n\\n\"\n", - " \"Also, include your internal thought process wrapped within ... tags. \"\n", - " \"Return only the final, visually structured reasoning trace.\"\n", - " ),\n", - " output_format=ReasoningTrace\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " name=\"final_trace_evaluation\",\n", - " type=\"llm-structured\",\n", - " model_alias=model_alias,\n", - " prompt=(\n", - " \"{{final_trace}}\\n\\n\"\n", - " \"Analyze the provided empathic reasoning trace and final answer from the viewpoint of an insightful observer. \"\n", - " \"Evaluate the response focusing on correctness, clarity, and completeness, as well as its visual structure and conciseness. \"\n", - " \"Assess whether the reasoning steps are clearly separated (e.g., numbered or bullet-pointed) and if the final answer is distinct and succinct.\\n\\n\"\n", - " \"Include your internal thought process within ... tags before providing the JSON.\"\n", - " ),\n", - " output_format=FinalEvaluation\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ‘€ Generating a dataset preview\n", - "\n", - "- Preview mode allows you to quickly iterate on your data design.\n", - "\n", - "- Each preview generation call creates a sample for inspection, helping you verify prompts and instructions before running a larger batch job." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Generate a preview\n", - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "preview.display_sample_record()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ€” Like what you see?\n", - "\n", - "- Submit a batch workflow!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Check to see if the Workflow is still active.\n", - "job_results.get_job_status()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset = job_results.load_dataset()\n", - "dataset.head()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "c5c8458b", + "metadata": {}, + "source": [ + "# 🧠 NeMo Data Designer: Synthetic Reasoning Traces\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "#### πŸ“š What you'll learn\n", + "\n", + "- This notebook demonstrates how to use NeMo Data Designer to build a synthetic data generation pipeline tailored for reasoning tasks. \n", + "\n", + "- Instead of creating multi-turn conversations, we will generate reasoning traces that can be utilized for training and \\\n", + "fine-tuning language models with reinforcement learning techniques and invoking chain-of-thought processing.\n", + "\n", + "- These synthetic reasoning traces can be used to enhance model performance in areas such as mathematics, coding, scientific \\\n", + "reasoning, and other domains that benefit from structured reasoning.\n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", + ">\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", + ">\n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", + ">\n", + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" + ] + }, + { + "cell_type": "markdown", + "id": "3eb133e4", + "metadata": {}, + "source": [ + "### πŸ“¦ Import the essentials\n", + "\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", + "\n", + "- The `essentials` module provides quick access to the most commonly used objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b395aa4", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_microservices.data_designer.essentials import (\n", + " CategorySamplerParams,\n", + " DataDesignerConfigBuilder,\n", + " InferenceParameters,\n", + " LLMStructuredColumnConfig,\n", + " LLMTextColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " SamplerColumnConfig,\n", + " SamplerType,\n", + " SubcategorySamplerParams,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "068db4f9", + "metadata": {}, + "source": [ + "### βš™οΈ Initialize the NeMo Data Designer Client\n", + "\n", + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "14478f0e", + "metadata": {}, + "outputs": [], + "source": [ + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" + ] + }, + { + "cell_type": "markdown", + "id": "1c20a8f9", + "metadata": {}, + "source": [ + "### πŸŽ›οΈ Define model configurations\n", + "\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", + "\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", + "\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", + "\n", + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ebb17960", + "metadata": {}, + "outputs": [], + "source": [ + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-nano-v2\"\n", + "\n", + "# This sets reasoning to False for the nemotron-nano-v2 model.\n", + "SYSTEM_PROMPT = \"/no_think\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "223bb477", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c603268e", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" + ] + }, + { + "cell_type": "markdown", + "id": "af03cffe", + "metadata": {}, + "source": [ + "## 🎲 Adding Categorical Columns for Controlled Diversity\n", + "\n", + "- Now we'll add categorical columns to control the diversity of our generated examples\n", + "\n", + "- Sampler columns offer non-LLM based generation of synthetic data.\n", + "\n", + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "628b8eac", + "metadata": {}, + "outputs": [], + "source": [ + "# Define a domain column that sets the context for empathic scenarios in everyday life.\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"domain\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\n", + " \"Family Dynamics\",\n", + " \"Workplace Challenges\",\n", + " \"Friendship Moments\",\n", + " \"Community Interactions\",\n", + " \"Personal Well-being\",\n", + " \"Unexpected Encounters\"\n", + " ]\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Add theme subcategories for each domain\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"theme\",\n", + " sampler_type=SamplerType.SUBCATEGORY,\n", + " params=SubcategorySamplerParams(\n", + " category=\"domain\",\n", + " values={\n", + " \"Family Dynamics\": [\n", + " \"Parenting Dilemmas\",\n", + " \"Sibling Rivalries\"\n", + " ],\n", + " \"Workplace Challenges\": [\n", + " \"Communication Breakdowns\",\n", + " \"Leadership Dilemmas\"\n", + " ],\n", + " \"Friendship Moments\": [\n", + " \"Support & Understanding\",\n", + " \"Misunderstandings & Reconciliations\"\n", + " ],\n", + " \"Community Interactions\": [\n", + " \"Neighborhood Support\",\n", + " \"Cultural Celebrations\"\n", + " ],\n", + " \"Personal Well-being\": [\n", + " \"Mental Health\",\n", + " \"Self-care & Reflection\"\n", + " ],\n", + " \"Unexpected Encounters\": [\n", + " \"Serendipitous Meetings\",\n", + " \"Moments of Realization\"\n", + " ]\n", + " }\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Define a complexity column to guide the level of detail and challenge in the empathic scenarios.\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"complexity\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Basic\", \"Intermediate\", \"Advanced\"]\n", + " )\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "1563556a", + "metadata": {}, + "source": [ + "## 🦜 LLM-generated columns\n", + "\n", + "- When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.\n", + "\n", + "- As we see below, nested json fields can be accessed using dot notation.\n", + "\n", + "- These prompts instruct the LLM to produce the actual empathic reasoning trace and answer, following the specified format with and tags.\n", + "
\n", + "\n", + "### 🧠 Empathic Reasoning Trace Generation\n", + "\n", + "This column is designed to generate clear, thoughtful reasoning traces that blend logical analysis with emotional insight for everyday situations \\\n", + "where empathy is crucial. The generation prompt is tailored to:\n", + "- Produce a structured explanation that highlights both the practical reasoning and the emotional dynamics at play.\n", + "\n", + "- Encourage a dual output: one part detailing the empathic thought process (enclosed within `` tags) and another delivering a \\\n", + "compassionate final answer (enclosed within `` tags).\n", + "\n", + "- Ensure that the generated content reflects deep understanding, compassion, and a balanced view of the challenges and emotions involved.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "faaa503e", + "metadata": {}, + "outputs": [], + "source": [ + "EMPATHIC_SYSTEM_PROMPT = (\n", + " \"You are an empathic reasoning agent. Your task is to generate realistic and compassionate reasoning traces for \"\n", + " \"common day-to-day situations. \\n\"\n", + " \"Adopt a caring and supportive tone as you provide detailed insights into human experiences and emotions.\\n\"\n", + " \"- Focus on everyday scenarios where empathy, understanding, and emotional intelligence are key.\\n\"\n", + " \"- Consider various perspectives, emphasizing the emotional impact of actions and decisions.\\n\"\n", + " \"- Ensure your reasoning process is clear, structured, and heartfelt, reflecting deep care for the individuals involved.\\n\"\n", + " \"- Enclose your thoughtful reasoning process within ... tags before providing the final JSON output.\"\n", + " \"/no_think\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"scenario\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=EMPATHIC_SYSTEM_PROMPT,\n", + " prompt=(\n", + " \"Generate a clear and concise everyday scenario for the {{domain}} domain, theme {{theme}}, and \"\n", + " \"complexity {{complexity}}, where empathy and understanding play a crucial role. Focus on a situation that \"\n", + " \"highlights emotional challenges or opportunities for compassionate support, and include a specific \"\n", + " \"question or request for help that clearly outlines a problem or challenge needing resolution.\\n\\n\"\n", + " \"Guidelines:\\n\"\n", + " \"- Provide only the scenario statement without any additional metadata, solution steps, or internal \"\n", + " \"commentary.\\n\"\n", + " \"- Use everyday language and incorporate realistic, practical context from an empathic perspective.\\n\"\n", + " \"- Ensure the scenario includes a clear follow-up question or request for assistance, making it \"\n", + " \"apparent what the problem or challenge is.\\n\"\n", + " \"- Do not include any formatting tags or markers.\\n\\n\"\n", + " \"Examples:\\n\"\n", + " \"1. 'Imagine a situation where a friend is visibly upset after a long, challenging day. What might be \"\n", + " \"causing their distress, and how could you offer support?'\\n\"\n", + " \"2. 'Consider a moment at a family dinner where a subtle conflict arises between members. What could be \"\n", + " \"the underlying issue, and how might empathy help mend the situation?'\\n\"\n", + " \"3. 'Picture a colleague receiving unexpected criticism during a meeting. What are the potential emotional \"\n", + " \"impacts, and what supportive response could be helpful?'\\n\"\n", + " )\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e4176d96", + "metadata": {}, + "source": [ + "### ⚑️ Empathic Reasoning Process Generation\n", + "\n", + "- These columns generate and evaluate a detailed empathic reasoning trace for addressing everyday scenarios. \n", + "\n", + "- The process emphasizes a compassionate, thoughtful approach that blends logical reasoning with emotional insight. \n", + "\n", + "- The prompts instruct the model to include its internal thought process within ... tags before providing the JSON output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc08891e", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List\n", + "from pydantic import BaseModel, Field\n", + "\n", + "class Thought(BaseModel):\n", + " \"\"\"A single step in the structured empathic reasoning process.\n", + " This step captures an empathetic observation or insight that informs a thoughtful, compassionate approach to addressing everyday challenges.\n", + " \"\"\"\n", + " step_number: int = Field(..., ge=1, description=\"The order of the reasoning step, starting from 1.\")\n", + " content: str = Field(..., min_length=5, description=(\"A detailed explanation of this reasoning step, incorporating both logical analysis and emotional insight.\"))\n", + "\n", + "class ReasoningTrace(BaseModel):\n", + " \"\"\"A structured empathic reasoning trace for addressing a scenario.\n", + " This model records a step-by-step process that integrates logical analysis with emotional insight and empathy to arrive at a supportive final answer.\n", + " \"\"\"\n", + " reasoning: List[Thought] = Field(..., description=\"Step-by-step reasoning leading to the final answer, enriched with empathetic observations and practical insights.\")\n", + " answer: str = Field(..., description=\"The final answer derived from the empathic reasoning process, offering compassionate guidance or resolution.\")\n", + "\n", + "class Evaluation(BaseModel):\n", + " \"\"\"Output format for evaluating an empathic reasoning answer.\n", + " The evaluation assesses the response based on correctness, clarity, and completeness,\n", + " with feedback that emphasizes compassionate insight, clarity, and a holistic understanding of the scenario.\n", + " \"\"\"\n", + " correctness: float = Field(..., description=\"Overall correctness rating of the answer (0 to 1).\")\n", + " clarity: float = Field(..., description=\"Clarity rating of the reasoning, including the integration of empathic explanations (0 to 1).\")\n", + " completeness: float = Field(..., description=\"Completeness rating of the reasoning, assessing whether all practical and emotional aspects were considered (0 to 1).\")\n", + " feedback: str = Field(..., description=\"Detailed feedback on the reasoning trace and answer, with suggestions for enhancing empathetic and real-world applicability.\")\n", + "\n", + "class FinalEvaluation(Evaluation):\n", + " \"\"\"Extended evaluation model for final empathic reasoning traces.\n", + " This model adds criteria to assess visual structure and conciseness,\n", + " ensuring the final output is both clear and visually appealing.\n", + " \"\"\"\n", + " structure: float = Field(..., description=\"Rating of the visual structure and formatting (0 to 1), assessing if reasoning steps and final answer are clearly delineated.\")\n", + " conciseness: float = Field(..., description=\"Rating of the conciseness of the reasoning trace (0 to 1), ensuring that extraneous verbosity is minimized.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7d038a9", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder.add_column(\n", + " LLMStructuredColumnConfig(\n", + " name=\"initial_trace\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=(\n", + " \"You are an empathic reasoning agent. Provide a detailed, step-by-step reasoning process that \"\n", + " \"thoughtfully addresses the following scenario. \\n\"\n", + " \"Begin by outlining your internal thought process, focusing on both logical considerations and \"\n", + " \"emotional insights, enclosed within ... tags. \\n\"\n", + " \"Then, provide your final compassionate answer.\\n\\n\"\n", + " \"Scenario: {{scenario}}\\n\\n\"\n", + " \"Ensure that your response is structured and reflective of a supportive, empathetic approach.\"\n", + " ),\n", + " output_format=ReasoningTrace\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " LLMStructuredColumnConfig(\n", + " name=\"initial_trace_evaluation\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=(\n", + " \"{{initial_trace}}\\n\\n\"\n", + " \"Now, analyze the provided empathic reasoning trace and final answer as if you were an insightful \"\n", + " \"observer assessing both logical and compassionate approaches. \\n\"\n", + " \"Evaluate the response with a focus on emotional insight, clarity, and holistic consideration.\\n\\n\"\n", + " \"Include your internal thought process within ... tags before providing the JSON.\"\n", + " ),\n", + " output_format=Evaluation\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "aba0d507", + "metadata": {}, + "source": [ + "### Final Empathic Reasoning Trace Generation and Evaluation\n", + "\n", + "- These columns refine and evaluate the final empathic reasoning trace. \n", + "\n", + "- The final trace is generated by reviewing the scenario, your initial empathic reasoning trace, and its evaluation. \n", + "\n", + "- The process integrates improvements suggested by the evaluation and ensures that the final reasoning is compassionate, clear, and comprehensive. \n", + "\n", + "- As always, include your internal thought process wrapped within ... tags before providing the final JSON output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eda8bc4f", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder.add_column(\n", + " LLMStructuredColumnConfig(\n", + " name=\"final_trace\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=(\n", + " \"Review the scenario, your initial empathic reasoning trace, and its evaluation:\\n\\n\"\n", + " \"Scenario: {{scenario}}\\n\\n\"\n", + " \"Initial Empathic Reasoning Trace:\\n{{initial_trace}}\\n\\n\"\n", + " \"Initial Trace Evaluation:\\n{{initial_trace_evaluation}}\\n\\n\"\n", + " \"From the perspective of an empathic reasoning agent, provide a refined final reasoning trace that \"\n", + " \"addresses both the emotional and logical dimensions of the scenario. \\n\"\n", + " \"Your final trace should be visually structured as follows:\\n\"\n", + " \"1. Present a numbered list of concise reasoning steps. Each step should be clear and free of \"\n", + " \"unnecessary verbosity.\\n\"\n", + " \"2. Include a clearly separated section for the final answer, prefixed with a header \"\n", + " \"(e.g., 'Final Answer:').\\n\"\n", + " \"3. Use visual markers or markdown formatting to enhance readability.\\n\"\n", + " \"Avoid adding extraneous detailsβ€”focus on clarity and conciseness.\\n\\n\"\n", + " \"Also, include your internal thought process wrapped within ... tags. \"\n", + " \"Return only the final, visually structured reasoning trace.\"\n", + " ),\n", + " output_format=ReasoningTrace\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " LLMStructuredColumnConfig(\n", + " name=\"final_trace_evaluation\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=SYSTEM_PROMPT,\n", + " prompt=(\n", + " \"{{final_trace}}\\n\\n\"\n", + " \"Analyze the provided empathic reasoning trace and final answer from the viewpoint of an \"\n", + " \"insightful observer. \\n\"\n", + " \"Evaluate the response focusing on correctness, clarity, and completeness, as well as its \"\n", + " \"visual structure and conciseness. \\n\"\n", + " \"Assess whether the reasoning steps are clearly separated (e.g., numbered or bullet-pointed) and if \"\n", + " \"the final answer is distinct and succinct.\\n\\n\"\n", + " \"Include your internal thought process within ... tags before providing the JSON.\"\n", + " ),\n", + " output_format=FinalEvaluation\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a94966fc", + "metadata": {}, + "source": [ + "### πŸ” Iteration is key – preview the dataset!\n", + "\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", + "\n", + "2. Inspect the results for quality and format issues.\n", + "\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", + "\n", + "4. Re-run the preview until satisfied." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7821de6", + "metadata": {}, + "outputs": [], + "source": [ + "# Preview a few records\n", + "preview = data_designer_client.preview(config_builder)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a2cc1a4", + "metadata": {}, + "outputs": [], + "source": [ + "# More previews\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "id": "7a65f990", + "metadata": {}, + "source": [ + "### πŸ“Š Analyze the generated data\n", + "\n", + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c4f0cf4d", + "metadata": {}, + "outputs": [], + "source": [ + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "id": "bc90ca08", + "metadata": {}, + "source": [ + "### πŸ†™ Scale up!\n", + "\n", + "- Happy with your preview data?\n", + "\n", + "- Use the `create` method to submit larger Data Designer generation jobs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55d53977", + "metadata": {}, + "outputs": [], + "source": [ + "job_results = data_designer_client.create(config_builder, num_records=2)\n", + "\n", + "# This will block until the job is complete.\n", + "job_results.wait_until_done()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a63e8db2", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", + "dataset = job_results.load_dataset()\n", + "\n", + "dataset.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc0aa6f3", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c34c56d7", + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-reasoning-reasoning-traces\",\n", + ");" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "sdg_venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python-evol.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python-evol.ipynb index f6c2610f6..2e39b0532 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python-evol.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python-evol.ipynb @@ -1,555 +1,740 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 🎨 NeMo Data Designer: Text-to-Python with Evolution" - ] - }, - { - "cell_type": "markdown", - "id": "90fa5cd6", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", - ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for Python code examples, with a focus on evolutionary improvements. We'll build a system that generates Python code based on natural language instructions, validates it, analyzes issues, and then improves the code based on feedback." - ] - }, - { - "cell_type": "markdown", - "id": "239f1822", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b19fddbb", - "metadata": {}, - "outputs": [], - "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "id": "2dc9018a", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0b2b9d1b", - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" - ] - }, - { - "cell_type": "markdown", - "id": "9dc329dc", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1542051d", - "metadata": {}, - "outputs": [], - "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias = \"nemotron-nano-9b-v2\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6309d900", - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🌱 Define Categorical Seed Columns\n", - "\n", - "We'll set up our seed columns for industry sectors, code complexity, and instruction types. These will help generate diverse and relevant code examples." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Add industry sector categories\n", - "config_builder.add_column(\n", - " name=\"industry_sector\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Healthcare\", \"Finance\", \"Technology\"],\n", - " \"description\": \"The industry sector for the code example\"\n", - " }\n", - ")\n", - "\n", - "# Add topic as a subcategory of industry_sector\n", - "config_builder.add_column(\n", - " name=\"topic\",\n", - " type=\"subcategory\",\n", - " params={\n", - " \"category\": \"industry_sector\",\n", - " \"values\": {\n", - " \"Healthcare\": [\n", - " \"Electronic Health Records (EHR) Systems\",\n", - " \"Telemedicine Platforms\",\n", - " \"AI-Powered Diagnostic Tools\"\n", - " ],\n", - " \"Finance\": [\n", - " \"Fraud Detection Software\",\n", - " \"Automated Trading Systems\",\n", - " \"Personal Finance Apps\"\n", - " ],\n", - " \"Technology\": [\n", - " \"Cloud Computing Platforms\",\n", - " \"Artificial Intelligence and Machine Learning Platforms\",\n", - " \"DevOps and CI/CD Tools\"\n", - " ]\n", - " }\n", - " }\n", - ")\n", - "\n", - "# Add code complexity with subcategory for code concepts\n", - "config_builder.add_column(\n", - " name=\"code_complexity\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Beginner\", \"Intermediate\", \"Advanced\"],\n", - " \"description\": \"The complexity level of the code\"\n", - " }\n", - ")\n", - "\n", - "# Add code_concept as a subcategory of code_complexity\n", - "config_builder.add_column(\n", - " name=\"code_concept\",\n", - " type=\"subcategory\",\n", - " params={\n", - " \"category\": \"code_complexity\",\n", - " \"values\": {\n", - " \"Beginner\": [\n", - " \"Variables\",\n", - " \"Data Types\",\n", - " \"Functions\",\n", - " \"Loops\",\n", - " \"Classes\"\n", - " ],\n", - " \"Intermediate\": [\n", - " \"List Comprehensions\",\n", - " \"Object-oriented programming\",\n", - " \"Lambda Functions\",\n", - " \"Web frameworks\",\n", - " \"Pandas\"\n", - " ],\n", - " \"Advanced\": [\n", - " \"Multithreading\",\n", - " \"Context Managers\",\n", - " \"Generators\"\n", - " ]\n", - " }\n", - " }\n", - ")\n", - "\n", - "# Add instruction phrases\n", - "config_builder.add_column(\n", - " name=\"instruction_phrase\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\n", - " \"Write a function that\",\n", - " \"Create a class that\",\n", - " \"Implement a script\",\n", - " \"Can you create a function\",\n", - " \"Develop a module that\"\n", - " ],\n", - " \"description\": \"Starting phrase for the code instruction\"\n", - " }\n", - ")\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ✨ Define Initial Code Generation\n", - "\n", - "First, we'll set up the columns for generating the instruction and initial code implementation using the same approach as in the text-to-python notebook." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Generate instruction for the code\n", - "config_builder.add_column(\n", - " name=\"instruction\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " system_prompt=\"You are an expert at generating clear and specific programming tasks.\",\n", - " prompt=\"\"\"\\\n", - "Generate an instruction to create Python code that solves a specific problem.\n", - "Each instruction should begin with one of the following phrases: {{instruction_phrase}}.\n", - "\n", - "Important Guidelines:\n", - "* Industry Relevance: Ensure the instruction pertains to the {{industry_sector}} sector and {{topic}} topic.\n", - "* Code Complexity: Tailor the instruction to the {{code_complexity}} level. Utilize relevant {{code_concept}} where appropriate to match the complexity level.\n", - "* Clarity and Specificity: Make the problem statement clear and unambiguous. Provide sufficient context to understand the requirements without being overly verbose.\n", - "* Response Formatting: Do not include any markers such as ### Response ### in the instruction.\n", - "\"\"\"\n", - ")\n", - "\n", - "# Generate the initial Python code\n", - "config_builder.add_column(\n", - " name=\"initial_code\",\n", - " type=\"llm-code\",\n", - " model_alias=model_alias,\n", - " output_format=\"python\",\n", - " system_prompt=\"You are an expert Python programmer who writes clean, efficient, and well-documented code.\",\n", - " prompt=\"\"\"\\\n", - "Write Python code for the following instruction:\n", - "Instruction: {{instruction}}\n", - "\n", - "Important Guidelines:\n", - "* Code Quality: Your code should be clean, complete, self-contained and accurate.\n", - "* Code Validity: Please ensure that your python code is executable and does not contain any errors.\n", - "* Packages: Remember to import any necessary libraries, and to use all libraries you import.\n", - "* Complexity & Concepts: The code should be written at a {{code_complexity}} level, making use of concepts such as {{code_concept}}.\n", - "\"\"\"\n", - ")\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ” Code Validation and Analysis\n", - "\n", - "Now we'll add validation for the initial code and generate analysis of any issues found." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Validate the initial code\n", - "config_builder.add_column(\n", - " name=\"code_validation\",\n", - " type=\"code-validation\",\n", - " model_alias=model_alias,\n", - " code_lang=\"python\",\n", - " target_column=\"initial_code\"\n", - ")\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Generate a detailed error analysis and improvement plan\n", - "config_builder.add_column(\n", - " name=\"code_analysis\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " prompt=\"\"\"\\\n", - "Analyze the following Python code and its validation results:\n", - "\n", - "INSTRUCTION:\n", - "{{ instruction }}\n", - "\n", - "INITIAL CODE:\n", - "{{ initial_code }}\n", - "\n", - "VALIDATION RESULTS:\n", - "{{ code_validation }}\n", - "\n", - "{% if not (code_validation == '[]') %}\n", - "Please provide:\n", - "1. A detailed analysis of each error or warning (categorize by type: convention, warning, error, refactor)\n", - "2. Specific recommendations that directly address each issue\n", - "3. A structured plan for implementing fixes while maintaining code functionality\n", - "4. Any PEP 8 style improvements that would improve code quality\n", - "{% else %}\n", - "The code passes all validation checks. Provide potential optimizations for:\n", - "1. Code readability\n", - "2. Performance improvements\n", - "3. Better adherence to Python best practices\n", - "4. Enhanced documentation\n", - "{% endif %}\n", - "\"\"\"\n", - ")\n", - "\n", - "\n", - "config_builder.validate()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ”„ Code Evolution\n", - "\n", - "Next, we'll create the improved version of the code based on the analysis and validation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Generate improved code based on feedback\n", - "config_builder.add_column(\n", - " name=\"improved_code\",\n", - " type=\"llm-code\",\n", - " model_alias=model_alias,\n", - " output_format=\"python\",\n", - " system_prompt=\"You are an expert Python programmer focused on writing production-quality code that adheres to best practices.\",\n", - " prompt=\"\"\"\\\n", - "Rewrite and improve the following Python code based on the analysis provided.\n", - "\n", - "ORIGINAL INSTRUCTION:\n", - "{{instruction}}\n", - "\n", - "INITIAL CODE:\n", - "{{initial_code}}\n", - "\n", - "CODE ANALYSIS:\n", - "{{code_analysis}}\n", - "\n", - "Your task is to create a revised version that:\n", - "1. Addresses all issues identified in the analysis\n", - "2. Follows PEP 8 style guidelines systematically\n", - "3. Eliminates common anti-patterns\n", - "4. Includes comprehensive docstrings for functions, classes, and modules\n", - "5. Uses type hints for function parameters and return values where appropriate\n", - "6. Implements proper error handling with specific exception types\n", - "7. Ensures all imports are properly organized and used\n", - "\n", - "The goal is production-quality code that would pass a professional code review at a {{code_complexity}} level.\n", - "\"\"\"\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2a477acf", - "metadata": {}, - "outputs": [], - "source": [ - "# Validate the improved code\n", - "config_builder.add_column(\n", - " name=\"improved_code_validation\",\n", - " type=\"code-validation\",\n", - " model_alias=model_alias,\n", - " code_lang=\"python\",\n", - " target_column=\"improved_code\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ“Š Evaluation\n", - "\n", - "Finally, we'll add an evaluation that compares the initial and improved code." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from nemo_microservices.beta.data_designer.config.params.rubrics import PYTHON_RUBRICS\n", - "\n", - "# Add judge evaluation\n", - "config_builder.add_column(\n", - " name=\"code_judge_result\",\n", - " type=\"llm-judge\",\n", - " model_alias=model_alias,\n", - " prompt=(\n", - " \"You are an expert in Python programming, with specialized knowledge in software engineering, \"\n", - " \"data science, and algorithmic problem-solving. You think about potential flaws and errors \"\n", - " \"in the code. You are a tough critic, but a fair one.\\n\\n\"\n", - " \"Take a deep breath and use the Python Code Quality Rubric below to score the **Generated Python Code** \"\n", - " \"based on the INSTRUCTIONS.\\n\\n\"\n", - " \"#### INSTRUCTIONS\\n\"\n", - " \"The Generated Python Code should be a valid response to the Natural Language Prompt below\\n\\n\"\n", - " \"Natural Language Prompt:\\n\"\n", - " \"{{ instruction }}\\n\\n\"\n", - " \"Generated Python Code\\n\"\n", - " \"```python\\n\"\n", - " \"{{ improved_code }}\\n\"\n", - " \"```\\n\"\n", - " ),\n", - " rubrics=PYTHON_RUBRICS\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ‘€ Generate Preview Dataset\n", - "\n", - "Let's generate a preview to see how our evolved code examples look." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Generate a preview\n", - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "preview.display_sample_record()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸš€ Generate Full Dataset\n", - "\n", - "If you're satisfied with the preview, you can generate a larger dataset using a batch workflow." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Submit batch job\n", - "job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)\n", - "\n", - "job_results.wait_until_done()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b3f28961", - "metadata": {}, - "outputs": [], - "source": [ - "dataset = job_results.load_dataset()\n", - "print(\"\\nGenerated dataset shape:\", dataset.shape)\n", - "\n", - "dataset.head()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "48d81a41", + "metadata": {}, + "source": [ + "# πŸ‘¨β€πŸ’» NeMo Data Designer: Text-to-Python with Evolution\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "#### πŸ“š What you'll learn\n", + "\n", + "- This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for Python code examples, \\\n", + "with a focus on evolutionary improvements.\n", + "\n", + "- We'll build a system that generates Python code based on natural language instructions, validates it, analyzes issues, and then improves the code based on feedback.\n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", + ">\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", + ">\n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", + ">\n", + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" + ] + }, + { + "cell_type": "markdown", + "id": "5cae1943", + "metadata": {}, + "source": [ + "### πŸ“¦ Import the essentials\n", + "\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", + "\n", + "- The `essentials` module provides quick access to the most commonly used objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b19fddbb", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_microservices.data_designer.essentials import (\n", + " CategorySamplerParams,\n", + " CodeLang,\n", + " CodeValidatorParams,\n", + " DataDesignerConfigBuilder,\n", + " InferenceParameters,\n", + " LLMCodeColumnConfig,\n", + " LLMJudgeColumnConfig,\n", + " LLMTextColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " SamplerColumnConfig,\n", + " SamplerType,\n", + " Score,\n", + " SubcategorySamplerParams,\n", + " ValidationColumnConfig,\n", + " ValidatorType,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "53ead84a", + "metadata": {}, + "source": [ + "### βš™οΈ Initialize the NeMo Data Designer Client\n", + "\n", + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6cc5324f", + "metadata": {}, + "outputs": [], + "source": [ + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" + ] + }, + { + "cell_type": "markdown", + "id": "89029ee8", + "metadata": {}, + "source": [ + "### πŸŽ›οΈ Define model configurations\n", + "\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", + "\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", + "\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", + "\n", + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f3ecb7c", + "metadata": {}, + "outputs": [], + "source": [ + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/llama-3.3-nemotron-super-49b-v1\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-super-49b-v1\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " timeout=300,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "496cef5e", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9774f495", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" + ] + }, + { + "cell_type": "markdown", + "id": "8473c2ea", + "metadata": {}, + "source": [ + "## 🎲 Adding Sampler Columns\n", + "\n", + "- Sampler columns offer non-LLM based generation of synthetic data.\n", + "\n", + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a752f960", + "metadata": {}, + "outputs": [], + "source": [ + "# Add industry sector categories\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"industry_sector\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Healthcare\", \"Finance\", \"Technology\"],\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Add topic as a subcategory of industry_sector\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"topic\",\n", + " sampler_type=SamplerType.SUBCATEGORY,\n", + " params=SubcategorySamplerParams(\n", + " category=\"industry_sector\",\n", + " values={\n", + " \"Healthcare\": [\n", + " \"Electronic Health Records (EHR) Systems\",\n", + " \"Telemedicine Platforms\",\n", + " \"AI-Powered Diagnostic Tools\",\n", + " ],\n", + " \"Finance\": [\n", + " \"Fraud Detection Software\",\n", + " \"Automated Trading Systems\",\n", + " \"Personal Finance Apps\",\n", + " ],\n", + " \"Technology\": [\n", + " \"Cloud Computing Platforms\",\n", + " \"Artificial Intelligence and Machine Learning Platforms\",\n", + " \"DevOps and CI/CD Tools\",\n", + " ],\n", + " },\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Add code complexity with subcategory for code concepts\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"code_complexity\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Beginner\", \"Intermediate\", \"Advanced\"],\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Add code_concept as a subcategory of code_complexity\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"code_concept\",\n", + " sampler_type=SamplerType.SUBCATEGORY,\n", + " params=SubcategorySamplerParams(\n", + " category=\"code_complexity\",\n", + " values={\n", + " \"Beginner\": [\n", + " \"Variables\",\n", + " \"Data Types\",\n", + " \"Functions\",\n", + " \"Loops\",\n", + " \"Classes\",\n", + " ],\n", + " \"Intermediate\": [\n", + " \"List Comprehensions\",\n", + " \"Object-oriented programming\",\n", + " \"Lambda Functions\",\n", + " \"Web frameworks\",\n", + " \"Pandas\",\n", + " ],\n", + " \"Advanced\": [\n", + " \"Multithreading\",\n", + " \"Context Managers\",\n", + " \"Generators\",\n", + " ],\n", + " },\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Add instruction phrases\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"instruction_phrase\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\n", + " \"Write a function that\",\n", + " \"Create a class that\",\n", + " \"Implement a script\",\n", + " \"Can you create a function\",\n", + " \"Develop a module that\",\n", + " ],\n", + " ),\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "1274b9a3", + "metadata": {}, + "source": [ + "## 🦜 Define Initial Code Generation\n", + "\n", + "First, we'll set up the columns for generating the instruction and initial code implementation using the same approach as in the [text-to-python notebook](./text-to-python-evol.ipynb)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c07c1e73", + "metadata": {}, + "outputs": [], + "source": [ + "# Generate instruction for the code\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"instruction\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=(\"You are an expert at generating clear and specific programming tasks.\"),\n", + " prompt=(\n", + " \"Generate an instruction to create Python code that solves a specific problem.\\n\"\n", + " \"Each instruction should begin with one of the following phrases: {{instruction_phrase}}.\\n\\n\"\n", + " \"Important Guidelines:\\n\"\n", + " \"* Industry Relevance: Ensure the instruction pertains to the {{industry_sector}} sector and {{topic}} topic.\\n\"\n", + " \"* Code Complexity: Tailor the instruction to the {{code_complexity}} level. Utilize relevant {{code_concept}} where appropriate to match the complexity level.\\n\"\n", + " \"* Clarity and Specificity: Make the problem statement clear and unambiguous. Provide sufficient context to understand the requirements without being overly verbose.\\n\"\n", + " \"* Response Formatting: Do not include any markers such as ### Response ### in the instruction.\\n\"\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Generate the initial Python code\n", + "config_builder.add_column(\n", + " LLMCodeColumnConfig(\n", + " name=\"initial_code\",\n", + " model_alias=MODEL_ALIAS,\n", + " code_lang=CodeLang.PYTHON,\n", + " system_prompt=(\"You are an expert Python programmer who writes clean, efficient, and well-documented code.\"),\n", + " prompt=(\n", + " \"Write Python code for the following instruction:\\n\"\n", + " \"Instruction: {{instruction}}\\n\\n\"\n", + " \"Important Guidelines:\\n\"\n", + " \"* Code Quality: Your code should be clean, complete, self-contained and accurate.\\n\"\n", + " \"* Code Validity: Please ensure that your python code is executable and does not contain any errors.\\n\"\n", + " \"* Packages: Remember to import any necessary libraries, and to use all libraries you import.\\n\"\n", + " \"* Complexity & Concepts: The code should be written at a {{code_complexity}} level, making use of concepts such as {{code_concept}}.\\n\"\n", + " )\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "32a4b0c4", + "metadata": {}, + "source": [ + "## ⚑️ Quality Assessment: Code Validation\n", + "\n", + "- Now we'll add validation for the initial code and generate analysis of any issues found.\n", + "\n", + "- NeMo Data Designer includes a built-in code validation feature that automatically checks the syntactic correctness and executable validity of \\\n", + "generated code snippets. \n", + "\n", + "- This helps ensure that outputs from language models are not only syntactically correct, but also able to run successfully in the \\\n", + "intended programming language environment. \n", + "\n", + "- Leveraging this validation step significantly increases dataset quality by promptly identifying invalid or non-functional code, \\\n", + "streamlining the process of generating reliable and production-ready data samples.\n", + "\n", + "- NeMo Data Designer supports validation for these languages\n", + "\n", + " - Python (CodeLang.PYTHON)\n", + "\n", + " - SQL dialects:\n", + "\n", + " - ANSI SQL (CodeLang.SQL_ANSI)\n", + "\n", + " - MySQL (CodeLang.SQL_MYSQL)\n", + "\n", + " - PostgreSQL (CodeLang.SQL_POSTGRES)\n", + "\n", + " - SQLite (CodeLang.SQL_SQLITE)\n", + "\n", + " - T-SQL (CodeLang.SQL_TSQL)\n", + "\n", + " - BigQuery (CodeLang.SQL_BIGQUERY)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a4ad2dda", + "metadata": {}, + "outputs": [], + "source": [ + "# Validate the initial code\n", + "config_builder.add_column(\n", + " ValidationColumnConfig(\n", + " name=\"code_validation\",\n", + " target_columns=[\"initial_code\"],\n", + " validator_type=ValidatorType.CODE,\n", + " validator_params=CodeValidatorParams(\n", + " code_lang=CodeLang.PYTHON,\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Generate a detailed error analysis and improvement plan\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"code_analysis\",\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=(\n", + " \"Analyze the following Python code and its validation results:\\n\\n\"\n", + " \"INSTRUCTION:\\n\"\n", + " \"{{ instruction }}\\n\\n\"\n", + " \"INITIAL CODE:\\n\"\n", + " \"{{ initial_code }}\\n\\n\"\n", + " \"VALIDATION RESULTS:\\n\"\n", + " \"{{ code_validation }}\\n\\n\"\n", + " \"{% if not (code_validation == '[]') %}\\n\"\n", + " \"Please provide:\\n\"\n", + " \"1. A detailed analysis of each error or warning (categorize by type: convention, warning, error, refactor)\\n\"\n", + " \"2. Specific recommendations that directly address each issue\\n\"\n", + " \"3. A structured plan for implementing fixes while maintaining code functionality\\n\"\n", + " \"4. Any PEP 8 style improvements that would improve code quality\\n\"\n", + " \"{% else %}\\n\"\n", + " \"The code passes all validation checks. Provide potential optimizations for:\\n\"\n", + " \"1. Code readability\\n\"\n", + " \"2. Performance improvements\\n\"\n", + " \"3. Better adherence to Python best practices\\n\"\n", + " \"4. Enhanced documentation\\n\"\n", + " \"{% endif %}\\n\"\n", + " )\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "69399f4b", + "metadata": {}, + "source": [ + "## ⚑️ Code Evolution\n", + "\n", + "Next, we'll create the improved version of the code based on the analysis and validation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ffb224cd", + "metadata": {}, + "outputs": [], + "source": [ + "# Generate improved code based on feedback\n", + "config_builder.add_column(\n", + " LLMCodeColumnConfig(\n", + " name=\"improved_code\",\n", + " model_alias=MODEL_ALIAS,\n", + " code_lang=CodeLang.PYTHON,\n", + " system_prompt=(\"You are an expert Python programmer focused on writing production-quality code \"\n", + " \"that adheres to best practices.\"),\n", + " prompt=(\n", + " \"Rewrite and improve the following Python code based on the analysis provided.\\n\\n\"\n", + " \"ORIGINAL INSTRUCTION:\\n\"\n", + " \"{{instruction}}\\n\\n\"\n", + " \"INITIAL CODE:\\n\"\n", + " \"{{initial_code}}\\n\\n\"\n", + " \"CODE ANALYSIS:\\n\"\n", + " \"{{code_analysis}}\\n\\n\"\n", + " \"Your task is to create a revised version that:\\n\"\n", + " \"1. Addresses all issues identified in the analysis\\n\"\n", + " \"2. Follows PEP 8 style guidelines systematically\\n\"\n", + " \"3. Eliminates common anti-patterns\\n\"\n", + " \"4. Includes comprehensive docstrings for functions, classes, and modules\\n\"\n", + " \"5. Uses type hints for function parameters and return values where appropriate\\n\"\n", + " \"6. Implements proper error handling with specific exception types\\n\"\n", + " \"7. Ensures all imports are properly organized and used\\n\\n\"\n", + " \"The goal is production-quality code that would pass a professional code review at a {{code_complexity}} level.\\n\"\n", + " )\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "f6358805", + "metadata": {}, + "source": [ + "## πŸ” Quality Assessment: LLM-as-a-Judge\n", + "\n", + "When generating our synthetic dataset, we need to determine the quality of the generated data \\\n", + "We use the LLM-as-a-Judge strategy to do this. \n", + "\n", + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", + "that provides relavant instructions. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "018e2805", + "metadata": {}, + "outputs": [], + "source": [ + "TEXT_TO_PYTHON_JUDGE_TEMPLATE = (\n", + " \"You are an expert in Python programming, with specialized knowledge in software engineering, data science, and algorithmic problem-solving. \"\n", + " \"You think about potential flaws and errors in the code. You are a tough critic, but a fair one.\\n\\n\"\n", + " \"Take a deep breath and use the Python Code Quality Rubric below to score the **Generated Python Code** based on the INSTRUCTIONS.\\n\\n\"\n", + " \"#### INSTRUCTIONS\\n\"\n", + " \"The Generated Python Code should be a valid response to the Natural Language Prompt below\\n\\n\"\n", + " \"Natural Language Prompt:\\n\"\n", + " \"{{ instruction }}\\n\\n\"\n", + " \"Generated Python Code\\n\"\n", + " \"{{ improved_code }}\\n\"\n", + ")\n", + "\n", + "python_scoring = [\n", + " Score(\n", + " name=\"Relevance\",\n", + " description=\"Adherence to INSTRUCTIONS and CONTEXT\",\n", + " options={\n", + " \"4\": \"Perfectly meets all specified requirements.\",\n", + " \"3\": \"Meets most requirements with minor deviations.\",\n", + " \"2\": \"Moderate deviation from the instructions.\",\n", + " \"1\": \"Significant deviations from the instructions.\",\n", + " \"0\": \"Does not adhere to the instructions.\",\n", + " },\n", + " ),\n", + " Score(\n", + " name=\"Pythonic\",\n", + " description=\"Pythonic Code and Best Practices (Does the code follow Python conventions and best practices?)\",\n", + " options={\n", + " \"4\": \"The code exemplifies Pythonic principles, making excellent use of Python-specific constructs, standard library modules and programming idioms; follows all relevant PEPs.\",\n", + " \"3\": \"The code closely follows Python conventions and adheres to many best practices; good use of Python-specific constructs, standard library modules and programming idioms.\",\n", + " \"2\": \"The code generally follows Python conventions but has room for better alignment with Pythonic practices.\",\n", + " \"1\": \"The code loosely follows Python conventions, with several deviations from best practices.\",\n", + " \"0\": \"The code does not follow Python conventions or best practices, using non-Pythonic approaches.\",\n", + " },\n", + " ),\n", + " Score(\n", + " name=\"Readability\",\n", + " description=\"Readability and Maintainability (Is the Python code easy to understand and maintain?)\",\n", + " options={\n", + " \"4\": \"The code is excellently formatted, follows PEP 8 guidelines, is elegantly concise and clear, uses meaningful variable names, ensuring high readability and ease of maintenance; organizes complex logic well. Docstrings are given in a Google Docstring format.\",\n", + " \"3\": \"The code is well-formatted in the sense of code-as-documentation, making it relatively easy to understand and maintain; uses descriptive names and organizes logic clearly.\",\n", + " \"2\": \"The code is somewhat readable with basic formatting and some comments, but improvements are needed; needs better use of descriptive names and organization.\",\n", + " \"1\": \"The code has minimal formatting, making it hard to understand; lacks meaningful names and organization.\",\n", + " \"0\": \"The code is unreadable, with no attempt at formatting or description.\",\n", + " },\n", + " ),\n", + " Score(\n", + " name=\"Efficiency\",\n", + " description=\"Efficiency and Performance (Is the code optimized for performance?)\",\n", + " options={\n", + " \"4\": \"The solution is highly efficient, using appropriate data structures and algorithms; avoids unnecessary computations and optimizes for both time and space complexity.\",\n", + " \"3\": \"The solution is efficient, with good use of Python's built-in functions and libraries; minor areas for optimization.\",\n", + " \"2\": \"The solution is moderately efficient, but misses some opportunities for optimization; uses some inefficient patterns.\",\n", + " \"1\": \"The solution shows poor efficiency, with notable performance issues; lacks effective optimization techniques.\",\n", + " \"0\": \"The solution is highly inefficient; overlooks fundamental optimization practices, resulting in significant performance issues.\",\n", + " },\n", + " ),\n", + "]\n", + "\n", + "# Add an LLM judge to evaluate code quality\n", + "config_builder.add_column(\n", + " LLMJudgeColumnConfig(\n", + " name=\"code_judge_result\",\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=TEXT_TO_PYTHON_JUDGE_TEMPLATE,\n", + " scores=python_scoring\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "45094bb5", + "metadata": {}, + "source": [ + "## ⚑️ Quality Assessment: Code Validation\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "deb6af72", + "metadata": {}, + "outputs": [], + "source": [ + "# Validate the improved code\n", + "config_builder.add_column(\n", + " ValidationColumnConfig(\n", + " name=\"improved_code_validation\",\n", + " target_columns=[\"improved_code\"],\n", + " validator_type=ValidatorType.CODE,\n", + " validator_params=CodeValidatorParams(\n", + " code_lang=CodeLang.PYTHON,\n", + " )\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "8f5d8567", + "metadata": {}, + "source": [ + "### πŸ” Iteration is key – preview the dataset!\n", + "\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", + "\n", + "2. Inspect the results for quality and format issues.\n", + "\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", + "\n", + "4. Re-run the preview until satisfied." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d834d08", + "metadata": {}, + "outputs": [], + "source": [ + "# Preview a few records\n", + "preview = data_designer_client.preview(config_builder, num_records=2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "996a0110", + "metadata": {}, + "outputs": [], + "source": [ + "# More previews\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "id": "f1d3c82c", + "metadata": {}, + "source": [ + "### πŸ“Š Analyze the generated data\n", + "\n", + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "810fe1d2", + "metadata": {}, + "outputs": [], + "source": [ + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "id": "b40cdb2e", + "metadata": {}, + "source": [ + "### πŸ†™ Scale up!\n", + "\n", + "- Happy with your preview data?\n", + "\n", + "- Use the `create` method to submit larger Data Designer generation jobs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4abff70f", + "metadata": {}, + "outputs": [], + "source": [ + "job_results = data_designer_client.create(config_builder, num_records=20)\n", + "\n", + "# This will block until the job is complete.\n", + "job_results.wait_until_done()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "66ce0796", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", + "dataset = job_results.load_dataset()\n", + "\n", + "dataset.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ee2001e", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e4b1f4e2", + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-text-to-code-text-to-python-evol\",\n", + ");" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "sdg_venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python.ipynb index 2ee155b12..d464fd359 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python.ipynb @@ -2,41 +2,41 @@ "cells": [ { "cell_type": "markdown", + "id": "0d1a6db1", "metadata": {}, "source": [ - "# 🎨 NeMo Data Designer: Text-to-Python" - ] - }, - { - "cell_type": "markdown", - "id": "c6c516b4", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", + "# πŸ‘¨β€πŸ’» NeMo Data Designer: Text-to-Python\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "#### πŸ“š What you'll learn\n", + "\n", + "- This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for Python code examples. \n", + "\n", + "- We'll build a system that generates Python code based on natural language instructions, with varying complexity levels and industry focuses.\n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." - ] - }, - { - "cell_type": "markdown", - "id": "99494da9", - "metadata": {}, - "source": [ - "This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for Python code examples. We'll build a system that generates Python code based on natural language instructions, with varying complexity levels and industry focuses." + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" ] }, { "cell_type": "markdown", - "id": "4301911f", + "id": "88df978d", "metadata": {}, "source": [ - "#### πŸ’Ύ Install dependencies\n", + "### πŸ“¦ Import the essentials\n", "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment." + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", + "\n", + "- The `essentials` module provides quick access to the most commonly used objects.\n" ] }, { @@ -46,96 +46,129 @@ "metadata": {}, "outputs": [], "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", + "from nemo_microservices.data_designer.essentials import (\n", + " CategorySamplerParams,\n", + " CodeLang,\n", + " CodeValidatorParams,\n", " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" + " InferenceParameters,\n", + " LLMCodeColumnConfig,\n", + " LLMJudgeColumnConfig,\n", + " LLMTextColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " SamplerColumnConfig,\n", + " SamplerType,\n", + " Score,\n", + " SubcategorySamplerParams,\n", + " ValidationColumnConfig,\n", + " ValidatorType,\n", + ")" ] }, { "cell_type": "markdown", - "id": "81da3752", + "id": "a86763c4", "metadata": {}, "source": [ "### βš™οΈ Initialize the NeMo Data Designer Client\n", "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "cc997456", + "id": "568624e8", "metadata": {}, "outputs": [], "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" ] }, { "cell_type": "markdown", - "id": "a0c93601", + "id": "56d473ca", "metadata": {}, "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "### πŸŽ›οΈ Define model configurations\n", "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", "\n", - "- You must provide a list of model configs to the builder at initialization.\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "7cb11661", + "id": "339a6002", "metadata": {}, "outputs": [], "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias = \"nemotron-nano-9b-v2\"" + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/llama-3.3-nemotron-super-49b-v1\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-super-49b-v1\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " timeout=300,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "bdaea4d2", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "63aa5a1e", + "id": "c6845d43", "metadata": {}, "outputs": [], "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" ] }, { "cell_type": "markdown", - "id": "9d3a2c17", + "id": "3b41d42d", "metadata": {}, "source": [ - "## 🌱 Define Categorical Seed Columns\n", + "## 🎲 Adding Sampler Columns\n", "\n", - "We'll set up our seed columns for industry sectors, code complexity, and instruction types. These will help generate diverse and relevant code examples." + "- Sampler columns offer non-LLM based generation of synthetic data.\n", + "\n", + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." ] }, { @@ -147,10 +180,10 @@ "source": [ "# Add industry sector categories\n", "config_builder.add_column(\n", - " C.SamplerColumn(\n", + " SamplerColumnConfig(\n", " name=\"industry_sector\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", " values=[\n", " \"Healthcare\",\n", " \"Finance\",\n", @@ -162,10 +195,10 @@ "\n", "# Add topic as a subcategory of industry_sector\n", "config_builder.add_column(\n", - " C.SamplerColumn(\n", + " SamplerColumnConfig(\n", " name=\"topic\",\n", - " type=P.SamplerType.SUBCATEGORY,\n", - " params=P.SubcategorySamplerParams(\n", + " sampler_type=SamplerType.SUBCATEGORY,\n", + " params=SubcategorySamplerParams(\n", " category=\"industry_sector\",\n", " values={\n", " \"Healthcare\": [\n", @@ -190,10 +223,10 @@ "\n", "# Add code complexity with subcategory for code concepts\n", "config_builder.add_column(\n", - " C.SamplerColumn(\n", + " SamplerColumnConfig(\n", " name=\"code_complexity\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", " values=[\n", " \"Beginner\",\n", " \"Intermediate\",\n", @@ -205,10 +238,10 @@ "\n", "# Add code_concept as a subcategory of code_complexity\n", "config_builder.add_column(\n", - " C.SamplerColumn(\n", + " SamplerColumnConfig(\n", " name=\"code_concept\",\n", - " type=P.SamplerType.SUBCATEGORY,\n", - " params=P.SubcategorySamplerParams(\n", + " sampler_type=SamplerType.SUBCATEGORY,\n", + " params=SubcategorySamplerParams(\n", " category=\"code_complexity\",\n", " values={\n", " \"Beginner\": [\n", @@ -237,10 +270,10 @@ "\n", "# Add instruction phrases\n", "config_builder.add_column(\n", - " C.SamplerColumn(\n", + " SamplerColumnConfig(\n", " name=\"instruction_phrase\",\n", - " type=P.SamplerType.CATEGORY,\n", - " params=P.CategorySamplerParams(\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", " values=[\n", " \"Write a function that\",\n", " \"Create a class that\",\n", @@ -250,7 +283,7 @@ " ],\n", " ),\n", " ),\n", - ")" + ")\n" ] }, { @@ -258,7 +291,7 @@ "id": "7e0cd0d7", "metadata": {}, "source": [ - "## ✨ Define Generated Data Columns\n", + "## 🦜 Define LLM-Generated Columns\n", "\n", "Now we'll set up the columns that will be generated by the LLMs, including the instruction and code implementation." ] @@ -272,9 +305,9 @@ "source": [ "# Generate instruction for the code\n", "config_builder.add_column(\n", - " C.LLMTextColumn(\n", + " LLMTextColumnConfig(\n", " name=\"instruction\",\n", - " model_alias=model_alias,\n", + " model_alias=MODEL_ALIAS,\n", " system_prompt=(\n", " \"You are an expert at generating clear and specific programming tasks.\"\n", " ),\n", @@ -292,10 +325,10 @@ "\n", "# Generate the Python code\n", "config_builder.add_column(\n", - " C.LLMCodeColumn(\n", + " LLMCodeColumnConfig(\n", " name=\"code_implementation\",\n", - " model_alias=model_alias,\n", - " output_format=P.CodeLang.PYTHON,\n", + " model_alias=MODEL_ALIAS,\n", + " code_lang=CodeLang.PYTHON,\n", " system_prompt=(\n", " \"You are an expert Python programmer who writes clean, efficient, and well-documented code.\"\n", " ),\n", @@ -309,17 +342,21 @@ " \"* Complexity & Concepts: The code should be written at a {{code_complexity}} level, making use of concepts such as {{code_concept}}.\\n\"\n", " ),\n", " )\n", - ")" + ")\n" ] }, { "cell_type": "markdown", - "id": "936eda48", + "id": "c16718b8", "metadata": {}, "source": [ - "## πŸ” Add Validation and Evaluation\n", + "## πŸ” Quality Assessment: LLM-as-a-Judge\n", + "\n", + "When generating our synthetic dataset, we need to determine the quality of the generated data \\\n", + "We use the LLM-as-a-Judge strategy to do this. \n", "\n", - "Let's add post-processing steps to validate the generated code and evaluate the text-to-Python conversion." + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", + "that provides relavant instructions. " ] }, { @@ -329,94 +366,262 @@ "metadata": {}, "outputs": [], "source": [ - "from nemo_microservices.beta.data_designer.config.params.rubrics import TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE, PYTHON_RUBRICS\n", + "TEXT_TO_PYTHON_JUDGE_TEMPLATE = \"\"\"\\\n", + "You are an expert in Python programming, with specialized knowledge in software engineering, data science, and algorithmic problem-solving. \\\n", + "You think about potential flaws and errors in the code. You are a tough critic, but a fair one.\n", "\n", + "Take a deep breath and use the Python Code Quality Rubric below to score the **Generated Python Code** based on the INSTRUCTIONS.\n", "\n", + "#### INSTRUCTIONS\n", + "The Generated Python Code should be a valid response to the Natural Language Prompt below\n", + "\n", + "Natural Language Prompt:\n", + "{{ instruction }}\n", + "\n", + "Generated Python Code\n", + "{{ code_implementation }}\n", + "\"\"\"\n", + "\n", + "python_scoring = [\n", + " Score(\n", + " name=\"Relevance\",\n", + " description=\"Adherence to INSTRUCTIONS and CONTEXT\",\n", + " options={\n", + " \"4\": \"Perfectly meets all specified requirements.\",\n", + " \"3\": \"Meets most requirements with minor deviations.\",\n", + " \"2\": \"Moderate deviation from the instructions.\",\n", + " \"1\": \"Significant deviations from the instructions.\",\n", + " \"0\": \"Does not adhere to the instructions.\",\n", + " },\n", + " ),\n", + " Score(\n", + " name=\"Pythonic\",\n", + " description=\"Pythonic Code and Best Practices (Does the code follow Python conventions and best practices?)\",\n", + " options={\n", + " \"4\": \"The code exemplifies Pythonic principles, making excellent use of Python-specific constructs, standard library modules and programming idioms; follows all relevant PEPs.\",\n", + " \"3\": \"The code closely follows Python conventions and adheres to many best practices; good use of Python-specific constructs, standard library modules and programming idioms.\",\n", + " \"2\": \"The code generally follows Python conventions but has room for better alignment with Pythonic practices.\",\n", + " \"1\": \"The code loosely follows Python conventions, with several deviations from best practices.\",\n", + " \"0\": \"The code does not follow Python conventions or best practices, using non-Pythonic approaches.\",\n", + " },\n", + " ),\n", + " Score(\n", + " name=\"Readability\",\n", + " description=\"Readability and Maintainability (Is the Python code easy to understand and maintain?)\",\n", + " options={\n", + " \"4\": \"The code is excellently formatted, follows PEP 8 guidelines, is elegantly concise and clear, uses meaningful variable names, ensuring high readability and ease of maintenance; organizes complex logic well. Docstrings are given in a Google Docstring format.\",\n", + " \"3\": \"The code is well-formatted in the sense of code-as-documentation, making it relatively easy to understand and maintain; uses descriptive names and organizes logic clearly.\",\n", + " \"2\": \"The code is somewhat readable with basic formatting and some comments, but improvements are needed; needs better use of descriptive names and organization.\",\n", + " \"1\": \"The code has minimal formatting, making it hard to understand; lacks meaningful names and organization.\",\n", + " \"0\": \"The code is unreadable, with no attempt at formatting or description.\",\n", + " },\n", + " ),\n", + " Score(\n", + " name=\"Efficiency\",\n", + " description=\"Efficiency and Performance (Is the code optimized for performance?)\",\n", + " options={\n", + " \"4\": \"The solution is highly efficient, using appropriate data structures and algorithms; avoids unnecessary computations and optimizes for both time and space complexity.\",\n", + " \"3\": \"The solution is efficient, with good use of Python's built-in functions and libraries; minor areas for optimization.\",\n", + " \"2\": \"The solution is moderately efficient, but misses some opportunities for optimization; uses some inefficient patterns.\",\n", + " \"1\": \"The solution shows poor efficiency, with notable performance issues; lacks effective optimization techniques.\",\n", + " \"0\": \"The solution is highly inefficient; overlooks fundamental optimization practices, resulting in significant performance issues.\",\n", + " },\n", + " ),\n", + "]\n", + "\n", + "# Add an LLM judge to evaluate code quality\n", "config_builder.add_column(\n", - " C.CodeValidationColumn(\n", - " name=\"code_validity_result\",\n", - " model_alias=model_alias,\n", - " code_lang=P.CodeLang.PYTHON,\n", - " target_column=\"code_implementation\",\n", + " LLMJudgeColumnConfig(\n", + " name=\"code_judge_result\",\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=TEXT_TO_PYTHON_JUDGE_TEMPLATE,\n", + " scores=python_scoring\n", " )\n", - ")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "71cb0502", + "metadata": {}, + "source": [ + "## ⚑️ Quality Assessment: Code Validation\n", + "\n", + "- NeMo Data Designer includes a built-in code validation feature that automatically checks the syntactic correctness and executable validity of \\\n", + "generated code snippets. \n", + "\n", + "- This helps ensure that outputs from language models are not only syntactically correct, but also able to run successfully in the \\\n", + "intended programming language environment. \n", + "\n", + "- Leveraging this validation step significantly increases dataset quality by promptly identifying invalid or non-functional code, \\\n", + "streamlining the process of generating reliable and production-ready data samples.\n", + "\n", + "- NeMo Data Designer supports validation for these languages\n", + "\n", + " - Python (CodeLang.PYTHON)\n", "\n", + " - SQL dialects:\n", + "\n", + " - ANSI SQL (CodeLang.SQL_ANSI)\n", + "\n", + " - MySQL (CodeLang.SQL_MYSQL)\n", + "\n", + " - PostgreSQL (CodeLang.SQL_POSTGRES)\n", + "\n", + " - SQLite (CodeLang.SQL_SQLITE)\n", + "\n", + " - T-SQL (CodeLang.SQL_TSQL)\n", + "\n", + " - BigQuery (CodeLang.SQL_BIGQUERY)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d0060c5b", + "metadata": {}, + "outputs": [], + "source": [ "config_builder.add_column(\n", - " C.LLMJudgeColumn(\n", - " name=\"code_judge_result\",\n", - " model_alias=model_alias,\n", - " prompt=TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE,\n", - " rubrics=PYTHON_RUBRICS,\n", + " ValidationColumnConfig(\n", + " name=\"code_validity_result\",\n", + " validator_type=ValidatorType.CODE,\n", + " target_columns=[\"code_implementation\"], # Column containing the code\n", + " validator_params=CodeValidatorParams(\n", + " code_lang=CodeLang.PYTHON,\n", + " ),\n", + " batch_size=100\n", " )\n", ")" ] }, { "cell_type": "markdown", - "id": "8732267c", + "id": "0f360a12", "metadata": {}, "source": [ - "## πŸ‘€ Generate Preview Dataset\n", + "### πŸ” Iteration is key – preview the dataset!\n", + "\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", + "\n", + "2. Inspect the results for quality and format issues.\n", + "\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "Let's generate a preview to see some data." + "4. Re-run the preview until satisfied." ] }, { "cell_type": "code", "execution_count": null, - "id": "3c222bec", + "id": "acc6a256", "metadata": {}, "outputs": [], "source": [ - "# Generate a preview\n", - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" + "# Preview a few records\n", + "preview = data_designer_client.preview(config_builder, num_records=2)" ] }, { "cell_type": "code", "execution_count": null, - "id": "892f6fc5", + "id": "5ef849f7", "metadata": {}, "outputs": [], "source": [ + "# More previews\n", "preview.display_sample_record()" ] }, { "cell_type": "markdown", - "id": "3ada3096", + "id": "db98ab34", "metadata": {}, "source": [ - "## πŸš€ Generate Full Dataset\n", + "### πŸ“Š Analyze the generated data\n", "\n", - "If you're satisfied with the preview, you can generate a larger dataset using a batch workflow." + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "2a064f08", + "id": "87cc6a6f", "metadata": {}, "outputs": [], "source": [ - "# Submit batch job\n", - "job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)\n", + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "id": "e940ba22", + "metadata": {}, + "source": [ + "### πŸ†™ Scale up!\n", + "\n", + "- Happy with your preview data?\n", + "\n", + "- Use the `create` method to submit larger Data Designer generation jobs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9c98ef7", + "metadata": {}, + "outputs": [], + "source": [ + "job_results = data_designer_client.create(config_builder, num_records=20)\n", "\n", + "# This will block until the job is complete.\n", "job_results.wait_until_done()" ] }, { "cell_type": "code", "execution_count": null, - "id": "88bfc2db", + "id": "56494cbc", "metadata": {}, "outputs": [], "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", "dataset = job_results.load_dataset()\n", - "print(\"\\nGenerated dataset shape:\", dataset.shape)\n", "\n", "dataset.head()" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a261f5ec", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "852671eb", + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-text-to-code-text-to-python\",\n", + ");" + ] } ], "metadata": { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-sql.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-sql.ipynb index f27b3695d..837a28ae9 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-sql.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-sql.ipynb @@ -1,448 +1,694 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 🎨 NeMo Data Designer: Text-to-SQL" - ] - }, - { - "cell_type": "markdown", - "id": "0ca478dd", - "metadata": {}, - "source": [ - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - ">\n", - "> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. \n", - ">\n", - "> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for SQL code examples. We'll build a system that generates SQL code based on natural language instructions, with varying complexity levels and industry focuses." - ] - }, - { - "cell_type": "markdown", - "id": "ce4571b8", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "553c2bd5", - "metadata": {}, - "outputs": [], - "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.data_designer import (\n", - " DataDesignerConfigBuilder,\n", - " DataDesignerClient,\n", - ")\n", - "from nemo_microservices.beta.data_designer.config import columns as C\n", - "from nemo_microservices.beta.data_designer.config import params as P" - ] - }, - { - "cell_type": "markdown", - "id": "0a701fa4", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b4b709e0", - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))" - ] - }, - { - "cell_type": "markdown", - "id": "f6b415cd", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "43c74e89", - "metadata": {}, - "outputs": [], - "source": [ - "# We specify the endpoint of the model during deployment using the model_provider_registry.\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias = \"nemotron-nano-9b-v2\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "65b5a79e", - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs=[\n", - " P.ModelConfig(\n", - " alias=model_alias,\n", - " provider=\"nvidiabuild\",\n", - " model=model_id,\n", - " inference_parameters=P.InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.6,\n", - " top_p=0.95,\n", - " ),\n", - " is_reasoner=True\n", - " ),\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🌱 Define Categorical Seed Columns\n", - "\n", - "We'll set up our seed columns for industry sectors, code complexity, and instruction types. These will help generate diverse and relevant SQL examples." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Add industry sector categories\n", - "config_builder.add_column(\n", - " name=\"industry_sector\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Healthcare\", \"Finance\", \"Technology\"],\n", - " \"description\": \"The industry sector for the SQL example\"\n", - " }\n", - ")\n", - "\n", - "# Add topic as a subcategory of industry_sector\n", - "config_builder.add_column(\n", - " name=\"topic\",\n", - " type=\"subcategory\",\n", - " params={\n", - " \"category\": \"industry_sector\",\n", - " \"values\": {\n", - " \"Healthcare\": [\n", - " \"Electronic Health Records (EHR) Systems\",\n", - " \"Telemedicine Platforms\",\n", - " \"AI-Powered Diagnostic Tools\"\n", - " ],\n", - " \"Finance\": [\n", - " \"Fraud Detection Software\",\n", - " \"Automated Trading Systems\",\n", - " \"Personal Finance Apps\"\n", - " ],\n", - " \"Technology\": [\n", - " \"Cloud Computing Platforms\",\n", - " \"Artificial Intelligence and Machine Learning Platforms\",\n", - " \"DevOps and CI/CD Tools\"\n", - " ]\n", - " }\n", - " }\n", - ")\n", - "\n", - "# Add SQL complexity with subcategory for SQL concepts\n", - "config_builder.add_column(\n", - " name=\"sql_complexity\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\"Beginner\", \"Intermediate\", \"Advanced\"],\n", - " \"description\": \"The complexity level of the SQL code\"\n", - " }\n", - ")\n", - "\n", - "# Add SQL concept as a subcategory of sql_complexity\n", - "config_builder.add_column(\n", - " name=\"sql_concept\",\n", - " type=\"subcategory\",\n", - " params={\n", - " \"category\": \"sql_complexity\",\n", - " \"values\": {\n", - " \"Beginner\": [\n", - " \"Basic SELECT Statements\",\n", - " \"WHERE Clauses\",\n", - " \"Basic JOINs\",\n", - " \"INSERT, UPDATE, DELETE\"\n", - " ],\n", - " \"Intermediate\": [\n", - " \"Aggregation Functions\",\n", - " \"Multiple JOINs\",\n", - " \"Subqueries\",\n", - " \"Views\"\n", - " ],\n", - " \"Advanced\": [\n", - " \"Window Functions\",\n", - " \"Common Table Expressions (CTEs)\",\n", - " \"Stored Procedures\",\n", - " \"Query Optimization\"\n", - " ]\n", - " }\n", - " }\n", - ")\n", - "\n", - "# Add SQL task types\n", - "config_builder.add_column(\n", - " name=\"sql_task_type\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\n", - " \"Data Retrieval\",\n", - " \"Data Manipulation\",\n", - " \"Analytics and Reporting\",\n", - " \"Data Transformation\"\n", - " ],\n", - " \"description\": \"The type of SQL task being performed\"\n", - " }\n", - ")\n", - "\n", - "# Add instruction phrases\n", - "config_builder.add_column(\n", - " name=\"instruction_phrase\",\n", - " type=\"category\",\n", - " params={\n", - " \"values\": [\n", - " \"Write an SQL query that\",\n", - " \"Create an SQL statement to\",\n", - " \"Develop an SQL query to\",\n", - " \"Can you write SQL that\",\n", - " \"Formulate an SQL query that\"\n", - " ],\n", - " \"description\": \"Starting phrase for the SQL instruction\"\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ✨ Define Generated Data Columns\n", - "\n", - "Now we'll set up the columns that will be generated by the LLMs, including the instruction, database context, and SQL implementation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Generate instruction for the SQL query\n", - "config_builder.add_column(\n", - " name=\"sql_prompt\",\n", - " type=\"llm-text\",\n", - " model_alias=model_alias,\n", - " system_prompt=\"You are an expert at generating clear and specific SQL tasks.\",\n", - " prompt=\"\"\"\\\n", - "Generate an instruction to create SQL code that solves a specific problem.\n", - "Each instruction should begin with one of the following phrases: {{instruction_phrase}}.\n", - "\n", - "Important Guidelines:\n", - "* Industry Relevance: Ensure the instruction pertains to the {{industry_sector}} sector and {{topic}} topic.\n", - "* SQL Complexity: Tailor the instruction to the {{sql_complexity}} level. Utilize relevant {{sql_concept}} where appropriate to match the complexity level.\n", - "* Task Type: The instruction should involve a {{sql_task_type}} task.\n", - "* Clarity and Specificity: Make the problem statement clear and unambiguous. Provide sufficient context to understand the requirements without being overly verbose.\n", - "* Response Formatting: Do not include any markers such as ### Response ### in the instruction.\n", - "\"\"\"\n", - ")\n", - "\n", - "# Generate database context\n", - "config_builder.add_column(\n", - " name=\"sql_context\",\n", - " type=\"llm-code\",\n", - " model_alias=model_alias,\n", - " output_format=P.CodeLang.SQL_ANSI, # Specify CodeLang.SQL_ANSI to ensure the code is structured as valid SQL\n", - " system_prompt=\"You are an expert SQL database designer who creates clean, efficient, and well-structured database schemas.\",\n", - " prompt=\"\"\"\\\n", - "Generate the SQL for creating database tables that would be relevant for the following instruction:\n", - "Instruction: {{sql_prompt}}\n", - "\n", - "Important Guidelines:\n", - "* Relevance: Ensure all tables are directly related to the {{industry_sector}} sector and {{topic}} topic.\n", - "* Completeness: Include all essential columns with appropriate data types, primary/foreign keys, and necessary constraints.\n", - "* Realism: Use realistic table structures typical for the specified industry.\n", - "* Executable SQL: Provide complete CREATE TABLE statements that can be run without modification.\n", - "* Consistency: Use consistent naming conventions (e.g., snake_case for table and column names).\n", - "* Sample Data: Include INSERT statements with sample data that makes sense for the tables (at least 5-10 rows per table).\n", - "\"\"\"\n", - ")\n", - "\n", - "# Generate the SQL code\n", - "config_builder.add_column(\n", - " name=\"sql\",\n", - " type=\"llm-code\",\n", - " model_alias=model_alias,\n", - " output_format=P.CodeLang.SQL_ANSI, # Specify CodeLang.SQL_ANSI to ensure the code is structured as valid SQL\n", - " system_prompt=\"You are an expert SQL programmer who writes clean, efficient, and well-structured queries.\",\n", - " prompt=\"\"\"\\\n", - "Write SQL code for the following instruction based on the provided database context:\n", - "Instruction: {{sql_prompt}}\n", - "\n", - "Database Context:\n", - "{{sql_context}}\n", - "\n", - "Important Guidelines:\n", - "* Code Quality: Your SQL should be clean, complete, self-contained and accurate.\n", - "* Code Validity: Please ensure that your SQL code is executable and does not contain any errors.\n", - "* Context: Base your query on the provided database context. Only reference tables and columns that exist in the context.\n", - "* Complexity & Concepts: The SQL should be written at a {{sql_complexity}} level, making use of concepts such as {{sql_concept}}.\n", - "* Task Type: Ensure your solution implements the appropriate {{sql_task_type}} operation.\n", - "* Comments: Include brief comments explaining the key parts of your query.\n", - "\"\"\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ” Add Validation and Evaluation\n", - "\n", - "Let's add post-processing steps to validate the generated code and evaluate the text-to-SQL conversion." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from nemo_microservices.beta.data_designer.config.params.rubrics import TEXT_TO_SQL_LLM_JUDGE_PROMPT_TEMPLATE, SQL_RUBRICS\n", - "\n", - "# Add validators and evaluators\n", - "config_builder.add_column(name=\"sql_validity_result\",\n", - " model_alias=model_alias,\n", - " type=\"code-validation\",\n", - " code_lang=P.CodeLang.SQL_ANSI,\n", - " target_column=\"sql\")\n", - "\n", - "\n", - "config_builder.add_column(name=\"sql_judge_result\",\n", - " type=\"llm-judge\",\n", - " model_alias=model_alias,\n", - " prompt=TEXT_TO_SQL_LLM_JUDGE_PROMPT_TEMPLATE,\n", - " rubrics=SQL_RUBRICS)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ‘€ Generate Preview Dataset\n", - "\n", - "Let's generate a preview to see some data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Generate a preview\n", - "preview = data_designer_client.preview(config_builder, verbose_logging=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "preview.display_sample_record()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸš€ Generate Full Dataset\n", - "\n", - "If you're satisfied with the preview, you can generate a larger dataset using a batch workflow." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Submit batch job\n", - "job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)\n", - "\n", - "job_results.wait_until_done()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "17c2aec6", - "metadata": {}, - "outputs": [], - "source": [ - "dataset = job_results.load_dataset()\n", - "print(\"\\nGenerated dataset shape:\", dataset.shape)\n", - "\n", - "dataset.head()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "d576b85a", + "metadata": {}, + "source": [ + "# πŸ‘¨β€πŸ’» NeMo Data Designer: Text-to-SQL\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", + "\n", + "#### πŸ“š What you'll learn\n", + "\n", + "- This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for SQL code examples. \n", + "\n", + "- We'll build a system that generates SQL code based on natural language instructions, with varying complexity levels and industry focuses.\n", + "\n", + "
\n", + "\n", + "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", + ">\n", + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", + ">\n", + "> - You may need to restart your notebook's kernel after setting up the environment.\n", + "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", + ">\n", + "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" + ] + }, + { + "cell_type": "markdown", + "id": "e085b967", + "metadata": {}, + "source": [ + "### πŸ“¦ Import the essentials\n", + "\n", + "- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.\n", + "\n", + "- The `essentials` module provides quick access to the most commonly used objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "553c2bd5", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_microservices.data_designer.essentials import (\n", + " CategorySamplerParams,\n", + " CodeLang,\n", + " CodeValidatorParams,\n", + " DataDesignerConfigBuilder,\n", + " InferenceParameters,\n", + " LLMCodeColumnConfig,\n", + " LLMJudgeColumnConfig,\n", + " LLMTextColumnConfig,\n", + " ModelConfig,\n", + " NeMoDataDesignerClient,\n", + " SamplerColumnConfig,\n", + " SamplerType,\n", + " Score,\n", + " SubcategorySamplerParams,\n", + " ValidationColumnConfig,\n", + " ValidatorType,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "320003a5", + "metadata": {}, + "source": [ + "### βš™οΈ Initialize the NeMo Data Designer Client\n", + "\n", + "- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e99d520", + "metadata": {}, + "outputs": [], + "source": [ + "NEMO_MICROSERVICES_BASE_URL = \"http://localhost:8080\"\n", + "\n", + "data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)" + ] + }, + { + "cell_type": "markdown", + "id": "d7c577ae", + "metadata": {}, + "source": [ + "### πŸŽ›οΈ Define model configurations\n", + "\n", + "- Each `ModelConfig` defines a model that can be used during the generation process.\n", + "\n", + "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", + "\n", + "- The \"model provider\" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).\n", + "\n", + "- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1950b463", + "metadata": {}, + "outputs": [], + "source": [ + "# This name is set in the microservice deployment configuration.\n", + "MODEL_PROVIDER = \"nvidiabuild\"\n", + "\n", + "# The model ID is from build.nvidia.com.\n", + "MODEL_ID = \"nvidia/llama-3.3-nemotron-super-49b-v1\"\n", + "\n", + "# We choose this alias to be descriptive for our use case.\n", + "MODEL_ALIAS = \"nemotron-super-49b-v1\"\n", + "\n", + "model_configs = [\n", + " ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=InferenceParameters(\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " max_tokens=1024,\n", + " timeout=300,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "6e330702", + "metadata": {}, + "source": [ + "### πŸ—οΈ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- The list of model configs is provided to the builder at initialization.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0be026ce", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" + ] + }, + { + "cell_type": "markdown", + "id": "3ac72f62", + "metadata": {}, + "source": [ + "## 🎲 Adding Sampler Columns\n", + "\n", + "- Sampler columns offer non-LLM based generation of synthetic data.\n", + "\n", + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5fc5ddc", + "metadata": {}, + "outputs": [], + "source": [ + "# Add industry sector categories\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"industry_sector\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Healthcare\", \"Finance\", \"Technology\"],\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Add topic as a subcategory of industry_sector\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"topic\",\n", + " sampler_type=SamplerType.SUBCATEGORY,\n", + " params=SubcategorySamplerParams(\n", + " category=\"industry_sector\",\n", + " values={\n", + " \"Healthcare\": [\n", + " \"Electronic Health Records (EHR) Systems\",\n", + " \"Telemedicine Platforms\",\n", + " \"AI-Powered Diagnostic Tools\"\n", + " ],\n", + " \"Finance\": [\n", + " \"Fraud Detection Software\",\n", + " \"Automated Trading Systems\",\n", + " \"Personal Finance Apps\"\n", + " ],\n", + " \"Technology\": [\n", + " \"Cloud Computing Platforms\",\n", + " \"Artificial Intelligence and Machine Learning Platforms\",\n", + " \"DevOps and CI/CD Tools\"\n", + " ]\n", + " }\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Add SQL complexity with subcategory for SQL concepts\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"sql_complexity\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\"Beginner\", \"Intermediate\", \"Advanced\"],\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Add SQL concept as a subcategory of sql_complexity\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"sql_concept\",\n", + " sampler_type=SamplerType.SUBCATEGORY,\n", + " params=SubcategorySamplerParams(\n", + " category=\"sql_complexity\",\n", + " values={\n", + " \"Beginner\": [\n", + " \"Basic SELECT Statements\",\n", + " \"WHERE Clauses\",\n", + " \"Basic JOINs\",\n", + " \"INSERT, UPDATE, DELETE\"\n", + " ],\n", + " \"Intermediate\": [\n", + " \"Aggregation Functions\",\n", + " \"Multiple JOINs\",\n", + " \"Subqueries\",\n", + " \"Views\"\n", + " ],\n", + " \"Advanced\": [\n", + " \"Window Functions\",\n", + " \"Common Table Expressions (CTEs)\",\n", + " \"Stored Procedures\",\n", + " \"Query Optimization\"\n", + " ]\n", + " }\n", + " )\n", + " )\n", + ")\n", + "\n", + "# Add SQL task types\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"sql_task_type\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\n", + " \"Data Retrieval\",\n", + " \"Data Manipulation\",\n", + " \"Analytics and Reporting\",\n", + " \"Data Transformation\"\n", + " ],\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Add instruction phrases\n", + "config_builder.add_column(\n", + " SamplerColumnConfig(\n", + " name=\"instruction_phrase\",\n", + " sampler_type=SamplerType.CATEGORY,\n", + " params=CategorySamplerParams(\n", + " values=[\n", + " \"Write an SQL query that\",\n", + " \"Create an SQL statement to\",\n", + " \"Develop an SQL query to\",\n", + " \"Can you write SQL that\",\n", + " \"Formulate an SQL query that\"\n", + " ],\n", + " ),\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "b53082fe", + "metadata": {}, + "source": [ + "## 🦜 Define Generated Data Columns\n", + "\n", + "Now we'll set up the columns that will be generated by the LLMs, including the instruction, database context, and SQL implementation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15a6f440", + "metadata": {}, + "outputs": [], + "source": [ + "# Generate instruction for the SQL query\n", + "SQL_PROMPT_TEXT = (\n", + " \"Generate an instruction to create SQL code that solves a specific problem.\\n\"\n", + " \"Each instruction should begin with one of the following phrases: {{instruction_phrase}}.\\n\\n\"\n", + " \"Important Guidelines:\\n\"\n", + " \"* Industry Relevance: Ensure the instruction pertains to the {{industry_sector}} sector and {{topic}} topic.\\n\"\n", + " \"* SQL Complexity: Tailor the instruction to the {{sql_complexity}} level. Utilize relevant {{sql_concept}} \"\n", + " \"where appropriate to match the complexity level.\\n\"\n", + " \"* Task Type: The instruction should involve a {{sql_task_type}} task.\\n\"\n", + " \"* Clarity and Specificity: Make the problem statement clear and unambiguous. Provide sufficient context to \"\n", + " \"understand the requirements without being overly verbose.\\n\"\n", + " \"* Response Formatting: Do not include any markers such as ### Response ### in the instruction.\\n\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"sql_prompt\",\n", + " model_alias=MODEL_ALIAS,\n", + " system_prompt=\"You are an expert at generating clear and specific SQL tasks.\",\n", + " prompt=SQL_PROMPT_TEXT\n", + " )\n", + ")\n", + "\n", + "# Generate database context\n", + "SQL_CONTEXT_TEXT = (\n", + " \"Generate the SQL for creating database tables that would be relevant for the following instruction:\\n\"\n", + " \"Instruction: {{sql_prompt}}\\n\\n\"\n", + " \"Important Guidelines:\\n\"\n", + " \"* Relevance: Ensure all tables are directly related to the {{industry_sector}} sector and {{topic}} topic.\\n\"\n", + " \"* Completeness: Include all essential columns with appropriate data types, primary/foreign keys, and necessary constraints.\\n\"\n", + " \"* Realism: Use realistic table structures typical for the specified industry.\\n\"\n", + " \"* Executable SQL: Provide complete CREATE TABLE statements that can be run without modification.\\n\"\n", + " \"* Consistency: Use consistent naming conventions (e.g., snake_case for table and column names).\\n\"\n", + " \"* Sample Data: Include INSERT statements with sample data that makes sense for the tables (at least 5-10 rows per table).\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " LLMCodeColumnConfig(\n", + " name=\"sql_context\",\n", + " model_alias=MODEL_ALIAS,\n", + " code_lang=CodeLang.SQL_ANSI,\n", + " system_prompt=(\"You are an expert SQL database designer who creates clean, efficient, and \"\n", + " \"well-structured database schemas.\"),\n", + " prompt=SQL_CONTEXT_TEXT\n", + " )\n", + ")\n", + "\n", + "# Generate the SQL code\n", + "SQL_CODE_TEXT = (\n", + " \"Write SQL code for the following instruction based on the provided database context:\\n\"\n", + " \"Instruction: {{sql_prompt}}\\n\\n\"\n", + " \"Database Context:\\n\"\n", + " \"{{sql_context}}\\n\\n\"\n", + " \"Important Guidelines:\\n\"\n", + " \"* Code Quality: Your SQL should be clean, complete, self-contained and accurate.\\n\"\n", + " \"* Code Validity: Please ensure that your SQL code is executable and does not contain any errors.\\n\"\n", + " \"* Context: Base your query on the provided database context. Only reference tables and columns that \"\n", + " \"exist in the context.\\n\"\n", + " \"* Complexity & Concepts: The SQL should be written at a {{sql_complexity}} level, making use of \"\n", + " \"concepts such as {{sql_concept}}.\\n\"\n", + " \"* Task Type: Ensure your solution implements the appropriate {{sql_task_type}} operation.\\n\"\n", + " \"* Comments: Include brief comments explaining the key parts of your query.\\n\"\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " LLMCodeColumnConfig(\n", + " name=\"sql\",\n", + " model_alias=MODEL_ALIAS,\n", + " code_lang=CodeLang.SQL_ANSI,\n", + " system_prompt=\"You are an expert SQL programmer who writes clean, efficient, and well-structured queries.\",\n", + " prompt=SQL_CODE_TEXT\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "8ebc70bf", + "metadata": {}, + "source": [ + "## πŸ” Quality Assessment: LLM-as-a-Judge\n", + "\n", + "When generating our synthetic dataset, we need to determine the quality of the generated data \\\n", + "We use the LLM-as-a-Judge strategy to do this. \n", + "\n", + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", + "that provides relavant instructions. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3bf1759", + "metadata": {}, + "outputs": [], + "source": [ + "TEXT_TO_SQL_JUDGE_TEMPLATE = \"\"\"\\\n", + "You are an expert in SQL with deep knowledge of relational modeling, query semantics,\n", + "and performance tuning across common dialects (e.g., PostgreSQL, MySQL, SQLite, SQL Server).\n", + "You think critically about correctness, readability, and efficiency.\n", + "\n", + "Use the SQL Query Quality Rubric below to score the **Generated SQL Query** based on the INSTRUCTIONS.\n", + "\n", + "#### INSTRUCTIONS\n", + "The Generated SQL Query should be a valid response to the Natural Language Prompt below\n", + "\n", + "Natural Language Prompt:\n", + "{{ sql_prompt }}\n", + "\n", + "Database Context:\n", + "{{ sql_context }}\n", + "\n", + "Generated SQL Query\n", + "{{ sql }}\n", + "\"\"\"\n", + "\n", + "sql_scoring = [\n", + " Score(\n", + " name=\"Relevance\",\n", + " description=\"Adherence to INSTRUCTIONS and CONTEXT\",\n", + " options={\n", + " \"4\": \"Perfectly meets all specified requirements.\",\n", + " \"3\": \"Meets most requirements with minor deviations.\",\n", + " \"2\": \"Moderate deviation from the instructions.\",\n", + " \"1\": \"Significant deviations from the instructions.\",\n", + " \"0\": \"Does not adhere to the instructions.\",\n", + " },\n", + " ),\n", + " Score(\n", + " name=\"SQL Correctness\",\n", + " description=\"Syntax and semantic correctness; returns the intended result\",\n", + " options={\n", + " \"4\": \"Valid SQL with correct joins, filters, grouping/aggregation, and NULL handling; produces the intended result set under the stated/implicit dialect.\",\n", + " \"3\": \"Generally correct with minor issues (e.g., edge-case NULLs, minor grouping detail) but still likely yields the intended result.\",\n", + " \"2\": \"Partially correct; noticeable semantic mistakes (joins, grouping, filters) that may change results or fail in edge cases.\",\n", + " \"1\": \"Largely incorrect; major semantic or syntactic errors likely causing failure or wrong results.\",\n", + " \"0\": \"Invalid SQL or unrelated to the task; will not run or cannot produce a meaningful result.\",\n", + " },\n", + " ),\n", + " Score(\n", + " name=\"Readability\",\n", + " description=\"Formatting, clarity, and maintainability\",\n", + " options={\n", + " \"4\": \"Cleanly formatted (keywords/clauses consistently styled), clear structure (CTEs/subqueries where helpful), meaningful table/column aliases, and concise.\",\n", + " \"3\": \"Generally readable with consistent formatting and understandable aliases; could be organized slightly better.\",\n", + " \"2\": \"Somewhat readable but inconsistent formatting or confusing aliasing; structure is harder to follow.\",\n", + " \"1\": \"Poorly formatted and hard to read; unclear structure and aliasing.\",\n", + " \"0\": \"Unreadable or chaotic; no meaningful structure or styling.\",\n", + " },\n", + " ),\n", + " Score(\n", + " name=\"Efficiency\",\n", + " description=\"Query performance best practices\",\n", + " options={\n", + " \"4\": \"Uses sargable predicates, appropriate joins, selective filters early, avoids SELECT *, unnecessary DISTINCT, and wasteful subqueries; likely to use indexes effectively.\",\n", + " \"3\": \"Mostly efficient; minor opportunities for improvement (e.g., simplifying expressions, reducing data early).\",\n", + " \"2\": \"Moderate inefficiencies (e.g., non-sargable filters, unnecessary nested subqueries, broad SELECT *).\",\n", + " \"1\": \"Notably inefficient patterns likely causing large scans or poor plans.\",\n", + " \"0\": \"Highly inefficient; ignores basic best practices and likely to perform very poorly.\",\n", + " },\n", + " ),\n", + "]\n", + "\n", + "# Add an LLM judge to evaluate code quality\n", + "config_builder.add_column(\n", + " LLMJudgeColumnConfig(\n", + " name=\"code_judge_result\",\n", + " model_alias=MODEL_ALIAS,\n", + " prompt=TEXT_TO_SQL_JUDGE_TEMPLATE,\n", + " scores=sql_scoring\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "74581e0f", + "metadata": {}, + "source": [ + "## ⚑️ Quality Assessment: Code Validation\n", + "\n", + "- Now we'll add validation for the initial code and generate analysis of any issues found.\n", + "\n", + "- NeMo Data Designer includes a built-in code validation feature that automatically checks the syntactic correctness and executable validity of \\\n", + "generated code snippets. \n", + "\n", + "- This helps ensure that outputs from language models are not only syntactically correct, but also able to run successfully in the \\\n", + "intended programming language environment. \n", + "\n", + "- Leveraging this validation step significantly increases dataset quality by promptly identifying invalid or non-functional code, \\\n", + "streamlining the process of generating reliable and production-ready data samples.\n", + "\n", + "- NeMo Data Designer supports validation for these languages\n", + "\n", + " - Python (CodeLang.PYTHON)\n", + "\n", + " - SQL dialects:\n", + "\n", + " - ANSI SQL (CodeLang.SQL_ANSI)\n", + "\n", + " - MySQL (CodeLang.SQL_MYSQL)\n", + "\n", + " - PostgreSQL (CodeLang.SQL_POSTGRES)\n", + "\n", + " - SQLite (CodeLang.SQL_SQLITE)\n", + "\n", + " - T-SQL (CodeLang.SQL_TSQL)\n", + "\n", + " - BigQuery (CodeLang.SQL_BIGQUERY)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "917f73fc", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder.add_column(\n", + " ValidationColumnConfig(\n", + " name=\"code_validity_result\",\n", + " validator_type=ValidatorType.CODE,\n", + " target_columns=[\"sql\"],\n", + " validator_params=CodeValidatorParams(\n", + " code_lang=CodeLang.SQL_ANSI,\n", + " ),\n", + " batch_size=100\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "eb6143ad", + "metadata": {}, + "source": [ + "### πŸ” Iteration is key – preview the dataset!\n", + "\n", + "1. Use the `preview` method to generate a sample of records quickly.\n", + "\n", + "2. Inspect the results for quality and format issues.\n", + "\n", + "3. Adjust column configurations, prompts, or parameters as needed.\n", + "\n", + "4. Re-run the preview until satisfied." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ca2cba7", + "metadata": {}, + "outputs": [], + "source": [ + "# Preview a few records\n", + "preview = data_designer_client.preview(config_builder)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5bd5bf00", + "metadata": {}, + "outputs": [], + "source": [ + "# More previews\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "id": "770dcebb", + "metadata": {}, + "source": [ + "### πŸ“Š Analyze the generated data\n", + "\n", + "- Data Designer automatically generates a basic statistical analysis of the generated data.\n", + "\n", + "- This analysis is available via the `analysis` property of generation result objects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7c38751", + "metadata": {}, + "outputs": [], + "source": [ + "# Print the analysis as a table.\n", + "preview.analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "id": "c443e961", + "metadata": {}, + "source": [ + "### πŸ†™ Scale up!\n", + "\n", + "- Happy with your preview data?\n", + "\n", + "- Use the `create` method to submit larger Data Designer generation jobs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1aaf4104", + "metadata": {}, + "outputs": [], + "source": [ + "job_results = data_designer_client.create(config_builder, num_records=20)\n", + "\n", + "# This will block until the job is complete.\n", + "job_results.wait_until_done()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0f670392", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the generated dataset as a pandas DataFrame.\n", + "dataset = job_results.load_dataset()\n", + "\n", + "dataset.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85251bd3", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the analysis results into memory.\n", + "analysis = job_results.load_analysis()\n", + "\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91dddbd6", + "metadata": {}, + "outputs": [], + "source": [ + "TUTORIAL_OUTPUT_PATH = \"data-designer-tutorial-output\"\n", + "\n", + "# Download the job artifacts and save them to disk.\n", + "job_results.download_artifacts(\n", + " output_path=TUTORIAL_OUTPUT_PATH,\n", + " artifacts_folder_name=\"artifacts-community-contributions-text-to-code-text-to-sql\",\n", + ");" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "sdg_venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/1-the-basics.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/1-the-basics.ipynb index 12f10cf2c..baf23a59e 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/1-the-basics.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/1-the-basics.ipynb @@ -16,10 +16,9 @@ "\n", "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", ">\n", - "> - If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies.\n", + "> - If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies.\n", ">\n", "> - You may need to restart your notebook's kernel after setting up the environment.\n", - ">\n", "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", ">\n", "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" @@ -486,7 +485,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### ⏭️ Next Steps\n", + "## ⏭️ Next Steps\n", "\n", "Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about:\n", "\n", diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb index a137ecf65..14e0f8bbd 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb @@ -18,10 +18,9 @@ "\n", "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", ">\n", - "> - If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies.\n", + "> - If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies.\n", ">\n", "> - You may need to restart your notebook's kernel after setting up the environment.\n", - ">\n", "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", ">\n", "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" @@ -370,7 +369,6 @@ " ),\n", " output_format=Product,\n", " model_alias=MODEL_ALIAS,\n", - " system_prompt=SYSTEM_PROMPT,\n", " )\n", ")\n", "\n", @@ -393,7 +391,6 @@ " ),\n", " output_format=ProductReview,\n", " model_alias=MODEL_ALIAS,\n", - " system_prompt=SYSTEM_PROMPT,\n", " )\n", ")\n", "\n", @@ -531,7 +528,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### ⏭️ Next Steps\n", + "## ⏭️ Next Steps\n", "\n", "Check out the following notebook to learn more about:\n", "\n", diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/3-seeding-with-a-dataset.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/3-seeding-with-a-dataset.ipynb index 235169c67..4b8b1f5e7 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/3-seeding-with-a-dataset.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/3-seeding-with-a-dataset.ipynb @@ -18,22 +18,11 @@ "\n", "
\n", "\n", - "> 🌱 **Why use a seed dataset?**\n", - ">\n", - "> - Seed datasets let you steer the generation process by providing context that is specific to your use case.\n", - ">\n", - "> - Seed datasets are also an excellent way to inject real-world diversity into your synthetic data.\n", - ">\n", - "> - During generation, prompt templates can reference any of the seed dataset fields.\n", - "\n", - "
\n", - "\n", "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", ">\n", - "> - If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies.\n", + "> - If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies.\n", ">\n", "> - You may need to restart your notebook's kernel after setting up the environment.\n", - ">\n", "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", ">\n", "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n" @@ -165,6 +154,18 @@ "\n", "- We already have the dataset downloaded in the [data](../data) directory of this repository.\n", "\n", + "
\n", + "\n", + "> 🌱 **Why use a seed dataset?**\n", + ">\n", + "> - Seed datasets let you steer the generation process by providing context that is specific to your use case.\n", + ">\n", + "> - Seed datasets are also an excellent way to inject real-world diversity into your synthetic data.\n", + ">\n", + "> - During generation, prompt templates can reference any of the seed dataset fields.\n", + "\n", + "
\n", + "\n", "> πŸ’‘ **About datastores**\n", ">\n", "> - You can use seed datasets from _either_ the Hugging Face Hub or a locally deployed datastore.\n", @@ -225,7 +226,7 @@ "\n", "- Generally, we recommend using concrete objects, but this is a convenient shorthand.\n", "\n", - "- **Note**: Prompt templates in the column configs can reference fields from the seed dataset:\n", + "- **Note**: The prompt template can reference fields from our seed dataset:\n", " - `{{ diagnosis }}` - the medical diagnosis from the seed data\n", " - `{{ patient_summary }}` - the symptom description from the seed data\n" ] @@ -461,7 +462,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### ⏭️ Next Steps\n", + "## ⏭️ Next Steps\n", "\n", "Use Data Designer to generate synthetic data for your specific use case!\n" ] diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/README.md b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/README.md index 8ed89412e..c3547f3b9 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/README.md +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/README.md @@ -14,6 +14,6 @@ | Notebook | Description | |---------------------------------------------------|----------------------------------------------------------------------------------| -| [1-the-basics.ipynb](./1-the-basics.ipynb) | Learn the basics of Data Designer by generating a simple product review dataset | -| [2-structured-outputs-and-jinja-expressions.ipynb](./2-structured-outputs-and-jinja-expressions.ipynb) | Explore advanced data generation using structured outputs and Jinja expressions | -| [3-seeding-with-a-dataset.ipynb](./3-seeding-with-a-dataset.ipynb) | Discover how to seed synthetic data generation with an external dataset | +| [1-the-basics.ipynb](./self-hosted-tutorials/getting-started/1-the-basics.ipynb) | Learn the basics of Data Designer by generating a simple product review dataset | +| [2-structured-outputs-and-jinja-expressions.ipynb](./self-hosted-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb) | Explore advanced data generation using structured outputs and Jinja expressions | +| [3-seeding-with-a-dataset.ipynb](./self-hosted-tutorials/getting-started/3-seeding-with-a-dataset.ipynb) | Discover how to seed synthetic data generation with an external dataset |