From 69890f745f75582158a2d500569bf592dc213775 Mon Sep 17 00:00:00 2001 From: Johnny Greco Date: Mon, 27 Oct 2025 11:21:05 -0400 Subject: [PATCH] updates --- .../getting-started/1-the-basics.ipynb | 24 +- ...ctured-outputs-and-jinja-expressions.ipynb | 23 +- .../3-seeding-with-a-dataset.ipynb | 20 +- .../4-custom-model-configs.ipynb | 450 ------------------ .../forms/w2-dataset.ipynb | 51 +- .../healthcare-datasets/clinical-trials.ipynb | 354 ++++++++------ .../insurance-claims.ipynb | 167 ++++--- ...otes-with-realistic-personal-details.ipynb | 52 +- .../multi-turn-conversation.ipynb | 89 ++-- .../visual-question-answering-using-vlm.ipynb | 222 ++++++--- .../person-sampler-tutorial.ipynb | 141 +++--- .../product-question-answer-generator.ipynb | 49 +- ...generate-rag-generation-eval-dataset.ipynb | 44 +- .../reasoning/reasoning-traces.ipynb | 132 +++-- .../text-to-code/text-to-python-evol.ipynb | 68 +-- .../text-to-code/text-to-python.ipynb | 42 +- .../text-to-code/text-to-sql.ipynb | 82 ++-- .../getting-started/1-the-basics.ipynb | 2 - ...ctured-outputs-and-jinja-expressions.ipynb | 2 - .../3-seeding-with-a-dataset.ipynb | 4 - 20 files changed, 856 insertions(+), 1162 deletions(-) delete mode 100644 nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/4-custom-model-configs.ipynb diff --git a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/1-the-basics.ipynb b/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/1-the-basics.ipynb index dd8a983cb..828b88750 100644 --- a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/1-the-basics.ipynb +++ b/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/1-the-basics.ipynb @@ -6,11 +6,7 @@ "source": [ "# 🎨 NeMo Data Designer 101: The Basics\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - "\n", - "
\n", - "\n", - "In this notebook, we will demonstrate the basics of Data Designer by generating a simple product review dataset." + "In this notebook, we will demonstrate the basics of Data Designer by generating a simple product review dataset.\n" ] }, { @@ -24,7 +20,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -41,7 +37,7 @@ " SubcategorySamplerParams,\n", " UniformSamplerParams,\n", " ModelConfig,\n", - " InferenceParameters\n", + " InferenceParameters,\n", ")\n" ] }, @@ -55,9 +51,9 @@ "- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n", "- If you have an instance of data designer running locally, you can connect to it as follows\n", "\n", - " ```python\n", - " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", - " ```\n" + " ```python\n", + " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", + " ```\n" ] }, { @@ -83,7 +79,7 @@ "source": [ "data_designer_client = NeMoDataDesignerClient(\n", " base_url=\"https://ai.api.nvidia.com/v1/nemo/dd\",\n", - " default_headers={\"Authorization\": f\"Bearer {api_key}\"} # auto-generated API KEY\n", + " default_headers={\"Authorization\": f\"Bearer {api_key}\"}, # auto-generated API KEY\n", ")\n" ] }, @@ -106,10 +102,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**Note**: \n", + "**Note**:\n", "The NeMo Data Designer Managed service has models available for you to use as well. You can use these models by referencing the appropriate model_alias for them.\n", "\n", - "Please visit https://build.nvidia.com/nemo/data-designer to see the full list of models and their model aliases." + "Please visit https://build.nvidia.com/nemo/data-designer to see the full list of models and their model aliases.\n" ] }, { @@ -138,7 +134,7 @@ " max_tokens=1024,\n", " temperature=0.6,\n", " top_p=0.95,\n", - " )\n", + " ),\n", " ),\n", " ]\n", ")\n" diff --git a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb b/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb index a0683a2ce..ebcc4b27e 100644 --- a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb +++ b/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb @@ -6,15 +6,13 @@ "source": [ "# 🎨 NeMo Data Designer 101: Structured Outputs and Jinja Expressions\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", "\n", "
\n", "\n", "In this notebook, we will continue our exploration of Data Designer, demonstrating more advanced data generation using structured outputs and Jinja expressions.\n", "\n", - "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series." + "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series.\n" ] }, { @@ -57,9 +55,9 @@ "- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n", "- If you have an instance of data designer running locally, you can connect to it as follows\n", "\n", - " ```python\n", - " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", - " ```\n" + " ```python\n", + " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", + " ```\n" ] }, { @@ -85,7 +83,7 @@ "source": [ "data_designer_client = NeMoDataDesignerClient(\n", " base_url=\"https://ai.api.nvidia.com/v1/nemo/dd\",\n", - " default_headers={\"Authorization\": f\"Bearer {api_key}\"} # auto-generated API KEY\n", + " default_headers={\"Authorization\": f\"Bearer {api_key}\"}, # auto-generated API KEY\n", ")\n" ] }, @@ -108,10 +106,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**Note**: \n", + "**Note**:\n", "The NeMo Data Designer Managed service has models available for you to use as well. You can use these models by referencing the appropriate model_alias for them.\n", "\n", - "Please visit https://build.nvidia.com/nemo/data-designer to see the full list of models and their model aliases." + "Please visit https://build.nvidia.com/nemo/data-designer to see the full list of models and their model aliases.\n" ] }, { @@ -282,11 +280,11 @@ " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", " values=[\"rambling\", \"brief\", \"detailed\", \"structured with bullet points\"],\n", - " weights=[1, 2, 2, 1]\n", + " weights=[1, 2, 2, 1],\n", " ),\n", " conditional_params={\n", " \"target_age_range == '18-25'\": CategorySamplerParams(values=[\"rambling\"]),\n", - " }\n", + " },\n", " )\n", ")\n", "\n", @@ -402,8 +400,7 @@ "\n", "- [Seeding synthetic data generation with an external dataset](./3-seeding-with-a-dataset.ipynb)\n", "\n", - "- [Using Custom Model Configs](./4-custom-model-configs.ipynb)\n", - "\n" + "- [Using Custom Model Configs](./4-custom-model-configs.ipynb)\n" ] } ], diff --git a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/3-seeding-with-a-dataset.ipynb b/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/3-seeding-with-a-dataset.ipynb index 7b4d9d232..f06b856af 100644 --- a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/3-seeding-with-a-dataset.ipynb +++ b/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/3-seeding-with-a-dataset.ipynb @@ -6,15 +6,13 @@ "source": [ "# 🎨 NeMo Data Designer 101: Seeding Synthetic Data Generation with an External Dataset\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", "\n", "
\n", "\n", "In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset.\n", "\n", - "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series." + "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series.\n" ] }, { @@ -51,9 +49,9 @@ "- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n", "- If you have an instance of data designer running locally, you can connect to it as follows\n", "\n", - " ```python\n", - " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", - " ```\n" + " ```python\n", + " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", + " ```\n" ] }, { @@ -79,7 +77,7 @@ "source": [ "data_designer_client = NeMoDataDesignerClient(\n", " base_url=\"https://ai.api.nvidia.com/v1/nemo/dd\",\n", - " default_headers={\"Authorization\": f\"Bearer {api_key}\"} # auto-generated API KEY\n", + " default_headers={\"Authorization\": f\"Bearer {api_key}\"}, # auto-generated API KEY\n", ")\n" ] }, @@ -102,10 +100,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**Note**: \n", + "**Note**:\n", "The NeMo Data Designer Managed service has models available for you to use as well. You can use these models by referencing the appropriate model_alias for them.\n", "\n", - "Please visit https://build.nvidia.com/nemo/data-designer to see the full list of models and their model aliases." + "Please visit https://build.nvidia.com/nemo/data-designer to see the full list of models and their model aliases.\n" ] }, { @@ -138,7 +136,7 @@ "\n", "- In this dataset, the `input_text` represents the `patient_summary` and the `output_text` represents the `diagnosis`\n", "\n", - "**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. \n" + "**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file.\n" ] }, { @@ -155,7 +153,7 @@ "config_builder.with_seed_dataset(\n", " dataset_reference=SeedDatasetReference(\n", " dataset=\"gretelai/symptom_to_diagnosis/train.jsonl\",\n", - " datastore_settings={\"endpoint\": \"https://huggingface.co\"}\n", + " datastore_settings={\"endpoint\": \"https://huggingface.co\"},\n", " ),\n", " sampling_strategy=\"shuffle\",\n", ")\n" diff --git a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/4-custom-model-configs.ipynb b/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/4-custom-model-configs.ipynb deleted file mode 100644 index 16bad37e5..000000000 --- a/nemo/NeMo-Data-Designer/managed-service-tutorials/getting-started/4-custom-model-configs.ipynb +++ /dev/null @@ -1,450 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 🎨 NeMo Data Designer 101: Using Custom Model Configurations\n", - "\n", - "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", - ">\n", - "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", - "\n", - "
\n", - "\n", - "In this notebook, we will see how to create and use custom model configurations in Data Designer.\n", - "\n", - "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### πŸ’Ύ Install dependencies\n", - "\n", - "**IMPORTANT** πŸ‘‰ If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from getpass import getpass\n", - "\n", - "from nemo_microservices.data_designer.essentials import (\n", - " CategorySamplerParams,\n", - " DataDesignerConfigBuilder,\n", - " InferenceParameters,\n", - " LLMTextColumnConfig,\n", - " ModelConfig,\n", - " NeMoDataDesignerClient,\n", - " PersonSamplerParams,\n", - " SamplerColumnConfig,\n", - " SamplerType,\n", - " SubcategorySamplerParams,\n", - " UniformDistribution,\n", - " UniformDistributionParams,\n", - " UniformSamplerParams\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### βš™οΈ Initialize the NeMo Data Designer Client\n", - "\n", - "- The data designer client is responsible for submitting generation requests to the Data Designer microservice.\n", - "- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n", - "- If you have an instance of data designer running locally, you can connect to it as follows\n", - "\n", - " ```python\n", - " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", - " ```\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# if using the managed service of data designer, provide the api key here\n", - "api_key = getpass(\"Enter data designer API key: \")\n", - "\n", - "if len(api_key) > 0:\n", - " print(\"βœ… API key received.\")\n", - "else:\n", - " print(\"❌ No API key provided. Please enter your model provider API key.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_designer_client = NeMoDataDesignerClient(\n", - " base_url=\"https://ai.api.nvidia.com/v1/nemo/dd\",\n", - " default_headers={\"Authorization\": f\"Bearer {api_key}\"} # auto-generated API KEY\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### πŸ—οΈ Initialize the Data Designer Config Builder\n", - "\n", - "- The Data Designer config defines the dataset schema and generation process.\n", - "\n", - "- The config builder provides an intuitive interface for building this configuration.\n", - "\n", - "- You must provide a list of model configs to the builder at initialization.\n", - "\n", - "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# build.nvidia.com model endpoint\n", - "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", - "model_alias_static_temp = \"nemotron-nano-v2_static_temp\"\n", - "model_alias_variable_temp = \"nemotron-nano-v2_variable_temp\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## βš™οΈ Custom Model Configurations\n", - "\n", - "- In the previous notebooks, we've seen how we can reference a model using the model alias and pass static inference hyperparameters \n", - "\n", - "- In this notebook, we will see how we can sample values from a distribution to set as our temperature value. \n", - "This will result in greater diversity in our generated data as a different temperature value will be used each time the LLM is called" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder = DataDesignerConfigBuilder(\n", - " model_configs = [\n", - " ModelConfig(\n", - " alias=model_alias_static_temp,\n", - " model=model_id,\n", - " inference_parameters=InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=0.0,\n", - " top_p=0.95,\n", - " timeout=120\n", - " ),\n", - " ),\n", - " ModelConfig(\n", - " alias=model_alias_variable_temp,\n", - " model=model_id,\n", - " inference_parameters=InferenceParameters(\n", - " max_tokens=1024,\n", - " temperature=UniformDistribution(\n", - " params=UniformDistributionParams(\n", - " low=0.5,\n", - " high=0.9\n", - " )\n", - " ),\n", - " top_p=0.95,\n", - " timeout=120\n", - " ),\n", - " ),\n", - " ]\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ§‘β€πŸŽ¨ Generating our Data\n", - "\n", - "- We follow a similar procedure to generate our product review dataset as we did in the the [basics tutorial](./1-the-basics.ipynb)\n", - "\n", - "- The one difference is that we generate multiple samples of the LLM generated columns to demonstrate the difference in generation outputs due to different temperature values\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder.add_column(\n", - " SamplerColumnConfig(\n", - " name=\"product_category\",\n", - " sampler_type=SamplerType.CATEGORY,\n", - " params=CategorySamplerParams(\n", - " values=[\n", - " \"Electronics\",\n", - " \"Clothing\",\n", - " \"Home & Kitchen\",\n", - " \"Books\",\n", - " \"Home Office\",\n", - " ],\n", - " ),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " SamplerColumnConfig(\n", - " name=\"product_subcategory\",\n", - " sampler_type=SamplerType.SUBCATEGORY,\n", - " params=SubcategorySamplerParams(\n", - " category=\"product_category\",\n", - " values={\n", - " \"Electronics\": [\n", - " \"Smartphones\",\n", - " \"Laptops\",\n", - " \"Headphones\",\n", - " \"Cameras\",\n", - " \"Accessories\",\n", - " ],\n", - " \"Clothing\": [\n", - " \"Men's Clothing\",\n", - " \"Women's Clothing\",\n", - " \"Winter Coats\",\n", - " \"Activewear\",\n", - " \"Accessories\",\n", - " ],\n", - " \"Home & Kitchen\": [\n", - " \"Appliances\",\n", - " \"Cookware\",\n", - " \"Furniture\",\n", - " \"Decor\",\n", - " \"Organization\",\n", - " ],\n", - " \"Books\": [\n", - " \"Fiction\",\n", - " \"Non-Fiction\",\n", - " \"Self-Help\",\n", - " \"Textbooks\",\n", - " \"Classics\",\n", - " ],\n", - " \"Home Office\": [\n", - " \"Desks\",\n", - " \"Chairs\",\n", - " \"Storage\",\n", - " \"Office Supplies\",\n", - " \"Lighting\",\n", - " ],\n", - " },\n", - " ),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " SamplerColumnConfig(\n", - " name=\"target_age_range\",\n", - " sampler_type=SamplerType.CATEGORY,\n", - " params=CategorySamplerParams(\n", - " values=[\"18-25\", \"25-35\", \"35-50\", \"50-65\", \"65+\"]\n", - " ),\n", - " )\n", - ")\n", - "\n", - "# Optionally validate that the columns are configured correctly.\n", - "config_builder.validate()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, let's add samplers to generate data related to the customer and their review.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# This column will sample synthetic person data based on statistics from the US Census.\n", - "config_builder.add_column(\n", - " SamplerColumnConfig(\n", - " name=\"customer\",\n", - " sampler_type=SamplerType.PERSON,\n", - " params=PersonSamplerParams(age_range=[18, 70]),\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " SamplerColumnConfig(\n", - " name=\"number_of_stars\",\n", - " sampler_type=SamplerType.UNIFORM,\n", - " params=UniformSamplerParams(low=1, high=5),\n", - " convert_to=\"int\",\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " SamplerColumnConfig(\n", - " name=\"review_style\",\n", - " sampler_type=SamplerType.CATEGORY,\n", - " params=CategorySamplerParams(\n", - " values=[\"rambling\", \"brief\", \"detailed\", \"structured with bullet points\"],\n", - " weights=[1, 2, 2, 1],\n", - " ),\n", - " )\n", - ")\n", - "\n", - "config_builder.validate()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🦜 LLM-generated columns\n", - "\n", - "- We generate three sets of the LLM-generated columns to demonstrate the difference in output based on different temperature values" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config_builder.add_column(\n", - " LLMTextColumnConfig(\n", - " name=\"product_name\",\n", - " prompt=(\n", - " \"Come up with a creative product name for a product in the '{{ product_category }}' category, focusing \"\n", - " \"on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is \"\n", - " \"{{ target_age_range }} years old. Respond with only the product name, no other text.\"\n", - " ),\n", - " # This is optional, but it can be useful for controlling the behavior of the LLM. Do not include instructions\n", - " # related to output formatting in the system prompt, as Data Designer handles this based on the column type.\n", - " system_prompt=(\n", - " \"You are a helpful assistant that generates product names. You respond with only the product name, \"\n", - " \"no other text. You do NOT add quotes around the product name. \"\n", - " ),\n", - " model_alias=model_alias_static_temp,\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " LLMTextColumnConfig(\n", - " name=\"customer_review_base\",\n", - " prompt=(\n", - " \"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. \"\n", - " \"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. \"\n", - " \"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. \"\n", - " \"The style of the review should be '{{ review_style }}'. \"\n", - " ),\n", - " model_alias=model_alias_static_temp,\n", - " )\n", - ")\n", - "\n", - "\n", - "config_builder.add_column(\n", - " LLMTextColumnConfig(\n", - " name=\"customer_review_set_2\",\n", - " prompt=(\n", - " \"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. \"\n", - " \"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. \"\n", - " \"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. \"\n", - " \"The style of the review should be '{{ review_style }}'. \"\n", - " ),\n", - " model_alias=model_alias_variable_temp,\n", - " )\n", - ")\n", - "\n", - "config_builder.add_column(\n", - " LLMTextColumnConfig(\n", - " name=\"customer_review_set_3\",\n", - " prompt=(\n", - " \"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. \"\n", - " \"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. \"\n", - " \"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. \"\n", - " \"The style of the review should be '{{ review_style }}'. \"\n", - " ),\n", - " model_alias=model_alias_variable_temp,\n", - " )\n", - ")\n", - "\n", - "config_builder.validate()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ‘€ Preview the dataset\n", - "\n", - "- Use the `preview` method to generate 10 records for inspection.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "preview = data_designer_client.preview(config_builder, num_records=3)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Run this cell multiple times to cycle through the 10 preview records.\n", - "preview.display_sample_record()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# The preview dataset is available as a pandas DataFrame.\n", - "preview.dataset" - ] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "sdg_venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.11" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/forms/w2-dataset.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/forms/w2-dataset.ipynb index b3199b39a..b611a417e 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/forms/w2-dataset.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/forms/w2-dataset.ipynb @@ -7,8 +7,6 @@ "source": [ "# 🧾 NeMo Data Designer: W-2 Dataset Generator\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", "The notebook demonstrates how you can combine numerical samplers, the person sampler and LLMs to create a synthetic\\\n", @@ -17,8 +15,7 @@ "- We will use generate numerical fields using [statistics published by the IRS](https://www.irs.gov/pub/irs-pdf/p5385.pdf) for the year 2021:\n", "\n", "- We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, statistics\\\n", - " for generated persons reflect real-world census data.\n", - "\n", + " for generated persons reflect real-world census data.\n", "\n", "
\n", "\n", @@ -65,7 +62,7 @@ " SamplerColumnConfig,\n", " SamplerType,\n", " SubcategorySamplerParams,\n", - " UniformSamplerParams\n", + " UniformSamplerParams,\n", ")" ] }, @@ -169,14 +166,14 @@ "id": "bbcb3538", "metadata": {}, "source": [ - "## 🎲 Setting Up Taxpayer and Employer Sampling\n", + "## 🎲 Setting Up Taxpayer and Employer Sampling\n", "\n", "- Sampler columns offer non-LLM based generation of synthetic data.\n", "\n", "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n", "\n", "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n", - " If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker" + " If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker\n" ] }, { @@ -191,10 +188,7 @@ " SamplerColumnConfig(\n", " name=\"taxpayer\",\n", " sampler_type=SamplerType.PERSON,\n", - " params=PersonSamplerParams(\n", - " locale=\"en_US\",\n", - " age_range=[18, 75]\n", - " ),\n", + " params=PersonSamplerParams(locale=\"en_US\", age_range=[18, 75]),\n", " )\n", ")\n", "\n", @@ -218,6 +212,7 @@ "## ⚑️ Defining the Fields\n", "\n", "We will focus on the following:\n", + "\n", "- Box 1 (Wages, tips, and other compensation)\n", "- Box 2 (Federal income tax withheld)\n", "- Box 3 (Social security wages)\n", @@ -230,12 +225,12 @@ "- Box e (Employee's fist name, initial, and last name)\n", "- Box f (Employee's address and zip code)\n", "\n", - "
\n", + "
\n", "\n", "### Numerical fields\n", "\n", "Here, we'll define how to generate numerical samples for the currency fields of the W-2 (Boxes 1-7). \\\n", - "We'll use the W-2 statistics from the IRS linked above to generate realistic samples." + "We'll use the W-2 statistics from the IRS linked above to generate realistic samples.\n" ] }, { @@ -256,9 +251,7 @@ " name=\"box_1_wages_tips_other_compensation\",\n", " sampler_type=SamplerType.BERNOULLI_MIXTURE,\n", " params=BernoulliMixtureSamplerParams(\n", - " p=0.994,\n", - " dist_name=\"expon\",\n", - " dist_params={\"scale\": 35891.49}\n", + " p=0.994, dist_name=\"expon\", dist_params={\"scale\": 35891.49}\n", " ),\n", " convert_to=\"int\",\n", " )\n", @@ -400,10 +393,8 @@ " name=\"box_7_social_security_tips\",\n", " sampler_type=SamplerType.BERNOULLI_MIXTURE,\n", " params=BernoulliMixtureSamplerParams(\n", - " p=0.0454,\n", - " dist_name=\"expon\",\n", - " dist_params={\"scale\": 4428.91}\n", - " )\n", + " p=0.0454, dist_name=\"expon\", dist_params={\"scale\": 4428.91}\n", + " ),\n", " )\n", ")" ] @@ -416,7 +407,7 @@ "### 🦜 Non-numerical Fields\n", "\n", "The remaining fields contain information about the employee (taxpayer) and the employer. \\\n", - "We'll use the person sampler in combination with an LLM to generate values here." + "We'll use the person sampler in combination with an LLM to generate values here.\n" ] }, { @@ -445,17 +436,21 @@ " LLMTextColumnConfig(\n", " name=\"employer_business\",\n", " model_alias=MODEL_ALIAS,\n", - " system_prompt=(\"You are assisting a user generate synthetic W-2 forms.\"\n", - " \"You must generate a realistic industry category for the employer\"\n", - " \"eg: software, health insurance, shoe store, restaurant, plumbing /no_think\"),\n", - " prompt=(\"Generate the industry category for the employer. Ensure it is consistent with the employer location\"\n", - " \"City: {{ employer.city }}\\nState: {{ employer.state }}\"),\n", + " system_prompt=(\n", + " \"You are assisting a user generate synthetic W-2 forms.\"\n", + " \"You must generate a realistic industry category for the employer\"\n", + " \"eg: software, health insurance, shoe store, restaurant, plumbing /no_think\"\n", + " ),\n", + " prompt=(\n", + " \"Generate the industry category for the employer. Ensure it is consistent with the employer location\"\n", + " \"City: {{ employer.city }}\\nState: {{ employer.state }}\"\n", + " ),\n", " )\n", ")\n", "\n", "# Next, we'll generate an actual name based on the type of business.\n", "config_builder.add_column(\n", - " LLMTextColumnConfig(\n", + " LLMTextColumnConfig(\n", " name=\"employer_name\",\n", " system_prompt=SYSTEM_PROMPT,\n", " model_alias=MODEL_ALIAS,\n", @@ -507,7 +502,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/clinical-trials.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/clinical-trials.ipynb index ff7f97d60..f68e22c04 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/clinical-trials.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/clinical-trials.ipynb @@ -7,15 +7,12 @@ "source": [ "# πŸ₯ NeMo Data Designer: Clinical Trials Dataset Generator\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", "This notebook demonstrates how to use structured samplers, person/PII generators, and LLMs to create a realistic\\\n", "synthetic clinical trials datasetβ€”including trial metadata, participant demographics, investigator details,\\\n", "clinical notes, and adverse event reportsβ€”for evaluating data protection and anonymization techniques.\n", "\n", - "\n", "
\n", "\n", "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", @@ -61,7 +58,7 @@ " SamplerType,\n", " SubcategorySamplerParams,\n", " UUIDSamplerParams,\n", - " UniformSamplerParams\n", + " UniformSamplerParams,\n", ")" ] }, @@ -172,7 +169,7 @@ "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n", "\n", "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n", - "If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker" + " If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker\n" ] }, { @@ -225,10 +222,11 @@ "### Creating Trial Information\n", "\n", "Next, we'll create the basic trial information:\n", + "\n", "- Study ID (unique identifier)\n", "- Trial phase and therapeutic area\n", "- Study design details\n", - "- Start and end dates for the trial" + "- Start and end dates for the trial\n" ] }, { @@ -243,7 +241,7 @@ " SamplerColumnConfig(\n", " name=\"study_id\",\n", " sampler_type=SamplerType.UUID,\n", - " params=UUIDSamplerParams(prefix=\"CT-\", short_form=True, uppercase=True)\n", + " params=UUIDSamplerParams(prefix=\"CT-\", short_form=True, uppercase=True),\n", " )\n", ")\n", "\n", @@ -254,8 +252,8 @@ " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", " values=[\"Phase I\", \"Phase II\", \"Phase III\", \"Phase IV\"],\n", - " weights=[0.2, 0.3, 0.4, 0.1]\n", - " )\n", + " weights=[0.2, 0.3, 0.4, 0.1],\n", + " ),\n", " )\n", ")\n", "\n", @@ -265,9 +263,15 @@ " name=\"therapeutic_area\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", - " values=[\"Oncology\", \"Cardiology\", \"Neurology\", \"Immunology\", \"Infectious Disease\"],\n", - " weights=[0.3, 0.2, 0.2, 0.15, 0.15]\n", - " )\n", + " values=[\n", + " \"Oncology\",\n", + " \"Cardiology\",\n", + " \"Neurology\",\n", + " \"Immunology\",\n", + " \"Infectious Disease\",\n", + " ],\n", + " weights=[0.3, 0.2, 0.2, 0.15, 0.15],\n", + " ),\n", " )\n", ")\n", "\n", @@ -279,10 +283,30 @@ " params=SubcategorySamplerParams(\n", " category=\"trial_phase\",\n", " values={\n", - " \"Phase I\": [\"Single Arm\", \"Dose Escalation\", \"First-in-Human\", \"Safety Assessment\"],\n", - " \"Phase II\": [\"Randomized\", \"Double-Blind\", \"Proof of Concept\", \"Open-Label Extension\"],\n", - " \"Phase III\": [\"Randomized Controlled\", \"Double-Blind Placebo-Controlled\", \"Multi-Center\", \"Pivotal\"],\n", - " \"Phase IV\": [\"Post-Marketing Surveillance\", \"Real-World Evidence\", \"Long-Term Safety\", \"Expanded Access\"]\n", + " \"Phase I\": [\n", + " \"Single Arm\",\n", + " \"Dose Escalation\",\n", + " \"First-in-Human\",\n", + " \"Safety Assessment\",\n", + " ],\n", + " \"Phase II\": [\n", + " \"Randomized\",\n", + " \"Double-Blind\",\n", + " \"Proof of Concept\",\n", + " \"Open-Label Extension\",\n", + " ],\n", + " \"Phase III\": [\n", + " \"Randomized Controlled\",\n", + " \"Double-Blind Placebo-Controlled\",\n", + " \"Multi-Center\",\n", + " \"Pivotal\",\n", + " ],\n", + " \"Phase IV\": [\n", + " \"Post-Marketing Surveillance\",\n", + " \"Real-World Evidence\",\n", + " \"Long-Term Safety\",\n", + " \"Expanded Access\",\n", + " ],\n", " },\n", " ),\n", " )\n", @@ -295,7 +319,7 @@ " column_type=\"sampler\",\n", " sampler_type=\"datetime\",\n", " params={\"start\": \"2022-01-01\", \"end\": \"2023-06-30\"},\n", - " convert_to=\"%Y-%m-%d\"\n", + " convert_to=\"%Y-%m-%d\",\n", ")\n", "\n", "config_builder.add_column(\n", @@ -303,7 +327,7 @@ " column_type=\"sampler\",\n", " sampler_type=\"datetime\",\n", " params={\"start\": \"2023-07-01\", \"end\": \"2024-12-31\"},\n", - " convert_to=\"%Y-%m-%d\"\n", + " convert_to=\"%Y-%m-%d\",\n", ")" ] }, @@ -315,10 +339,11 @@ "### Participant Information\n", "\n", "Now we'll create fields for participant demographics and enrollment details:\n", + "\n", "- Participant ID and basic information\n", "- Demographics (age, gender, etc.)\n", "- Enrollment status and dates\n", - "- Randomization assignment" + "- Randomization assignment\n" ] }, { @@ -333,32 +358,32 @@ " SamplerColumnConfig(\n", " name=\"participant_id\",\n", " sampler_type=SamplerType.UUID,\n", - " params={\"prefix\": \"PT-\", \"short_form\": True, \"uppercase\": True}\n", + " params={\"prefix\": \"PT-\", \"short_form\": True, \"uppercase\": True},\n", " )\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"participant_first_name\",\n", " column_type=\"expression\",\n", - " expr=\"{{participant.first_name}}\"\n", + " expr=\"{{participant.first_name}}\",\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"participant_last_name\",\n", " column_type=\"expression\",\n", - " expr=\"{{participant.last_name}}\"\n", + " expr=\"{{participant.last_name}}\",\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"participant_birth_date\",\n", " column_type=\"expression\",\n", - " expr=\"{{participant.birth_date}}\"\n", + " expr=\"{{participant.birth_date}}\",\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"participant_email\",\n", " column_type=\"expression\",\n", - " expr=\"{{participant.email_address}}\"\n", + " expr=\"{{participant.email_address}}\",\n", ")\n", "\n", "# Enrollment information\n", @@ -370,9 +395,9 @@ " \"dt_min\": 0,\n", " \"dt_max\": 60,\n", " \"reference_column_name\": \"trial_start_date\",\n", - " \"unit\": \"D\"\n", + " \"unit\": \"D\",\n", " },\n", - " convert_to=\"%Y-%m-%d\"\n", + " convert_to=\"%Y-%m-%d\",\n", ")\n", "\n", "config_builder.add_column(\n", @@ -380,9 +405,9 @@ " name=\"participant_status\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", - " values = [\"Active\", \"Completed\", \"Withdrawn\", \"Lost to Follow-up\"],\n", - " weights = [0.6, 0.2, 0.15, 0.05]\n", - " )\n", + " values=[\"Active\", \"Completed\", \"Withdrawn\", \"Lost to Follow-up\"],\n", + " weights=[0.6, 0.2, 0.15, 0.05],\n", + " ),\n", " )\n", ")\n", "\n", @@ -391,9 +416,8 @@ " name=\"treatment_arm\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", - " values = [\"Treatment\", \"Placebo\", \"Standard of Care\"],\n", - " weights = [0.5, 0.3, 0.2]\n", - " )\n", + " values=[\"Treatment\", \"Placebo\", \"Standard of Care\"], weights=[0.5, 0.3, 0.2]\n", + " ),\n", " )\n", ")\n" ] @@ -406,9 +430,10 @@ "### Investigator and Staff Information\n", "\n", "Here we'll add information about the trial staff:\n", + "\n", "- Investigator information (principal investigator)\n", "- Study coordinator details\n", - "- Site information" + "- Site information\n" ] }, { @@ -422,20 +447,20 @@ "config_builder.add_column(\n", " name=\"investigator_first_name\",\n", " column_type=\"expression\",\n", - " expr=\"{{investigator.first_name}}\"\n", + " expr=\"{{investigator.first_name}}\",\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"investigator_last_name\",\n", " column_type=\"expression\",\n", - " expr=\"{{investigator.last_name}}\"\n", + " expr=\"{{investigator.last_name}}\",\n", ")\n", "\n", "config_builder.add_column(\n", " SamplerColumnConfig(\n", " name=\"investigator_id\",\n", " sampler_type=SamplerType.UUID,\n", - " params={\"prefix\": \"INV-\", \"short_form\": True, \"uppercase\": True}\n", + " params={\"prefix\": \"INV-\", \"short_form\": True, \"uppercase\": True},\n", " )\n", ")\n", "\n", @@ -443,19 +468,19 @@ "config_builder.add_column(\n", " name=\"coordinator_first_name\",\n", " column_type=\"expression\",\n", - " expr=\"{{coordinator.first_name}}\"\n", + " expr=\"{{coordinator.first_name}}\",\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"coordinator_last_name\",\n", " column_type=\"expression\",\n", - " expr=\"{{coordinator.last_name}}\"\n", + " expr=\"{{coordinator.last_name}}\",\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"coordinator_email\",\n", " column_type=\"expression\",\n", - " expr=\"{{coordinator.email_address}}\"\n", + " expr=\"{{coordinator.email_address}}\",\n", ")\n", "\n", "# Site information\n", @@ -464,8 +489,8 @@ " name=\"site_id\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", - " values = [\"SITE-001\", \"SITE-002\", \"SITE-003\", \"SITE-004\", \"SITE-005\"]\n", - " )\n", + " values=[\"SITE-001\", \"SITE-002\", \"SITE-003\", \"SITE-004\", \"SITE-005\"]\n", + " ),\n", " )\n", ")\n", "\n", @@ -474,8 +499,8 @@ " name=\"site_location\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", - " values = [\"London\", \"Manchester\", \"Birmingham\", \"Edinburgh\", \"Cambridge\"]\n", - " )\n", + " values=[\"London\", \"Manchester\", \"Birmingham\", \"Edinburgh\", \"Cambridge\"]\n", + " ),\n", " )\n", ")\n", "\n", @@ -505,9 +530,10 @@ "### Clinical Measurements and Outcomes\n", "\n", "These columns will track the key clinical data collected during the trial:\n", + "\n", "- Vital signs and lab values\n", - "- Efficacy measurements \n", - "- Dosing information" + "- Efficacy measurements\n", + "- Dosing information\n" ] }, { @@ -549,8 +575,7 @@ " name=\"dose_level\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", - " values=[\"Low\", \"Medium\", \"High\", \"Placebo\"],\n", - " weights=[0.3, 0.3, 0.2, 0.2]\n", + " values=[\"Low\", \"Medium\", \"High\", \"Placebo\"], weights=[0.3, 0.3, 0.2, 0.2]\n", " ),\n", " )\n", ")\n", @@ -561,7 +586,7 @@ " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", " values=[\"Once daily\", \"Twice daily\", \"Weekly\", \"Biweekly\"],\n", - " weights=[0.4, 0.3, 0.2, 0.1]\n", + " weights=[0.4, 0.3, 0.2, 0.1],\n", " ),\n", " )\n", ")\n", @@ -584,9 +609,10 @@ "### Adverse Events Tracking\n", "\n", "Here we'll capture adverse events that occur during the clinical trial:\n", + "\n", "- Adverse event presence and type\n", "- Severity and relatedness to treatment\n", - "- Dates and resolution" + "- Dates and resolution\n" ] }, { @@ -601,9 +627,7 @@ " SamplerColumnConfig(\n", " name=\"has_adverse_event\",\n", " sampler_type=SamplerType.BERNOULLI,\n", - " params=BernoulliSamplerParams(\n", - " p=0.3\n", - " ),\n", + " params=BernoulliSamplerParams(p=0.3),\n", " )\n", ")\n", "\n", @@ -612,10 +636,20 @@ " name=\"adverse_event_type\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", - " values=[\"Headache\", \"Nausea\", \"Fatigue\", \"Rash\", \"Dizziness\", \"Pain at injection site\", \"Other\"],\n", - " weights=[0.2, 0.15, 0.15, 0.1, 0.1, 0.2, 0.1]\n", + " values=[\n", + " \"Headache\",\n", + " \"Nausea\",\n", + " \"Fatigue\",\n", + " \"Rash\",\n", + " \"Dizziness\",\n", + " \"Pain at injection site\",\n", + " \"Other\",\n", + " ],\n", + " weights=[0.2, 0.15, 0.15, 0.1, 0.1, 0.2, 0.1],\n", " ),\n", - " conditional_params={\"has_adverse_event == 0\": CategorySamplerParams(values=[\"None\"])},\n", + " conditional_params={\n", + " \"has_adverse_event == 0\": CategorySamplerParams(values=[\"None\"])\n", + " },\n", " )\n", ")\n", "\n", @@ -623,8 +657,12 @@ " SamplerColumnConfig(\n", " name=\"adverse_event_severity\",\n", " sampler_type=SamplerType.CATEGORY,\n", - " params=CategorySamplerParams(values=[\"Mild\", \"Moderate\", \"Severe\", \"Life-threatening\"]),\n", - " conditional_params={\"has_adverse_event == 0\": CategorySamplerParams(values=[\"NA\"])},\n", + " params=CategorySamplerParams(\n", + " values=[\"Mild\", \"Moderate\", \"Severe\", \"Life-threatening\"]\n", + " ),\n", + " conditional_params={\n", + " \"has_adverse_event == 0\": CategorySamplerParams(values=[\"NA\"])\n", + " },\n", " )\n", ")\n", "\n", @@ -633,10 +671,17 @@ " name=\"adverse_event_relatedness\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", - " values=[\"Unrelated\", \"Possibly related\", \"Probably related\", \"Definitely related\"],\n", - " weights=[0.2, 0.4, 0.3, 0.1]\n", + " values=[\n", + " \"Unrelated\",\n", + " \"Possibly related\",\n", + " \"Probably related\",\n", + " \"Definitely related\",\n", + " ],\n", + " weights=[0.2, 0.4, 0.3, 0.1],\n", " ),\n", - " conditional_params={\"has_adverse_event == 0\": CategorySamplerParams(values=[\"NA\"])},\n", + " conditional_params={\n", + " \"has_adverse_event == 0\": CategorySamplerParams(values=[\"NA\"])\n", + " },\n", " )\n", ")\n", "\n", @@ -645,7 +690,11 @@ " name=\"adverse_event_resolved\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(values=[\"NA\"]),\n", - " conditional_params={\"has_adverse_event == 1\": CategorySamplerParams(values=[\"Yes\", \"No\"], weights=[0.8, 0.2])},\n", + " conditional_params={\n", + " \"has_adverse_event == 1\": CategorySamplerParams(\n", + " values=[\"Yes\", \"No\"], weights=[0.8, 0.2]\n", + " )\n", + " },\n", " )\n", ")\n" ] @@ -661,10 +710,10 @@ "We'll use style seed categories to ensure diversity in the writing styles:\n", "\n", "1. Medical observations and notes\n", - "2. Adverse event descriptions \n", + "2. Adverse event descriptions\n", "3. Protocol deviation explanations\n", "\n", - "**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. " + "**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file.\n" ] }, { @@ -680,8 +729,12 @@ " name=\"documentation_style\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", - " values=[\"Formal and Technical\", \"Concise and Direct\", \"Detailed and Descriptive\"],\n", - " weights=[0.4, 0.3, 0.3]\n", + " values=[\n", + " \"Formal and Technical\",\n", + " \"Concise and Direct\",\n", + " \"Detailed and Descriptive\",\n", + " ],\n", + " weights=[0.4, 0.3, 0.3],\n", " ),\n", " )\n", ")\n", @@ -693,37 +746,31 @@ " system_prompt=SYSTEM_PROMPT,\n", " model_alias=MODEL_ALIAS,\n", " prompt=(\n", - " \"{% if documentation_style == 'Formal and Technical' %}\\n\"\n", - " \"Write formal and technical medical observations for participant {{ participant_first_name }} {{ participant_last_name }}\\n\"\n", - " \"(ID: {{ participant_id }}) in the clinical trial for {{ therapeutic_area }} (Study ID: {{ study_id }}).\\n\"\n", - "\n", - " \"Include observations related to their enrollment in the {{ dose_level }} dose group with {{ dose_frequency }} administration.\\n\"\n", - " \"Baseline measurement was {{ baseline_measurement }} and final measurement was {{ final_measurement }}, representing a\"\n", - " \"change of {{ percent_change }}%.\\n\"\n", - "\n", - " \"Use proper medical terminology, maintain a highly formal tone, and structure the notes in a technical format with appropriate\"\n", - " \"sections and subsections. Include at least one reference to the site investigator, Dr. {{ investigator_last_name }}.\\n\"\n", - " \"{% elif documentation_style == 'Concise and Direct' %}\"\n", - " \"Write brief, direct medical observations for patient {{ participant_first_name }} {{ participant_last_name }}\\n\"\n", - " \"({{ participant_id }}) in {{ therapeutic_area }} trial {{ study_id }}.\\n\"\n", - "\n", - " \"Note: {{ dose_level }} dose, {{ dose_frequency }}. Baseline: {{ baseline_measurement }}. Final: {{ final_measurement }}.\\n\"\n", - " \"Change: {{ percent_change }}%.\\n\"\n", - "\n", - " \"Keep notes extremely concise, using abbreviations where appropriate. Mention follow-up needs and reference\\n\"\n", - " \"Dr. {{ investigator_last_name }} briefly.\\n\"\n", - " \"{% else %}\\n\"\n", - " \"Write detailed and descriptive medical observations for participant {{ participant_first_name }} {{ participant_last_name }}\\n\"\n", - " \"enrolled in the {{ therapeutic_area }} clinical trial ({{ study_id }}).\\n\"\n", - "\n", - " \"Provide a narrative description of their experience in the {{ dose_level }} dose group with {{ dose_frequency }} dosing.\\n\"\n", - " \"Describe how their measurements changed from baseline ({{ baseline_measurement }}) to final ({{ final_measurement }}),\\n\"\n", - " \"representing a {{ percent_change }}% change.\\n\"\n", - "\n", - " \"Use a mix of technical terms and explanatory language. Include thorough descriptions of observed effects and subjective \"\n", - " \"patient reports. Mention interactions with the investigator, Dr. {{ investigator_first_name }} {{ investigator_last_name }}.\\n\"\n", - " \"{% endif %}\"\n", - " )\n", + " \"{% if documentation_style == 'Formal and Technical' %}\\n\"\n", + " \"Write formal and technical medical observations for participant {{ participant_first_name }} {{ participant_last_name }}\\n\"\n", + " \"(ID: {{ participant_id }}) in the clinical trial for {{ therapeutic_area }} (Study ID: {{ study_id }}).\\n\"\n", + " \"Include observations related to their enrollment in the {{ dose_level }} dose group with {{ dose_frequency }} administration.\\n\"\n", + " \"Baseline measurement was {{ baseline_measurement }} and final measurement was {{ final_measurement }}, representing a\"\n", + " \"change of {{ percent_change }}%.\\n\"\n", + " \"Use proper medical terminology, maintain a highly formal tone, and structure the notes in a technical format with appropriate\"\n", + " \"sections and subsections. Include at least one reference to the site investigator, Dr. {{ investigator_last_name }}.\\n\"\n", + " \"{% elif documentation_style == 'Concise and Direct' %}\"\n", + " \"Write brief, direct medical observations for patient {{ participant_first_name }} {{ participant_last_name }}\\n\"\n", + " \"({{ participant_id }}) in {{ therapeutic_area }} trial {{ study_id }}.\\n\"\n", + " \"Note: {{ dose_level }} dose, {{ dose_frequency }}. Baseline: {{ baseline_measurement }}. Final: {{ final_measurement }}.\\n\"\n", + " \"Change: {{ percent_change }}%.\\n\"\n", + " \"Keep notes extremely concise, using abbreviations where appropriate. Mention follow-up needs and reference\\n\"\n", + " \"Dr. {{ investigator_last_name }} briefly.\\n\"\n", + " \"{% else %}\\n\"\n", + " \"Write detailed and descriptive medical observations for participant {{ participant_first_name }} {{ participant_last_name }}\\n\"\n", + " \"enrolled in the {{ therapeutic_area }} clinical trial ({{ study_id }}).\\n\"\n", + " \"Provide a narrative description of their experience in the {{ dose_level }} dose group with {{ dose_frequency }} dosing.\\n\"\n", + " \"Describe how their measurements changed from baseline ({{ baseline_measurement }}) to final ({{ final_measurement }}),\\n\"\n", + " \"representing a {{ percent_change }}% change.\\n\"\n", + " \"Use a mix of technical terms and explanatory language. Include thorough descriptions of observed effects and subjective \"\n", + " \"patient reports. Mention interactions with the investigator, Dr. {{ investigator_first_name }} {{ investigator_last_name }}.\\n\"\n", + " \"{% endif %}\"\n", + " ),\n", " )\n", ")\n", "\n", @@ -734,16 +781,16 @@ " system_prompt=SYSTEM_PROMPT,\n", " model_alias=MODEL_ALIAS,\n", " prompt=(\n", - " \"{% if has_adverse_event == 1 %}\"\n", - " \"[INSTRUCTIONS: Write a brief clinical description (1-2 sentences only) of the adverse event. \"\n", - " \"Use formal medical language. Do not include meta-commentary or explain what you're doing.] \"\n", - " \"{{adverse_event_type}}, {{adverse_event_severity}}. {{adverse_event_relatedness}} to study treatment.\\n\"\n", - " \"{% if adverse_event_resolved == 'Yes' %}Resolved.{% else %}Ongoing.{% endif %}\\n\"\n", - " \"{% else %}\\n\"\n", - " \"[INSTRUCTIONS: Output only the exact text 'No adverse events reported' without any additional commentary.] \"\n", - " \"No adverse events reported.\\n\"\n", - " \"{% endif %}\"\n", - " )\n", + " \"{% if has_adverse_event == 1 %}\"\n", + " \"[INSTRUCTIONS: Write a brief clinical description (1-2 sentences only) of the adverse event. \"\n", + " \"Use formal medical language. Do not include meta-commentary or explain what you're doing.] \"\n", + " \"{{adverse_event_type}}, {{adverse_event_severity}}. {{adverse_event_relatedness}} to study treatment.\\n\"\n", + " \"{% if adverse_event_resolved == 'Yes' %}Resolved.{% else %}Ongoing.{% endif %}\\n\"\n", + " \"{% else %}\\n\"\n", + " \"[INSTRUCTIONS: Output only the exact text 'No adverse events reported' without any additional commentary.] \"\n", + " \"No adverse events reported.\\n\"\n", + " \"{% endif %}\"\n", + " ),\n", " )\n", ")\n", "\n", @@ -754,44 +801,38 @@ " system_prompt=SYSTEM_PROMPT,\n", " model_alias=MODEL_ALIAS,\n", " prompt=(\n", - " \"{% if compliance_rate < 0.85 %}\"\n", - " \"{% if documentation_style == 'Formal and Technical' %}\"\n", - " \"[FORMAT INSTRUCTIONS: Write in a direct documentation style. Do not use phrases like 'it looks like' or \"\n", - " \"'you've provided'. Begin with the protocol deviation details. Use formal terminology.]\\n\"\n", - "\n", - " \"PROTOCOL DEVIATION REPORT\\n\"\n", - " \"Study ID: {{ study_id }}\\n\"\n", - " \"Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\\n\"\n", - " \"Compliance Rate: {{ compliance_rate }}\\n\"\n", - "\n", - " \"[Continue with formal description of the deviation, impact on data integrity, and corrective actions. \"\n", - " \"Reference coordinator {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_last_name }}]\\n\"\n", - " \"{% elif documentation_style == 'Concise and Direct' %}\\n\"\n", - " \"[FORMAT INSTRUCTIONS: Use only brief notes and bullet points. No introductions or explanations.]\\n\"\n", - "\n", - " \"PROTOCOL DEVIATION - {{ participant_id }}\\n\"\n", - " \"β€’ Compliance: {{ compliance_rate }}\\n\"\n", - " \"β€’ Impact: [severity level]\\n\"\n", - " \"β€’ Actions: [list actions]\\n\"\n", - " \"β€’ Coordinator: {{ coordinator_first_name }} {{ coordinator_last_name }}\\n\"\n", - " \"β€’ PI: Dr. {{ investigator_last_name }}\\n\"\n", - " \"{% else %}\\n\"\n", - " \"[FORMAT INSTRUCTIONS: Write a narrative description. Begin directly with the deviation details. No meta-commentary.]\\n\"\n", - "\n", - " \"During the {{ therapeutic_area }} study at {{ site_location }}, participant {{ participant_first_name }} \"\n", - " \"{{ participant_last_name }} demonstrated a compliance rate of {{ compliance_rate }}, which constitutes a protocol deviation.\\n\"\n", - "\n", - " \"[Continue with narrative about circumstances, discovery, impact, and team response. Include references to \"\n", - " \"{{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_first_name }} {{ investigator_last_name }}]\\n\"\n", - " \"{% endif %}]\\n\"\n", - " \"{% else %}\\n\"\n", - " \"[FORMAT INSTRUCTIONS: Write a simple direct statement. No meta-commentary or explanation.]\\n\"\n", - "\n", - " \"PROTOCOL COMPLIANCE ASSESSMENT\\n\"\n", - " \"Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\\n\"\n", - " \"Finding: No protocol deviations. Compliance rate: {{ compliance_rate }}.\\n\"\n", - " \"{% endif %}\"\n", - " )\n", + " \"{% if compliance_rate < 0.85 %}\"\n", + " \"{% if documentation_style == 'Formal and Technical' %}\"\n", + " \"[FORMAT INSTRUCTIONS: Write in a direct documentation style. Do not use phrases like 'it looks like' or \"\n", + " \"'you've provided'. Begin with the protocol deviation details. Use formal terminology.]\\n\"\n", + " \"PROTOCOL DEVIATION REPORT\\n\"\n", + " \"Study ID: {{ study_id }}\\n\"\n", + " \"Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\\n\"\n", + " \"Compliance Rate: {{ compliance_rate }}\\n\"\n", + " \"[Continue with formal description of the deviation, impact on data integrity, and corrective actions. \"\n", + " \"Reference coordinator {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_last_name }}]\\n\"\n", + " \"{% elif documentation_style == 'Concise and Direct' %}\\n\"\n", + " \"[FORMAT INSTRUCTIONS: Use only brief notes and bullet points. No introductions or explanations.]\\n\"\n", + " \"PROTOCOL DEVIATION - {{ participant_id }}\\n\"\n", + " \"β€’ Compliance: {{ compliance_rate }}\\n\"\n", + " \"β€’ Impact: [severity level]\\n\"\n", + " \"β€’ Actions: [list actions]\\n\"\n", + " \"β€’ Coordinator: {{ coordinator_first_name }} {{ coordinator_last_name }}\\n\"\n", + " \"β€’ PI: Dr. {{ investigator_last_name }}\\n\"\n", + " \"{% else %}\\n\"\n", + " \"[FORMAT INSTRUCTIONS: Write a narrative description. Begin directly with the deviation details. No meta-commentary.]\\n\"\n", + " \"During the {{ therapeutic_area }} study at {{ site_location }}, participant {{ participant_first_name }} \"\n", + " \"{{ participant_last_name }} demonstrated a compliance rate of {{ compliance_rate }}, which constitutes a protocol deviation.\\n\"\n", + " \"[Continue with narrative about circumstances, discovery, impact, and team response. Include references to \"\n", + " \"{{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_first_name }} {{ investigator_last_name }}]\\n\"\n", + " \"{% endif %}]\\n\"\n", + " \"{% else %}\\n\"\n", + " \"[FORMAT INSTRUCTIONS: Write a simple direct statement. No meta-commentary or explanation.]\\n\"\n", + " \"PROTOCOL COMPLIANCE ASSESSMENT\\n\"\n", + " \"Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\\n\"\n", + " \"Finding: No protocol deviations. Compliance rate: {{ compliance_rate }}.\\n\"\n", + " \"{% endif %}\"\n", + " ),\n", " )\n", ")\n" ] @@ -804,9 +845,10 @@ "### Adding Constraints\n", "\n", "Finally, we'll add constraints to ensure our data is logically consistent:\n", + "\n", "- Trial dates must be in proper sequence\n", "- Adverse event dates must occur after enrollment\n", - "- Measurement changes must be realistic" + "- Measurement changes must be realistic\n" ] }, { @@ -821,21 +863,21 @@ " target_column=\"trial_end_date\",\n", " constraint_type=\"column_inequality\",\n", " operator=\"gt\",\n", - " rhs=\"trial_start_date\"\n", + " rhs=\"trial_start_date\",\n", ")\n", "\n", "config_builder.add_constraint(\n", " target_column=\"enrollment_date\",\n", " constraint_type=\"column_inequality\",\n", " operator=\"ge\",\n", - " rhs=\"trial_start_date\"\n", + " rhs=\"trial_start_date\",\n", ")\n", "\n", "config_builder.add_constraint(\n", " target_column=\"enrollment_date\",\n", " constraint_type=\"column_inequality\",\n", " operator=\"lt\",\n", - " rhs=\"trial_end_date\"\n", + " rhs=\"trial_end_date\",\n", ")\n", "\n", "# Ensure reasonable clinical measurements\n", @@ -843,35 +885,35 @@ " target_column=\"baseline_measurement\",\n", " constraint_type=\"scalar_inequality\",\n", " operator=\"gt\",\n", - " rhs=0\n", + " rhs=0,\n", ")\n", "\n", "config_builder.add_constraint(\n", " target_column=\"final_measurement\",\n", " constraint_type=\"scalar_inequality\",\n", " operator=\"gt\",\n", - " rhs=0\n", + " rhs=0,\n", ")\n", "\n", "config_builder.add_constraint(\n", " target_column=\"trial_end_date\",\n", " constraint_type=\"column_inequality\",\n", " operator=\"gt\",\n", - " rhs=\"trial_start_date\"\n", + " rhs=\"trial_start_date\",\n", ")\n", "\n", "config_builder.add_constraint(\n", " target_column=\"enrollment_date\",\n", " constraint_type=\"column_inequality\",\n", " operator=\"ge\",\n", - " rhs=\"trial_start_date\"\n", + " rhs=\"trial_start_date\",\n", ")\n", "\n", "config_builder.add_constraint(\n", " target_column=\"enrollment_date\",\n", " constraint_type=\"column_inequality\",\n", " operator=\"lt\",\n", - " rhs=\"trial_end_date\"\n", + " rhs=\"trial_end_date\",\n", ")\n", "\n", "# Ensure reasonable clinical measurements\n", @@ -879,14 +921,14 @@ " target_column=\"baseline_measurement\",\n", " constraint_type=\"scalar_inequality\",\n", " operator=\"gt\",\n", - " rhs=0\n", + " rhs=0,\n", ")\n", "\n", "config_builder.add_constraint(\n", " target_column=\"final_measurement\",\n", " constraint_type=\"scalar_inequality\",\n", " operator=\"gt\",\n", - " rhs=0\n", + " rhs=0,\n", ")" ] }, @@ -903,7 +945,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/insurance-claims.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/insurance-claims.ipynb index 99751bb2c..0afa3e69c 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/insurance-claims.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/insurance-claims.ipynb @@ -7,20 +7,18 @@ "source": [ "# 🧾 NeMo Data Designer: Synthetic Insurance Claims Dataset Generator\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", "This notebook creates a synthetic dataset of insurance claims with realistic PII (Personally Identifiable Information) \\\n", "for testing data protection and anonymization techniques.\n", "\n", "The dataset includes:\n", + "\n", "- Policy and claim details\n", "- Policyholder and claimant information (PII)\n", "- Claim descriptions and adjuster notes with embedded PII\n", "- Medical information for relevant claims\n", "\n", - "\n", "
\n", "\n", "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", @@ -67,7 +65,7 @@ " SamplerType,\n", " SubcategorySamplerParams,\n", " UUIDSamplerParams,\n", - " UniformSamplerParams\n", + " UniformSamplerParams,\n", ")" ] }, @@ -174,10 +172,11 @@ "## 🎲 Creating Person Samplers\n", "\n", "We'll create person samplers to generate consistent personal information for different roles in the insurance claims process:\n", + "\n", "- Policyholders (primary insurance customers)\n", "- Claimants (who may be different from policyholders)\n", "- Adjusters (insurance company employees who evaluate claims)\n", - "- Physicians (for medical-related claims)" + "- Physicians (for medical-related claims)\n" ] }, { @@ -226,10 +225,11 @@ "### Creating Policy Information\n", "\n", "Next, we'll create the basic policy information:\n", + "\n", "- Policy number (unique identifier)\n", "- Policy type (Auto, Home, Health, etc.)\n", "- Coverage details (based on policy type)\n", - "- Policy start and end dates" + "- Policy start and end dates\n" ] }, { @@ -244,7 +244,7 @@ " SamplerColumnConfig(\n", " name=\"policy_number\",\n", " sampler_type=SamplerType.UUID,\n", - " params=UUIDSamplerParams(prefix=\"POL-\", short_form=True, uppercase=True)\n", + " params=UUIDSamplerParams(prefix=\"POL-\", short_form=True, uppercase=True),\n", " )\n", ")\n", "\n", @@ -255,8 +255,8 @@ " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", " values=[\"Auto\", \"Home\", \"Health\", \"Life\", \"Travel\"],\n", - " weights=[0.4, 0.3, 0.15, 0.1, 0.05]\n", - " )\n", + " weights=[0.4, 0.3, 0.15, 0.1, 0.05],\n", + " ),\n", " )\n", ")\n", "\n", @@ -268,13 +268,33 @@ " params=SubcategorySamplerParams(\n", " category=\"policy_type\",\n", " values={\n", - " \"Auto\": [\"Liability\", \"Comprehensive\", \"Collision\", \"Uninsured Motorist\"],\n", - " \"Home\": [\"Dwelling\", \"Personal Property\", \"Liability\", \"Natural Disaster\"],\n", - " \"Health\": [\"Emergency Care\", \"Primary Care\", \"Specialist\", \"Prescription\"],\n", + " \"Auto\": [\n", + " \"Liability\",\n", + " \"Comprehensive\",\n", + " \"Collision\",\n", + " \"Uninsured Motorist\",\n", + " ],\n", + " \"Home\": [\n", + " \"Dwelling\",\n", + " \"Personal Property\",\n", + " \"Liability\",\n", + " \"Natural Disaster\",\n", + " ],\n", + " \"Health\": [\n", + " \"Emergency Care\",\n", + " \"Primary Care\",\n", + " \"Specialist\",\n", + " \"Prescription\",\n", + " ],\n", " \"Life\": [\"Term\", \"Whole Life\", \"Universal Life\", \"Variable Life\"],\n", - " \"Travel\": [\"Trip Cancellation\", \"Medical Emergency\", \"Lost Baggage\", \"Flight Accident\"]\n", - " }\n", - " )\n", + " \"Travel\": [\n", + " \"Trip Cancellation\",\n", + " \"Medical Emergency\",\n", + " \"Lost Baggage\",\n", + " \"Flight Accident\",\n", + " ],\n", + " },\n", + " ),\n", " )\n", ")\n", "\n", @@ -284,7 +304,7 @@ " name=\"policy_start_date\",\n", " sampler_type=SamplerType.DATETIME,\n", " params={\"start\": \"2022-01-01\", \"end\": \"2023-06-30\"},\n", - " convert_to=\"%Y-%m-%d\"\n", + " convert_to=\"%Y-%m-%d\",\n", " )\n", ")\n", "\n", @@ -293,7 +313,7 @@ " name=\"policy_end_date\",\n", " sampler_type=SamplerType.DATETIME,\n", " params={\"start\": \"2023-07-01\", \"end\": \"2024-12-31\"},\n", - " convert_to=\"%Y-%m-%d\"\n", + " convert_to=\"%Y-%m-%d\",\n", " )\n", ")" ] @@ -307,11 +327,12 @@ "\n", "Now we'll add fields for the policyholder's personal information. This includes PII elements that would typically be \\\n", "subject to privacy regulations:\n", + "\n", "- First and last name\n", "- Birth date\n", "- Contact information (email)\n", "\n", - "These fields use expressions to reference the person sampler we defined earlier." + "These fields use expressions to reference the person sampler we defined earlier.\n" ] }, { @@ -324,29 +345,25 @@ "# Policyholder personal information\n", "config_builder.add_column(\n", " ExpressionColumnConfig(\n", - " name=\"policyholder_first_name\",\n", - " expr=\"{{policyholder.first_name}}\"\n", + " name=\"policyholder_first_name\", expr=\"{{policyholder.first_name}}\"\n", " )\n", ")\n", "\n", "config_builder.add_column(\n", " ExpressionColumnConfig(\n", - " name=\"policyholder_last_name\",\n", - " expr=\"{{policyholder.last_name}}\"\n", + " name=\"policyholder_last_name\", expr=\"{{policyholder.last_name}}\"\n", " )\n", ")\n", "\n", "config_builder.add_column(\n", " ExpressionColumnConfig(\n", - " name=\"policyholder_birth_date\",\n", - " expr=\"{{policyholder.birth_date}}\"\n", + " name=\"policyholder_birth_date\", expr=\"{{policyholder.birth_date}}\"\n", " )\n", ")\n", "\n", "config_builder.add_column(\n", " ExpressionColumnConfig(\n", - " name=\"policyholder_email\",\n", - " expr=\"{{policyholder.email_address}}\"\n", + " name=\"policyholder_email\", expr=\"{{policyholder.email_address}}\"\n", " )\n", ")" ] @@ -359,10 +376,11 @@ "### Claim Information\n", "\n", "Next, we'll create the core claim details:\n", + "\n", "- Claim ID (unique identifier)\n", "- Dates (filing date, incident date)\n", "- Claim status (in process, approved, denied, etc.)\n", - "- Financial information (amount claimed, amount approved)" + "- Financial information (amount claimed, amount approved)\n" ] }, { @@ -377,7 +395,7 @@ " SamplerColumnConfig(\n", " name=\"claim_id\",\n", " sampler_type=SamplerType.UUID,\n", - " params=UUIDSamplerParams(prefix=\"CLM-\", short_form=True, uppercase=True)\n", + " params=UUIDSamplerParams(prefix=\"CLM-\", short_form=True, uppercase=True),\n", " )\n", ")\n", "\n", @@ -387,7 +405,7 @@ " name=\"incident_date\",\n", " sampler_type=SamplerType.DATETIME,\n", " params={\"start\": \"2023-01-01\", \"end\": \"2023-12-31\"},\n", - " convert_to=\"%Y-%m-%d\"\n", + " convert_to=\"%Y-%m-%d\",\n", " )\n", ")\n", "\n", @@ -399,9 +417,9 @@ " \"dt_min\": 1,\n", " \"dt_max\": 30,\n", " \"reference_column_name\": \"incident_date\",\n", - " \"unit\": \"D\"\n", + " \"unit\": \"D\",\n", " },\n", - " convert_to=\"%Y-%m-%d\"\n", + " convert_to=\"%Y-%m-%d\",\n", ")\n", "\n", "# Claim status\n", @@ -410,9 +428,16 @@ " name=\"claim_status\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", - " values=[\"Filed\", \"Under Review\", \"Additional Info Requested\", \"Approved\", \"Denied\", \"Appealed\"],\n", - " weights=[0.15, 0.25, 0.15, 0.25, 0.15, 0.05]\n", - " )\n", + " values=[\n", + " \"Filed\",\n", + " \"Under Review\",\n", + " \"Additional Info Requested\",\n", + " \"Approved\",\n", + " \"Denied\",\n", + " \"Appealed\",\n", + " ],\n", + " weights=[0.15, 0.25, 0.15, 0.25, 0.15, 0.05],\n", + " ),\n", " )\n", ")\n", "\n", @@ -421,7 +446,7 @@ " SamplerColumnConfig(\n", " name=\"claim_amount\",\n", " sampler_type=SamplerType.GAUSSIAN,\n", - " params=GaussianSamplerParams(mean=5000, stddev=2000)\n", + " params=GaussianSamplerParams(mean=5000, stddev=2000),\n", " )\n", ")\n", "\n", @@ -429,15 +454,14 @@ " SamplerColumnConfig(\n", " name=\"approved_percentage\",\n", " sampler_type=SamplerType.UNIFORM,\n", - " params=UniformSamplerParams(low=0.0, high=1.0)\n", + " params=UniformSamplerParams(low=0.0, high=1.0),\n", " )\n", ")\n", "\n", "# Calculate approved amount based on percentage\n", "config_builder.add_column(\n", " ExpressionColumnConfig(\n", - " name=\"approved_amount\",\n", - " expr=\"{{claim_amount * approved_percentage}}\"\n", + " name=\"approved_amount\", expr=\"{{claim_amount * approved_percentage}}\"\n", " )\n", ")" ] @@ -451,9 +475,10 @@ "\n", "In some cases, the claimant (person filing the claim) may be different from the policyholder. \\\n", "We'll create fields to capture claimant information and their relationship to the policyholder:\n", + "\n", "- Flag indicating if claimant is the policyholder\n", "- Claimant personal details (when different from policyholder)\n", - "- Relationship to policyholder" + "- Relationship to policyholder\n" ] }, { @@ -468,30 +493,21 @@ " SamplerColumnConfig(\n", " name=\"is_claimant_policyholder\",\n", " sampler_type=SamplerType.BERNOULLI,\n", - " params=BernoulliSamplerParams(p=0.7)\n", + " params=BernoulliSamplerParams(p=0.7),\n", " )\n", ")\n", "\n", "# Claimant personal information (when different from policyholder)\n", "config_builder.add_column(\n", - " ExpressionColumnConfig(\n", - " name=\"claimant_first_name\",\n", - " expr=\"{{claimant.first_name}}\"\n", - " )\n", + " ExpressionColumnConfig(name=\"claimant_first_name\", expr=\"{{claimant.first_name}}\")\n", ")\n", "\n", "config_builder.add_column(\n", - " ExpressionColumnConfig(\n", - " name=\"claimant_last_name\",\n", - " expr=\"{{claimant.last_name}}\"\n", - " )\n", + " ExpressionColumnConfig(name=\"claimant_last_name\", expr=\"{{claimant.last_name}}\")\n", ")\n", "\n", "config_builder.add_column(\n", - " ExpressionColumnConfig(\n", - " name=\"claimant_birth_date\",\n", - " expr=\"{{claimant.birth_date}}\"\n", - " )\n", + " ExpressionColumnConfig(name=\"claimant_birth_date\", expr=\"{{claimant.birth_date}}\")\n", ")\n", "\n", "# Relationship to policyholder\n", @@ -499,7 +515,9 @@ " SamplerColumnConfig(\n", " name=\"relationship_to_policyholder\",\n", " sampler_type=SamplerType.CATEGORY,\n", - " params=CategorySamplerParams(values=[\"Self\", \"Spouse\", \"Child\", \"Parent\", \"Sibling\", \"Other\"]),\n", + " params=CategorySamplerParams(\n", + " values=[\"Self\", \"Spouse\", \"Child\", \"Parent\", \"Sibling\", \"Other\"]\n", + " ),\n", " )\n", ")" ] @@ -511,11 +529,12 @@ "source": [ "### Claim Adjuster Information\n", "\n", - "Insurance claims are typically handled by claim adjusters. We'll add information about \n", + "Insurance claims are typically handled by claim adjusters. We'll add information about\n", "the adjuster assigned to each claim:\n", + "\n", "- Adjuster name\n", "- Assignment date\n", - "- Contact information" + "- Contact information\n" ] }, { @@ -527,17 +546,11 @@ "source": [ "# Adjuster information\n", "config_builder.add_column(\n", - " ExpressionColumnConfig(\n", - " name=\"adjuster_first_name\",\n", - " expr=\"{{adjuster.first_name}}\"\n", - " )\n", + " ExpressionColumnConfig(name=\"adjuster_first_name\", expr=\"{{adjuster.first_name}}\")\n", ")\n", "\n", "config_builder.add_column(\n", - " ExpressionColumnConfig(\n", - " name=\"adjuster_last_name\",\n", - " expr=\"{{adjuster.last_name}}\"\n", - " )\n", + " ExpressionColumnConfig(name=\"adjuster_last_name\", expr=\"{{adjuster.last_name}}\")\n", ")\n", "\n", "# Adjuster assignment date\n", @@ -549,9 +562,9 @@ " \"dt_min\": 0,\n", " \"dt_max\": 5,\n", " \"reference_column_name\": \"filing_date\",\n", - " \"unit\": \"D\"\n", + " \"unit\": \"D\",\n", " },\n", - " convert_to=\"%Y-%m-%d\"\n", + " convert_to=\"%Y-%m-%d\",\n", ")" ] }, @@ -562,10 +575,11 @@ "source": [ "### Medical Information\n", "\n", - "For health insurance claims and injury-related claims in other policy types, \n", + "For health insurance claims and injury-related claims in other policy types,\n", "we'll include medical information:\n", + "\n", "- Flag indicating if there's a medical component to the claim\n", - "- Medical claim details (when applicable)" + "- Medical claim details (when applicable)\n" ] }, { @@ -580,7 +594,7 @@ " SamplerColumnConfig(\n", " name=\"has_medical_component\",\n", " sampler_type=SamplerType.BERNOULLI,\n", - " params=BernoulliSamplerParams(p=0.4)\n", + " params=BernoulliSamplerParams(p=0.4),\n", " )\n", ")\n", "\n", @@ -588,14 +602,14 @@ "config_builder.add_column(\n", " ExpressionColumnConfig(\n", " name=\"physician_first_name\",\n", - " expr=\"{% if has_medical_component == 1 %}{{physician.first_name}}{% else %}'NA'{% endif %}\"\n", + " expr=\"{% if has_medical_component == 1 %}{{physician.first_name}}{% else %}'NA'{% endif %}\",\n", " )\n", ")\n", "\n", "config_builder.add_column(\n", " ExpressionColumnConfig(\n", " name=\"physician_last_name\",\n", - " expr=\"{% if has_medical_component == 1 %}{{physician.last_name}}{% else %}'NA'{% endif %}\"\n", + " expr=\"{% if has_medical_component == 1 %}{{physician.last_name}}{% else %}'NA'{% endif %}\",\n", " )\n", ")" ] @@ -615,7 +629,7 @@ "3. Medical Notes - For claims with a medical component\n", "\n", "The LLM will be prompted to include PII elements like names, dates, and contact information\n", - "within the narrative text." + "within the narrative text.\n" ] }, { @@ -708,8 +722,9 @@ "### Adding Constraints\n", "\n", "To ensure our data is logically consistent, we'll add some constraints:\n", + "\n", "- Incident date must be during the policy term\n", - "- Filing date must be after incident date" + "- Filing date must be after incident date\n" ] }, { @@ -724,14 +739,14 @@ " target_column=\"incident_date\",\n", " constraint_type=\"column_inequality\",\n", " operator=\"ge\",\n", - " rhs=\"policy_start_date\"\n", + " rhs=\"policy_start_date\",\n", ")\n", "\n", "config_builder.add_constraint(\n", " target_column=\"incident_date\",\n", " constraint_type=\"column_inequality\",\n", " operator=\"le\",\n", - " rhs=\"policy_end_date\"\n", + " rhs=\"policy_end_date\",\n", ")\n", "\n", "# Ensure filing date is after incident date\n", @@ -739,7 +754,7 @@ " target_column=\"filing_date\",\n", " constraint_type=\"column_inequality\",\n", " operator=\"gt\",\n", - " rhs=\"incident_date\"\n", + " rhs=\"incident_date\",\n", ")" ] }, @@ -756,7 +771,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/physician-notes-with-realistic-personal-details.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/physician-notes-with-realistic-personal-details.ipynb index a84738ba7..1a743d3f5 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/physician-notes-with-realistic-personal-details.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/healthcare-datasets/physician-notes-with-realistic-personal-details.ipynb @@ -6,20 +6,18 @@ "source": [ "# πŸ§‘β€βš•οΈ NeMo Data Designer: Realistic Patient Data & Physician Notes\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", "This notebook demonstrates how to use NeMo Data Designer to generate realistic patient data including physician notes.\\\n", " We'll leverage both structured data generation and LLM capabilities to create a comprehensive medical dataset.\n", "\n", "The dataset includes:\n", + "\n", "- Policy and claim details\n", "- Policyholder and claimant information (PII)\n", "- Claim descriptions and adjuster notes with embedded PII\n", "- Medical information for relevant claims\n", "\n", - "\n", "
\n", "\n", "> πŸ‘‹ **IMPORTANT** – Environment Setup\n", @@ -159,11 +157,11 @@ "source": [ "## 🌱 Loading Seed Data\n", "\n", - "- We'll use the symptom-to-diagnosis dataset as our seed data. \n", + "- We'll use the symptom-to-diagnosis dataset as our seed data.\n", "\n", "- This dataset contains patient symptoms and corresponding diagnoses which will help generate realistic medical scenarios.\n", "\n", - "
\n", + "
\n", "\n", "> 🌱 **Why use a seed dataset?**\n", ">\n", @@ -183,9 +181,8 @@ ">\n", "> - The datastore endpoint is specified in the deployment configuration.\n", "\n", - "\n", "πŸ‘‹ **Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as \\\n", - "seeds, it is recommended you consolidated these into a single file. " + "seeds, it is recommended you consolidated these into a single file.\n" ] }, { @@ -198,7 +195,9 @@ "\n", "# Let's use the symptom-to-diagnosis dataset to seed our workflow\n", "df_seed = load_dataset(\"gretelai/symptom_to_diagnosis\")[\"train\"].to_pandas()\n", - "df_seed = df_seed.rename(columns={\"output_text\": \"diagnosis\", \"input_text\": \"patient_summary\"})\n", + "df_seed = df_seed.rename(\n", + " columns={\"output_text\": \"diagnosis\", \"input_text\": \"patient_summary\"}\n", + ")\n", "\n", "print(f\"Number of records: {len(df_seed)}\")\n", "\n", @@ -245,7 +244,7 @@ "> dataset=\"data-designer/demo/gretelai_symptom_to_diagnosis.csv\",\n", "> datastore_settings={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n", "> )\n", - "> ```" + "> ```\n" ] }, { @@ -254,10 +253,10 @@ "source": [ "## 🎲 Creating Person Samplers\n", "\n", - "- We create persona samplers to simulate details about the patient and the doctor \n", + "- We create persona samplers to simulate details about the patient and the doctor\n", "\n", "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n", - "If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker\n" + " If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker\n" ] }, { @@ -295,7 +294,7 @@ "- `uuid` for patient identification\n", "- Patient personal information (`first_name`, `last_name`, `dob`, `patient_email`)\n", "- Medical timeline information (`symptom_onset_date`, `date_of_visit`)\n", - "- Physician information (`physician`)" + "- Physician information (`physician`)\n" ] }, { @@ -313,27 +312,21 @@ ")\n", "\n", "config_builder.add_column(\n", - " name=\"first_name\",\n", - " column_type=\"expression\",\n", - " expr=\"{{patient_sampler.first_name}}\"\n", + " name=\"first_name\", column_type=\"expression\", expr=\"{{patient_sampler.first_name}}\"\n", ")\n", "\n", "config_builder.add_column(\n", - " name=\"last_name\",\n", - " column_type=\"expression\",\n", - " expr=\"{{patient_sampler.last_name}}\"\n", + " name=\"last_name\", column_type=\"expression\", expr=\"{{patient_sampler.last_name}}\"\n", ")\n", "\n", "config_builder.add_column(\n", - " name=\"dob\",\n", - " column_type=\"expression\",\n", - " expr=\"{{patient_sampler.birth_date}}\"\n", + " name=\"dob\", column_type=\"expression\", expr=\"{{patient_sampler.birth_date}}\"\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"patient_email\",\n", " column_type=\"expression\",\n", - " expr=\"{{patient_sampler.email_address}}\"\n", + " expr=\"{{patient_sampler.email_address}}\",\n", ")\n", "\n", "config_builder.add_column(\n", @@ -347,11 +340,7 @@ " name=\"date_of_visit\",\n", " column_type=\"sampler\",\n", " sampler_type=\"timedelta\",\n", - " params={\n", - " \"dt_min\": 1,\n", - " \"dt_max\": 30,\n", - " \"reference_column_name\": \"symptom_onset_date\"\n", - " },\n", + " params={\"dt_min\": 1, \"dt_max\": 30, \"reference_column_name\": \"symptom_onset_date\"},\n", ")\n", "\n", "config_builder.add_column(\n", @@ -373,7 +362,7 @@ "- Patient summary from our seed data\n", "- Clear formatting instructions\n", "\n", - "This will create detailed medical notes that reflect the patient's diagnosis and visit information. " + "This will create detailed medical notes that reflect the patient's diagnosis and visit information.\n" ] }, { @@ -393,17 +382,14 @@ " \"who has been struggling with symptoms from {{diagnosis}} since {{symptom_onset_date}}.\\n\"\n", " \"The date of today's visit is {{date_of_visit}}.\\n\"\n", " \"\\n\"\n", - "\n", " \"\\n\"\n", " \"{{patient_summary}}\\n\"\n", " \"\\n\"\n", - "\n", " \"\\n\"\n", " \"Write careful notes about your visit with {{first_name}}, as {{physician}}.\\n\"\n", - "\n", " \"Format the notes as a busy doctor might.\\n\"\n", " \"\"\n", - " )\n", + " ),\n", " )\n", ")" ] @@ -420,7 +406,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multi-turn-chat/multi-turn-conversation.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multi-turn-chat/multi-turn-conversation.ipynb index 1ded5f81d..b3076c9e6 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multi-turn-chat/multi-turn-conversation.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multi-turn-chat/multi-turn-conversation.ipynb @@ -7,17 +7,14 @@ "source": [ "# 🎨 NeMo Data Designer: Synthetic Conversational Data with Person Details\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "### πŸ“š What you'll learn\n", "\n", "- This notebook demonstrates how to use the NeMo Data Designer to build a synthetic data generation pipeline step-by-step.\n", "\n", - "- We will create multi-turn user-assistant dialogues tailored for fine-tuning language models, enhanced with realistic person details. \n", + "- We will create multi-turn user-assistant dialogues tailored for fine-tuning language models, enhanced with realistic person details.\n", "\n", "- These datasets could be used for developing and enhancing conversational AI applications, including customer \\\n", - "support chatbots, virtual assistants, and interactive learning systems.\n", - "\n", + " support chatbots, virtual assistants, and interactive learning systems.\n", "\n", "
\n", "\n", @@ -62,7 +59,7 @@ " SamplerColumnConfig,\n", " SamplerType,\n", " Score,\n", - " SubcategorySamplerParams\n", + " SubcategorySamplerParams,\n", ")" ] }, @@ -168,7 +165,7 @@ "source": [ "### Define Pydantic Models for Structured Outputs\n", "\n", - "You can use Pydantic to define a structure for the messages that are produced by Data Designer" + "You can use Pydantic to define a structure for the messages that are produced by Data Designer\n" ] }, { @@ -181,9 +178,13 @@ "from typing import Literal\n", "from pydantic import BaseModel, Field\n", "\n", + "\n", "class Message(BaseModel):\n", " \"\"\"A single message turn in the conversation.\"\"\"\n", - " role: Literal[\"user\", \"assistant\"] = Field(..., description=\"Which role is writing the message.\")\n", + "\n", + " role: Literal[\"user\", \"assistant\"] = Field(\n", + " ..., description=\"Which role is writing the message.\"\n", + " )\n", " content: str = Field(..., description=\"Message contents.\")\n", "\n", "\n", @@ -196,7 +197,10 @@ " * Message content can be long or short.\n", " * All assistant messages are faithful responses and must be answered fully.\n", " \"\"\"\n", - " conversation: list[Message] = Field(..., description=\"List of all messages in the conversation.\")\n", + "\n", + " conversation: list[Message] = Field(\n", + " ..., description=\"List of all messages in the conversation.\"\n", + " )\n", "\n", "\n", "class UserToxicityScore(BaseModel):\n", @@ -208,8 +212,11 @@ " Moderate: Some disrespectful or harassing language.\n", " Severe: Overt hate, harassment, or harmful content.\n", " \"\"\"\n", + "\n", " reasons: list[str] = Field(..., description=\"Reasoning for user toxicity score.\")\n", - " score: Literal[\"None\", \"Mild\", \"Moderate\", \"Severe\"] = Field(..., description=\"Level of toxicity observed in the user role responses.\")" + " score: Literal[\"None\", \"Mild\", \"Moderate\", \"Severe\"] = Field(\n", + " ..., description=\"Level of toxicity observed in the user role responses.\"\n", + " )" ] }, { @@ -221,7 +228,7 @@ "\n", "- Sampler columns offer non-LLM based generation of synthetic data.\n", "\n", - "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n" ] }, { @@ -238,7 +245,7 @@ " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", " values=[\"Tech Support\", \"Personal Finances\", \"Educational Guidance\"]\n", - " )\n", + " ),\n", " )\n", ")\n", "\n", @@ -266,7 +273,7 @@ " \"Learning a New Language\",\n", " ],\n", " },\n", - " )\n", + " ),\n", " )\n", ")\n", "\n", @@ -275,9 +282,7 @@ " SamplerColumnConfig(\n", " name=\"complexity\",\n", " sampler_type=SamplerType.CATEGORY,\n", - " params=CategorySamplerParams(\n", - " values=[\"Basic\", \"Intermediate\", \"Advanced\"]\n", - " )\n", + " params=CategorySamplerParams(values=[\"Basic\", \"Intermediate\", \"Advanced\"]),\n", " )\n", ")\n", "\n", @@ -286,9 +291,7 @@ " SamplerColumnConfig(\n", " name=\"conversation_length\",\n", " sampler_type=SamplerType.CATEGORY,\n", - " params=CategorySamplerParams(\n", - " values=[2, 4, 6, 8]\n", - " )\n", + " params=CategorySamplerParams(values=[2, 4, 6, 8]),\n", " )\n", ")\n", "\n", @@ -299,7 +302,7 @@ " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(\n", " values=[\"happy\", \"silly\", \"sarcastic\", \"combative\", \"disappointed\", \"toxic\"]\n", - " )\n", + " ),\n", " )\n", ")" ] @@ -310,15 +313,17 @@ "metadata": {}, "source": [ "## 🦜 Adding LLM Generated columns\n", - "Now define the columns that the model will generate. These prompts instruct the LLM to produce the actual conversation: \n", - "- a system prompt to guide how the AI assistant engages in the conversation with the user, \n", - "- the conversation, and \n", + "\n", + "Now define the columns that the model will generate. These prompts instruct the LLM to produce the actual conversation:\n", + "\n", + "- a system prompt to guide how the AI assistant engages in the conversation with the user,\n", + "- the conversation, and\n", "- finally, we generate a toxicity_label to assess user toxicity over the entire conversation.\n", - "
\n", + "
\n", "\n", "### πŸ’¬πŸ€– AI Assistant system prompt and conversation\n", "\n", - "We generate a system prompt to base the AI assistant and then generate the entire conversation." + "We generate a system prompt to base the AI assistant and then generate the entire conversation.\n" ] }, { @@ -333,8 +338,10 @@ " LLMTextColumnConfig(\n", " name=\"assistant_system_prompt\",\n", " system_prompt=SYSTEM_PROMPT,\n", - " prompt=(\"Write a reasonable system prompt for a helpful AI assistant with expertise in \"\n", - " \"{{domain}} and {{topic}}. The AI assistant must not engage in harmful behaviors.\"),\n", + " prompt=(\n", + " \"Write a reasonable system prompt for a helpful AI assistant with expertise in \"\n", + " \"{{domain}} and {{topic}}. The AI assistant must not engage in harmful behaviors.\"\n", + " ),\n", " model_alias=MODEL_ALIAS,\n", " )\n", ")\n", @@ -379,10 +386,10 @@ "## πŸ” LLM-as-a-Judge: Toxicity Assessment\n", "\n", "When generating our synthetic dataset, we need to determine the quality of the generated dialogs. \\\n", - "We use the LLM-as-a-Judge strategy to do this. \n", + "We use the LLM-as-a-Judge strategy to do this.\n", "\n", - "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", - "that provides relavant instructions. " + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt\n", + "that provides relavant instructions.\n" ] }, { @@ -397,15 +404,13 @@ "toxicity_rubric = Score(\n", " name=\"Toxicity\",\n", " description=\"Evaluation of harmful, harassing, or hateful language in conversations\",\n", - " options=\n", - " {\n", - " \"4\": \"No toxicity detected; conversation is respectful and appropriate.\",\n", - " \"3\": \"Mild toxicity; slightly rude or sarcastic but not hateful or harmful.\",\n", - " \"2\": \"Moderate toxicity; contains disrespectful or harassing language.\",\n", - " \"1\": \"Severe toxicity; contains overt hate, harassment, or harmful content.\",\n", - " \"0\": \"Extreme toxicity; contains dangerous, threatening, or abusive content.\"\n", - " }\n", - "\n", + " options={\n", + " \"4\": \"No toxicity detected; conversation is respectful and appropriate.\",\n", + " \"3\": \"Mild toxicity; slightly rude or sarcastic but not hateful or harmful.\",\n", + " \"2\": \"Moderate toxicity; contains disrespectful or harassing language.\",\n", + " \"1\": \"Severe toxicity; contains overt hate, harassment, or harmful content.\",\n", + " \"0\": \"Extreme toxicity; contains dangerous, threatening, or abusive content.\",\n", + " },\n", ")\n", "\n", "TOXICITY_PROMPT_TEMPLATE = (\n", @@ -413,13 +418,11 @@ " \"You are evaluating the toxicity level in user messages.\\n\"\n", " \"Take a deep breath and carefully analyze the conversation provided. \\n\"\n", " \"Use the rubric to score the level of toxicity in the USER'S messages only.\\n\"\n", - "\n", " \"Consider:\\n\"\n", " \"- Focus primarily on harassment, hate, or harmful language\\n\"\n", " \"- Evaluate the full context of the conversation\\n\"\n", " \"- Be objective and consistent in your assessment\\n\"\n", " \"- If any single message exhibits toxicity, it should influence the final score\\n\"\n", - "\n", " \"## CONVERSATION\\n\"\n", " \"{{ conversation }}\"\n", ")\n", @@ -430,7 +433,7 @@ " system_prompt=SYSTEM_PROMPT,\n", " prompt=TOXICITY_PROMPT_TEMPLATE,\n", " scores=[toxicity_rubric],\n", - " model_alias=MODEL_ALIAS\n", + " model_alias=MODEL_ALIAS,\n", " )\n", ")" ] @@ -448,7 +451,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multimodal/visual-question-answering-using-vlm.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multimodal/visual-question-answering-using-vlm.ipynb index 9d5b8d521..56ea1a5f2 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multimodal/visual-question-answering-using-vlm.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/multimodal/visual-question-answering-using-vlm.ipynb @@ -6,11 +6,9 @@ "source": [ "# 🎨 NeMo Data Designer: Visual Question Answering Dataset Generation\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "### πŸ“š What you'll learn\n", "\n", - "This notebook demonstrates how to use NeMo Data Designer to generate high-quality synthetic question-answer datasets from visual documents. \n", + "This notebook demonstrates how to use NeMo Data Designer to generate high-quality synthetic question-answer datasets from visual documents.\n", "\n", "
\n", "\n", @@ -23,7 +21,7 @@ ">\n", "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n", "\n", - "
" + "
\n" ] }, { @@ -71,7 +69,7 @@ " ModalityDataType,\n", " NeMoDataDesignerClient,\n", " SamplerColumnConfig,\n", - " SamplerType\n", + " SamplerType,\n", ")" ] }, @@ -120,7 +118,7 @@ "MODEL_PROVIDER = \"nvidiabuild\"\n", "\n", "# The model ID is from build.nvidia.com.\n", - "MODEL_ID = \"meta/llama-4-maverick-17b-128e-instruct\"\n", + "MODEL_ID = \"meta/llama-4-maverick-17b-128e-instruct\"\n", "\n", "# We choose this alias to be descriptive for our use case.\n", "MODEL_ALIAS = \"llama-4-maverick-17b-128e-instruct\"\n", @@ -170,13 +168,13 @@ "In this section, we'll prepare our visual documents as a seed dataset. The seed dataset provides the foundation for synthetic data generation by:\n", "\n", "- **Loading Visual Documents**: We use the ColPali dataset containing document images\n", - "- **Image Processing**: Convert images to base64 format for model consumption \n", + "- **Image Processing**: Convert images to base64 format for model consumption\n", "- **Metadata Extraction**: Preserve relevant document information\n", "- **Sampling Strategy**: Configure how the seed data is utilized during generation\n", "\n", "The seed dataset can be referenced in generation prompts using Jinja templating.\n", "\n", - "
\n", + "
\n", "\n", "> 🌱 **Why use a seed dataset?**\n", ">\n", @@ -196,9 +194,8 @@ ">\n", "> - The datastore endpoint is specified in the deployment configuration.\n", "\n", - "\n", "πŸ‘‹ **Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as \\\n", - "seeds, it is recommended you consolidated these into a single file. " + "seeds, it is recommended you consolidated these into a single file.\n" ] }, { @@ -215,7 +212,7 @@ "img_dataset_cfg = {\n", " \"path\": \"vidore/colpali_train_set\",\n", " \"split\": \"train\",\n", - " \"streaming\": True\n", + " \"streaming\": True,\n", "}" ] }, @@ -223,7 +220,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Define helper functions to preprocess the dataset" + "Define helper functions to preprocess the dataset\n" ] }, { @@ -247,6 +244,7 @@ " width = int(original_width * (height / original_height))\n", " return image.resize((width, height))\n", "\n", + "\n", "def convert_image_to_chat_format(record, height: int) -> dict:\n", " \"\"\"\n", " Convert PIL image to base64 format for chat template usage.\n", @@ -269,10 +267,7 @@ " base64_string = base64_encoded_data.decode(\"utf-8\")\n", "\n", " # Return updated record\n", - " return record | {\n", - " \"base64_image\": base64_string,\n", - " \"uuid\": str(uuid.uuid4())\n", - " }" + " return record | {\"base64_image\": base64_string, \"uuid\": str(uuid.uuid4())}" ] }, { @@ -285,8 +280,9 @@ "print(\"πŸ“₯ Loading and processing document images...\")\n", "\n", "img_dataset_iter = iter(\n", - " load_dataset(**img_dataset_cfg)\n", - " .map(convert_image_to_chat_format, fn_kwargs={\"height\": BASE64_IMAGE_HEIGHT})\n", + " load_dataset(**img_dataset_cfg).map(\n", + " convert_image_to_chat_format, fn_kwargs={\"height\": BASE64_IMAGE_HEIGHT}\n", + " )\n", ")\n", "img_dataset = pd.DataFrame([next(img_dataset_iter) for _ in range(IMG_COUNT)])\n", "\n", @@ -302,7 +298,9 @@ "# save the seed dataset to a csv file locally\n", "os.makedirs(\"./data/\", exist_ok=True)\n", "\n", - "df_seed = pd.DataFrame(img_dataset)[[\"uuid\", \"image_filename\", \"base64_image\", \"page\", \"options\", \"source\"]]\n", + "df_seed = pd.DataFrame(img_dataset)[\n", + " [\"uuid\", \"image_filename\", \"base64_image\", \"page\", \"options\", \"source\"]\n", + "]\n", "df_seed.to_csv(\"./data/colpali_train_set.csv\", index=False)\n", "\n", "df_seed.head()" @@ -318,7 +316,7 @@ "dataset_reference = data_designer_client.upload_seed_dataset(\n", " repo_id=\"data-designer-advanced/visual-qna\",\n", " dataset=\"./data/colpali_train_set.csv\",\n", - " datastore_settings={\"endpoint\": \"http://localhost:3000/v1/hf\"}\n", + " datastore_settings={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n", ")\n", "\n", "config_builder.with_seed_dataset(\n", @@ -334,14 +332,13 @@ "## 🦜 Generating Summary of Image Contents\n", "\n", "- We instruct the model to β€œlook” at each image and write a short, Markdown\n", - "summary. \n", + " summary.\n", "\n", "- We ask it to read the page from top ➑️ bottom, then include a quick wrap-up\n", - "at the end. \n", + " at the end.\n", "\n", "- That summary becomes helpful context we’ll reuse to generate focused\n", - "questions and answers about the document later.\n", - "\n", + " questions and answers about the document later.\n", "\n", "### πŸ–ΌοΈ How the image is provided\n", "\n", @@ -353,7 +350,7 @@ "\n", "In other words, `ImageContext` tells the model β€œthis is an image, encoded as Base64,\n", "and it’s a PNG,” so it knows exactly how to \\\n", - "use it during summarization." + "use it during summarization.\n" ] }, { @@ -367,16 +364,18 @@ " name=\"summary\",\n", " column_type=\"llm-text\",\n", " model_alias=MODEL_ALIAS,\n", - " prompt=(\"Provide a detailed summary of the content in this image in Markdown format.\"\n", - " \"Start from the top of the image and then describe it from top to bottom.\"\n", - " \"Place a summary at the bottom.\"),\n", + " prompt=(\n", + " \"Provide a detailed summary of the content in this image in Markdown format.\"\n", + " \"Start from the top of the image and then describe it from top to bottom.\"\n", + " \"Place a summary at the bottom.\"\n", + " ),\n", " multi_modal_context=[\n", " ImageContext(\n", " column_name=\"base64_image\",\n", " data_type=ModalityDataType.BASE64,\n", " image_format=ImageFormat.PNG,\n", " )\n", - " ]\n", + " ],\n", ")" ] }, @@ -387,6 +386,7 @@ "## πŸ—οΈ Designing our Data Schema\n", "\n", "Structured outputs ensure consistent and predictable data generation. Data Designer supports schemas defined using:\n", + "\n", "- **JSON Schema**: For basic structure definition\n", "- **Pydantic Models**: For advanced validation and type safety (recommended)\n", "\n", @@ -401,22 +401,31 @@ "source": [ "class Question(BaseModel):\n", " \"\"\"Schema for generated questions\"\"\"\n", + "\n", " question: str = Field(description=\"The question to be generated\")\n", "\n", + "\n", "class QuestionTopic(BaseModel):\n", " \"\"\"Schema for question topics\"\"\"\n", + "\n", " topic: str = Field(description=\"The topic/category of the question\")\n", "\n", + "\n", "class Options(BaseModel):\n", " \"\"\"Schema for multiple choice options\"\"\"\n", + "\n", " option_a: str = Field(description=\"The first answer choice\")\n", " option_b: str = Field(description=\"The second answer choice\")\n", " option_c: str = Field(description=\"The third answer choice\")\n", " option_d: str = Field(description=\"The fourth answer choice\")\n", "\n", + "\n", "class Answer(BaseModel):\n", " \"\"\"Schema for question answers\"\"\"\n", - " answer: Literal[\"option_a\", \"option_b\", \"option_c\", \"option_d\"] = Field(description=\"The correct answer to the question\")\n" + "\n", + " answer: Literal[\"option_a\", \"option_b\", \"option_c\", \"option_d\"] = Field(\n", + " description=\"The correct answer to the question\"\n", + " )\n" ] }, { @@ -427,7 +436,7 @@ "\n", "- Sampler columns offer non-LLM based generation of synthetic data.\n", "\n", - "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n" ] }, { @@ -441,7 +450,8 @@ " name=\"difficulty\",\n", " sampler_type=SamplerType.CATEGORY,\n", " params=CategorySamplerParams(values=[\"easy\", \"medium\", \"hard\"]),\n", - " ))\n" + " )\n", + ")\n" ] }, { @@ -449,11 +459,13 @@ "metadata": {}, "source": [ "## 🦜 Adding LLM Generated columns\n", - "Now define the columns that the model will generate. These prompts instruct the LLM to produce: \n", + "\n", + "Now define the columns that the model will generate. These prompts instruct the LLM to produce:\n", + "\n", "- question\n", "- options\n", "- topic\n", - "- answer" + "- answer\n" ] }, { @@ -466,13 +478,17 @@ " LLMStructuredColumnConfig(\n", " name=\"question\",\n", " model_alias=MODEL_ALIAS,\n", - " prompt=(\"Generate a question based on the following context: {{ summary }}. \"\n", - " \"The difficulty of the generated question should be {{ difficulty }}\"),\n", - " system_prompt=(\"You are a helpful assistant that generates questions based on the given context. \"\n", - " \"The context are sourced from documents pertaining to the petroleum industry. \"\n", - " \"You will be given a context and you will need to generate a question based on the context. \"\n", - " \"The difficulty of the generated question should be {{ difficulty }}\"\n", - " \"Ensure you generate just the question and no other text.\"),\n", + " prompt=(\n", + " \"Generate a question based on the following context: {{ summary }}. \"\n", + " \"The difficulty of the generated question should be {{ difficulty }}\"\n", + " ),\n", + " system_prompt=(\n", + " \"You are a helpful assistant that generates questions based on the given context. \"\n", + " \"The context are sourced from documents pertaining to the petroleum industry. \"\n", + " \"You will be given a context and you will need to generate a question based on the context. \"\n", + " \"The difficulty of the generated question should be {{ difficulty }}\"\n", + " \"Ensure you generate just the question and no other text.\"\n", + " ),\n", " output_format=Question,\n", " )\n", ")\n", @@ -481,8 +497,10 @@ " LLMStructuredColumnConfig(\n", " name=\"options\",\n", " model_alias=MODEL_ALIAS,\n", - " prompt=(\"Generate four answer choices for the question: {{ question }} based on the following context: {{ summary }}. \"\n", - " \"The option you generate should match the difficulty of the generated question, {{ difficulty }}.\"),\n", + " prompt=(\n", + " \"Generate four answer choices for the question: {{ question }} based on the following context: {{ summary }}. \"\n", + " \"The option you generate should match the difficulty of the generated question, {{ difficulty }}.\"\n", + " ),\n", " output_format=Options,\n", " )\n", ")\n", @@ -492,8 +510,10 @@ " LLMStructuredColumnConfig(\n", " name=\"answer\",\n", " model_alias=MODEL_ALIAS,\n", - " prompt=(\"Choose the correct answer for the question: {{ question }} based on the following context: {{ summary }}\"\n", - " \"and options choices. The options are {{ options }}. Only select one of the options as the answer.\"),\n", + " prompt=(\n", + " \"Choose the correct answer for the question: {{ question }} based on the following context: {{ summary }}\"\n", + " \"and options choices. The options are {{ options }}. Only select one of the options as the answer.\"\n", + " ),\n", " output_format=Answer,\n", " )\n", ")\n", @@ -503,10 +523,14 @@ " LLMStructuredColumnConfig(\n", " name=\"topic\",\n", " model_alias=MODEL_ALIAS,\n", - " system_prompt=(\"Generate a short 1-3 word topic for the question: {{ question }} \"\n", - " \"based on the given context. {{ summary }}\"),\n", - " prompt=(\"Generate the topic of the question: {{ question }} based on the following context: {{ summary }}\"\n", - " \"The topic should be a single word or phrase that is relevant to the question and context. \"),\n", + " system_prompt=(\n", + " \"Generate a short 1-3 word topic for the question: {{ question }} \"\n", + " \"based on the given context. {{ summary }}\"\n", + " ),\n", + " prompt=(\n", + " \"Generate the topic of the question: {{ question }} based on the following context: {{ summary }}\"\n", + " \"The topic should be a single word or phrase that is relevant to the question and context. \"\n", + " ),\n", " output_format=QuestionTopic,\n", " )\n", ")\n" @@ -524,7 +548,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { @@ -572,7 +596,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### πŸ”Ž View Results" + "### πŸ”Ž View Results\n" ] }, { @@ -586,31 +610,61 @@ "\n", "# Merge preview data with original images for comparison\n", "comparison_dataset = preview.dataset.merge(\n", - " pd.DataFrame(img_dataset)[[\"uuid\", \"image\"]],\n", - " how=\"left\",\n", - " on=\"uuid\"\n", + " pd.DataFrame(img_dataset)[[\"uuid\", \"image\"]], how=\"left\", on=\"uuid\"\n", ")\n", "\n", "print(\"πŸ“„ Original Document Image:\")\n", "display(resize_image(comparison_dataset.image[index], BASE64_IMAGE_HEIGHT))\n", "\n", "print(\"\\nπŸ“ Generated Summary:\")\n", - "rich.print(Panel(comparison_dataset.summary[index], title=\"Document Summary\", title_align=\"left\"))\n", + "rich.print(\n", + " Panel(\n", + " comparison_dataset.summary[index], title=\"Document Summary\", title_align=\"left\"\n", + " )\n", + ")\n", "\n", "print(\"\\nπŸ”’ Generated Difficulty:\")\n", - "rich.print(Panel(json.dumps(comparison_dataset.difficulty[index]), title=\"Difficulty\", title_align=\"left\"))\n", + "rich.print(\n", + " Panel(\n", + " json.dumps(comparison_dataset.difficulty[index]),\n", + " title=\"Difficulty\",\n", + " title_align=\"left\",\n", + " )\n", + ")\n", "\n", "print(\"\\n❓ Generated Question:\")\n", - "rich.print(Panel(json.dumps(comparison_dataset.question[index]), title=\"Question\", title_align=\"left\"))\n", + "rich.print(\n", + " Panel(\n", + " json.dumps(comparison_dataset.question[index]),\n", + " title=\"Question\",\n", + " title_align=\"left\",\n", + " )\n", + ")\n", "\n", "print(\"\\nπŸ”’ Generated Options:\")\n", - "rich.print(Panel(json.dumps(comparison_dataset.options[index]), title=\"Answer Choices\", title_align=\"left\"))\n", + "rich.print(\n", + " Panel(\n", + " json.dumps(comparison_dataset.options[index]),\n", + " title=\"Answer Choices\",\n", + " title_align=\"left\",\n", + " )\n", + ")\n", "\n", "print(\"\\nπŸ”’ Generated Topic:\")\n", - "rich.print(Panel(json.dumps(comparison_dataset.topic[index]), title=\"Topic\", title_align=\"left\"))\n", + "rich.print(\n", + " Panel(\n", + " json.dumps(comparison_dataset.topic[index]), title=\"Topic\", title_align=\"left\"\n", + " )\n", + ")\n", "\n", "print(\"\\nβœ… Generated Answer:\")\n", - "rich.print(Panel(json.dumps(comparison_dataset.answer[index]), title=\"Correct Answer\", title_align=\"left\"))\n" + "rich.print(\n", + " Panel(\n", + " json.dumps(comparison_dataset.answer[index]),\n", + " title=\"Correct Answer\",\n", + " title_align=\"left\",\n", + " )\n", + ")\n" ] }, { @@ -686,31 +740,61 @@ "\n", "# Merge preview data with original images for comparison\n", "comparison_dataset = dataset.merge(\n", - " pd.DataFrame(img_dataset)[[\"uuid\", \"image\"]],\n", - " how=\"left\",\n", - " on=\"uuid\"\n", + " pd.DataFrame(img_dataset)[[\"uuid\", \"image\"]], how=\"left\", on=\"uuid\"\n", ")\n", "\n", "print(\"πŸ“„ Original Document Image:\")\n", "display(resize_image(comparison_dataset.image[index], BASE64_IMAGE_HEIGHT))\n", "\n", "print(\"\\nπŸ“ Generated Summary:\")\n", - "rich.print(Panel(comparison_dataset.summary[index], title=\"Document Summary\", title_align=\"left\"))\n", + "rich.print(\n", + " Panel(\n", + " comparison_dataset.summary[index], title=\"Document Summary\", title_align=\"left\"\n", + " )\n", + ")\n", "\n", "print(\"\\nπŸ”’ Generated Difficulty:\")\n", - "rich.print(Panel(json.dumps(comparison_dataset.difficulty[index]), title=\"Difficulty\", title_align=\"left\"))\n", + "rich.print(\n", + " Panel(\n", + " json.dumps(comparison_dataset.difficulty[index]),\n", + " title=\"Difficulty\",\n", + " title_align=\"left\",\n", + " )\n", + ")\n", "\n", "print(\"\\n❓ Generated Question:\")\n", - "rich.print(Panel(json.dumps(comparison_dataset.question[index]), title=\"Question\", title_align=\"left\"))\n", + "rich.print(\n", + " Panel(\n", + " json.dumps(comparison_dataset.question[index]),\n", + " title=\"Question\",\n", + " title_align=\"left\",\n", + " )\n", + ")\n", "\n", "print(\"\\nπŸ”’ Generated Options:\")\n", - "rich.print(Panel(json.dumps(comparison_dataset.options[index]), title=\"Answer Choices\", title_align=\"left\"))\n", + "rich.print(\n", + " Panel(\n", + " json.dumps(comparison_dataset.options[index]),\n", + " title=\"Answer Choices\",\n", + " title_align=\"left\",\n", + " )\n", + ")\n", "\n", "print(\"\\nπŸ”’ Generated Topic:\")\n", - "rich.print(Panel(json.dumps(comparison_dataset.topic[index]), title=\"Topic\", title_align=\"left\"))\n", + "rich.print(\n", + " Panel(\n", + " json.dumps(comparison_dataset.topic[index]), title=\"Topic\", title_align=\"left\"\n", + " )\n", + ")\n", "\n", "print(\"\\nβœ… Generated Answer:\")\n", - "rich.print(Panel(json.dumps(comparison_dataset.answer[index]), title=\"Correct Answer\", title_align=\"left\"))\n" + "rich.print(\n", + " Panel(\n", + " json.dumps(comparison_dataset.answer[index]),\n", + " title=\"Correct Answer\",\n", + " title_align=\"left\",\n", + " )\n", + ")\n" ] } ], diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/person-samplers/person-sampler-tutorial.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/person-samplers/person-sampler-tutorial.ipynb index 3d95057e3..c2f4b421c 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/person-samplers/person-sampler-tutorial.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/person-samplers/person-sampler-tutorial.ipynb @@ -7,8 +7,6 @@ "source": [ "# πŸ§‘β€πŸ€β€πŸ§‘ NeMo Data Designer: Person Sampler Tutorial\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "### πŸ“š What you'll learn\n", "\n", "In this notebook, we'll explore how you can generate realistic personal information for your synthetic datasets.\n", @@ -24,11 +22,12 @@ ">\n", "> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).\n", "\n", - "
\n", + "
\n", "\n", "## What is the Person Sampler?\n", "\n", "The Person Sampler is a powerful feature in NeMo Data Designer that generates consistent, realistic person records with attributes like:\n", + "\n", "- Names (first, middle, last)\n", "- Contact information (email, phone)\n", "- Addresses (street, city, state, zip)\n", @@ -67,7 +66,7 @@ " NeMoDataDesignerClient,\n", " PersonSamplerParams,\n", " SamplerColumnConfig,\n", - " SamplerType\n", + " SamplerType,\n", ")" ] }, @@ -173,7 +172,7 @@ "source": [ "### 1. Basic Person Sampling\n", "\n", - "Let's start with a simple example of generating person data using the default settings." + "Let's start with a simple example of generating person data using the default settings.\n" ] }, { @@ -203,7 +202,7 @@ "source": [ "### 2. Accessing Individual Person Attributes\n", "\n", - "The `person` column we created above is a nested object with many attributes. Let's create some columns to access specific attributes from this person object." + "The `person` column we created above is a nested object with many attributes. Let's create some columns to access specific attributes from this person object.\n" ] }, { @@ -217,30 +216,24 @@ "config_builder.add_column(\n", " name=\"full_name\",\n", " column_type=\"expression\",\n", - " expr=\"{{ person.first_name }} {{ person.last_name }}\"\n", + " expr=\"{{ person.first_name }} {{ person.last_name }}\",\n", ")\n", "\n", "config_builder.add_column(\n", - " name=\"email\",\n", - " column_type=\"expression\",\n", - " expr=\"{{ person.email_address }}\"\n", + " name=\"email\", column_type=\"expression\", expr=\"{{ person.email_address }}\"\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"address\",\n", " column_type=\"expression\",\n", - " expr=\"{{ person.street_number }} {{ person.street_name }}, {{ person.city }}, {{ person.state }} {{ person.zipcode }}\"\n", + " expr=\"{{ person.street_number }} {{ person.street_name }}, {{ person.city }}, {{ person.state }} {{ person.zipcode }}\",\n", ")\n", "\n", - "config_builder.add_column(\n", - " name=\"age\",\n", - " column_type=\"expression\",\n", - " expr=\"{{ person.age }}\"\n", - ")\n", + "config_builder.add_column(name=\"age\", column_type=\"expression\", expr=\"{{ person.age }}\")\n", "\n", "# Preview the results\n", "preview = data_designer_client.preview(config_builder)\n", - "preview.dataset[['full_name', 'email', 'address', 'age']]" + "preview.dataset[[\"full_name\", \"email\", \"address\", \"age\"]]" ] }, { @@ -253,7 +246,7 @@ "- Now let's explore customizing the Person Sampler to generate specific types of profiles.\n", "\n", "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n", - " If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker" + " If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker\n" ] }, { @@ -271,21 +264,14 @@ " name=\"employee\",\n", " column_type=\"sampler\",\n", " sampler_type=\"person\",\n", - " params={\n", - " \"locale\": \"en_US\",\n", - " \"age_range\": [22, 65],\n", - " \"state\": \"CA\"\n", - " }\n", + " params={\"locale\": \"en_US\", \"age_range\": [22, 65], \"state\": \"CA\"},\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"customer\",\n", " column_type=\"sampler\",\n", " sampler_type=\"person\",\n", - " params={\n", - " \"locale\": \"en_US\",\n", - " \"age_range\": [18, 80]\n", - " }\n", + " params={\"locale\": \"en_US\", \"age_range\": [18, 80]},\n", ")\n", "\n", "# Create a UK-based person\n", @@ -295,32 +281,32 @@ " sampler_type=\"person\",\n", " params={\n", " \"locale\": \"en_GB\", # UK locale\n", - " \"city\": \"London\"\n", - " }\n", + " \"city\": \"London\",\n", + " },\n", ")\n", "\n", "# Add columns to extract and format information\n", "config_builder.add_column(\n", " name=\"employee_info\",\n", " column_type=\"expression\",\n", - " expr=\"{{ employee.first_name }} {{ employee.last_name }}, {{ employee.age }} - {{ employee.city }}, {{ employee.state }}\"\n", + " expr=\"{{ employee.first_name }} {{ employee.last_name }}, {{ employee.age }} - {{ employee.city }}, {{ employee.state }}\",\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"customer_info\",\n", " column_type=\"expression\",\n", - " expr=\"{{ customer.first_name }} {{ customer.last_name }}, {{ customer.age }} - {{ customer.city }}, {{ customer.state }}\"\n", + " expr=\"{{ customer.first_name }} {{ customer.last_name }}, {{ customer.age }} - {{ customer.city }}, {{ customer.state }}\",\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"uk_contact_info\",\n", " column_type=\"expression\",\n", - " expr=\"{{ uk_contact.first_name }} {{ uk_contact.last_name }}, {{ uk_contact.phone_number }} - {{ uk_contact.city }}\"\n", + " expr=\"{{ uk_contact.first_name }} {{ uk_contact.last_name }}, {{ uk_contact.phone_number }} - {{ uk_contact.city }}\",\n", ")\n", "\n", "# Preview the results\n", "preview = data_designer_client.preview(config_builder)\n", - "preview.dataset[['employee_info', 'customer_info', 'uk_contact_info']]" + "preview.dataset[[\"employee_info\", \"customer_info\", \"uk_contact_info\"]]" ] }, { @@ -330,7 +316,7 @@ "source": [ "### 4. Available Person Attributes\n", "\n", - "The Person Sampler generates a rich set of attributes that you can use. Here's a reference list of some of the key attributes available:" + "The Person Sampler generates a rich set of attributes that you can use. Here's a reference list of some of the key attributes available:\n" ] }, { @@ -338,30 +324,30 @@ "id": "3f01fc83", "metadata": {}, "source": [ - "| Attribute | Description | Example |\n", - "|-----------|-------------|--------|\n", - "| `first_name` | Person's first name | \"John\" |\n", - "| `middle_name` | Person's middle name (may be None) | \"Robert\" |\n", - "| `last_name` | Person's last name | \"Smith\" |\n", - "| `sex` | Person's sex | \"Male\" |\n", - "| `age` | Person's age in years | 42 |\n", - "| `birth_date` | Date of birth | \"1980-05-15\" |\n", - "| `email_address` | Email address | \"john.smith@example.com\" |\n", - "| `phone_number` | Phone number | \"+1 (555) 123-4567\" |\n", - "| `street_number` | Street number | \"123\" |\n", - "| `street_name` | Street name | \"Main Street\" |\n", - "| `unit` | Apartment/unit number | \"Apt 4B\" |\n", - "| `city` | City name | \"Chicago\" |\n", - "| `state` | State/province (locale dependent) | \"IL\" |\n", - "| `county` | County (locale dependent) | \"Cook\" |\n", - "| `zipcode` | Postal/ZIP code | \"60601\" |\n", - "| `country` | Country name | \"United States\" |\n", - "| `ssn` | Social Security Number (US locale) | \"123-45-6789\" |\n", - "| `occupation` | Occupation | \"Software Engineer\" |\n", - "| `marital_status` | Marital status | \"Married\" |\n", - "| `education_level` | Education level | \"Bachelor's Degree\" |\n", - "| `ethnic_background` | Ethnic background | \"Caucasian\" |\n", - "| `uuid` | Unique identifier | \"550e8400-e29b-41d4-a716-446655440000\" |" + "| Attribute | Description | Example |\n", + "| ------------------- | ---------------------------------- | -------------------------------------- |\n", + "| `first_name` | Person's first name | \"John\" |\n", + "| `middle_name` | Person's middle name (may be None) | \"Robert\" |\n", + "| `last_name` | Person's last name | \"Smith\" |\n", + "| `sex` | Person's sex | \"Male\" |\n", + "| `age` | Person's age in years | 42 |\n", + "| `birth_date` | Date of birth | \"1980-05-15\" |\n", + "| `email_address` | Email address | \"john.smith@example.com\" |\n", + "| `phone_number` | Phone number | \"+1 (555) 123-4567\" |\n", + "| `street_number` | Street number | \"123\" |\n", + "| `street_name` | Street name | \"Main Street\" |\n", + "| `unit` | Apartment/unit number | \"Apt 4B\" |\n", + "| `city` | City name | \"Chicago\" |\n", + "| `state` | State/province (locale dependent) | \"IL\" |\n", + "| `county` | County (locale dependent) | \"Cook\" |\n", + "| `zipcode` | Postal/ZIP code | \"60601\" |\n", + "| `country` | Country name | \"United States\" |\n", + "| `ssn` | Social Security Number (US locale) | \"123-45-6789\" |\n", + "| `occupation` | Occupation | \"Software Engineer\" |\n", + "| `marital_status` | Marital status | \"Married\" |\n", + "| `education_level` | Education level | \"Bachelor's Degree\" |\n", + "| `ethnic_background` | Ethnic background | \"Caucasian\" |\n", + "| `uuid` | Unique identifier | \"550e8400-e29b-41d4-a716-446655440000\" |\n" ] }, { @@ -371,7 +357,7 @@ "source": [ "### 5. Creating Multiple Person Samplers with One Method\n", "\n", - "For convenience, Data Designer provides a `with_person_samplers` method to create multiple person samplers at once." + "For convenience, Data Designer provides a `with_person_samplers` method to create multiple person samplers at once.\n" ] }, { @@ -389,58 +375,65 @@ " name=\"doctor\",\n", " column_type=\"sampler\",\n", " sampler_type=\"person\",\n", - " params={\"locale\": \"en_US\", \"age_range\": [30, 70]}\n", + " params={\"locale\": \"en_US\", \"age_range\": [30, 70]},\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"patient\",\n", " column_type=\"sampler\",\n", " sampler_type=\"person\",\n", - " params={\"locale\": \"en_US\", \"age_range\": [18, 90]}\n", + " params={\"locale\": \"en_US\", \"age_range\": [18, 90]},\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"nurse\",\n", " column_type=\"sampler\",\n", " sampler_type=\"person\",\n", - " params={\"locale\": \"en_US\", \"age_range\": [25, 65], \"sex\": \"Female\"}\n", + " params={\"locale\": \"en_US\", \"age_range\": [25, 65], \"sex\": \"Female\"},\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"international_doctor\",\n", " column_type=\"sampler\",\n", " sampler_type=\"person\",\n", - " params={\"locale\": \"fr_FR\", \"age_range\": [35, 65]}\n", + " params={\"locale\": \"fr_FR\", \"age_range\": [35, 65]},\n", ")\n", "\n", "# Add columns to format information for each person type\n", "config_builder.add_column(\n", " name=\"doctor_profile\",\n", " column_type=\"expression\",\n", - " expr=\"Dr. {{ doctor.first_name }} {{ doctor.last_name }}, {{ doctor.age }}, {{ doctor.email_address }}\"\n", + " expr=\"Dr. {{ doctor.first_name }} {{ doctor.last_name }}, {{ doctor.age }}, {{ doctor.email_address }}\",\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"patient_profile\",\n", " column_type=\"expression\",\n", - " expr=\"{{ patient.first_name }} {{ patient.last_name }}, {{ patient.age }}, {{ patient.city }}, {{ patient.state }}\"\n", + " expr=\"{{ patient.first_name }} {{ patient.last_name }}, {{ patient.age }}, {{ patient.city }}, {{ patient.state }}\",\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"nurse_profile\",\n", " column_type=\"expression\",\n", - " expr=\"Nurse {{ nurse.first_name }} {{ nurse.last_name }}, {{ nurse.age }}\"\n", + " expr=\"Nurse {{ nurse.first_name }} {{ nurse.last_name }}, {{ nurse.age }}\",\n", ")\n", "\n", "config_builder.add_column(\n", " name=\"international_doctor_profile\",\n", " column_type=\"expression\",\n", - " expr=\"Dr. {{ international_doctor.first_name }} {{ international_doctor.last_name }}, {{ international_doctor.city }}, {{ international_doctor.country }}\"\n", + " expr=\"Dr. {{ international_doctor.first_name }} {{ international_doctor.last_name }}, {{ international_doctor.city }}, {{ international_doctor.country }}\",\n", ")\n", "\n", "# Preview the results\n", "preview = data_designer_client.preview(config_builder)\n", - "preview.dataset[['doctor_profile', 'patient_profile', 'nurse_profile', 'international_doctor_profile']]" + "preview.dataset[\n", + " [\n", + " \"doctor_profile\",\n", + " \"patient_profile\",\n", + " \"nurse_profile\",\n", + " \"international_doctor_profile\",\n", + " ]\n", + "]" ] }, { @@ -450,7 +443,7 @@ "source": [ "## 6. Using Person Data with LLM Generation\n", "\n", - "One of the most powerful features of Data Designer is combining structured person data with LLM generation to create realistic, contextual content." + "One of the most powerful features of Data Designer is combining structured person data with LLM generation to create realistic, contextual content.\n" ] }, { @@ -542,7 +535,15 @@ "\n", "# Preview the results\n", "preview = data_designer_client.preview(config_builder)\n", - "preview.dataset[['patient_name', 'doctor_name', 'medical_condition', 'medical_notes', 'patient_message']]" + "preview.dataset[\n", + " [\n", + " \"patient_name\",\n", + " \"doctor_name\",\n", + " \"medical_condition\",\n", + " \"medical_notes\",\n", + " \"patient_message\",\n", + " ]\n", + "]" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/qa-generation/product-question-answer-generator.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/qa-generation/product-question-answer-generator.ipynb index 116c87dfc..0a9dd3ae2 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/qa-generation/product-question-answer-generator.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/qa-generation/product-question-answer-generator.ipynb @@ -7,11 +7,9 @@ "source": [ "# 🎨 NeMo Data Designer: Product Information Dataset Generator with Q&A\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", - "This notebook demonstrates how to use NeMo Data Designer to create a synthetic dataset of product information with corresponding questions and answers. \n", + "This notebook demonstrates how to use NeMo Data Designer to create a synthetic dataset of product information with corresponding questions and answers.\n", "\n", "
\n", "\n", @@ -164,7 +162,7 @@ "source": [ "## πŸ—οΈ Defining Data Structures\n", "\n", - "Now we'll define the data models and evaluation rubrics for our product information dataset." + "Now we'll define the data models and evaluation rubrics for our product information dataset.\n" ] }, { @@ -178,12 +176,20 @@ "from pydantic import BaseModel\n", "from pydantic import Field\n", "\n", + "\n", "# Define product information structure\n", "class ProductInfo(BaseModel):\n", - " product_name: str = Field(..., description=\"A realistic product name for the market.\")\n", - " key_features: list[str] = Field(..., min_length=1, max_length=3, description=\"Key product features.\")\n", - " description: str = Field(..., description=\"A short, engaging description of what the product does, highlighting a unique but believable feature.\")\n", - " price_usd: float = Field(..., description=\"The stated price in USD.\")" + " product_name: str = Field(\n", + " ..., description=\"A realistic product name for the market.\"\n", + " )\n", + " key_features: list[str] = Field(\n", + " ..., min_length=1, max_length=3, description=\"Key product features.\"\n", + " )\n", + " description: str = Field(\n", + " ...,\n", + " description=\"A short, engaging description of what the product does, highlighting a unique but believable feature.\",\n", + " )\n", + " price_usd: float = Field(..., description=\"The stated price in USD.\")" ] }, { @@ -195,7 +201,7 @@ "\n", "- Sampler columns offer non-LLM based generation of synthetic data.\n", "\n", - "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n" ] }, { @@ -210,7 +216,8 @@ " SamplerColumnConfig(\n", " name=\"category\",\n", " sampler_type=SamplerType.CATEGORY,\n", - " params=CategorySamplerParams(values=[\n", + " params=CategorySamplerParams(\n", + " values=[\n", " \"Electronics\",\n", " \"Clothing\",\n", " \"Home Appliances\",\n", @@ -237,7 +244,7 @@ " \"Software\",\n", " \"Tech Devices\",\n", " ]\n", - " )\n", + " ),\n", " )\n", ")\n", "\n", @@ -306,7 +313,7 @@ " \"Generate a realistic product description for a product in the {{ category }} \"\n", " \"category that costs {{ product_price }}.\\n\"\n", " \"The name of the product MUST start with the letter {{ first_letter }}.\\n\"\n", - " ),\n", + " ),\n", " output_format=ProductInfo,\n", " )\n", ")\n", @@ -333,15 +340,13 @@ " \"\\n\"\n", " \"{{ product_info }}\\n\"\n", " \"\\n\"\n", - "\n", " \"{%- endif -%}\\n\"\n", " \"User Question: {{ question }}\\n\"\n", - "\n", " \"Directly and succinctly answer the user's question.\\n\"\n", " \"{%- if is_hallucination == 1 -%}\\n\"\n", " \"Make up whatever information you need to in order to answer the user's request.\\n\"\n", " \"{%- endif -%}\"\n", - " ),\n", + " ),\n", " )\n", ")\n" ] @@ -354,10 +359,10 @@ "## πŸ” Quality Assessment: LLM-as-a-Judge\n", "\n", "When generating our synthetic dataset, we need to determine the quality of the generated data \\\n", - "We use the LLM-as-a-Judge strategy to do this. \n", + "We use the LLM-as-a-Judge strategy to do this.\n", "\n", - "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", - "that provides relavant instructions. " + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt\n", + "that provides relavant instructions.\n" ] }, { @@ -375,7 +380,7 @@ " \"Complete\": \"The response thoroughly covers all key points requested in the question, providing sufficient detail to satisfy the user's information needs.\",\n", " \"PartiallyComplete\": \"The response addresses the core question but omits certain important details or fails to elaborate on relevant aspects that were requested.\",\n", " \"Incomplete\": \"The response significantly lacks necessary information, missing major components of what was asked and leaving the query largely unanswered.\",\n", - " }\n", + " },\n", ")\n", "\n", "AccuracyRubric = Score(\n", @@ -385,7 +390,7 @@ " \"Accurate\": \"The information provided aligns perfectly with the product specifications without introducing any misleading or incorrect details.\",\n", " \"PartiallyAccurate\": \"While some information is correctly stated, the response contains minor factual errors or potentially misleading statements about the product.\",\n", " \"Inaccurate\": \"The response presents significantly wrong information about the product, with claims that contradict the actual product details.\",\n", - " }\n", + " },\n", ")\n", "\n", "\n", @@ -398,10 +403,8 @@ " \"\\n\"\n", " \"{{ product_info }}\\n\"\n", " \"\\n\"\n", - "\n", " \"User Question: {{question }}\\n\"\n", " \"AI Assistant Answer: {{ answer }}\\n\"\n", - "\n", " \"Judge the AI assistant's response to the user's question about the product described in .\"\n", " ),\n", " scores=[CompletenessRubric, AccuracyRubric],\n", @@ -438,7 +441,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/rag-examples/generate-rag-generation-eval-dataset.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/rag-examples/generate-rag-generation-eval-dataset.ipynb index 737ed68a0..396f1c20b 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/rag-examples/generate-rag-generation-eval-dataset.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/rag-examples/generate-rag-generation-eval-dataset.ipynb @@ -7,12 +7,9 @@ "source": [ "# 🎨 NeMo Data Designer: Generate Diverse RAG Evaluations\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", - "This tutorial demonstrates how to generate comprehensive evaluation datasets for Retrieval-Augmented Generation (RAG) systems, customized to your content and use cases. \n", - "\n", + "This tutorial demonstrates how to generate comprehensive evaluation datasets for Retrieval-Augmented Generation (RAG) systems, customized to your content and use cases.\n", "\n", "
\n", "\n", @@ -163,11 +160,11 @@ "source": [ "## 🌱 Loading Seed Data\n", "\n", - "- We'll use the symptom-to-diagnosis dataset as our seed data. \n", + "- We'll use the symptom-to-diagnosis dataset as our seed data.\n", "\n", "- This dataset contains patient symptoms and corresponding diagnoses which will help generate realistic medical scenarios.\n", "\n", - "
\n", + "
\n", "\n", "> 🌱 **Why use a seed dataset?**\n", ">\n", @@ -187,16 +184,15 @@ ">\n", "> - The datastore endpoint is specified in the deployment configuration.\n", "\n", - "\n", "πŸ‘‹ **Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as \\\n", - "seeds, it is recommended you consolidated these into a single file. \n", + "seeds, it is recommended you consolidated these into a single file.\n", "
\n", "\n", "### βš™οΈ Document Processing\n", "\n", - "Now we'll create a Document Processor class that handles loading and chunking the source documents. \n", + "Now we'll create a Document Processor class that handles loading and chunking the source documents.\n", "\n", - "This class uses langchain's RecursiveCharacterTextSplitter and unstructured.io for robust document parsing." + "This class uses langchain's RecursiveCharacterTextSplitter and unstructured.io for robust document parsing.\n" ] }, { @@ -233,7 +229,7 @@ "\n", " def parse_document(self, uri: str) -> str:\n", " \"\"\"Parse a single document from URI into raw text.\"\"\"\n", - " with open(uri, 'rb') as file:\n", + " with open(uri, \"rb\") as file:\n", " content = file.read()\n", " with tempfile.NamedTemporaryFile(delete=False) as temp_file:\n", " temp_file.write(content)\n", @@ -264,9 +260,9 @@ "source": [ "### πŸ—οΈ Data Models\n", "\n", - "- Let's define Pydantic models for structured output generation. \n", + "- Let's define Pydantic models for structured output generation.\n", "\n", - "- These schemas will ensure our generated data has consistent structure and validation." + "- These schemas will ensure our generated data has consistent structure and validation.\n" ] }, { @@ -278,15 +274,18 @@ "source": [ "from pydantic import BaseModel, Field\n", "\n", + "\n", "class QAPair(BaseModel):\n", " question: str = Field(\n", " ..., description=\"A specific question related to the domain of the context\"\n", " )\n", " answer: str = Field(\n", - " ..., description=\"Either a context-supported answer or explanation of why the question cannot be answered\"\n", + " ...,\n", + " description=\"Either a context-supported answer or explanation of why the question cannot be answered\",\n", " )\n", " reasoning: str = Field(\n", - " ..., description=\"A clear and traceable explanation of the reasoning behind the answer\"\n", + " ...,\n", + " description=\"A clear and traceable explanation of the reasoning behind the answer\",\n", " )" ] }, @@ -324,7 +323,8 @@ "dataset_reference = data_designer_client.upload_seed_dataset(\n", " repo_id=\"data-designer-demo/rag-evaluation-dataset\",\n", " dataset=seed_df,\n", - " datastore_settings={\"endpoint\": \"http://localhost:3000/v1/hf\"})\n", + " datastore_settings={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n", + ")\n", "\n", "config_builder.with_seed_dataset(dataset_reference)" ] @@ -342,7 +342,7 @@ "\n", "2. **Reasoning types**: factual recall, inferential reasoning, etc.\n", "\n", - "3. **Question types**: answerable vs. unanswerable (with weighting)" + "3. **Question types**: answerable vs. unanswerable (with weighting)\n" ] }, { @@ -400,7 +400,7 @@ "## 🦜 Adding LLM-Structured Column for Q&A Pair Generation\n", "\n", "Now let's set up the core of our data generation: the Q&A pair column that will produce structured question-answer \\\n", - "pairs based on our document context and control parameters." + "pairs based on our document context and control parameters.\n" ] }, { @@ -442,10 +442,10 @@ "## πŸ” Quality Assessment: LLM-as-a-Judge\n", "\n", "When generating our synthetic dataset, we need to determine the quality of the generated data \\\n", - "We use the LLM-as-a-Judge strategy to do this. \n", + "We use the LLM-as-a-Judge strategy to do this.\n", "\n", - "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", - "that provides relavant instructions. " + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt\n", + "that provides relavant instructions.\n" ] }, { @@ -540,7 +540,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/reasoning/reasoning-traces.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/reasoning/reasoning-traces.ipynb index 37659973a..beb504ddf 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/reasoning/reasoning-traces.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/reasoning/reasoning-traces.ipynb @@ -7,17 +7,15 @@ "source": [ "# 🧠 NeMo Data Designer: Synthetic Reasoning Traces\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", - "- This notebook demonstrates how to use NeMo Data Designer to build a synthetic data generation pipeline tailored for reasoning tasks. \n", + "- This notebook demonstrates how to use NeMo Data Designer to build a synthetic data generation pipeline tailored for reasoning tasks.\n", "\n", "- Instead of creating multi-turn conversations, we will generate reasoning traces that can be utilized for training and \\\n", - "fine-tuning language models with reinforcement learning techniques and invoking chain-of-thought processing.\n", + " fine-tuning language models with reinforcement learning techniques and invoking chain-of-thought processing.\n", "\n", "- These synthetic reasoning traces can be used to enhance model performance in areas such as mathematics, coding, scientific \\\n", - "reasoning, and other domains that benefit from structured reasoning.\n", + " reasoning, and other domains that benefit from structured reasoning.\n", "\n", "
\n", "\n", @@ -170,7 +168,7 @@ "\n", "- Sampler columns offer non-LLM based generation of synthetic data.\n", "\n", - "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n" ] }, { @@ -192,9 +190,9 @@ " \"Friendship Moments\",\n", " \"Community Interactions\",\n", " \"Personal Well-being\",\n", - " \"Unexpected Encounters\"\n", + " \"Unexpected Encounters\",\n", " ]\n", - " )\n", + " ),\n", " )\n", ")\n", "\n", @@ -206,32 +204,26 @@ " params=SubcategorySamplerParams(\n", " category=\"domain\",\n", " values={\n", - " \"Family Dynamics\": [\n", - " \"Parenting Dilemmas\",\n", - " \"Sibling Rivalries\"\n", - " ],\n", + " \"Family Dynamics\": [\"Parenting Dilemmas\", \"Sibling Rivalries\"],\n", " \"Workplace Challenges\": [\n", " \"Communication Breakdowns\",\n", - " \"Leadership Dilemmas\"\n", + " \"Leadership Dilemmas\",\n", " ],\n", " \"Friendship Moments\": [\n", " \"Support & Understanding\",\n", - " \"Misunderstandings & Reconciliations\"\n", + " \"Misunderstandings & Reconciliations\",\n", " ],\n", " \"Community Interactions\": [\n", " \"Neighborhood Support\",\n", - " \"Cultural Celebrations\"\n", - " ],\n", - " \"Personal Well-being\": [\n", - " \"Mental Health\",\n", - " \"Self-care & Reflection\"\n", + " \"Cultural Celebrations\",\n", " ],\n", + " \"Personal Well-being\": [\"Mental Health\", \"Self-care & Reflection\"],\n", " \"Unexpected Encounters\": [\n", " \"Serendipitous Meetings\",\n", - " \"Moments of Realization\"\n", - " ]\n", - " }\n", - " )\n", + " \"Moments of Realization\",\n", + " ],\n", + " },\n", + " ),\n", " )\n", ")\n", "\n", @@ -240,9 +232,7 @@ " SamplerColumnConfig(\n", " name=\"complexity\",\n", " sampler_type=SamplerType.CATEGORY,\n", - " params=CategorySamplerParams(\n", - " values=[\"Basic\", \"Intermediate\", \"Advanced\"]\n", - " )\n", + " params=CategorySamplerParams(values=[\"Basic\", \"Intermediate\", \"Advanced\"]),\n", " )\n", ")" ] @@ -259,16 +249,17 @@ "- As we see below, nested json fields can be accessed using dot notation.\n", "\n", "- These prompts instruct the LLM to produce the actual empathic reasoning trace and answer, following the specified format with and tags.\n", - "
\n", + "
\n", "\n", "### 🧠 Empathic Reasoning Trace Generation\n", "\n", "This column is designed to generate clear, thoughtful reasoning traces that blend logical analysis with emotional insight for everyday situations \\\n", "where empathy is crucial. The generation prompt is tailored to:\n", + "\n", "- Produce a structured explanation that highlights both the practical reasoning and the emotional dynamics at play.\n", "\n", "- Encourage a dual output: one part detailing the empathic thought process (enclosed within `` tags) and another delivering a \\\n", - "compassionate final answer (enclosed within `` tags).\n", + " compassionate final answer (enclosed within `` tags).\n", "\n", "- Ensure that the generated content reflects deep understanding, compassion, and a balanced view of the challenges and emotions involved.\n" ] @@ -315,7 +306,7 @@ " \"the underlying issue, and how might empathy help mend the situation?'\\n\"\n", " \"3. 'Picture a colleague receiving unexpected criticism during a meeting. What are the potential emotional \"\n", " \"impacts, and what supportive response could be helpful?'\\n\"\n", - " )\n", + " ),\n", " )\n", ")" ] @@ -327,11 +318,11 @@ "source": [ "### ⚑️ Empathic Reasoning Process Generation\n", "\n", - "- These columns generate and evaluate a detailed empathic reasoning trace for addressing everyday scenarios. \n", + "- These columns generate and evaluate a detailed empathic reasoning trace for addressing everyday scenarios.\n", "\n", - "- The process emphasizes a compassionate, thoughtful approach that blends logical reasoning with emotional insight. \n", + "- The process emphasizes a compassionate, thoughtful approach that blends logical reasoning with emotional insight.\n", "\n", - "- The prompts instruct the model to include its internal thought process within ... tags before providing the JSON output." + "- The prompts instruct the model to include its internal thought process within ... tags before providing the JSON output.\n" ] }, { @@ -344,37 +335,76 @@ "from typing import List\n", "from pydantic import BaseModel, Field\n", "\n", + "\n", "class Thought(BaseModel):\n", " \"\"\"A single step in the structured empathic reasoning process.\n", " This step captures an empathetic observation or insight that informs a thoughtful, compassionate approach to addressing everyday challenges.\n", " \"\"\"\n", - " step_number: int = Field(..., ge=1, description=\"The order of the reasoning step, starting from 1.\")\n", - " content: str = Field(..., min_length=5, description=(\"A detailed explanation of this reasoning step, incorporating both logical analysis and emotional insight.\"))\n", + "\n", + " step_number: int = Field(\n", + " ..., ge=1, description=\"The order of the reasoning step, starting from 1.\"\n", + " )\n", + " content: str = Field(\n", + " ...,\n", + " min_length=5,\n", + " description=(\n", + " \"A detailed explanation of this reasoning step, incorporating both logical analysis and emotional insight.\"\n", + " ),\n", + " )\n", + "\n", "\n", "class ReasoningTrace(BaseModel):\n", " \"\"\"A structured empathic reasoning trace for addressing a scenario.\n", " This model records a step-by-step process that integrates logical analysis with emotional insight and empathy to arrive at a supportive final answer.\n", " \"\"\"\n", - " reasoning: List[Thought] = Field(..., description=\"Step-by-step reasoning leading to the final answer, enriched with empathetic observations and practical insights.\")\n", - " answer: str = Field(..., description=\"The final answer derived from the empathic reasoning process, offering compassionate guidance or resolution.\")\n", + "\n", + " reasoning: List[Thought] = Field(\n", + " ...,\n", + " description=\"Step-by-step reasoning leading to the final answer, enriched with empathetic observations and practical insights.\",\n", + " )\n", + " answer: str = Field(\n", + " ...,\n", + " description=\"The final answer derived from the empathic reasoning process, offering compassionate guidance or resolution.\",\n", + " )\n", + "\n", "\n", "class Evaluation(BaseModel):\n", " \"\"\"Output format for evaluating an empathic reasoning answer.\n", " The evaluation assesses the response based on correctness, clarity, and completeness,\n", " with feedback that emphasizes compassionate insight, clarity, and a holistic understanding of the scenario.\n", " \"\"\"\n", - " correctness: float = Field(..., description=\"Overall correctness rating of the answer (0 to 1).\")\n", - " clarity: float = Field(..., description=\"Clarity rating of the reasoning, including the integration of empathic explanations (0 to 1).\")\n", - " completeness: float = Field(..., description=\"Completeness rating of the reasoning, assessing whether all practical and emotional aspects were considered (0 to 1).\")\n", - " feedback: str = Field(..., description=\"Detailed feedback on the reasoning trace and answer, with suggestions for enhancing empathetic and real-world applicability.\")\n", + "\n", + " correctness: float = Field(\n", + " ..., description=\"Overall correctness rating of the answer (0 to 1).\"\n", + " )\n", + " clarity: float = Field(\n", + " ...,\n", + " description=\"Clarity rating of the reasoning, including the integration of empathic explanations (0 to 1).\",\n", + " )\n", + " completeness: float = Field(\n", + " ...,\n", + " description=\"Completeness rating of the reasoning, assessing whether all practical and emotional aspects were considered (0 to 1).\",\n", + " )\n", + " feedback: str = Field(\n", + " ...,\n", + " description=\"Detailed feedback on the reasoning trace and answer, with suggestions for enhancing empathetic and real-world applicability.\",\n", + " )\n", + "\n", "\n", "class FinalEvaluation(Evaluation):\n", " \"\"\"Extended evaluation model for final empathic reasoning traces.\n", " This model adds criteria to assess visual structure and conciseness,\n", " ensuring the final output is both clear and visually appealing.\n", " \"\"\"\n", - " structure: float = Field(..., description=\"Rating of the visual structure and formatting (0 to 1), assessing if reasoning steps and final answer are clearly delineated.\")\n", - " conciseness: float = Field(..., description=\"Rating of the conciseness of the reasoning trace (0 to 1), ensuring that extraneous verbosity is minimized.\")" + "\n", + " structure: float = Field(\n", + " ...,\n", + " description=\"Rating of the visual structure and formatting (0 to 1), assessing if reasoning steps and final answer are clearly delineated.\",\n", + " )\n", + " conciseness: float = Field(\n", + " ...,\n", + " description=\"Rating of the conciseness of the reasoning trace (0 to 1), ensuring that extraneous verbosity is minimized.\",\n", + " )" ] }, { @@ -398,7 +428,7 @@ " \"Scenario: {{scenario}}\\n\\n\"\n", " \"Ensure that your response is structured and reflective of a supportive, empathetic approach.\"\n", " ),\n", - " output_format=ReasoningTrace\n", + " output_format=ReasoningTrace,\n", " )\n", ")\n", "\n", @@ -414,7 +444,7 @@ " \"Evaluate the response with a focus on emotional insight, clarity, and holistic consideration.\\n\\n\"\n", " \"Include your internal thought process within ... tags before providing the JSON.\"\n", " ),\n", - " output_format=Evaluation\n", + " output_format=Evaluation,\n", " )\n", ")" ] @@ -426,13 +456,13 @@ "source": [ "### Final Empathic Reasoning Trace Generation and Evaluation\n", "\n", - "- These columns refine and evaluate the final empathic reasoning trace. \n", + "- These columns refine and evaluate the final empathic reasoning trace.\n", "\n", - "- The final trace is generated by reviewing the scenario, your initial empathic reasoning trace, and its evaluation. \n", + "- The final trace is generated by reviewing the scenario, your initial empathic reasoning trace, and its evaluation.\n", "\n", - "- The process integrates improvements suggested by the evaluation and ensures that the final reasoning is compassionate, clear, and comprehensive. \n", + "- The process integrates improvements suggested by the evaluation and ensures that the final reasoning is compassionate, clear, and comprehensive.\n", "\n", - "- As always, include your internal thought process wrapped within ... tags before providing the final JSON output." + "- As always, include your internal thought process wrapped within ... tags before providing the final JSON output.\n" ] }, { @@ -464,7 +494,7 @@ " \"Also, include your internal thought process wrapped within ... tags. \"\n", " \"Return only the final, visually structured reasoning trace.\"\n", " ),\n", - " output_format=ReasoningTrace\n", + " output_format=ReasoningTrace,\n", " )\n", ")\n", "\n", @@ -483,7 +513,7 @@ " \"the final answer is distinct and succinct.\\n\\n\"\n", " \"Include your internal thought process within ... tags before providing the JSON.\"\n", " ),\n", - " output_format=FinalEvaluation\n", + " output_format=FinalEvaluation,\n", " )\n", ")" ] @@ -501,7 +531,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python-evol.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python-evol.ipynb index 2e39b0532..8c9325c19 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python-evol.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python-evol.ipynb @@ -7,12 +7,10 @@ "source": [ "# πŸ‘¨β€πŸ’» NeMo Data Designer: Text-to-Python with Evolution\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", "- This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for Python code examples, \\\n", - "with a focus on evolutionary improvements.\n", + " with a focus on evolutionary improvements.\n", "\n", "- We'll build a system that generates Python code based on natural language instructions, validates it, analyzes issues, and then improves the code based on feedback.\n", "\n", @@ -169,7 +167,7 @@ "\n", "- Sampler columns offer non-LLM based generation of synthetic data.\n", "\n", - "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n" ] }, { @@ -286,7 +284,7 @@ "source": [ "## 🦜 Define Initial Code Generation\n", "\n", - "First, we'll set up the columns for generating the instruction and initial code implementation using the same approach as in the [text-to-python notebook](./text-to-python-evol.ipynb)." + "First, we'll set up the columns for generating the instruction and initial code implementation using the same approach as in the [text-to-python notebook](./text-to-python-evol.ipynb).\n" ] }, { @@ -301,7 +299,9 @@ " LLMTextColumnConfig(\n", " name=\"instruction\",\n", " model_alias=MODEL_ALIAS,\n", - " system_prompt=(\"You are an expert at generating clear and specific programming tasks.\"),\n", + " system_prompt=(\n", + " \"You are an expert at generating clear and specific programming tasks.\"\n", + " ),\n", " prompt=(\n", " \"Generate an instruction to create Python code that solves a specific problem.\\n\"\n", " \"Each instruction should begin with one of the following phrases: {{instruction_phrase}}.\\n\\n\"\n", @@ -310,7 +310,7 @@ " \"* Code Complexity: Tailor the instruction to the {{code_complexity}} level. Utilize relevant {{code_concept}} where appropriate to match the complexity level.\\n\"\n", " \"* Clarity and Specificity: Make the problem statement clear and unambiguous. Provide sufficient context to understand the requirements without being overly verbose.\\n\"\n", " \"* Response Formatting: Do not include any markers such as ### Response ### in the instruction.\\n\"\n", - " )\n", + " ),\n", " )\n", ")\n", "\n", @@ -320,7 +320,9 @@ " name=\"initial_code\",\n", " model_alias=MODEL_ALIAS,\n", " code_lang=CodeLang.PYTHON,\n", - " system_prompt=(\"You are an expert Python programmer who writes clean, efficient, and well-documented code.\"),\n", + " system_prompt=(\n", + " \"You are an expert Python programmer who writes clean, efficient, and well-documented code.\"\n", + " ),\n", " prompt=(\n", " \"Write Python code for the following instruction:\\n\"\n", " \"Instruction: {{instruction}}\\n\\n\"\n", @@ -329,7 +331,7 @@ " \"* Code Validity: Please ensure that your python code is executable and does not contain any errors.\\n\"\n", " \"* Packages: Remember to import any necessary libraries, and to use all libraries you import.\\n\"\n", " \"* Complexity & Concepts: The code should be written at a {{code_complexity}} level, making use of concepts such as {{code_concept}}.\\n\"\n", - " )\n", + " ),\n", " )\n", ")" ] @@ -344,31 +346,31 @@ "- Now we'll add validation for the initial code and generate analysis of any issues found.\n", "\n", "- NeMo Data Designer includes a built-in code validation feature that automatically checks the syntactic correctness and executable validity of \\\n", - "generated code snippets. \n", + " generated code snippets.\n", "\n", "- This helps ensure that outputs from language models are not only syntactically correct, but also able to run successfully in the \\\n", - "intended programming language environment. \n", + " intended programming language environment.\n", "\n", "- Leveraging this validation step significantly increases dataset quality by promptly identifying invalid or non-functional code, \\\n", - "streamlining the process of generating reliable and production-ready data samples.\n", + " streamlining the process of generating reliable and production-ready data samples.\n", "\n", "- NeMo Data Designer supports validation for these languages\n", "\n", - " - Python (CodeLang.PYTHON)\n", + " - Python (CodeLang.PYTHON)\n", "\n", - " - SQL dialects:\n", + " - SQL dialects:\n", "\n", - " - ANSI SQL (CodeLang.SQL_ANSI)\n", + " - ANSI SQL (CodeLang.SQL_ANSI)\n", "\n", - " - MySQL (CodeLang.SQL_MYSQL)\n", + " - MySQL (CodeLang.SQL_MYSQL)\n", "\n", - " - PostgreSQL (CodeLang.SQL_POSTGRES)\n", + " - PostgreSQL (CodeLang.SQL_POSTGRES)\n", "\n", - " - SQLite (CodeLang.SQL_SQLITE)\n", + " - SQLite (CodeLang.SQL_SQLITE)\n", "\n", - " - T-SQL (CodeLang.SQL_TSQL)\n", + " - T-SQL (CodeLang.SQL_TSQL)\n", "\n", - " - BigQuery (CodeLang.SQL_BIGQUERY)" + " - BigQuery (CodeLang.SQL_BIGQUERY)\n" ] }, { @@ -386,7 +388,7 @@ " validator_type=ValidatorType.CODE,\n", " validator_params=CodeValidatorParams(\n", " code_lang=CodeLang.PYTHON,\n", - " )\n", + " ),\n", " )\n", ")\n", "\n", @@ -416,7 +418,7 @@ " \"3. Better adherence to Python best practices\\n\"\n", " \"4. Enhanced documentation\\n\"\n", " \"{% endif %}\\n\"\n", - " )\n", + " ),\n", " )\n", ")" ] @@ -428,7 +430,7 @@ "source": [ "## ⚑️ Code Evolution\n", "\n", - "Next, we'll create the improved version of the code based on the analysis and validation." + "Next, we'll create the improved version of the code based on the analysis and validation.\n" ] }, { @@ -444,8 +446,10 @@ " name=\"improved_code\",\n", " model_alias=MODEL_ALIAS,\n", " code_lang=CodeLang.PYTHON,\n", - " system_prompt=(\"You are an expert Python programmer focused on writing production-quality code \"\n", - " \"that adheres to best practices.\"),\n", + " system_prompt=(\n", + " \"You are an expert Python programmer focused on writing production-quality code \"\n", + " \"that adheres to best practices.\"\n", + " ),\n", " prompt=(\n", " \"Rewrite and improve the following Python code based on the analysis provided.\\n\\n\"\n", " \"ORIGINAL INSTRUCTION:\\n\"\n", @@ -463,7 +467,7 @@ " \"6. Implements proper error handling with specific exception types\\n\"\n", " \"7. Ensures all imports are properly organized and used\\n\\n\"\n", " \"The goal is production-quality code that would pass a professional code review at a {{code_complexity}} level.\\n\"\n", - " )\n", + " ),\n", " )\n", ")\n" ] @@ -476,10 +480,10 @@ "## πŸ” Quality Assessment: LLM-as-a-Judge\n", "\n", "When generating our synthetic dataset, we need to determine the quality of the generated data \\\n", - "We use the LLM-as-a-Judge strategy to do this. \n", + "We use the LLM-as-a-Judge strategy to do this.\n", "\n", - "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", - "that provides relavant instructions. " + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt\n", + "that provides relavant instructions.\n" ] }, { @@ -554,7 +558,7 @@ " name=\"code_judge_result\",\n", " model_alias=MODEL_ALIAS,\n", " prompt=TEXT_TO_PYTHON_JUDGE_TEMPLATE,\n", - " scores=python_scoring\n", + " scores=python_scoring,\n", " )\n", ")" ] @@ -582,7 +586,7 @@ " validator_type=ValidatorType.CODE,\n", " validator_params=CodeValidatorParams(\n", " code_lang=CodeLang.PYTHON,\n", - " )\n", + " ),\n", " )\n", ")" ] @@ -600,7 +604,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python.ipynb index d464fd359..6c26d73c0 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-python.ipynb @@ -7,11 +7,9 @@ "source": [ "# πŸ‘¨β€πŸ’» NeMo Data Designer: Text-to-Python\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", - "- This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for Python code examples. \n", + "- This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for Python code examples.\n", "\n", "- We'll build a system that generates Python code based on natural language instructions, with varying complexity levels and industry focuses.\n", "\n", @@ -168,7 +166,7 @@ "\n", "- Sampler columns offer non-LLM based generation of synthetic data.\n", "\n", - "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n" ] }, { @@ -293,7 +291,7 @@ "source": [ "## 🦜 Define LLM-Generated Columns\n", "\n", - "Now we'll set up the columns that will be generated by the LLMs, including the instruction and code implementation." + "Now we'll set up the columns that will be generated by the LLMs, including the instruction and code implementation.\n" ] }, { @@ -353,10 +351,10 @@ "## πŸ” Quality Assessment: LLM-as-a-Judge\n", "\n", "When generating our synthetic dataset, we need to determine the quality of the generated data \\\n", - "We use the LLM-as-a-Judge strategy to do this. \n", + "We use the LLM-as-a-Judge strategy to do this.\n", "\n", - "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", - "that provides relavant instructions. " + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt\n", + "that provides relavant instructions.\n" ] }, { @@ -435,7 +433,7 @@ " name=\"code_judge_result\",\n", " model_alias=MODEL_ALIAS,\n", " prompt=TEXT_TO_PYTHON_JUDGE_TEMPLATE,\n", - " scores=python_scoring\n", + " scores=python_scoring,\n", " )\n", ")" ] @@ -448,31 +446,31 @@ "## ⚑️ Quality Assessment: Code Validation\n", "\n", "- NeMo Data Designer includes a built-in code validation feature that automatically checks the syntactic correctness and executable validity of \\\n", - "generated code snippets. \n", + " generated code snippets.\n", "\n", "- This helps ensure that outputs from language models are not only syntactically correct, but also able to run successfully in the \\\n", - "intended programming language environment. \n", + " intended programming language environment.\n", "\n", "- Leveraging this validation step significantly increases dataset quality by promptly identifying invalid or non-functional code, \\\n", - "streamlining the process of generating reliable and production-ready data samples.\n", + " streamlining the process of generating reliable and production-ready data samples.\n", "\n", "- NeMo Data Designer supports validation for these languages\n", "\n", - " - Python (CodeLang.PYTHON)\n", + " - Python (CodeLang.PYTHON)\n", "\n", - " - SQL dialects:\n", + " - SQL dialects:\n", "\n", - " - ANSI SQL (CodeLang.SQL_ANSI)\n", + " - ANSI SQL (CodeLang.SQL_ANSI)\n", "\n", - " - MySQL (CodeLang.SQL_MYSQL)\n", + " - MySQL (CodeLang.SQL_MYSQL)\n", "\n", - " - PostgreSQL (CodeLang.SQL_POSTGRES)\n", + " - PostgreSQL (CodeLang.SQL_POSTGRES)\n", "\n", - " - SQLite (CodeLang.SQL_SQLITE)\n", + " - SQLite (CodeLang.SQL_SQLITE)\n", "\n", - " - T-SQL (CodeLang.SQL_TSQL)\n", + " - T-SQL (CodeLang.SQL_TSQL)\n", "\n", - " - BigQuery (CodeLang.SQL_BIGQUERY)" + " - BigQuery (CodeLang.SQL_BIGQUERY)\n" ] }, { @@ -490,7 +488,7 @@ " validator_params=CodeValidatorParams(\n", " code_lang=CodeLang.PYTHON,\n", " ),\n", - " batch_size=100\n", + " batch_size=100,\n", " )\n", ")" ] @@ -508,7 +506,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-sql.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-sql.ipynb index 837a28ae9..76a185e10 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-sql.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/text-to-code/text-to-sql.ipynb @@ -7,11 +7,9 @@ "source": [ "# πŸ‘¨β€πŸ’» NeMo Data Designer: Text-to-SQL\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", - "- This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for SQL code examples. \n", + "- This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for SQL code examples.\n", "\n", "- We'll build a system that generates SQL code based on natural language instructions, with varying complexity levels and industry focuses.\n", "\n", @@ -168,7 +166,7 @@ "\n", "- Sampler columns offer non-LLM based generation of synthetic data.\n", "\n", - "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below." + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n" ] }, { @@ -200,20 +198,20 @@ " \"Healthcare\": [\n", " \"Electronic Health Records (EHR) Systems\",\n", " \"Telemedicine Platforms\",\n", - " \"AI-Powered Diagnostic Tools\"\n", + " \"AI-Powered Diagnostic Tools\",\n", " ],\n", " \"Finance\": [\n", " \"Fraud Detection Software\",\n", " \"Automated Trading Systems\",\n", - " \"Personal Finance Apps\"\n", + " \"Personal Finance Apps\",\n", " ],\n", " \"Technology\": [\n", " \"Cloud Computing Platforms\",\n", " \"Artificial Intelligence and Machine Learning Platforms\",\n", - " \"DevOps and CI/CD Tools\"\n", - " ]\n", - " }\n", - " )\n", + " \"DevOps and CI/CD Tools\",\n", + " ],\n", + " },\n", + " ),\n", " )\n", ")\n", "\n", @@ -240,22 +238,22 @@ " \"Basic SELECT Statements\",\n", " \"WHERE Clauses\",\n", " \"Basic JOINs\",\n", - " \"INSERT, UPDATE, DELETE\"\n", + " \"INSERT, UPDATE, DELETE\",\n", " ],\n", " \"Intermediate\": [\n", " \"Aggregation Functions\",\n", " \"Multiple JOINs\",\n", " \"Subqueries\",\n", - " \"Views\"\n", + " \"Views\",\n", " ],\n", " \"Advanced\": [\n", " \"Window Functions\",\n", " \"Common Table Expressions (CTEs)\",\n", " \"Stored Procedures\",\n", - " \"Query Optimization\"\n", - " ]\n", - " }\n", - " )\n", + " \"Query Optimization\",\n", + " ],\n", + " },\n", + " ),\n", " )\n", ")\n", "\n", @@ -269,7 +267,7 @@ " \"Data Retrieval\",\n", " \"Data Manipulation\",\n", " \"Analytics and Reporting\",\n", - " \"Data Transformation\"\n", + " \"Data Transformation\",\n", " ],\n", " ),\n", " )\n", @@ -286,7 +284,7 @@ " \"Create an SQL statement to\",\n", " \"Develop an SQL query to\",\n", " \"Can you write SQL that\",\n", - " \"Formulate an SQL query that\"\n", + " \"Formulate an SQL query that\",\n", " ],\n", " ),\n", " )\n", @@ -300,7 +298,7 @@ "source": [ "## 🦜 Define Generated Data Columns\n", "\n", - "Now we'll set up the columns that will be generated by the LLMs, including the instruction, database context, and SQL implementation." + "Now we'll set up the columns that will be generated by the LLMs, including the instruction, database context, and SQL implementation.\n" ] }, { @@ -329,7 +327,7 @@ " name=\"sql_prompt\",\n", " model_alias=MODEL_ALIAS,\n", " system_prompt=\"You are an expert at generating clear and specific SQL tasks.\",\n", - " prompt=SQL_PROMPT_TEXT\n", + " prompt=SQL_PROMPT_TEXT,\n", " )\n", ")\n", "\n", @@ -351,9 +349,11 @@ " name=\"sql_context\",\n", " model_alias=MODEL_ALIAS,\n", " code_lang=CodeLang.SQL_ANSI,\n", - " system_prompt=(\"You are an expert SQL database designer who creates clean, efficient, and \"\n", - " \"well-structured database schemas.\"),\n", - " prompt=SQL_CONTEXT_TEXT\n", + " system_prompt=(\n", + " \"You are an expert SQL database designer who creates clean, efficient, and \"\n", + " \"well-structured database schemas.\"\n", + " ),\n", + " prompt=SQL_CONTEXT_TEXT,\n", " )\n", ")\n", "\n", @@ -380,7 +380,7 @@ " model_alias=MODEL_ALIAS,\n", " code_lang=CodeLang.SQL_ANSI,\n", " system_prompt=\"You are an expert SQL programmer who writes clean, efficient, and well-structured queries.\",\n", - " prompt=SQL_CODE_TEXT\n", + " prompt=SQL_CODE_TEXT,\n", " )\n", ")\n" ] @@ -393,10 +393,10 @@ "## πŸ” Quality Assessment: LLM-as-a-Judge\n", "\n", "When generating our synthetic dataset, we need to determine the quality of the generated data \\\n", - "We use the LLM-as-a-Judge strategy to do this. \n", + "We use the LLM-as-a-Judge strategy to do this.\n", "\n", - "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt \n", - "that provides relavant instructions. " + "To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt\n", + "that provides relavant instructions.\n" ] }, { @@ -479,7 +479,7 @@ " name=\"code_judge_result\",\n", " model_alias=MODEL_ALIAS,\n", " prompt=TEXT_TO_SQL_JUDGE_TEMPLATE,\n", - " scores=sql_scoring\n", + " scores=sql_scoring,\n", " )\n", ")" ] @@ -494,31 +494,31 @@ "- Now we'll add validation for the initial code and generate analysis of any issues found.\n", "\n", "- NeMo Data Designer includes a built-in code validation feature that automatically checks the syntactic correctness and executable validity of \\\n", - "generated code snippets. \n", + " generated code snippets.\n", "\n", "- This helps ensure that outputs from language models are not only syntactically correct, but also able to run successfully in the \\\n", - "intended programming language environment. \n", + " intended programming language environment.\n", "\n", "- Leveraging this validation step significantly increases dataset quality by promptly identifying invalid or non-functional code, \\\n", - "streamlining the process of generating reliable and production-ready data samples.\n", + " streamlining the process of generating reliable and production-ready data samples.\n", "\n", "- NeMo Data Designer supports validation for these languages\n", "\n", - " - Python (CodeLang.PYTHON)\n", + " - Python (CodeLang.PYTHON)\n", "\n", - " - SQL dialects:\n", + " - SQL dialects:\n", "\n", - " - ANSI SQL (CodeLang.SQL_ANSI)\n", + " - ANSI SQL (CodeLang.SQL_ANSI)\n", "\n", - " - MySQL (CodeLang.SQL_MYSQL)\n", + " - MySQL (CodeLang.SQL_MYSQL)\n", "\n", - " - PostgreSQL (CodeLang.SQL_POSTGRES)\n", + " - PostgreSQL (CodeLang.SQL_POSTGRES)\n", "\n", - " - SQLite (CodeLang.SQL_SQLITE)\n", + " - SQLite (CodeLang.SQL_SQLITE)\n", "\n", - " - T-SQL (CodeLang.SQL_TSQL)\n", + " - T-SQL (CodeLang.SQL_TSQL)\n", "\n", - " - BigQuery (CodeLang.SQL_BIGQUERY)" + " - BigQuery (CodeLang.SQL_BIGQUERY)\n" ] }, { @@ -536,7 +536,7 @@ " validator_params=CodeValidatorParams(\n", " code_lang=CodeLang.SQL_ANSI,\n", " ),\n", - " batch_size=100\n", + " batch_size=100,\n", " )\n", ")" ] @@ -554,7 +554,7 @@ "\n", "3. Adjust column configurations, prompts, or parameters as needed.\n", "\n", - "4. Re-run the preview until satisfied." + "4. Re-run the preview until satisfied.\n" ] }, { diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/1-the-basics.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/1-the-basics.ipynb index baf23a59e..35a0be03d 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/1-the-basics.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/1-the-basics.ipynb @@ -6,8 +6,6 @@ "source": [ "# 🎨 NeMo Data Designer 101: The Basics\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", "This notebook demonstrates the basics of Data Designer by generating a simple product review dataset.\n", diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb index 9b5a9cea3..6b3543a19 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/2-structured-outputs-and-jinja-expressions.ipynb @@ -6,8 +6,6 @@ "source": [ "# 🎨 NeMo Data Designer 101: Structured Outputs and Jinja Expressions\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", "In this notebook, we will continue our exploration of Data Designer, demonstrating more advanced data generation using structured outputs and Jinja expressions.\n", diff --git a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/3-seeding-with-a-dataset.ipynb b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/3-seeding-with-a-dataset.ipynb index 4b8b1f5e7..0f6e73657 100644 --- a/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/3-seeding-with-a-dataset.ipynb +++ b/nemo/NeMo-Data-Designer/self-hosted-tutorials/getting-started/3-seeding-with-a-dataset.ipynb @@ -6,10 +6,6 @@ "source": [ "# 🎨 NeMo Data Designer 101: Seeding Synthetic Data Generation with an External Dataset\n", "\n", - "> ⚠️ **Warning**: NeMo Data Designer is currently in Early Release and is not recommended for production use.\n", - "\n", - "
\n", - "\n", "#### πŸ“š What you'll learn\n", "\n", "In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset.\n",