# 🎨 NeMo Data Designer 101: Structured Outputs and Jinja Expressions

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.

<br>

In this notebook, we will continue our exploration of Data Designer, demonstrating more advanced data generation using structured outputs and Jinja expressions.

#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.


If the installation worked, you should be able to make the following imports:


In [None]:
from getpass import getpass

from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

## 🧑‍🎨 Designing our data

- We will again create a product review dataset, but this time we will use structured outputs and Jinja expressions.

- Structured outputs let you specify the exact schema of the data you want to generate.

- Data Designer supports schemas specified using either json schema or Pydantic data models (recommended).

<br>

We'll define our structured outputs using Pydantic data models:


In [None]:
from decimal import Decimal
from typing import Literal
from pydantic import BaseModel, Field


# We define a Product schema so that the name, description, and price are generated
# in one go, with the types and constraints specified.
class Product(BaseModel):
    name: str = Field(description="The name of the product")
    description: str = Field(description="A description of the product")
    price: Decimal = Field(
        description="The price of the product", ge=10, le=1000, decimal_places=2
    )


class ProductReview(BaseModel):
    rating: int = Field(description="The rating of the product", ge=1, le=5)
    customer_mood: Literal["irritated", "mad", "happy", "neutral", "excited"] = Field(
        description="The mood of the customer"
    )
    review: str = Field(description="A review of the product")

### ⚙️ Initialize the NeMo Data Designer (NDD) Client

- The NDD client is responsible for submitting generation requests to the Data Designer microservice.


In [None]:
ndd = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8000"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# build.nvidia.com model endpoint
endpoint = "https://integrate.api.nvidia.com/v1"
model_id = "mistralai/mistral-small-24b-instruct"

model_alias = "mistral-small"

# You will need to enter your model provider API key to run this notebook.
api_key = getpass("Enter model provider API key: ")

if len(api_key) > 0:
    print("✅ API key received.")
else:
    print("❌ No API key provided. Please enter your model provider API key.")

In [None]:
model_configs = [
    P.ModelConfig(
        alias=model_alias,
        inference_parameters=P.InferenceParameters(
            max_tokens=1024,
            temperature=0.5,
            top_p=1.0,
        ),
        model=P.Model(
            api_endpoint=P.ApiEndpoint(
                api_key=api_key,
                model_id=model_id,
                url=endpoint,
            ),
        ),
    )
]

config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

Next, let's design our product review dataset using a few more tricks compared to the previous notebook:


In [None]:
# Since we often just want a few attributes from Person objects, we can use
# Data Designer's `with_person_samplers` method to create multiple person samplers
# at once and drop the person object columns from the final dataset.
config_builder.with_person_samplers(
    {"customer": P.PersonSamplerParams(age_range=[18, 65])}
)

config_builder.add_column(
    C.SamplerColumn(
        name="product_category",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=[
                "Electronics",
                "Clothing",
                "Home & Kitchen",
                "Books",
                "Home Office",
            ],
        ),
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="product_subcategory",
        type=P.SamplerType.SUBCATEGORY,
        params=P.SubcategorySamplerParams(
            category="product_category",
            values={
                "Electronics": [
                    "Smartphones",
                    "Laptops",
                    "Headphones",
                    "Cameras",
                    "Accessories",
                ],
                "Clothing": [
                    "Men's Clothing",
                    "Women's Clothing",
                    "Winter Coats",
                    "Activewear",
                    "Accessories",
                ],
                "Home & Kitchen": [
                    "Appliances",
                    "Cookware",
                    "Furniture",
                    "Decor",
                    "Organization",
                ],
                "Books": [
                    "Fiction",
                    "Non-Fiction",
                    "Self-Help",
                    "Textbooks",
                    "Classics",
                ],
                "Home Office": [
                    "Desks",
                    "Chairs",
                    "Storage",
                    "Office Supplies",
                    "Lighting",
                ],
            },
        ),
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="target_age_range",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["18-25", "25-35", "35-50", "50-65", "65+"]
        ),
    )
)

config_builder.add_column(
    C.SamplerColumn(
        name="review_style",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["rambling", "brief", "detailed", "structured with bullet points"],
            weights=[1, 2, 2, 1],
        ),
    )
)

# We can create new columns using Jinja expressions that reference
# existing columns, including attributes of nested objects.
config_builder.add_column(
    C.ExpressionColumn(
        name="customer_name", expr="{{ customer.first_name }} {{ customer.last_name }}"
    )
)

config_builder.add_column(
    C.ExpressionColumn(name="customer_age", expr="{{ customer.age }}")
)

# Add an `LLMStructuredColumn` column to generate structured outputs.
config_builder.add_column(
    C.LLMStructuredColumn(
        name="product",
        prompt=(
            "Create a product in the '{{ product_category }}' category, focusing on products  "
            "related to '{{ product_subcategory }}'. The target age range of the ideal customer is "
            "{{ target_age_range }} years old. The product should be priced between $10 and $1000."
        ),
        output_format=Product,
        model_alias=model_alias,
    )
)

config_builder.add_column(
    C.LLMStructuredColumn(
        name="customer_review",
        prompt=(
            "Your task is to write a review for the following product:\n\n"
            "Product Name: {{ product.name }}\n"
            "Product Description: {{ product.description }}\n"
            "Price: {{ product.price }}\n\n"
            "Imagine your name is {{ customer_name }} and you are from {{ customer.city }}, {{ customer.state }}. "
            "Write the review in a style that is '{{ review_style }}'."
        ),
        output_format=ProductReview,
        model_alias=model_alias,
    )
)

# Let's add an evaluation report to our dataset.
config_builder.with_evaluation_report().validate()

## 👀 Preview the dataset

- Iteration is key to generating high-quality synthetic data.

- Use the `preview` method to generate 10 records for inspection.

- Setting `verbose_logging=True` prints logs within each task of the generation process.


In [None]:
preview = ndd.preview(config_builder, verbose_logging=True)

In [None]:
# The preview dataset is available as a pandas DataFrame.
preview.dataset

In [None]:
# Run this cell multiple times to cycle through the 10 preview records.
preview.display_sample_record()

## 🧬 Generate your dataset

- Once you are happy with the preview, scale up to a larger dataset.

- The `create` method will submit your generate job to the microservice and return a results object.

- If you want to pause and wait for the job to complete, set `wait_until_done=True`.


In [None]:
results = ndd.create(config_builder, num_records=20, wait_until_done=True)

In [None]:
# load the dataset into a pandas DataFrame
dataset = results.load_dataset()

dataset.head()

### 🔎 View the evaluation report

- The evaluation report is generated in HTML format and can be viewed in a browser.


In [None]:
import webbrowser
from pathlib import Path

eval_report_path = Path(
    "./2-structured-outputs-and-jinja-expressions-eval-report.html"
).resolve()

results.download_evaluation_report(eval_report_path)

webbrowser.open_new_tab(f"file:///{eval_report_path}");