# 🎨 NeMo Data Designer: Visual Question Answering Dataset Generation

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
>
> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. 
>
> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method.

This notebook demonstrates how to use NeMo Data Designer to generate high-quality synthetic Question-Answer datasets from visual documents. 

### Key Features Demonstrated

- ✨ **Visual Document Processing**: Converting images to chat-ready format
- 🏗️ **Structured Output Generation**: Using Pydantic models for consistent data schemas
- 🎯 **Multi-step Generation Pipeline**: Summary → Question → Answer generation workflow
- 🔄 **Iterative Development**: Preview functionality for rapid iteration



#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.


In [None]:
# Standard library imports
import io
import os
import json
import base64
import uuid

# Third-party imports
import pandas as pd
from datasets import load_dataset
from typing import Literal
from pydantic import BaseModel, Field
import rich
from rich.panel import Panel
from rich.markdown import Markdown

# NeMo Data Designer imports
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).


In [None]:
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# We specify the endpoint of the model during deployment using the model_provider_registry.
model_id = "meta/llama-4-maverick-17b-128e-instruct"
model_alias = "llama-4-maverick-17b-128e-instruct"

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.95,
            ),
            is_reasoner=False
        ),
    ]
)

### 🌱 Seed Dataset Creation

In this section, we'll prepare our visual documents as a seed dataset. The seed dataset provides the foundation for synthetic data generation by:

- **Loading Visual Documents**: We use the ColPali dataset containing document images
- **Image Processing**: Convert images to base64 format for model consumption  
- **Metadata Extraction**: Preserve relevant document information
- **Sampling Strategy**: Configure how the seed data is utilized during generation

The seed dataset can be referenced in generation prompts using Jinja templating.

**Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as seeds, it is recommended you consolidated these into a single file. 

In [None]:
# Dataset processing configuration
IMG_COUNT = 512  # Number of images to process
BASE64_IMAGE_HEIGHT = 512  # Standardized height for model input

# Load ColPali dataset for visual documents
img_dataset_cfg = {
    "path": "vidore/colpali_train_set",
    "split": "train",
    "streaming": True
}

In [None]:
def resize_image(image, height: int):
    """
    Resize image while maintaining aspect ratio.

    Args:
        image: PIL Image object
        height: Target height in pixels

    Returns:
        Resized PIL Image object
    """
    original_width, original_height = image.size
    width = int(original_width * (height / original_height))
    return image.resize((width, height))

def convert_image_to_chat_format(record, height: int) -> dict:
    """
    Convert PIL image to base64 format for chat template usage.

    Args:
        record: Dataset record containing image and metadata
        height: Target height for image resizing

    Returns:
        Updated record with base64_image and uuid fields
    """
    # Resize image for consistent processing
    image = resize_image(record["image"], height)

    # Convert to base64 string
    img_buffer = io.BytesIO()
    image.save(img_buffer, format="PNG")
    byte_data = img_buffer.getvalue()
    base64_encoded_data = base64.b64encode(byte_data)
    base64_string = base64_encoded_data.decode("utf-8")

    # Return updated record
    return record | {
        "base64_image": base64_string,
        "uuid": str(uuid.uuid4())
    }

In [None]:
# Load and process the visual document dataset
print("📥 Loading and processing document images...")

img_dataset_iter = iter(
    load_dataset(**img_dataset_cfg)
    .map(convert_image_to_chat_format, fn_kwargs={"height": BASE64_IMAGE_HEIGHT})
)
img_dataset = pd.DataFrame([next(img_dataset_iter) for _ in range(IMG_COUNT)])

print(f"✅ Loaded {len(img_dataset)} images with columns: {list(img_dataset.columns)}")

In [None]:
img_dataset.head()

In [None]:
os.makedirs("./data/", exist_ok=True)

df_seed = pd.DataFrame(img_dataset)[["uuid", "image_filename", "base64_image", "page", "options", "source"]]
df_seed.to_csv("./data/colpali_train_set.csv", index=False)


In [None]:
# Add the seed dataset containing our processed images
config_builder.with_seed_dataset(
    repo_id="advanced/visual-qna",
    filename="colpali_train_set.csv",
    dataset_path="./data/colpali_train_set.csv",
    sampling_strategy="ordered",
    with_replacement=True,
    datastore={"endpoint": "http://localhost:3000/v1/hf"},
)

In [None]:
# Add a column to generate detailed document summaries
config_builder.add_column(
    name="summary",
    type="llm-code",
    model_alias=model_alias,
    prompt=("Provide a detailed summary of the content in this image in Markdown format."
            "Start from the top of the image and then describe it from top to bottom."
            "Place a summary at the bottom."),
    output_format="markdown",
    multi_modal_context=[
        P.ImageContext(
            column_name="base64_image",
            data_type=P.ModalityDataType.BASE64,
            image_format=P.ImageFormat.PNG,
        )
    ]
)

### 🎨 Designing our Data Schema

Structured outputs ensure consistent and predictable data generation. Data Designer supports schemas defined using:
- **JSON Schema**: For basic structure definition
- **Pydantic Models**: For advanced validation and type safety (recommended)

We'll use Pydantic models to define our Question-Answer schema:


In [None]:
class Question(BaseModel):
    """Schema for generated questions"""
    question: str = Field(description="The question to be generated")

class QuestionTopic(BaseModel):
    """Schema for question topics"""
    topic: str = Field(description="The topic/category of the question")

class Options(BaseModel):
    """Schema for multiple choice options"""
    option_a: str = Field(description="The first answer choice")
    option_b: str = Field(description="The second answer choice")
    option_c: str = Field(description="The third answer choice")
    option_d: str = Field(description="The fourth answer choice")

class Answer(BaseModel):
    """Schema for question answers"""
    answer: Literal["option_a", "option_b", "option_c", "option_d"] = Field(description="The correct answer to the question")


In [None]:
config_builder.add_column(
    C.SamplerColumn(
        name="difficulty",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(values=["easy", "medium", "hard"]),
        description="The difficulty of the generated question",
    ))


In [None]:
config_builder.add_column(
    C.LLMStructuredColumn(
        name="question",
        model_alias=model_alias,
        prompt=("Generate a question based on the following context: {{ summary }}. "
        "The difficulty of the generated question should be {{ difficulty }}"),
        system_prompt=("You are a helpful assistant that generates questions based on the given context. "
        "The context are sourced from documents pertaining to the petroleum industry. "
        "You will be given a context and you will need to generate a question based on the context. "
        "The difficulty of the generated question should be {{ difficulty }}"
        "Ensure you generate just the question and no other text."),
        output_format=Question,
    )
)

config_builder.add_column(
    C.LLMStructuredColumn(
        name="options",
        model_alias=model_alias,
        prompt=("Generate four answer choices for the question: {{ question }} based on the following context: {{ summary }}. "
        "The option you generate should match the difficulty of the generated question, {{ difficulty }}."),
        output_format=Options,
    )
)


config_builder.add_column(
    C.LLMStructuredColumn(
        name="answer",
        prompt=("Choose the correct answer for the question: {{ question }} based on the following context: {{ summary }}"
                "and options choices. The options are {{ options }}. Only select one of the options as the answer."),
        output_format=Answer,
        model_alias=model_alias,
    )
)


config_builder.add_column(
    C.LLMStructuredColumn(
        name="topic",
        model_alias=model_alias,
        prompt=("Generate the topic of the question: {{ question }} based on the following context: {{ summary }}"
        "The topic should be a single word or phrase that is relevant to the question and context. "),
        system_prompt=("Generate a short 1-3 word topic for the question: {{ question }} based on the given context. {{ summary }}"),
        output_format=QuestionTopic,
    )
)


### 👀 Preview Generation

Before scaling up, it's crucial to validate your configuration with a small sample. The preview functionality:

- **Generates Sample Data**: Creates 10 records for quick inspection
- **Enables Rapid Iteration**: Test and refine your prompts and schemas
- **Provides Detailed Logging**: Understand the generation process with verbose output

Use this step to fine-tune your configuration before full-scale generation.


**Note** Please ignore the validation warning, `PROMPT_WITHOUT_REFERENCES` that shows up. The image context is being passed to the LLM using the `multi_modal_context` and so the prompt does not need to reference any other column. 

In [None]:
preview = data_designer_client.preview(config_builder, verbose_logging=True)

In [None]:
# Display a sample record from the preview
# Run this cell multiple times to cycle through different records
preview.display_sample_record()

In [None]:
# The preview dataset is available as a pandas DataFrame.
preview.dataset

In [None]:
# Compare original document with generated outputs
index = 0  # Change this to view different examples

# Merge preview data with original images for comparison
comparison_dataset = preview.dataset.merge(
    pd.DataFrame(img_dataset)[["uuid", "image"]],
    how="left",
    on="uuid"
)

print("📄 Original Document Image:")
display(resize_image(comparison_dataset.image[index], BASE64_IMAGE_HEIGHT))

print("\n📝 Generated Summary:")
rich.print(Panel(comparison_dataset.summary[index], title="Document Summary", title_align="left"))

print("\n❓ Generated Question:")
rich.print(Panel(comparison_dataset.question[index], title="Question", title_align="left"))

print("\n🔢 Generated Options:")
rich.print(Panel(comparison_dataset.options[index], title="Answer Choices", title_align="left"))

print("\n✅ Generated Answer:")
rich.print(Panel(comparison_dataset.answer[index], title="Correct Answer", title_align="left"))


### 🚀 Scale Up Generations

Once satisfied with the preview results, scale up to generate the full dataset. The generation process offers flexible execution modes:

#### Synchronous Generation
Set `wait_until_done=True` to block until completion - ideal for smaller datasets or interactive workflows.

#### Asynchronous Generation  
Set `wait_until_done=False` for batch processing - returns a `job_id` for later retrieval:

In [None]:
job_results = data_designer_client.create(config_builder, num_records=1, wait_until_done=False)

job_results.wait_until_done()

In [None]:
# load the dataset into a pandas DataFrame
dataset = job_results.load_dataset()

print(f"Generated {len(dataset)} records")

dataset.head()

### 🔎 View Results

In [None]:
# Compare original document with generated outputs
index = 0  # Change this to view different examples

# Merge preview data with original images for comparison
comparison_dataset = dataset.merge(
    pd.DataFrame(img_dataset)[["uuid", "image"]],
    how="left",
    on="uuid"
)

print("📄 Original Document Image:")
display(resize_image(comparison_dataset.image[index], BASE64_IMAGE_HEIGHT))

print("\n📝 Generated Summary:")
rich.print(Panel(comparison_dataset.summary[index], title="Document Summary", title_align="left"))

print("\n❓ Generated Question:")
rich.print(Panel(comparison_dataset.question[index], title="Question", title_align="left"))

# print("\n🔢 Generated Options:")
# rich.print(Panel(comparison_dataset.options[index], title="Answer Choices", title_align="left"))

print("\n✅ Generated Answer:")
rich.print(Panel(comparison_dataset.answer[index], title="Correct Answer", title_align="left"))
