# 🎨 NeMo Data Designer: Visual Question Answering Dataset Generation

### 📚 What you'll learn

This notebook demonstrates how to use NeMo Data Designer to generate high-quality synthetic question-answer datasets from visual documents.

<br>

> 👋 **IMPORTANT** – Environment Setup
>
> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.
>
> - You may need to restart your notebook's kernel after setting up the environment.
> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.
>
> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).

<br>


### 📦 Import the essentials

- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.

- The `essentials` module provides quick access to the most commonly used objects.


In [None]:
# Standard library imports
import io
import os
import base64
import uuid
import json

# Third-party imports
import pandas as pd
from datasets import load_dataset
from typing import Literal
from pydantic import BaseModel, Field
import rich
from rich.panel import Panel
from rich.markdown import Markdown

# NeMo Data Designer imports
from nemo_microservices.data_designer.essentials import (
    CategorySamplerParams,
    DataDesignerConfigBuilder,
    ImageContext,
    ImageFormat,
    InferenceParameters,
    LLMStructuredColumnConfig,
    ModelConfig,
    ModalityDataType,
    NeMoDataDesignerClient,
    SamplerColumnConfig,
    SamplerType,
)

### ⚙️ Initialize the NeMo Data Designer Client

- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.


In [None]:
NEMO_MICROSERVICES_BASE_URL = "http://localhost:8080"

data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)

### 🎛️ Define model configurations

- Each `ModelConfig` defines a model that can be used during the generation process.

- The "model alias" is used to reference the model in the Data Designer config (as we will see below).

- The "model provider" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).

- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.


In [None]:
# This name is set in the microservice deployment configuration.
MODEL_PROVIDER = "nvidiabuild"

# The model ID is from build.nvidia.com.
MODEL_ID = "meta/llama-4-maverick-17b-128e-instruct"

# We choose this alias to be descriptive for our use case.
MODEL_ALIAS = "llama-4-maverick-17b-128e-instruct"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        provider=MODEL_PROVIDER,
        inference_parameters=InferenceParameters(
            temperature=0.6,
            top_p=0.95,
            max_tokens=1024,
        ),
    )
]

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- The list of model configs is provided to the builder at initialization.


In [None]:
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

## 🌱 Loading Seed Data

In this section, we'll prepare our visual documents as a seed dataset. The seed dataset provides the foundation for synthetic data generation by:

- **Loading Visual Documents**: We use the ColPali dataset containing document images
- **Image Processing**: Convert images to base64 format for model consumption
- **Metadata Extraction**: Preserve relevant document information
- **Sampling Strategy**: Configure how the seed data is utilized during generation

The seed dataset can be referenced in generation prompts using Jinja templating.

<br>

> 🌱 **Why use a seed dataset?**
>
> - Seed datasets let you steer the generation process by providing context that is specific to your use case.
>
> - Seed datasets are also an excellent way to inject real-world diversity into your synthetic data.
>
> - During generation, prompt templates can reference any of the seed dataset fields.

<br>

> 💡 **About datastores**
>
> - You can use seed datasets from _either_ the Hugging Face Hub or a locally deployed datastore.
>
> - By default, we use the local datastore deployed with the Data Designer microservice.
>
> - The datastore endpoint is specified in the deployment configuration.

👋 **Note**: At this time, we only support using a single file as the seed. If you have multiple files you would like to use as \
seeds, it is recommended you consolidated these into a single file.


In [None]:
# Dataset processing configuration
IMG_COUNT = 512  # Number of images to process
BASE64_IMAGE_HEIGHT = 512  # Standardized height for model input

# Load ColPali dataset for visual documents
img_dataset_cfg = {
    "path": "vidore/colpali_train_set",
    "split": "train",
    "streaming": True,
}

Define helper functions to preprocess the dataset


In [None]:
def resize_image(image, height: int):
    """
    Resize image while maintaining aspect ratio.

    Args:
        image: PIL Image object
        height: Target height in pixels

    Returns:
        Resized PIL Image object
    """
    original_width, original_height = image.size
    width = int(original_width * (height / original_height))
    return image.resize((width, height))


def convert_image_to_chat_format(record, height: int) -> dict:
    """
    Convert PIL image to base64 format for chat template usage.

    Args:
        record: Dataset record containing image and metadata
        height: Target height for image resizing

    Returns:
        Updated record with base64_image and uuid fields
    """
    # Resize image for consistent processing
    image = resize_image(record["image"], height)

    # Convert to base64 string
    img_buffer = io.BytesIO()
    image.save(img_buffer, format="PNG")
    byte_data = img_buffer.getvalue()
    base64_encoded_data = base64.b64encode(byte_data)
    base64_string = base64_encoded_data.decode("utf-8")

    # Return updated record
    return record | {"base64_image": base64_string, "uuid": str(uuid.uuid4())}

In [None]:
# Load and process the visual document dataset
print("📥 Loading and processing document images...")

img_dataset_iter = iter(
    load_dataset(**img_dataset_cfg).map(
        convert_image_to_chat_format, fn_kwargs={"height": BASE64_IMAGE_HEIGHT}
    )
)
img_dataset = pd.DataFrame([next(img_dataset_iter) for _ in range(IMG_COUNT)])

print(f"✅ Loaded {len(img_dataset)} images with columns: {list(img_dataset.columns)}")


In [None]:
# save the seed dataset to a csv file locally
os.makedirs("./data/", exist_ok=True)

df_seed = pd.DataFrame(img_dataset)[
    ["uuid", "image_filename", "base64_image", "page", "options", "source"]
]
df_seed.to_csv("./data/colpali_train_set.csv", index=False)

df_seed.head()

In [None]:
# Upload the seed dataset containing our processed images
dataset_reference = data_designer_client.upload_seed_dataset(
    repo_id="data-designer-advanced/visual-qna",
    dataset="./data/colpali_train_set.csv",
    datastore_settings={"endpoint": "http://localhost:3000/v1/hf"},
)

config_builder.with_seed_dataset(
    dataset_reference=dataset_reference,
    sampling_strategy="ordered",
)

## 🦜 Generating Summary of Image Contents

- We instruct the model to “look” at each image and write a short, Markdown
  summary.

- We ask it to read the page from top ➡️ bottom, then include a quick wrap-up
  at the end.

- That summary becomes helpful context we’ll reuse to generate focused
  questions and answers about the document later.

### 🖼️ How the image is provided

We pass the image via `multi_modal_context` using `ImageContext`:

- **Column**: `base64_image` (your image bytes encoded as Base64)
- **Modality**: `ModalityDataType.BASE64`
- **Format**: `ImageFormat.PNG`

In other words, `ImageContext` tells the model “this is an image, encoded as Base64,
and it’s a PNG,” so it knows exactly how to \
use it during summarization.


In [None]:
# Add a column to generate detailed document summaries
config_builder.add_column(
    name="summary",
    column_type="llm-text",
    model_alias=MODEL_ALIAS,
    prompt=(
        "Provide a detailed summary of the content in this image in Markdown format."
        "Start from the top of the image and then describe it from top to bottom."
        "Place a summary at the bottom."
    ),
    multi_modal_context=[
        ImageContext(
            column_name="base64_image",
            data_type=ModalityDataType.BASE64,
            image_format=ImageFormat.PNG,
        )
    ],
)

## 🏗️ Designing our Data Schema

Structured outputs ensure consistent and predictable data generation. Data Designer supports schemas defined using:

- **JSON Schema**: For basic structure definition
- **Pydantic Models**: For advanced validation and type safety (recommended)

We'll use Pydantic models to define our Question-Answer schema:


In [None]:
class Question(BaseModel):
    """Schema for generated questions"""

    question: str = Field(description="The question to be generated")


class QuestionTopic(BaseModel):
    """Schema for question topics"""

    topic: str = Field(description="The topic/category of the question")


class Options(BaseModel):
    """Schema for multiple choice options"""

    option_a: str = Field(description="The first answer choice")
    option_b: str = Field(description="The second answer choice")
    option_c: str = Field(description="The third answer choice")
    option_d: str = Field(description="The fourth answer choice")


class Answer(BaseModel):
    """Schema for question answers"""

    answer: Literal["option_a", "option_b", "option_c", "option_d"] = Field(
        description="The correct answer to the question"
    )


## 🎲 Adding Sampler Columns

- Sampler columns offer non-LLM based generation of synthetic data.

- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.


In [None]:
config_builder.add_column(
    SamplerColumnConfig(
        name="difficulty",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["easy", "medium", "hard"]),
    )
)


## 🦜 Adding LLM Generated columns

Now define the columns that the model will generate. These prompts instruct the LLM to produce:

- question
- options
- topic
- answer


In [None]:
config_builder.add_column(
    LLMStructuredColumnConfig(
        name="question",
        model_alias=MODEL_ALIAS,
        prompt=(
            "Generate a question based on the following context: {{ summary }}. "
            "The difficulty of the generated question should be {{ difficulty }}"
        ),
        system_prompt=(
            "You are a helpful assistant that generates questions based on the given context. "
            "The context are sourced from documents pertaining to the petroleum industry. "
            "You will be given a context and you will need to generate a question based on the context. "
            "The difficulty of the generated question should be {{ difficulty }}"
            "Ensure you generate just the question and no other text."
        ),
        output_format=Question,
    )
)

config_builder.add_column(
    LLMStructuredColumnConfig(
        name="options",
        model_alias=MODEL_ALIAS,
        prompt=(
            "Generate four answer choices for the question: {{ question }} based on the following context: {{ summary }}. "
            "The option you generate should match the difficulty of the generated question, {{ difficulty }}."
        ),
        output_format=Options,
    )
)


config_builder.add_column(
    LLMStructuredColumnConfig(
        name="answer",
        model_alias=MODEL_ALIAS,
        prompt=(
            "Choose the correct answer for the question: {{ question }} based on the following context: {{ summary }}"
            "and options choices. The options are {{ options }}. Only select one of the options as the answer."
        ),
        output_format=Answer,
    )
)


config_builder.add_column(
    LLMStructuredColumnConfig(
        name="topic",
        model_alias=MODEL_ALIAS,
        system_prompt=(
            "Generate a short 1-3 word topic for the question: {{ question }} "
            "based on the given context. {{ summary }}"
        ),
        prompt=(
            "Generate the topic of the question: {{ question }} based on the following context: {{ summary }}"
            "The topic should be a single word or phrase that is relevant to the question and context. "
        ),
        output_format=QuestionTopic,
    )
)


### 🔁 Iteration is key – preview the dataset!

1. Use the `preview` method to generate a sample of records quickly.

2. Inspect the results for quality and format issues.

3. Adjust column configurations, prompts, or parameters as needed.

4. Re-run the preview until satisfied.


In [None]:
# Preview a few records
preview = data_designer_client.preview(config_builder)

In [None]:
# More previews
preview.display_sample_record()

### 📊 Analyze the generated data

- Data Designer automatically generates a basic statistical analysis of the generated data.

- This analysis is available via the `analysis` property of generation result objects.


In [None]:
# Print the analysis as a table.
preview.analysis.to_report()

### 🔎 View Results


In [None]:
# Compare original document with generated outputs
index = 0  # Change this to view different examples

# Merge preview data with original images for comparison
comparison_dataset = preview.dataset.merge(
    pd.DataFrame(img_dataset)[["uuid", "image"]], how="left", on="uuid"
)

print("📄 Original Document Image:")
display(resize_image(comparison_dataset.image[index], BASE64_IMAGE_HEIGHT))

print("\n📝 Generated Summary:")
rich.print(
    Panel(
        comparison_dataset.summary[index], title="Document Summary", title_align="left"
    )
)

print("\n🔢 Generated Difficulty:")
rich.print(
    Panel(
        json.dumps(comparison_dataset.difficulty[index]),
        title="Difficulty",
        title_align="left",
    )
)

print("\n❓ Generated Question:")
rich.print(
    Panel(
        json.dumps(comparison_dataset.question[index]),
        title="Question",
        title_align="left",
    )
)

print("\n🔢 Generated Options:")
rich.print(
    Panel(
        json.dumps(comparison_dataset.options[index]),
        title="Answer Choices",
        title_align="left",
    )
)

print("\n🔢 Generated Topic:")
rich.print(
    Panel(
        json.dumps(comparison_dataset.topic[index]), title="Topic", title_align="left"
    )
)

print("\n✅ Generated Answer:")
rich.print(
    Panel(
        json.dumps(comparison_dataset.answer[index]),
        title="Correct Answer",
        title_align="left",
    )
)


### 🆙 Scale up!

- Happy with your preview data?

- Use the `create` method to submit larger Data Designer generation jobs.


In [None]:
job_results = data_designer_client.create(config_builder, num_records=20)

# This will block until the job is complete.
job_results.wait_until_done()

In [None]:
# Load the generated dataset as a pandas DataFrame.
dataset = job_results.load_dataset()

dataset.head()

In [None]:
# Load the analysis results into memory.
analysis = job_results.load_analysis()

analysis.to_report()

In [None]:
TUTORIAL_OUTPUT_PATH = "data-designer-tutorial-output"

# Download the job artifacts and save them to disk.
job_results.download_artifacts(
    output_path=TUTORIAL_OUTPUT_PATH,
    artifacts_folder_name="artifacts-community-contributions-multimodal-visual-question-answering",
);

In [None]:
# Compare original document with generated outputs
index = 0  # Change this to view different examples

# Merge preview data with original images for comparison
comparison_dataset = dataset.merge(
    pd.DataFrame(img_dataset)[["uuid", "image"]], how="left", on="uuid"
)

print("📄 Original Document Image:")
display(resize_image(comparison_dataset.image[index], BASE64_IMAGE_HEIGHT))

print("\n📝 Generated Summary:")
rich.print(
    Panel(
        comparison_dataset.summary[index], title="Document Summary", title_align="left"
    )
)

print("\n🔢 Generated Difficulty:")
rich.print(
    Panel(
        json.dumps(comparison_dataset.difficulty[index]),
        title="Difficulty",
        title_align="left",
    )
)

print("\n❓ Generated Question:")
rich.print(
    Panel(
        json.dumps(comparison_dataset.question[index]),
        title="Question",
        title_align="left",
    )
)

print("\n🔢 Generated Options:")
rich.print(
    Panel(
        json.dumps(comparison_dataset.options[index]),
        title="Answer Choices",
        title_align="left",
    )
)

print("\n🔢 Generated Topic:")
rich.print(
    Panel(
        json.dumps(comparison_dataset.topic[index]), title="Topic", title_align="left"
    )
)

print("\n✅ Generated Answer:")
rich.print(
    Panel(
        json.dumps(comparison_dataset.answer[index]),
        title="Correct Answer",
        title_align="left",
    )
)
