# Multimodal Structured Outputs with Daft, Gemma-3n, and vLLM

*An end-to-end example of **Multimodal Structured Outputs** with Daft's high performance data engine.*


<a target="_blank" href="https://colab.research.google.com/github/Eventual-Inc/Daft/blob/main/tutorials/structured_outputs/mm_structured_outputs.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Introduction

**Structured Outputs** refers to a family of features that enables LLMs to respond in a constrained format. While LLMs continue to improve and demonstrate emergent abilities, their dynamic nature make them difficult to integrate them with traditional software systems. Almost all emergent AI uses cases from agents to synthetic data and knowledge extraction leverage structured outputs, whether that be to execute tool calls or adhere to Pydantic Models. 

Structured Outputs strategies consist of 5 strategies that define the desired output type: 

- Basic Python Types: `int`, `float`, `bool`...
- Multiple Choices: using `Literal` or `Enum`
- JSON Schemas: using Pydantic models or dataclasses
- Regex
- Context-free Grammars

While there are tremendous number of examples in pure python, few tutorials exist demonstrate structured outputs in distributed batch processing context. Even fewer, if any, examples exist that demonstrate how to run batch structured outputs with multimodal data. Here we will take things one step further - using your own OpenAI-compatible server.  

This is a real use-case that every enterprise team faces when attempting to work with massive amounts of internal/private data. These teams face significant hurdles with traditional tooling, especially for cutting-edge uses cases like batch tool calls for background agents or reinforcement learning with verifiable rewards. 

Daft's unified multimodal data processing engine is purpose built to support workloads like this and is rapidly becoming the default engine of choice for teams deploying frontier AI solutions in production.

In this notebook, we will leverage daft to accomplish to evaluate the image understanding accuracy of Gemma‑3n‑e4b‑it using the AI2D dataset. By the end of this notebook, you will be ready to implement your own distributed batch structured outputs pipeline with a copy-paste script you can use in your own environment. 

NOTE: This Notebook contains an advanced path where you can use vLLM as your inference solution. In this case, Google Colab's A100 GPU instance is recommended. Additionally, in order to access to [google/gemma-3n-e4b-it](https://huggingface.co/google/gemma-3n-E4B-it) you will need accept Google's usage policy and authenticate with HuggingFace.

### Table of Contents

1. [Setup](#1-setup) 
2. [Choose an Inference Solution](#2-choose-an-inference-solution)
3. [Test OpenAI client Requests](#3-sanity-check-openai-client-requests)
4. [Dataset Preprocessing](#4-dataset-preprocessing)
5. [Multimodal Structured Outputs](#5-multimodal-inference-with-structured-outputs)
6. Post Processing and Analysis 
7. Evaluation 
8. Conclusion



## 1. Setup 

### Install Dependencies 

In [None]:
!uv pip install  "daft[huggingface]>=0.6.1"

### Configure Parameters

In [None]:
# Model and Dataset
MODEL_ID = "google/gemma-3-4b-it"  # OpenRouter free version -> google/gemma-3-4b-it:free
DATASET_URI = "hf://datasets/HuggingFaceM4/the_cauldron/ai2d/train-00000-of-00001-2ce340398c113b79.parquet"
IMAGE_URL = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"

# Inference Parameters
ROW_LIMIT = 100
TEMPERATURE = 0.1
CONCURRENCY = 4
BATCH_SIZE = 32

## 2. Choose an Inference Solution

### Option 1: Connect to an Inference Provider

* Both OpenRouter and LMStudio have model support for google/gemma-3n-e4b-it.
* Export OPENAI_API_KEY and OPENAI_BASE_URL for your chosen inference provider.

### Option 2: Launch vLLM OpenAI Compatible Server (Advanced)

#### Install vLLM

After you install vllm you will be prompted to restart the session, then proceed to the next step. 

In [None]:
!pip install -q vllm

#### Log in to HF for access google/gemma-3n-e4b-it

The [google/gemma-3n-e4b-it repository](https://huggingface.co/google/gemma-3n-E4B-it) is publicly accessible, but will need to login to HuggingFace and accept Google's conditions to access its files and content. Requests are processed immediately.

In [None]:
!hf auth login

#### Launch vLLM OpenAI Compatible Server

Run the following vllm cli command in your terminal

If you are in Google Colab, you can open a terminal by clicking the terminal icon in the bottom left of the ui.

```bash
 python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-3n-e4b-it \
  --enable-chunked-prefill \
  --guided-decoding-backend guidance \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.85 \
  --host 0.0.0.0 --port 8000
```

* This config is optimized for Google Colab's A100 instance and gemma-3n-e4b-it. 
* For vLLM online serving, set `api_key = "none"` and `base_url = "http://0.0.0.0:8000/v1"`
* Server readiness may take ~7–8 minutes; ‘guided_choice’ requires guided decoding enabled

## 3. Sanity Check OpenAI Client Requests

### Export your API key and base url environment variables.

In [None]:
!export OPENAI_API_KEY=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 && export OPENAI_BASE_URL=https://openrouter.ai/api/v1

In [None]:
import os

# OpenAI Client Environment Variables
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "sk-or-v1-...")
OPENAI_BASE_URL = os.environ.get("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")

In [None]:
from openai import OpenAI

client = OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)

# Test Client connects to Server
result = client.models.list()

In [None]:
# Test Simple Text Completion
chat_completion = client.chat.completions.create(
    messages=[{"role": "user", "content": "How many strawberries are in the word r?"}],
    model=MODEL_ID,
)

result = chat_completion.choices[0].message.content
print(result)

In [None]:
# Test Structured Output
completion = client.chat.completions.create(
    model=MODEL_ID,
    messages=[{"role": "user", "content": "Classify this sentiment: Daft is wicked fast!"}],
    extra_body={"guided_choice": ["positive", "negative"]},
)
print(completion.choices[0].message.content)

In [None]:
# Test Image Understanding

completion = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": IMAGE_URL}},
                {"type": "text", "text": "Describe this image in detail."},
            ],
        },
    ],
)
print(completion.choices[0].message.content)

### Test Combining Image Inputs with Structured Output

We can play with prompting/structured outputs to understand how prompting and structured outputs can affect results.

Try commenting out the `response_model` argument or the third text prompt to see how results change.

vLLM also supports a simpler usage pattern of `extra_body={guided_choice:["A","B","C","D"]}`, but for compatibility with OpenRouter we use the Pydantic Json Schema approach.

In [None]:
from enum import Enum

from pydantic import BaseModel, Field


# Define a Pydantic Model for the Choice Response (overkill)
class Choices(str, Enum):
    A = "A"
    B = "B"
    C = "C"
    D = "D"


class ChoiceResponse(BaseModel):
    choice: Choices = Field(..., description="Provide the letter of the correct choice with no other text.")


# Test Image Understanding
completion = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": IMAGE_URL}},
                {
                    "type": "text",
                    "text": "Which insect is portrayed in the image: A. Ladybug, B. Beetle, C. Bee, D. Wasp ",
                },
                # {"type": "text", "text": "Answer with only the letter from the multiple choice. "} # Try comment me out
            ],
        },
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "math-response",
            "schema": ChoiceResponse.model_json_schema(),
        },
    },
)
response = completion.choices[0].message.content
print(response)

In [None]:
# Pydantic Valiation
choice_obj = ChoiceResponse.model_validate_json(response)
print(choice_obj)

## 4: Dataset Preprocessing

[HuggingFaceM4/the_cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron/viewer?views%5B%5D=ai2d) is a massive collection of 50 vision-language datasets that were used for the fine-tuning of the vision-language model Idefics2. We will use the AI2D subset to develop and test our pipeline. 

We can begin by reading directly from huggingface datasets, using the `hf://` prefix in the url string. 

In [None]:
import daft

# There are a total of 2,434 images in this dataset, at a size of ~ 500 MB
# DATASET_URI ="hf://datasets/HuggingFaceM4/the_cauldron/ai2d/train-00000-of-00001-2ce340398c113b79.parquet"
df_raw = daft.read_parquet(DATASET_URI).limit(ROW_LIMIT).collect()
df_raw.show(3)

 Taking a look at the schema we can see the familiar messages nested datatype we are used to in chat completions inside the `texts` column


In [None]:
print(df_raw.schema())

Lets decode the image bytes to see a preview of the images and add one more column for the base64 encoding. 

Note: You can click on any cell to preview its contents.

In [None]:
from daft import col

df_img = df_raw.explode(col("images")).with_columns(
    {
        "image": col("images").struct.get("bytes").image.decode(),
        "image_base64": col("images").struct.get("bytes").encode("base64"),
    }
)
df_img.show(3)

#### Preprocessing the 'texts' column to extract Question, Choices, and Answer Columns

Copy/Pasting an entry from the `texts` column yields an openai messages list of dicts of the form:

```python
[{
    "user": """Question:
            
        From the above food web diagram, what cause kingfisher to increase

        Choices:
            A. decrease in fish
            B. decrease in water boatman
            C. increase in fish
            D. increase in algae

        Answer with the letter.""",

    "assistant": "Answer: C",
    "source": "AI2D",
}, ...]
```

In [None]:
# Explode the List of Dicts inside "texts" to extract "user" and "assistant" messages
df_text = df_img.explode(col("texts")).collect()

# Extract User and Assistant Messages
df_text = df_text.with_columns(
    {"user": df_text["texts"].struct.get("user"), "assistant": df_text["texts"].struct.get("assistant")}
).collect()
df_text.show(3)

We can also go above an beyond to parse each text input into individual question, choices, and answer columns.  

In [None]:
# Parsing "user" and "assistant" messages for question, choices, and answer""
df_prepped = df_text.with_columns(
    {
        "question": col("user")
        .str.extract(r"(?s)Question:\s*(.*?)\s*Choices:")
        .str.replace("Choices:", "")
        .str.replace("Question:", ""),
        "choices_string": col("user")
        .str.extract(r"(?s)Choices:\s*(.*?)\s*Answer?\.?")
        .str.replace("Choices:\n", "")
        .str.replace("Answer", ""),
        "answer": col("assistant").str.extract(r"Answer:\s*(.*)$").str.replace("Answer:", ""),
    }
).collect()

df_prepped.show(3)

## 5. Multimodal Inference with Structured Outputs

Now we will move on to scaling our OpenAI client calls with Daft UDFs, exploring three methods of implementing structured outputs on images:
1. Naive Row-Wise UDF
2. Naive Async Batch UDF
3. Production Batch UDF

### Minimal Row-Wise UDF

In [None]:
@daft.func()
async def struct_output_rowwise(
    model_id: str, text_col: str, image_col: str, model_json_schema: dict | None = None
) -> str:
    client = OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)

    content = [{"type": "text", "text": text_col}]

    if image_col:
        content.append(
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{image_col}"},
            }
        )

    if model_json_schema:
        response_format = {
            "type": "json_schema",
            "json_schema": {
                "name": "math-response",
                "schema": ChoiceResponse.model_json_schema(),
            },
        }

    result = client.chat.completions.create(
        messages=[{"role": "user", "content": content}],
        model=model_id,
        response_format=response_format,
    )
    return result.choices[0].message.content

In [None]:
import time

from daft import col
from daft.functions import format

# Run the Rowwise UDF
start = time.time()
df_rowwise_udf = (
    # Inference
    df_prepped.with_column(
        "result",
        struct_output_rowwise(
            model_id=MODEL_ID,
            text_col=format("{} \n {}", col("question"), col("choices_string")),
            image_col=col("image_base64"),
            model_json_schema=ChoiceResponse.model_json_schema(),
        ),
    )
    # Postprocessing
    .with_column("is_correct", col("result").str.lstrip().str.rstrip() == col("answer").str.lstrip().str.rstrip())
    .limit(ROW_LIMIT)
    .collect()
)
end = time.time()
print(
    f"Row-wise UDF - Processed {df_rowwise_udf.count_rows()} rows in {end-start} seconds, {df_rowwise_udf.count_rows()/(end-start)} rows/s"
)

Write down each of your runs here:
- Row-wise UDF - Processed ...

### Minimal Async Batch UDF

In [None]:
import asyncio

from openai import AsyncOpenAI

from daft import DataType as dt

client = AsyncOpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)


@daft.udf(return_dtype=dt.string())
def struct_output_batch(
    model_id: str,
    text_col: daft.Series,
    image_col: daft.Series,
    model_json_schema: dict | None = None,
    extra_body: dict | None = None,
) -> list[str]:
    # Nested Async Function
    async def generate(model_id: str, text: str, image: str) -> str:
        content = [{"type": "text", "text": text}]

        # Argument Handling
        if image:
            content.append(
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{image}"},
                }
            )

        if model_json_schema:
            response_format = {
                "type": "json_schema",
                "json_schema": {
                    "name": "math-response",
                    "schema": model_json_schema,
                },
            }

        # Inference
        result = await client.chat.completions.create(
            messages=[{"role": "user", "content": content}],
            model=model_id,
            response_format=response_format,
            extra_body=extra_body,
        )
        return result.choices[0].message.content

    # Input Handling
    texts = text_col.to_pylist()
    images = image_col.to_pylist()

    # Async
    async def gather_completions() -> list[str]:
        tasks = [generate(model_id, t, i) for t, i in zip(texts, images)]
        return await asyncio.gather(*tasks)

    return asyncio.run(gather_completions())

In [None]:
# 2. Run the Batch UDF
start = time.time()
df_batch_udf = (
    df_prepped.with_column(
        "result",
        struct_output_batch(
            model_id=MODEL_ID,
            text_col=format("{} \n {}", col("question"), col("choices_string")),  # Prompt Template
            image_col=col("image_base64"),
            extra_body={"guided_choice": ["A", "B", "C", "D"]},
        ),
    )
    .with_column("is_correct", col("result").str.lstrip().str.rstrip() == col("answer").str.lstrip().str.rstrip())
    .limit(ROW_LIMIT)
    .collect()
)
end = time.time()
print(
    f"Batch UDF - Processed {df_batch_udf.count_rows()} rows in {end-start} seconds, {df_batch_udf.count_rows()/(end-start)} rows/s"
)

Write down each of your runs here:
- Batch UDF - Processed ...

#### Challenge
Before you move on to the Production UDF, try increasing the ROW_LIMIT to 500, 1000, and 2000 rows.
- What happens if you try to run the full dataset (7462 rows)?
- How does row processing rate change when you increase the row_limit?
- Do you run into any issues?


## Production UDF

Here is what a production version of our minimal user defined functions looks like.

In [None]:
BATCH_SIZE = 32
CONCURRENCY = 4
MAX_CONN = 32

In [None]:
from typing import Any


@daft.udf(return_dtype=daft.DataType.string(), concurrency=CONCURRENCY, batch_size=BATCH_SIZE)
class StructuredOutputsProdUDF:
    def __init__(self, base_url: str, api_key: str):
        self.client = AsyncOpenAI(base_url=base_url, api_key=api_key)

        # Handle Event Loop Exhaustion
        try:
            self.loop = asyncio.get_running_loop()
        except RuntimeError:
            self.loop = asyncio.new_event_loop()
            asyncio.set_event_loop(self.loop)

    def __call__(
        self,
        model_id: str,
        text_col: daft.Series,
        image_col: daft.Series,
        sampling_params: dict[str, Any] | None = None,
        model_json_schema: dict | None = None,
        extra_body: dict[str, Any] | None = None,
    ) -> list[str]:
        # Argument Handling
        if model_json_schema:
            response_format = {
                "type": "json_schema",
                "json_schema": {
                    "name": "math-response",
                    "schema": model_json_schema,
                },
            }
        else:
            response_format = None

        # Nested Async Function
        async def generate(text: str, image: str) -> str:
            content = []
            if image:
                content.append(
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{image}"},
                    }
                )

            if text:
                content.append({"type": "text", "text": text})

            result = await self.client.chat.completions.create(
                messages=[
                    {
                        "role": "user",
                        "content": content,  # Dataset prefers image first
                    }
                ],
                model=model_id,
                response_format=response_format,
                extra_body=extra_body,
                **sampling_params,
            )
            return result.choices[0].message.content

        async def gather_completions(texts, images) -> list[str]:
            tasks = [generate(t, i) for t, i in zip(texts, images)]
            return await asyncio.gather(*tasks)

        texts = text_col.to_pylist()
        images = image_col.to_pylist()

        return self.loop.run_until_complete(gather_completions(texts, images))

In [None]:
# 3. Production UDF
start = time.time()
df_prod_udf = (
    df_prepped.with_column(
        "result",
        StructuredOutputsProdUDF.with_init_args(
            base_url=OPENAI_BASE_URL,
            api_key=OPENAI_API_KEY,
        ).with_concurrency(CONCURRENCY)(
            model_id=MODEL_ID,
            text_col=format("{} \n {}", col("question"), col("choices_string")),  # Prompt Template
            image_col=col("image_base64"),
            extra_body={"guided_choice": ["A", "B", "C", "D"]},
        ),
    )
    .with_column("is_correct", col("result").str.lstrip().str.rstrip() == col("answer").str.lstrip().str.rstrip())
    .limit(ROW_LIMIT)
    .collect()
)
end = time.time()
print(
    f"Prod UDF - Processed {df_prod_udf.count_rows()} rows in {end-start} seconds, {df_prod_udf.count_rows()/(end-start)} rows/s"
)

___
# Analysis
Evaluating Gemma-3's performance on image understanding by comparing structured output responses to the answer.

In [None]:
pass_fail_rate = df_prod_udf.where(col("is_correct")).count_rows() / df_prod_udf.count_rows()
print(f"Pass/Fail Rate: {pass_fail_rate}")

In [None]:
# How does this compare without images?
# Here we will use Daft's native inference function llm_generate
from daft.functions import llm_generate

start = time.time()
df_prod_no_img = (
    df_prepped.with_column(
        "result",
        llm_generate(
            input_column=format("{} \n {}", col("question"), col("choices_string")),  # Prompt Template
            model=MODEL_ID,
            extra_body={"guided_choice": ["A", "B", "C", "D"]},
            api_key=OPENAI_API_KEY,
            base_url=OPENAI_BASE_URL,
            provider="openai",
        ),
    )
    .with_column("is_correct", col("result").str.lstrip().str.rstrip() == col("answer").str.lstrip().str.rstrip())
    .collect()
)
end = time.time()
print(
    f"llm_generate - Processed {df_prod_no_img.count_rows()} rows in {end-start} seconds,  {df_prod_no_img.count_rows()/(end-start)} rows/s"
)

In [None]:
pass_fail_rate_no_img = df_prod_no_img.where(col("is_correct")).count_rows() / df_prod_no_img.count_rows()
print(f"Pass/Fail Rate: {pass_fail_rate}")

---
# Putting everything together: Evaluating Gemma across the AI2D Dataset
Now that we have walked through implementing this image understanding evaluation pipeline from end to end, lets put it all together so we can take full advantage of lazy evaluation and provide opportunities for future extensibility and re-use.

In [None]:
from typing import Any

from openai import AsyncOpenAI

import daft
from daft import col
from daft.functions import format


@daft.udf(return_dtype=daft.DataType.string(), concurrency=4)
class StructuredOutputsProdUDF:
    def __init__(self, base_url: str, api_key: str):
        self.client = AsyncOpenAI(base_url=base_url, api_key=api_key)
        try:
            self.loop = asyncio.get_running_loop()
        except RuntimeError:
            self.loop = asyncio.new_event_loop()
            asyncio.set_event_loop(self.loop)

    def __call__(
        self,
        model_id: str,
        text_col: daft.Series,
        image_col: daft.Series,
        sampling_params: dict[str, Any] | None = None,
        model_json_schema: dict | None = None,
        extra_body: dict[str, Any] | None = None,
    ):
        # Argument Handling
        if model_json_schema:
            response_format = {
                "type": "json_schema",
                "json_schema": {
                    "name": "math-response",
                    "schema": model_json_schema,
                },
            }
        else:
            response_format = None

        async def generate(text: str, image: str) -> str:
            content = []
            if image:
                content.append(
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{image}"},
                    }
                )
            if text:
                content.append({"type": "text", "text": text})

            result = await self.client.chat.completions.create(
                messages=[
                    {
                        "role": "user",
                        "content": content,  # Dataset prefers image first
                    }
                ],
                model=model_id,
                response_format=response_format,
                extra_body=extra_body,
                **sampling_params,
            )
            return result.choices[0].message.content

        async def gather_completions(texts, images) -> list[str]:
            tasks = [generate(t, i) for t, i in zip(texts, images)]
            return await asyncio.gather(*tasks)

        texts = text_col.to_pylist()
        images = image_col.to_pylist()

        return self.loop.run_until_complete(gather_completions(texts, images))


class TheCauldronImageUnderstandingEvaluationPipeline:
    def __init__(self, base_url: str, api_key: str):
        self.base_url = base_url
        self.api_key = api_key

    def __call__(
        self,
        model_id: str,
        dataset_uri: str,
        sampling_params: dict[str, Any] | None = None,
        concurrency: int = 4,
        row_limit: int | None = None,
        is_eager: bool = False,
    ) -> daft.DataFrame:
        """Executes dataset loading, preprocessing, inference, and post-processing.

        Evaluation must be run separately since it requires materialization.
        """
        if is_eager:
            # Load Dataset and Materialize
            df = self.load_dataset(dataset_uri)
            df = df.limit(row_limit) if row_limit else df
            df = self._log_processing_time(df)

            # Preprocess
            df = self.preprocess(df)
            df = self._log_processing_time(df)

            # Perform Inference
            df = self.infer(df, model_id, sampling_params)
            df = self._log_processing_time(df)

            # Post-Process
            df = self.postprocess(df)
            df = self._log_processing_time(df)
        else:
            df = self.load_dataset(dataset_uri)
            df = self.preprocess(df)
            df = self.infer(df, model_id, sampling_params)
            df = self.postprocess(df)
            df = df.limit(row_limit) if row_limit else df

        return df

    @staticmethod
    def _log_processing_time(df: daft.DataFrame):
        start = time.time()
        df_materialized = df.collect()
        end = time.time()
        num_rows = df_materialized.count_rows()
        print(f"Processed {num_rows} rows in {end-start} sec, {num_rows/(end-start)} rows/s")
        return df_materialized

    def load_dataset(self, uri: str) -> daft.DataFrame:
        return daft.read_parquet(uri)

    def preprocess(self, df: daft.DataFrame) -> daft.DataFrame:
        # Convert png image byte string to base64
        df = df.explode(col("images")).with_column(
            "image_base64",
            df["images"].struct.get("bytes").encode("base64"),
        )

        # Explode Lists of User Prompts and Assistant Answer Pairs
        df = df.explode(col("texts")).with_columns(
            {"user": df["texts"].struct.get("user"), "assistant": df["texts"].struct.get("assistant")}
        )

        # Parse the Question/Answer Strings
        df = df.with_columns(
            {
                "question": df["user"]
                .str.extract(r"(?s)Question:\s*(.*?)\s*Choices:")
                .str.replace("Choices:", "")
                .str.replace("Question:", ""),
                "choices_string": df["user"]
                .str.extract(r"(?s)Choices:\s*(.*?)\s*Answer?\.?")
                .str.replace("Choices:\n", "")
                .str.replace("Answer", ""),
                "answer": df["assistant"].str.extract(r"Answer:\s*(.*)$").str.replace("Answer:", ""),
            }
        )
        return df

    def infer(
        self,
        df: daft.DataFrame,
        model_id: str = "google/gemma-3n-e4b-it",
        sampling_params: dict[str, Any] = {"temperature": 0.0},
        concurrency: int = 4,
        extra_body: dict[str, Any] = {"guided_choice": ["A", "B", "C", "D"]},
    ) -> daft.DataFrame:
        return df.with_column(
            "result",
            StructuredOutputsProdUDF.with_init_args(
                base_url=self.base_url,
                api_key=self.api_key,
            ).with_concurrency(concurrency)(
                model_id=model_id,
                text_col=format("{} \n {}", col("question"), col("choices_string")),  # Prompt Template
                image_col=col("image_base64"),
                sampling_params=sampling_params,
                extra_body=extra_body,
            ),
        )

    def postprocess(self, df: daft.DataFrame) -> daft.DataFrame:
        df = df.with_column(
            "is_correct", col("result").str.lstrip().str.rstrip() == col("answer").str.lstrip().str.rstrip()
        )
        return df

    def evaluate(self, df: daft.DataFrame) -> float:
        pass_fail_rate = df.where(col("is_correct")).count_rows() / df.count_rows()
        return pass_fail_rate

In [None]:
# Our entire pipeline collapses into a three lines
dataset_uri = "hf://datasets/HuggingFaceM4/the_cauldron/ai2d/train-00000-of-00001-2ce340398c113b79.parquet"
pipeline = TheCauldronImageUnderstandingEvaluationPipeline(
    api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL, row_limit=ROW_LIMIT
)
df = pipeline(model_id=MODEL_ID, sampling_params={"temperature": 0.1}, is_eager=True)

In [None]:
# Materialize if not eager
df_mat = df.collect()

In [None]:
# Print the Pass/Fail Rate
print(f"Pass/Fail Rate: {pipeline.evaluate(df_mat)}")

---
## Conclusion

In this notebook we explored how to evaluate Gemma-3's image understanding using a subset from HuggingFace's TheCauldron Dataset. The AI2D subset we used is just one of a massive collection of 50 vision-language datasets that can be used for evaluating or training vision language models totaling millions of rows. You can also leverage this pipeline to evaluate model performance across sampling parameters or model variants. Please note that not all Gemma-3 series models support image inputs, and leveraging datasets outside of the TheCauldron would require different preprocessing stages.

A natural next step would be to parallelize this pipeline across multiple datasets leveraging multiple gpus. In this scenario, I recommend transitioning daft's execution context to leverage Ray, a distributed compute framework.

```bash
pip install "daft[huggingface,ray]"
```

You can set daft's execution context to ray adding the `ray` optional dependency during installation and running the following at the top of your script.

```python
import daft

daft.set_runner_ray()
```

Simply run your pipeline across each dataset uri and collect the results, Daft will orchestrate ray in the background for you. 