# Evaluating Image Understanding at Scale with Structured Outputs and LLM-as-a-Judge Feedback

*An end-to-end example of **Multimodal Structured Outputs** with Daft, vLLM, and Qwen3-VL-8B-Instruct*

<a target="_blank" href="https://colab.research.google.com/github/everettVT/daft-examples-1/blob/main/notebooks/mm_structured_outputs.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Introduction

**Structured Outputs** refers to a family of features that enables language models to respond in a constrained format. While language models continue to improve and demonstrate emergent abilities, their unpredictable nature make them difficult to integrate with traditional software systems. Most real-world AI uses-cases leverage structured outputs to some extent, whether that be to execute tool calls or adhere to Pydantic Models. 

As a core primative of almost every LM workload, here are a few canonical references for structured outputs that are worth saving:

- [**Getting Structured LLM Output** from dottxt founders Will Kurt & Cameron Pfiffer (DeepLearning.ai Course)](https://learn.deeplearning.ai/courses/getting-structured-llm-output/information)
- [**Coding for Structured Generation with LLMs** by Will Kurt](https://blog.dottxt.ai/coding-for-structured-generation.html)
- [**Structured Decoding in vLLM: A Gentle Introduction** by Aaron Pham](https://www.bentoml.com/blog/structured-decoding-in-vllm-a-gentle-introduction#user-content-fn-7)

### What is Structured Outputs?

The underlying technology that makes structured outputs is called guided decoding. Guided decoding uses logits to control the output of a language model by adjusting the probabilities of the next possible tokens to enforce constraints or guide the generation process. This can be done through various methods, such as applying a logit bias to penalize or promote specific tokens, filtering invalid tokens based on rules like a Finite State Machine (FSM), or by using more advanced techniques to interact with the model's internal probability distribution.

Structured Outputs consists of 5 strategies that define the desired output type:

- Basic Python Types: `int`, `float`, `bool`...
- Multiple Choices: using `Literal` or `Enum`
- JSON Schemas: using Pydantic models or dataclasses
- Regex
- Context-free Grammars

While there are tremendous number of examples in pure python, few tutorials exist that demonstrate structured outputs within a large-scale processing context. Even fewer, if any, examples exist that demonstrate how to run batch structured outputs with multimodal data on your own inference server. Here, we will walk you through the entire process, using your own OpenAI-compatible server using [vLLM](https://docs.vllm.ai/en/v0.6.3.post1/serving/openai_compatible_server.html).

### What is an LLM-as-a-Judge?

LLM-as-a-Judge is a framework where a language model evaluates the outputs of other AI systems—providing qualitative assessments beyond simple accuracy metrics. Rather than relying solely on human evaluation (expensive, slow, inconsistent) or traditional metrics like BLEU/ROUGE (surface-level, miss semantic nuance), LLM judges can assess relevance, coherence, factual accuracy, and instruction adherence at scale.

The approach was formalized in the seminal paper introducing MT-Bench and Chatbot Arena, which demonstrated that strong LLMs like GPT-4 can achieve ~80% agreement with human preferences—comparable to inter-annotator agreement between humans themselves.

In this notebook, we extend LLM-as-a-Judge to the multimodal domain, using a vision-language model to analyze why our image understanding pipeline succeeded or failed on specific examples.

See also: 
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al. (NeurIPS 2023)
The foundational academic paper that introduced the LLM-as-a-Judge framework, MT-Bench benchmark, and Chatbot Arena. Shows GPT-4 achieves >80% agreement with humans.
Using an LLM-as-a-Judge — Cloud Security Alliance
Practical overview covering scalability benefits, inference-time safety, and cost-effectiveness of LLM judges.
LLM-as-a-Judge on Wikipedia — Wikipedia

### Why This Matters

Large scale multimodal structured outputs is a real world use-case that every enterprise team faces when attempting to work with massive amounts of internal/private data. These teams face significant hurdles with traditional tooling, especially for cutting-edge uses cases like batch tool calls for background agents or reinforcement learning with verifiable rewards. 

Daft's unified multimodal data processing engine is purpose built to support workloads like this and is rapidly becoming the default engine of choice for teams deploying frontier AI solutions in production.

In this notebook, we will leverage Daft to evaluate the image understanding accuracy of [Qwen3 VL 8b](https://github.com/QwenLM/Qwen3-VL) using HugginFace's [the_cauldron dataset](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron). By the end of this notebook, you will be ready to implement your own distributed batch structured outputs pipeline with a copy-paste script you can use in your own environment.

> NOTE:
  This Notebook contains an advanced path where you can use vLLM as your inference solution. In this case, Google Colab's A100 GPU instance is recommended. Additionally, in order to access to [qwen/qwen3-vl-8b-instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) you will need accept Google's usage policy and authenticate with HuggingFace. If you do not have access to a GPU, you can use the OpenRouter instead.

### Table of Contents

1. [Setup](#1-setup)
2. [Inference Configuration and Setup](#2-inference-configuration-and-setup)
3. [Sanity Checks](#3-sanity-check-openai-client-requests) 
4. [Data Loading and Preprocessing](#4-data-loading-and-preprocessing)
5. [Multimodal Structured Outputs with `prompt`](#5-multimodal-structured-outputs-with-the-prompt-function)
6. [Ablation Study](#6-analysis)
7. [Accuracy Analysis and Comparison](#7-analyzing-the-results)
8. [LLM-as-a-Judge Evaluation](#8-llm-as-a-judge-evaluation)
9. [Putting it all together](#9-putting-it-all-together)
9. [Conclusion](#conclusion)



## 1. Setup

### Install Dependencies

In [78]:
!uv pip install -q "daft>=0.6.14" openai numpy pillow ipykernel ipywidgets 

### Configure Parameters

In [None]:
MODEL_ID = "qwen/qwen3-vl-8b-instruct"        # vLLM & OpenRouter
DATASET_URI = "HuggingFaceM4/the_cauldron"

## 2. Inference Configuration and Setup

**Option 1: Use OpenRouter (provider)**

[OpenRouter](https://openrouter.ai/models?max_price=0.5&order=top-weekly) has model endpoints for [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct). That means if you don't have access to an A100 GPU or a PRO Google Colab subscription, you can still walk through this notebook without spinning up a production vLLM server.

In [None]:
import os
from dotenv import load_dotenv

load_dotenv() # load your environment variables from .env file

OPENAI_API_KEY = os.environ.get("OPENROUTER_API_KEY")
OPENAI_BASE_URL = "https://openrouter.ai/api/v1/"

**Option 2: Launch vLLM OpenAI Compatible Server** (Advanced)

Begin by installing vLLM. After you install vllm you will be prompted to restart the session, then proceed to the next step.

In [51]:
OPENAI_API_KEY = "none"
OPENAI_BASE_URL = "http://0.0.0.0:8000/v1"

In [None]:
!pip install -q vllm

**Launch vLLM OpenAI Compatible Server**

Run the following vllm cli command in your terminal. If you are in Google Colab, you can open a terminal by clicking the terminal icon in the bottom left of the ui.

```bash
python -m vllm.entrypoints.openai.api_server \
  --model qwen/qwen3-vl-8b-instruct \
  --enable-chunked-prefill \
  --guided-decoding-backend guidance \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.85 \
  --host 0.0.0.0 --port 8000
```

* This config is optimized for Google Colab's A100 instance and gemma-3-4b-it.
* For vLLM online serving, set `api_key = "none"` and `base_url = "http://0.0.0.0:8000/v1"`
* Server readiness may take ~7–8 minutes; requires guided decoding enabled

## 3. Sanity Check OpenAI Client Requests

Configuring an inference server on a new model can be a long and painful process. 

When configuring vLLM for multimodal structured outputs, it can be helpful to have a series of small tests that accomplish the Adding support for Images and Guided Decoding are not standard options, and tuning a particular model to specific hardware takes multiple iterations to get right. Along the way, we need to make sure our inference server is working across all of the types of requests we expect to need to support.  

In [68]:
from openai import OpenAI

client = OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)

# Test Client connects to Server
result = client.models.list()
print(result)

SyncPage[Model](data=[Model(id='anthropic/claude-opus-4.5', created=1764010580, object=None, owned_by=None, canonical_slug='anthropic/claude-4.5-opus-20251124', hugging_face_id='', name='Anthropic: Claude Opus 4.5', description='Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software engineering, agentic workflows, and long-horizon computer use. It offers strong multimodal capabilities, competitive performance across real-world coding and reasoning benchmarks, and improved robustness to prompt injection. The model is designed to operate efficiently across varied effort levels, enabling developers to trade off speed, depth, and token usage depending on task requirements. It comes with a new parameter to control token efficiency, which can be accessed using the OpenRouter Verbosity parameter with low, medium, or high.\n\nOpus 4.5 supports advanced tool use, extended context management, and coordinated multi-agent setups, making it well-suited for autonomous

In [69]:
# Test Simple Text Completion
chat_completion = client.chat.completions.create(
    messages=[{"role": "user", "content": "How many strawberries are in the word r?"}],
    model=MODEL_ID,
)

result = chat_completion.choices[0].message.content
print(result)

Actually, there are **zero strawberries** in the word **"r"** — because "r" is just a single letter, and strawberries are fruits, not letters.

But if you're asking this in a playful or riddle-like way — perhaps referencing the **letter "r"** sounding like **"strawberry"**? — then that’s a fun pun!

Let’s break it down:

- The letter **"r"** has **no "s"**, **no "t"**, **no "r"** (wait, it *does* have an "r" — but that’s itself), **no "a"**, **no "w"**, **no "b"**, **no "e"**, **no "r"** again, **no "y"** — so no letters that spell "strawberry".

So, **literally**:  
> **0 strawberries**

**Playfully / as a riddle**:  
> Maybe **1 strawberry** — because the letter "r" *sounds* like "strawberry" — but that’s a stretch!

✅ Final answer: **0 strawberries** — unless you're playing word games!

🍓 Bonus: If you're thinking of the word **"strawberry"** and asking how many **"r"s** are in it — that’s **2**! (one at the start of "strawberry" — wait, no: "strawberry" has **2 r’s** — yes, "strawb

### Test Combining Image Inputs with Structured Output

We can play with prompting/structured outputs to understand how prompting and structured outputs can affect results.

Try commenting out the `response_model` argument or the third text prompt to see how results change.

vLLM also supports a simpler usage pattern of `extra_body={guided_choice:["A","B","C","D"]}`, but for compatibility with OpenRouter we use the Pydantic Json Schema approach.

In [None]:
from pydantic import BaseModel, Field
from typing import Literal
import time

# Define a pydantic model and try adding/removing fields
class ChoiceResponse(BaseModel):
    choice: str = Field(..., description="Provide the letter of the correct choice with no other text ie: F")
    #tags: list[str] = Field(..., description="list up to 5 tags related to the image.")
    #description: str = Field(..., description="Provide a one sentence description of the image.")
    

# Test Image Understanding
start = time.time()
completion = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": IMAGE_URL}},
                {
                    "type": "text",
                    "text": "Which insect is portrayed in the image: A. Ladybug, B. Beetle, C. Bee, D. Wasp ",
                }, 
                # {"type": "text", "text": "Answer with only the letter from the multiple choice. "} 
            ],
        },
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "ChoiceResponse",
            "schema": ChoiceResponse.model_json_schema(),
        },
    },
)
response = completion.choices[0].message.content
print(f"Processed in {time.time() - start} seconds")
print(response)

{
  "choice": "C"
  }


In [19]:
# Pydantic Valiation to convert response back to a ChoiceResponse object
choice_obj = ChoiceResponse.model_validate_json(response)
print(choice_obj)

choice='C' tags=['insect', 'bee', 'flower', 'garden']


## 4: Data Loading and Preprocessing

[HuggingFaceM4/the_cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron/viewer?views%5B%5D=ai2d) is a massive collection of 50 vision-language datasets spanning millions of rows across:

1. General visual question answering
2. OCR document understanding & text transcription
3. Chart/figure understanding
4. Table understanding
5. Reasoning, logic, maths
6. Textbook/academic questions
7. Differences between 2 images
8. Screenshot to code

This dataset is a great resource for evaluating the image understanding capabilities of a vision language model as it gives us a wide range of tasks and image compositions to test on. It's size alone makes it particularly useful for training and validation. 

For now we will begin with General visual Q&A subset AI2D. 

In [75]:
import daft

df_raw = daft.read_huggingface("HuggingFaceM4/the_cauldron/ai2d").collect()
df_raw.show(3)

"images List[Struct[bytes: Binary, path: String]]","texts List[Struct[user: String, assistant: String, source: String]]"
"[{bytes: b""\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD""..., path: None, }]","[{user: Question: What do respiration and combustion give out Choices: A. Oxygen B. Carbon dioxide C. Nitrogen D. Heat Answer with the letter., assistant: Answer: B, source: AI2D, }]"
"[{bytes: b""\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD""..., path: None, }]","[{user: Question: From the given food web, name any two herbivores? Choices: A. coyote, bobcat B. dingo, jack rabbit C. dingo, bobcat D. roadrunner&jack rabbit Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: In the given food web, which are the organism that only eaten roadrunner? Choices: A. dingo, jack rabbit B. coyote, bobcat C. dingo, bobcat D. snake, jack rabbit Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: Name a herbivore from the given food web? Choices: A. cactus B. kangaroo rat C. snake D. bobcat Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: Name a producer from the given food web? Choices: A. bobcat B. snake C. road runner D. barrel cactus Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: Name an omnivore from the given food web? Choices: A. dingo B. bobcat C. cactus D. kangaroo Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: What is a predator of the roadrunner? Choices: A. kangaroo B. coyote C. dingo D. cactus Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: What will happen if kangroo rats goes extinct? Choices: A. Cactus count will decrease B. Dessert plants growth will decresse C. Snake population will increase D. Snake population will decrease Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: What would be most affected if the cactus all died? Choices: A. coyote B. dingo C. kangaroo rat D. snake Answer with the letter., assistant: Answer: C, source: AI2D, }, {user: Question: Which among the below is a producer in the food chain diagram shown? Choices: A. Kangaroo rat B. Roadrunner C. Dessert grass D. Snake Answer with the letter., assistant: Answer: C, source: AI2D, }, {user: Question: Which is a producer? Choices: A. Coyote B. Desert Grass C. Kangaroo D. Dingo Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: Who would suffer without kangaroo rats? Choices: A. Desert Grass B. Snake C. Cactus D. Roadrunner Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: desert grasses are known as Choices: A. consumer B. herbivores C. omnivores D. producer Answer with the letter., assistant: Answer: D, source: AI2D, }]"
"[{bytes: b""\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD""..., path: None, }]","[{user: Question: Anatomy One of a series of long curved bones occurring in 12 pairs in humans is called. Choices: A. diaphram B. lung C. none D. ribs Answer with the letter., assistant: Answer: D, source: AI2D, }]"


### Investigating the Schema

Taking a look at the schema we can see the familiar messages nested datatype we are used to in chat completions inside the `texts` column


In [21]:
print(df_raw.schema())

╭─────────────┬───────────────────────────────────────────────────────────────╮
│ column_name ┆ type                                                          │
╞═════════════╪═══════════════════════════════════════════════════════════════╡
│ images      ┆ List[Struct[bytes: Binary, path: String]]                     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ texts       ┆ List[Struct[user: String, assistant: String, source: String]] │
╰─────────────┴───────────────────────────────────────────────────────────────╯



### Decoding Images into Daft's Image Type. 

Daft provides a simple way to decode images into its internal Image type. This allows you to use Daft's powerful image processing capabilities to preprocess your images before sending them to a model.

Note: You can click on any cell to preview its contents.

In [92]:
from daft import col

# Explode the list of images (usually just one image anyways)
df_img = df_raw.explode(col("images"))

# Decode the image into daft.DataType.image() 
df_img = df_img.with_column("image_decoded", col("images").struct.get("bytes").decode_image())
df_img.show(3)

"images Struct[bytes: Binary, path: String]","texts List[Struct[user: String, assistant: String, source: String]]",image_decoded Image[MIXED]
"{bytes: b""\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD""..., path: None, }","[{user: Question: What do respiration and combustion give out Choices: A. Oxygen B. Carbon dioxide C. Nitrogen D. Heat Answer with the letter., assistant: Answer: B, source: AI2D, }]",
"{bytes: b""\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD""..., path: None, }","[{user: Question: From the given food web, name any two herbivores? Choices: A. coyote, bobcat B. dingo, jack rabbit C. dingo, bobcat D. roadrunner&jack rabbit Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: In the given food web, which are the organism that only eaten roadrunner? Choices: A. dingo, jack rabbit B. coyote, bobcat C. dingo, bobcat D. snake, jack rabbit Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: Name a herbivore from the given food web? Choices: A. cactus B. kangaroo rat C. snake D. bobcat Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: Name a producer from the given food web? Choices: A. bobcat B. snake C. road runner D. barrel cactus Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: Name an omnivore from the given food web? Choices: A. dingo B. bobcat C. cactus D. kangaroo Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: What is a predator of the roadrunner? Choices: A. kangaroo B. coyote C. dingo D. cactus Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: What will happen if kangroo rats goes extinct? Choices: A. Cactus count will decrease B. Dessert plants growth will decresse C. Snake population will increase D. Snake population will decrease Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: What would be most affected if the cactus all died? Choices: A. coyote B. dingo C. kangaroo rat D. snake Answer with the letter., assistant: Answer: C, source: AI2D, }, {user: Question: Which among the below is a producer in the food chain diagram shown? Choices: A. Kangaroo rat B. Roadrunner C. Dessert grass D. Snake Answer with the letter., assistant: Answer: C, source: AI2D, }, {user: Question: Which is a producer? Choices: A. Coyote B. Desert Grass C. Kangaroo D. Dingo Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: Who would suffer without kangaroo rats? Choices: A. Desert Grass B. Snake C. Cactus D. Roadrunner Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: desert grasses are known as Choices: A. consumer B. herbivores C. omnivores D. producer Answer with the letter., assistant: Answer: D, source: AI2D, }]",
"{bytes: b""\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD""..., path: None, }","[{user: Question: Anatomy One of a series of long curved bones occurring in 12 pairs in humans is called. Choices: A. diaphram B. lung C. none D. ribs Answer with the letter., assistant: Answer: D, source: AI2D, }]",


RuntimeStatsManager finished with active nodes {1}


### Preprocessing the 'texts' column to extract Question, Choices, and Answer Columns

Copy/Pasting an entry from the `texts` column yields an openai messages list of dicts of the form:

```text
[{
    "user": """Question:
            
        From the above food web diagram, what cause kingfisher to increase

        Choices:
            A. decrease in fish
            B. decrease in water boatman
            C. increase in fish
            D. increase in algae

        Answer with the letter.""",

    "assistant": "Answer: C",
    "source": "AI2D",
}, ...]
```

In [25]:
from daft.functions import unnest

# Explode the List of Dicts inside "texts" and unnest the resulting Struct into dedicated "user", "assistant", and "source" columns
df_text = df_img.explode(col("texts")).select(unnest(col("texts")), "image_decoded")

df_text.show(3)

user String,assistant String,source String,image_decoded Image[MIXED]
Question: What do respiration and combustion give out Choices: A. Oxygen B. Carbon dioxide C. Nitrogen D. Heat Answer with the letter.,Answer: B,AI2D,
"Question: From the given food web, name any two herbivores? Choices: A. coyote, bobcat B. dingo, jack rabbit C. dingo, bobcat D. roadrunner&jack rabbit Answer with the letter.",Answer: D,AI2D,
"Question: In the given food web, which are the organism that only eaten roadrunner? Choices: A. dingo, jack rabbit B. coyote, bobcat C. dingo, bobcat D. snake, jack rabbit Answer with the letter.",Answer: B,AI2D,


RuntimeStatsManager finished with active nodes {3, 2}


### Parsing Text with Regular Expressions

While not strictly necessary, we can also leverage Daft's built-in string expressions to parse each text input into individual question, choices, and answer columns.  

In [None]:
# Parsing "user" and "assistant" messages for question, choices, and answer""
df_prepped = df_text.with_columns(
    {
        "question": col("user").regexp_extract(r"(?s)Question:\s*(.*?)\s*Choices:").regexp_replace("Choices:", "").regexp_replace("Question:", ""),
        "choices_string": col("user").regexp_extract(r"(?s)Choices:\s*(.*?)\s*Answer?\.?").regexp_replace("Choices:\n", "").regexp_replace("Answer", ""),
        "answer": col("assistant").regexp_replace("Answer:", ""),
    }
).collect()

df_prepped.show(3)

user String,assistant String,source String,image_decoded Image[MIXED],question String,choices_string String,answer String
Question: What do respiration and combustion give out Choices: A. Oxygen B. Carbon dioxide C. Nitrogen D. Heat Answer with the letter.,Answer: B,AI2D,,What do respiration and combustion give out,A. Oxygen B. Carbon dioxide C. Nitrogen D. Heat,B
"Question: From the given food web, name any two herbivores? Choices: A. coyote, bobcat B. dingo, jack rabbit C. dingo, bobcat D. roadrunner&jack rabbit Answer with the letter.",Answer: D,AI2D,,"From the given food web, name any two herbivores?","A. coyote, bobcat B. dingo, jack rabbit C. dingo, bobcat D. roadrunner&jack rabbit",D
"Question: In the given food web, which are the organism that only eaten roadrunner? Choices: A. dingo, jack rabbit B. coyote, bobcat C. dingo, bobcat D. snake, jack rabbit Answer with the letter.",Answer: B,AI2D,,"In the given food web, which are the organism that only eaten roadrunner?","A. dingo, jack rabbit B. coyote, bobcat C. dingo, bobcat D. snake, jack rabbit",B


## 5. Multimodal Structured Outputs with the `prompt` function

Now we will move on to scaling our OpenAI client calls with Daft's new `prompt` function. Using a similar syntax to OpenAI client calls adapted for dataframes, we can quickly scale our structured output requests across the dataset. 
 
For more info see the [API docs](https://docs.daft.ai/en/stable/api/functions/prompt/#daft.functions.prompt), [User Guide](https://docs.daft.ai/en/stable/ai-functions/prompt/), & [Usage Patterns](https://github.com/Eventual-Inc/daft-examples/tree/main/usage_patterns/prompt).

In [29]:
# Fist, we need to set the OpenAI Provider with our api_key and custom base_url that we set earlier. 
daft.set_provider("openai", api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)

In [102]:
from daft import col
from daft.functions import format, prompt
import time

# Lets keep our initial requests low to avoid spending too much $$$ 
LIMIT = 100 

# We will also redefine our Pydantic Model to prioritize only the choice:
class ChoiceResponse(BaseModel):
    choice: str = Field(..., description="Provide the letter of the correct choice with no other text ie: F")
    

start = time.time()
df = df_prepped.with_column(
    "result",
    prompt(
        messages = [col("image_decoded"), col("user")],
        system_message = "Observe the attached image and respond to the multiple choice question with just the letter corresponding to the correct answer.",
        model = MODEL_ID,
        provider = "openai",
        use_chat_completions = True,
        return_format = ChoiceResponse,
    )
).limit(LIMIT).collect()
end = time.time()
print(f"Processed {df.count_rows()} rows in {end-start} seconds")

🗡️ 🐟 InMemorySource: 00:00 

🗡️ 🐟 Limit 100: 00:00 

🗡️ 🐟 Async UDF prompt-8f926fc9-c15d-4ee7-ba5b-7420eae5d7ab: 00:00 

Processed 100 rows in 10.178888320922852 seconds


In [None]:
df.show()

user String,assistant String,source String,image_decoded Image[MIXED],question String,choices_string String,answer String,result Struct[choice: String]
"Question: Using the food web above, what would happen to the other organisms if the number of mouse were decreased? Choices: A. Rabbits will increase B. Grass will decrease C. Owls will decrease D. None of above Answer with the letter.",Answer: C,AI2D,,"Using the food web above, what would happen to the other organisms if the number of mouse were decreased?",A. Rabbits will increase B. Grass will decrease C. Owls will decrease D. None of above,C,"{choice: C, }"
Question: Which of the following terms applies to the fox? Choices: A. Decomposer B. Carnivore C. Herbivore D. Producer Answer with the letter.,Answer: B,AI2D,,Which of the following terms applies to the fox?,A. Decomposer B. Carnivore C. Herbivore D. Producer,B,"{choice: B, }"
"Question: Given the food web listed and your knowledge of science, name a predator in this food web. Choices: A. Acorns B. Mouse C. Owl D. Rabbit Answer with the letter.",Answer: C,AI2D,,"Given the food web listed and your knowledge of science, name a predator in this food web.",A. Acorns B. Mouse C. Owl D. Rabbit,C,"{choice: C, }"
Question: What is formed by the deposition of either the weathered remains of other rocks? Choices: A. Lava B. Rock C. Sedimentary Rock D. Igneous Rock Answer with the letter.,Answer: C,AI2D,,What is formed by the deposition of either the weathered remains of other rocks?,A. Lava B. Rock C. Sedimentary Rock D. Igneous Rock,C,"{choice: C, }"
Question: What is a predator of the roadrunner? Choices: A. kangaroo B. coyote C. dingo D. cactus Answer with the letter.,Answer: B,AI2D,,What is a predator of the roadrunner?,A. kangaroo B. coyote C. dingo D. cactus,B,"{choice: B, }"
"Question: If the Termites in the community below were destroyed, which population would be most directly affected? Choices: A. Dingoes B. Mice C. Northern brown bandicoot D. none of above Answer with the letter.",Answer: A,AI2D,,"If the Termites in the community below were destroyed, which population would be most directly affected?",A. Dingoes B. Mice C. Northern brown bandicoot D. none of above,A,"{choice: C, }"
Question: Which flower is the longest? Choices: A. 8 B. 9 C. 7 D. 5 Answer with the letter.,Answer: D,AI2D,,Which flower is the longest?,A. 8 B. 9 C. 7 D. 5,D,"{choice: D, }"
Question: desert grasses are known as Choices: A. consumer B. herbivores C. omnivores D. producer Answer with the letter.,Answer: D,AI2D,,desert grasses are known as,A. consumer B. herbivores C. omnivores D. producer,D,"{choice: D, }"


In [None]:
df = df.with_column("is_correct", col("result")["choice"].lstrip().rstrip() == col("answer")) # strip whitespace

pass_fail_rate = df.where(col("is_correct")).count_rows() / df.count_rows()
print(f"Pass/Fail Rate: {pass_fail_rate}")

## 6. Analysis

Lets investigate a few of the failures, leveraging Daft's dataframe visualizations to investage the results. You can click on any cell to preview its contents. 

In [105]:
# Lets investigate some of the Failures
df.where(col("is_correct") == False).select("user", "image_decoded", "answer", "result").show(8)

user String,image_decoded Image[MIXED],answer String,result Struct[choice: String]
Question: What will happen if kangroo rats goes extinct? Choices: A. Cactus count will decrease B. Dessert plants growth will decresse C. Snake population will increase D. Snake population will decrease Answer with the letter.,,D,"{choice: C, }"
"Question: Please refer to the food web provided when answering, if the web does not answer question then please use your knowledge of biology to answer question. If there were to be a decrease in available algae, which animal would have the MOST decrease in available energy? A) Sand Shrimp B) Mud Crab C) Mussels D) Anchovy Choices: A. Mud Crab B. Sand Shrimp C. Anchovy D. Mussels Answer with the letter.",,B,"{choice: C, }"
Question: What is directly above the epidermis? Choices: A. Cortex B. Phloem C. Pericycle D. Root Hair Answer with the letter.,,D,"{choice: A, }"
"Question: From the above food chain diagram, which species eat insects Choices: A. mammals B. reptiles C. plants D. decomposer Answer with the letter.",,B,"{choice: A, }"
"Question: If the Termites in the community below were destroyed, which population would be most directly affected? Choices: A. Dingoes B. Mice C. Northern brown bandicoot D. none of above Answer with the letter.",,A,"{choice: C, }"
Question: How many different types of leaves were shown above in the diagram? Choices: A. 7 B. 6 C. 9 D. 8 Answer with the letter.,,A,"{choice: B, }"


Moving beyond simple accuracy, we can learn a lot more about how strong our model's image understanding is by comparing our results with and without the image input. 

In [111]:
# How does these results compare without images?
start = time.time()
df = df.with_column(
    "result_no_image",
    prompt(
        col("user"),
        system_message = "Observe the attached image and respond to the multiple choice question with just the letter corresponding to the correct answer.",
        model = MODEL_ID,
        provider = "openai",
        use_chat_completions=True,
        return_format = ChoiceResponse,
    )
).with_column("is_correct_no_image", col("result_no_image")["choice"] == col("answer").lstrip().rstrip()).limit(LIMIT).collect()
end = time.time()

print(f"Processed {df.count_rows()} rows in {end-start} seconds")

🗡️ 🐟 Async UDF prompt-b6fa471d-6abe-41fc-8c9b-408cfe321845: 00:00 

🗡️ 🐟 InMemorySource: 00:00 

🗡️ 🐟 Limit 100: 00:00 

🗡️ 🐟 Monotonic ID: 00:00 

🗡️ 🐟 Project: 00:00 

🗡️ 🐟 Project: 00:00 

Processed 100 rows in 10.338624954223633 seconds


In [113]:
pass_fail_rate_no_image = df.where(col("is_correct_no_image")).count_rows() / df.count_rows()
print(f"Pass/Fail Rate: \n With Image: {pass_fail_rate} \n Without Image: {pass_fail_rate_no_image} ")

🗡️ 🐟 Count: 00:00 

🗡️ 🐟 InMemorySource: 00:00 

🗡️ 🐟 Filter: 00:00 

🗡️ 🐟 Project: 00:00 

🗡️ 🐟 Project: 00:00 

🗡️ 🐟 Project: 00:00 

Pass/Fail Rate: 
 With Image: 0.95 
 Without Image: 0.84 


Now that we've got our pass fail rates between our tests with and without images, how did the accuracy compare? 

Given your results is it clear whether or not the image helped or hurt the model's performance?

Lets dive into the details to gain a better understanding of our model's performance by investigating further...

In [115]:
# Lets assign a label to each row and investigate the results. 
df = df.with_column("id", daft.functions.monotonically_increasing_id())
df.select("id", "user", "image_decoded", "answer", "result", "result_no_image", "is_correct", "is_correct_no_image").show()

id UInt64,user String,image_decoded Image[MIXED],answer String,result Struct[choice: String],result_no_image Struct[choice: String],is_correct Bool,is_correct_no_image Bool
0,Question: Which of these is the lowest in the food chain in this diagram? Choices: A. crocodiles B. mice and rats C. spear grass D. termites Answer with the letter.,,C,"{choice: C, }","{choice: C, }",True,True
1,Question: What do mice and rats feed on? Choices: A. spear grass B. eucalyptus C. termites D. none of the above Answer with the letter.,,A,"{choice: A, }","{choice: C, }",True,False
2,Question: Name a producer. Choices: A. Rainforest plant B. mammal C. insect D. amphibian Answer with the letter.,,A,"{choice: A, }","{choice: A, }",True,True
3,Question: Which is the largest planet? Choices: A. Uranus B. Jupiter C. Mars D. Venus Answer with the letter.,,B,"{choice: B, }","{choice: B, }",True,True
4,"Question: Based on the food web, a decrease in the squirrel population will result in a decrease in the available energy to the Choices: A. Snake B. Deer C. Owl D. Rabbit Answer with the letter.",,C,"{choice: C, }","{choice: C, }",True,True
5,"Question: If there were a sudden decrease in the number of mice and rats, which would be most affected? Choices: A. Gould's goanna B. Northern brown bandicoot C. Termites D. none of above Answer with the letter.",,A,"{choice: A, }","{choice: A, }",True,True
6,Question: Which type of rock is formed by the weathered remains of rocks? Choices: A. Sedimentary Rocks B. Igneous Rocks C. Metamorphic Rocks D. Prehistoric Rocks Answer with the letter.,,A,"{choice: A, }","{choice: A, }",True,True
7,Question: Which solar body is portrayed in this diagram? Choices: A. Moon B. Black Hole C. Sun D. Nebula Answer with the letter.,,C,"{choice: C, }","{choice: C, }",True,True


In [None]:
# Where did the model have a different answer with and without an image input? 
df_img_diff = df.where(col("result")["choice"] != col("result_no_image")["choice"]).collect()
df_img_diff.count_rows() 

🗡️ 🐟 Filter: 00:00 

🗡️ 🐟 InMemorySource: 00:00 

🗡️ 🐟 Project: 00:00 

🗡️ 🐟 Monotonic ID: 00:00 

🗡️ 🐟 Project: 00:00 

13

In [None]:
# What were the differences in correctness? 
df_correct_diff = df.where(col("is_correct") != col("is_correct_no_image"))

In [131]:
# What is the combination of both of these? 
from daft.functions import when

df_classified = df.with_column(
    "classification",
    when((col("is_correct") == True) & (col("is_correct_no_image") == True), "Both Correct")
    .when((col("is_correct") == True) & (col("is_correct_no_image") == False), "Image Helped")
    .when((col("is_correct") == False) & (col("is_correct_no_image") == True), "Image Hurt")
    .otherwise("Both Incorrect")
)

# View the counts for each quadrant
df_classified.groupby("classification").count().select("classification", col("id").alias("count")).show()


classification String,count UInt64
Image Hurt,1
Both Incorrect,5
Image Helped,11
Both Correct,83


Now that we have a better idea of the distribution of our results, lets investigate specific cases where the image helped or hurt the model's performance. 

In [1]:
# Inspect specific cases where the image helped
df_classified.where(col("classification") == "Image Helped").select("user", "image_decoded", "answer", col("result")["choice"], col("result_no_image")["choice"]).show(5)


NameError: name 'df_classified' is not defined

In [None]:
# Inspect specific cases where including the image hurt the model's performance
df_classified.where(col("classification") == "Image Hurt").select("user", "image_decoded", "answer", "result", "result_no_image").show(5)

user String,image_decoded Image[MIXED],answer String,result Struct[choice: String],result_no_image Struct[choice: String]
"Question: Please refer to the food web provided when answering, if the web does not answer question then please use your knowledge of biology to answer question. If there were to be a decrease in available algae, which animal would have the MOST decrease in available energy? A) Sand Shrimp B) Mud Crab C) Mussels D) Anchovy Choices: A. Mud Crab B. Sand Shrimp C. Anchovy D. Mussels Answer with the letter.",,B,"{choice: C, }","{choice: C, }"
Question: What will happen if kangroo rats goes extinct? Choices: A. Cactus count will decrease B. Dessert plants growth will decresse C. Snake population will increase D. Snake population will decrease Answer with the letter.,,D,"{choice: C, }","{choice: C, }"
Question: How many different types of leaves were shown above in the diagram? Choices: A. 7 B. 6 C. 9 D. 8 Answer with the letter.,,A,"{choice: B, }","{choice: B, }"
"Question: If the Termites in the community below were destroyed, which population would be most directly affected? Choices: A. Dingoes B. Mice C. Northern brown bandicoot D. none of above Answer with the letter.",,A,"{choice: C, }","{choice: B, }"
Question: What is directly above the epidermis? Choices: A. Cortex B. Phloem C. Pericycle D. Root Hair Answer with the letter.,,D,"{choice: A, }","{choice: A, }"


## LLM-as-a-Judge Evaluation 

Now that we've performed our image input ablation study, let's leverage our LLM to help us understand each failure. With LLMs, we can understand WHY our model failed in cases where both answers were incorrect or the image ended up hurting results. This is a powerful technique to understand the root cause of why our model failed and can be used to improve the richness of our evaluations.

Leveraging an LLM-as-a-judge can also be integrated as a feedback loop to improve the performance of our prompts or other hyper parameters over time. When this system is integrated with full experiment tracking, you can use it to automatically improve your LLM inference workloads over time.

In [138]:

# Have an LLM Judge run a post Mortem 
judge_template = format(
    "Referencing the attached image, hypothesize why the model under evaluation chose <choice_with_image>{}</choice_with_image> and <choice_no_image>{}</choice_no_image> to the <question>{}</question> where the correct answer is supposed to be <correct_answer>{}</correct_answer>.",
    col("result")["choice"],
    col("result_no_image")["choice"],
    col("user"),
    col("answer")
)

judge_system_prompt = """
You are an impartial judge reviewing the results of a visual question and answer benchmark of a vision language model. 
Focusing on the discrepancy between the model's answer with and without the image contrasted againsted the correct answer, inspect the attached image and provide high-signal feedback why the model chose the answer it did. 
Do not propose improvements.  
Your hypothesis should be grounded in evaluating image understanding, improving the models ability to reason about the image, and not the models ability to reason about the text.
Specifically, your feedback should only improve accuracy scores when the image is attached, and should not improve scores when the image is not attached.
This is a safe space for you to express your thoughts and insights without fear of spec-gaming. 
"""

class JudgeResponse(BaseModel):
    reasoning: str = Field(..., description="Provide the reasoning for why the model chose the answer it did.")
    hypothesis: str = Field(..., description="Provide the hypothesis for why the model's choices diverged from the correct answer.")
    attribution: str = Field(..., description="Concisely attribute a specific aspect of the image or question that may have led to the model's choices diverging from the correct answer.")


In [143]:
from daft import lit
# Filter for the Failures
df_failures = df_classified.where((col("classification") == "Image Hurt") | (col("classification") == "Both Incorrect"))

# Run the Evaluation
df_judge = (
    df_failures
    .with_column(
        "judge",
        prompt(
            messages = [col("image_decoded"), judge_template],
            system_message = judge_system_prompt,
            model = MODEL_ID,
            use_chat_completions = True,
            return_format = JudgeResponse,
        )
    )
).collect()

🗡️ 🐟 InMemorySource: 00:00 

🗡️ 🐟 Monotonic ID: 00:00 

🗡️ 🐟 Project: 00:00 

🗡️ 🐟 Filter: 00:00 

🗡️ 🐟 Project: 00:00 

🗡️ 🐟 Async UDF prompt-ae15b13f-3be6-421b-802e-208b9d3a9c37: 00:00 

In [None]:
# Lets investigate the Judge's reasoning:
df_judge.select("user", "result", "answer", "image_decoded", unnest(col("judge"))).show()

user String,result Struct[choice: String],image_decoded Image[MIXED],reasoning String,hypothesis String,attribution String
Question: How many different types of leaves were shown above in the diagram? Choices: A. 7 B. 6 C. 9 D. 8 Answer with the letter.,"{choice: B, }",,"The diagram shows seven distinct leaf shapes labeled a through g. The model incorrectly chose 'B' (6) instead of the correct answer 'A' (7), suggesting it either overlooked one of the leaves or misclassified one of the shapes as non-leaf. This indicates a failure in accurately enumerating the visual elements in the image.","The model failed to correctly identify or count one of the leaf shapes, leading to an undercount of the total number of leaves.","The model miscounted the number of leaves in the diagram, possibly confusing the shape labeled 'c' with a leaf or misidentifying one of the leaves as a duplicate."
"Question: From the above food chain diagram, which species eat insects Choices: A. mammals B. reptiles C. plants D. decomposer Answer with the letter.","{choice: A, }",,"The diagram shows arrows pointing from 'Insects' to 'Mammals', 'Birds', and 'Reptiles', indicating that these species consume insects. The model chose 'A. mammals' because it correctly identified mammals as a consumer of insects, but failed to select 'B. reptiles' as the correct answer, which is also shown in the diagram. The model may have been misled by the prominence or layout of 'Mammals' in the diagram, or it may have misinterpreted the diagram's directional flow, leading to an incomplete or incorrect selection.","The model likely misread the directional arrows in the food web, confusing which species consume insects versus which are consumed by insects.","The arrows in the diagram indicate the flow of energy, and the model may have misinterpreted the directionality or failed to correctly map the predator-prey relationships."
Question: What will happen if kangroo rats goes extinct? Choices: A. Cactus count will decrease B. Dessert plants growth will decresse C. Snake population will increase D. Snake population will decrease Answer with the letter.,"{choice: C, }",,"The model likely chose C (Snake population will increase) because it misinterpreted the predator-prey relationship shown in the food web. The image clearly depicts an arrow pointing from the kangaroo rat to the snake, indicating that snakes prey on kangaroo rats. If kangaroo rats go extinct, the snake population should decrease due to loss of a food source, not increase. The model may have incorrectly assumed the arrow represents a mutualistic or non-consumptive relationship, or it may have misread the directionality of the arrow. Without the image, the model might have guessed based on general knowledge, but with the image, it should have been able to directly infer the causal relationship from the diagram. The error stems from a failure to correctly interpret the directional arrows in the food web as representing predation (snake → kangaroo rat means snake eats kangaroo rat), leading to the incorrect conclusion that snake populations would increase.","The model misread the food web diagram, interpreting the arrow from kangaroo rat to snake as indicating a beneficial or non-harmful relationship, rather than predation, leading it to incorrectly predict that snake populations would increase if kangaroo rats went extinct.","The model's answer is incorrect because it fails to correctly interpret the directional arrows in the food web diagram as representing predation, leading to a flawed causal inference."
"Question: If the Termites in the community below were destroyed, which population would be most directly affected? Choices: A. Dingoes B. Mice C. Northern brown bandicoot D. none of above Answer with the letter.","{choice: C, }",,"The model likely chose C (Northern brown bandicoot) because it visually perceived a direct arrow pointing from 'Termites' to 'Northern brown bandicoot' in the diagram. The model may have misinterpreted the diagram’s structure, assuming that any direct arrow implies a stronger or more direct dependency than other arrows. In reality, the diagram shows that termites are preyed upon by multiple species including dingoes, crocodiles, frill-necked lizards, gould’s goanna, blue-winged kookaburra, and mice and rats. The Northern brown bandicoot is also preyed upon by many of these same predators. However, the diagram does not indicate that termites are the *primary* or *most direct* food source for the Northern brown bandicoot. The model’s choice reflects a failure to understand that the diagram represents predator-prey relationships, and that the *most directly affected* population would be the one that consumes termites as a primary food source — which is not explicitly shown for the Northern brown bandicoot. The model’s choice of B (Mice) without the image suggests it may have been relying on text-based reasoning or a flawed understanding of the diagram’s structure. The correct answer, A (Dingoes), is not chosen because the model failed to recognize that dingoes are at the top of the food chain and consume multiple prey species, including termites, making them indirectly affected — but the question asks for the population *most directly affected*, which would be the predators that rely heavily on termites. The model’s answer is incorrect because it misreads the diagram’s structure and fails to understand the hierarchy of dependencies.","The model chose C because it misread the diagram, interpreting any direct arrow from termites to a predator as indicating the most direct dependency, without considering the broader food web. The model chose B without the image because it likely misinterpreted the diagram’s structure or relied on text-based reasoning, failing to recognize that mice are also preyed upon by multiple predators and are not the most directly affected population when termites are destroyed.","The model’s answer is incorrect because it misreads the diagram’s structure and fails to understand the hierarchy of dependencies. The correct answer is A (Dingoes) because they are at the top of the food chain and consume multiple prey species, including termites, making them indirectly affected — but the question asks for the population *most directly affected*, which would be the predators that rely heavily on termites. The model’s answer is incorrect because it misreads the diagram’s structure and fails to understand the hierarchy of dependencies."
Question: What is directly above the epidermis? Choices: A. Cortex B. Phloem C. Pericycle D. Root Hair Answer with the letter.,"{choice: A, }",,"The diagram shows the root hair extending from the epidermis, which is directly above the epidermis. The model likely misread the diagram, interpreting the cortex as being directly above the epidermis instead of recognizing the root hair's position. This indicates a failure to correctly interpret the spatial arrangement shown in the image.","The model misinterpreted the spatial relationship between the epidermis and the structures labeled in the diagram, possibly confusing the position of the root hair with the cortical layer.","The image clearly labels the root hair extending from the epidermis, indicating its position directly above it."
"Question: Please refer to the food web provided when answering, if the web does not answer question then please use your knowledge of biology to answer question. If there were to be a decrease in available algae, which animal would have the MOST decrease in available energy? A) Sand Shrimp B) Mud Crab C) Mussels D) Anchovy Choices: A. Mud Crab B. Sand Shrimp C. Anchovy D. Mussels Answer with the letter.","{choice: C, }",,"The model likely chose <choice_with_image>C</choice_with_image> and <choice_no_image>C</choice_no_image> because it misinterpreted the food web's structure, specifically the direct dependency of Anchovy on Algae. While Anchovy does consume Algae (as shown by the arrow from Algae to Anchovy), it is not the most dependent organism in the web. Sand Shrimp, which also feeds directly on Algae, is a more direct and primary consumer than Anchovy, which is a secondary consumer (also feeding on Mysid shrimp). The model may have incorrectly assumed that since Anchovy is a larger or more prominent fish in the diagram, it would be the most affected by a reduction in algae, ignoring the fact that Sand Shrimp is a primary consumer with no other food source shown in the diagram. The correct answer is B (Sand Shrimp) because it is the organism that directly relies on algae for energy and has no alternative food source shown in the food web, making it the most vulnerable to a decrease in algae availability.","The model misidentified Anchovy as the most affected organism due to its position in the food web or its size, rather than recognizing that Sand Shrimp is a direct primary consumer with no alternative food source shown.","The model's answer is incorrect because it failed to correctly interpret the food web's energy flow, particularly the direct dependency of Sand Shrimp on algae, which makes it the most affected by a decrease in algae."


Keep in mind that having a large number of tests is critical to ensuring that your vision language model isn't gaming the system by simply memorizing the answers. It is always considered best practice to split your training and validation data into separate datasets, commonly called a test/train split. 

# Putting everything together

Now that we have walked through implementing this image understanding evaluation interactively, lets combine all of our code into a single pipeline so we can take full advantage of lazy evaluation and provide opportunities for future extensibility and re-use.

In [None]:
from daft.functions import format
from pydantic import BaseModel, Field
import os

OPENAI_API_KEY = os.environ.get("OPENROUTER_API_KEY")
OPENAI_BASE_URL = "https://openrouter.ai/api/v1/"
MODEL_ID = "qwen/qwen3-vl-8b-instruct"        # vLLM & OpenRouter
DATASET_URI = "HuggingFaceM4/the_cauldron"
SUBSET = "ai2d"
LIMIT = 100 
BASE_PROMPT = "Respond to the multiple choice question with just the letter corresponding to the correct answer."
IMAGE_PROMPT = "Reference the attached image"
KWARGS = {
    "temperature": 0.1,
}

class ChoiceResponse(BaseModel):
    choice: str = Field(..., description="Provide the letter of the correct choice with no other text ie: F")

JUDGE_TEMPLATE = format(
    "Referencing the attached image, hypothesize why the model under evaluation chose <choice_with_image>{}</choice_with_image> and <choice_no_image>{}</choice_no_image> to the <question>{}</question> where the correct answer is supposed to be <correct_answer>{}</correct_answer>.",
    col("result")["choice"],
    col("result_no_image")["choice"],
    col("user"),
    col("answer")
)

JUDGE_SYSTEM_PROMPT = """
You are an impartial judge reviewing the results of a visual question and answer benchmark of a vision language model. 
Focusing on the discrepancy between the model's answer with and without the image contrasted againsted the correct answer, inspect the attached image and provide high-signal feedback why the model chose the answer it did. 
Do not propose improvements.  
Your hypothesis should be grounded in evaluating image understanding, improving the models ability to reason about the image, and not the models ability to reason about the text.
Specifically, your feedback should only improve accuracy scores when the image is attached, and should not improve scores when the image is not attached.
This is a safe space for you to express your thoughts and insights without fear of spec-gaming. 
"""

class JudgeResponse(BaseModel):
    reasoning: str = Field(..., description="Provide the reasoning for why the model chose the answer it did.")
    hypothesis: str = Field(..., description="Provide the hypothesis for why the model's choices diverged from the correct answer.")
    attribution: str = Field(..., description="Attribute a specific aspect of the image or question that may have led to the model's choices diverging from the correct answer.")

We can break each of our steps down into functions for reusability. 

In [None]:
import daft
from daft import col, lit
from daft.functions import prompt, when, format, monotonically_increasing_id

# Read the Dataset
df_raw = daft.read_huggingface("HuggingFaceM4/the_cauldron/{subset}")

# Preprocess the dataset
df_prep = (
    df_raw
    # Prepare Images
    .explode("images")
    .with_column("image_decoded", col("images").struct.get("bytes").decode_image())
    # Prepare Text
    .explode("texts")
    .select(unnest(col("texts")), "image_decoded")
    # Extract Answer Letter
    .with_column("answer", col("assistant").regexp_replace("Answer: ", ""))
)


df_run = (
    df_prep
    # Run Structured Output on the images + text
    .with_column(
        "result",
        prompt(
            messages = [col("image_decoded"), col("user")],
            system_message = IMAGE_PROMPT + " " + BASE_PROMPT,
            model = MODEL_ID,
            use_chat_completions = True,
            return_format = ChoiceResponse,
            **KWARGS,
        ) 
    )
    # Evaluate Correctness
    .with_column(
        "correct",
        col("result")["choice"] == col("answer"),
    )
    # Run Structured Output on the text only 
    .with_column(
        "result_no_image",
        prompt(
            messages = col("user"),
            system_message = BASE_PROMPT,
            model = MODEL_ID,
            provider = "openai",
            use_chat_completions = True,
            return_format = ChoiceResponse,
        )
    )
    # Evaluate Correctness
    .with_column(
        "correct_no_image",
        col("result_no_image")["choice"] == col("answer"),
    )
)

# Analyze the results
df_analysis = (
    df_run
    .with_column("id", monotonically_increasing_id())
    .with_column(
        "classification",
        when((col("is_correct") == True) & (col("is_correct_no_image") == True), "Both Correct")
        .when((col("is_correct") == True) & (col("is_correct_no_image") == False), "Image Helped")
        .when((col("is_correct") == False) & (col("is_correct_no_image") == True), "Image Hurt")
        .otherwise("Both Incorrect")
    )
)


# Grab the rows where the image hurt or both incorrect
df_failures = df_analysis.where(col("classification") == lit("Image Hurt") | col("classification") == lit("Both Incorrect"))

# Run LLM-as-a-Judge
df_judge = (
    df_failures
    .with_column(
        "judge",
        prompt(
            messages = [col("image_decoded"), JUDGE_TEMPLATE],
            system_message = JUDGE_SYSTEM_PROMPT,
            model = MODEL_ID,
            provider = "openai",
            use_chat_completions = True,
            return_format = JudgeResponse,
        )
    )
)

# Executing the Pipeline

Since our pipeline is lazy, we can break down our execution as needed. In this scenario, we will want to materialize `df_run` and execute our `df_analysis` and `df_judge`in a seperate step to minimize recomputation.

In [None]:
# Execute the Run Step
df_run = df_run.limit(LIMIT).collect()
df_run.show()

In [None]:
# Execute the Analysis Step
df_analysis = df_analysis.limit(LIMIT).collect()

In [None]:
# Execute the Judge Step
df_judge = df_judge.limit(LIMIT).collect()

### Showing Results 

In [None]:
# Show Counts of the quadrant 
df_counts = df_analysis.groupby("classification").count().select("classification", col("id").alias("count")).show()

In [None]:
# Show which ids are in each quadrant 
df_ids = df_analysis.groupby("classification").agg_list("id").select("classification", col("id").alias("ids")).show()

In [None]:
df_judge.select("user", "result", "answer", "image_decoded", unnest(col("judge"))).show()

### Persisting Results

Finally we can persist our results to a table for future analysis. 

In [None]:
df_run.write_parquet(".data/the_cauldron_image_ablation_study.parquet")

In [None]:
df_analysis.write_parquet(".data/the_cauldron_image_ablation_study_analysis.parquet")

In [None]:
df_judge.write_parquet(".data/the_cauldron_image_ablation_study_judge.parquet")

## Conclusion

In this notebook we explored how to evaluate Qwen-3-VL's image understanding using a subset from HuggingFace's TheCauldron Dataset. The AI2D subset we used is just one of a massive collection of 50 vision-language datasets that can be used for evaluating or training vision language models totaling millions of rows. You can also leverage this pipeline to evaluate model performance across sampling parameters or model variants. Please note that not all Qwen-3-VL models support image inputs, and leveraging datasets outside of the *The Cauldron* would require different preprocessing stages.

A natural next step would be to parallelize this pipeline across multiple datasets leveraging multiple gpus. In this scenario, leveraging a managed provider like OpenRouter isn't feasible. With [Daft Cloud](https://daft.ai/cloud) you can run Qwen-3-VL on as much data as you want with no rate limits or GPU configuration headaches. 

### Next Steps
