# Evaluating Image Understanding at Scale with Structured Outputs

*An end-to-end example of **Multimodal Structured Outputs** with Daft and Qwen3-VL-8B*. 

## Introduction

In this notebook, we'll evaluate [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)'s image understanding using a multiple choice subset of HuggingFace's [The Cauldron dataset](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron), a massive collection of 50 vision-language datasets. 

Our pipeline will:

1. Run structured output inference on image+text prompts
2. Conduct an **ablation study** (with vs. without images) to isolate image understanding
3. Classify results into diagnostic quadrants
4. Use **VLM-as-a-Judge** to explain failures

The steps we'll take in this notebook are a simplified version of the [production-ready pipeline](https://github.com/Eventual-Inc/daft-examples/blob/main/use_cases/image_understanding_eval/eval_image_understanding.py) used to evaluate Qwen3-VL-4B on 20k rows. Check out the [blog post](https://www.daft.ai/blog/multimodal-structured-outputs-evaluating-vlm-image-understanding-at-scale) for the full results and implementation. 

### Table of Contents

1. [Setup](#1-setup)
2. [Data Loading](#2-data-loading)
3. [Preprocessing](#3-preprocessing)
4. [Structured Outputs with `prompt`](#4-structured-outputs-with-prompt)
5. [Ablation Study](#5-ablation-study)
6. [LLM-as-a-Judge](#6-llm-as-a-judge)
7. [Scale with Daft Cloud](#7-scale-with-daft-cloud)
8. [Conclusion](#8-conclusion)

## Notebook vs. Production Pipeline (How this maps)

This notebook is the **interactive companion** to the production evaluation pipeline in [`use_cases/image_understanding_eval/eval_image_understanding.py`](../use_cases/image_understanding_eval/eval_image_understanding.py) and the methodology described in the blog post: [Multimodal Structured Outputs: Evaluating VLM Image Understanding at Scale](https://www.daft.ai/blog/multimodal-structured-outputs-evaluating-vlm-image-understanding-at-scale).

The notebook keeps `LIMIT` small so you can inspect examples, but the stages are the same:

| Notebook section | Pipeline function | Purpose |
|---|---|---|
| Preprocessing | `preprocess()` | Extract `answer` from Cauldron text format and track config |
| Structured Outputs (with image) | `run_inference(with_image=True)` | Predict multiple-choice letter with the image attached |
| Ablation (no image) | `run_inference(with_image=False)` | Predict the same question *without* the image |
| Quadrant classification | `classify_quadrants()` | Bucket behavior into Both Correct / Image Helped / Image Hurt / Both Incorrect |
| LLM-as-a-Judge | `run_judge()` | Diagnose failure modes on the ‚ÄúImage Hurt‚Äù + ‚ÄúBoth Incorrect‚Äù subsets |



## 1. Setup


In [None]:
%pip install -q daft openai numpy pillow python-dotenv ipykernel ipywidgets pydantic

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

# Configuration
MODEL_ID = "Qwen/Qwen3-VL-8B-Instruct"
LIMIT = 50  # Keep low for interactive demo

# HuggingFace Inference Provider (hosted Qwen3-VL endpoints)
OPENAI_API_KEY = os.getenv("HF_TOKEN")
OPENAI_BASE_URL = "https://router.huggingface.co/v1"

In [None]:
import daft

# Set the OpenAI-compatible provider
daft.set_provider("openai", api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)

## 2. Data Loading

[The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) is a massive collection of 50 vision-language datasets spanning:
- Visual question answering
- OCR & document understanding
- Chart/figure understanding
- Reasoning & math
- And more...

We'll start with the **AI2D** subset‚Äîscience diagrams with multiple-choice questions.


In [14]:
df_raw = daft.read_huggingface("HuggingFaceM4/the_cauldron/ai2d").limit(LIMIT).collect()
df_raw.show(3)

"images List[Struct[bytes: Binary, path: String]]","texts List[Struct[user: String, assistant: String, source: String]]"
"[{bytes: b""\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD""..., path: None, }]","[{user: Question: What do respiration and combustion give out Choices: A. Oxygen B. Carbon dioxide C. Nitrogen D. Heat Answer with the letter., assistant: Answer: B, source: AI2D, }]"
"[{bytes: b""\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD""..., path: None, }]","[{user: Question: From the given food web, name any two herbivores? Choices: A. coyote, bobcat B. dingo, jack rabbit C. dingo, bobcat D. roadrunner&jack rabbit Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: In the given food web, which are the organism that only eaten roadrunner? Choices: A. dingo, jack rabbit B. coyote, bobcat C. dingo, bobcat D. snake, jack rabbit Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: Name a herbivore from the given food web? Choices: A. cactus B. kangaroo rat C. snake D. bobcat Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: Name a producer from the given food web? Choices: A. bobcat B. snake C. road runner D. barrel cactus Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: Name an omnivore from the given food web? Choices: A. dingo B. bobcat C. cactus D. kangaroo Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: What is a predator of the roadrunner? Choices: A. kangaroo B. coyote C. dingo D. cactus Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: What will happen if kangroo rats goes extinct? Choices: A. Cactus count will decrease B. Dessert plants growth will decresse C. Snake population will increase D. Snake population will decrease Answer with the letter., assistant: Answer: D, source: AI2D, }, {user: Question: What would be most affected if the cactus all died? Choices: A. coyote B. dingo C. kangaroo rat D. snake Answer with the letter., assistant: Answer: C, source: AI2D, }, {user: Question: Which among the below is a producer in the food chain diagram shown? Choices: A. Kangaroo rat B. Roadrunner C. Dessert grass D. Snake Answer with the letter., assistant: Answer: C, source: AI2D, }, {user: Question: Which is a producer? Choices: A. Coyote B. Desert Grass C. Kangaroo D. Dingo Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: Who would suffer without kangaroo rats? Choices: A. Desert Grass B. Snake C. Cactus D. Roadrunner Answer with the letter., assistant: Answer: B, source: AI2D, }, {user: Question: desert grasses are known as Choices: A. consumer B. herbivores C. omnivores D. producer Answer with the letter., assistant: Answer: D, source: AI2D, }]"
"[{bytes: b""\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD""..., path: None, }]","[{user: Question: Anatomy One of a series of long curved bones occurring in 12 pairs in humans is called. Choices: A. diaphram B. lung C. none D. ribs Answer with the letter., assistant: Answer: D, source: AI2D, }]"


## 3. Preprocessing

We need to:
1. Decode images into Daft's Image type
2. Extract the question, choices, and correct answer from the text


In [None]:
from daft import col
from daft.functions import unnest

# Decode images
df_img = df_raw.explode(col("images"))
df_img = df_img.with_column("image", col("images")["bytes"].decode_image())

# Extract text fields (user question, assistant answer)
df_text = df_img.explode(col("texts")).select(unnest(col("texts")), "image")

# Parse the answer letter from "Answer: C" format
df_prep = df_text.with_column(
    "answer", 
    col("assistant").regexp_replace("Answer: ", "").lstrip().rstrip()
).collect()

df_prep.show(3)

## 4. Structured Outputs with `prompt`

Daft's `prompt` function scales OpenAI-compatible calls across dataframes. We'll use a Pydantic model to enforce structured output.

For more info: [API docs](https://docs.daft.ai/en/stable/api/functions/prompt/) | [User Guide](https://docs.daft.ai/en/stable/ai-functions/prompt/)


In [None]:
from daft.functions import prompt
from pydantic import BaseModel, Field
import time

# Deterministic inference params (matches the production pipeline defaults)
PARAMS = {"temperature": 0.0, "max_tokens": 2}

class ChoiceResponse(BaseModel):
    """Structured output for multiple choice answers."""
    choice: str = Field(
        ..., description="The letter of the correct choice (e.g., A, B, C, D)"
    )

start = time.time()
df_results = df_prep.with_column(
    "result",
    prompt(
        messages=[col("image"), col("user")],
        model=MODEL_ID,
        use_chat_completions=True,
        return_format=ChoiceResponse,
        **PARAMS,
    )
).limit(LIMIT).collect()
elapsed = time.time() - start

print(f"Processed {df_results.count_rows()} rows in {elapsed:.1f} seconds")

In [None]:
# Evaluate correctness
df_eval = df_results.with_column(
    "is_correct", 
    col("result")["choice"].lstrip().rstrip() == col("answer").lstrip().rstrip()
)

accuracy = df_eval.where(col("is_correct")).count_rows() / df_eval.count_rows()
print(f"Accuracy (with image): {accuracy:.1%}")


In [None]:
# Let's look at some results
df_eval.select("user", "image", "answer", col("result")["choice"].alias("predicted"), "is_correct").show(5)


## 5. Ablation Study

A simple accuracy score tells us *how often* the model is correct, but not *why*. To understand the contribution of image understanding, we'll conduct an **ablation study**‚Äîrunning the same prompts without images.

This lets us classify each example into four quadrants:

| Quadrant | With Image | Without Image | Interpretation |
|----------|------------|---------------|----------------|
| **Both Correct** | ‚úì | ‚úì | Question may be solvable from text alone |
| **Image Helped** | ‚úì | ‚úó | True image understanding |
| **Image Hurt** | ‚úó | ‚úì | Visual confusion |
| **Both Incorrect** | ‚úó | ‚úó | Hard question or model limitation |


In [None]:
# Run without images
SYSTEM_PROMPT_NO_IMAGE = "Respond to the multiple choice question with just the letter corresponding to the correct answer."

start = time.time()
df_ablation = df_eval.with_column(
    "result_no_image",
    prompt(
        messages=col("user"),
        system_message=SYSTEM_PROMPT_NO_IMAGE,
        model=MODEL_ID,
        use_chat_completions=True,
        return_format=ChoiceResponse,
        **PARAMS,
    )
).with_column(
    "is_correct_no_image",
    col("result_no_image")["choice"].lstrip().rstrip() == col("answer").lstrip().rstrip()
).collect()
elapsed = time.time() - start

print(f"Processed {df_ablation.count_rows()} rows in {elapsed:.1f} seconds")


In [None]:
# Compare accuracy
accuracy_no_image = df_ablation.where(col("is_correct_no_image")).count_rows() / df_ablation.count_rows()

print(f"Accuracy with image:    {accuracy:.1%}")
print(f"Accuracy without image: {accuracy_no_image:.1%}")
print(f"Delta:                  {accuracy - accuracy_no_image:+.1%}")


In [None]:
from daft.functions import when, monotonically_increasing_id

# Classify into quadrants
df_classified = df_ablation.with_column(
    "id", monotonically_increasing_id()
).with_column(
    "quadrant",
    when((col("is_correct") == True) & (col("is_correct_no_image") == True), "Both Correct")
    .when((col("is_correct") == True) & (col("is_correct_no_image") == False), "Image Helped")
    .when((col("is_correct") == False) & (col("is_correct_no_image") == True), "Image Hurt")
    .otherwise("Both Incorrect")
)

# Show distribution
df_classified.groupby("quadrant").count().select("quadrant", col("id").alias("count")).show()


In [None]:
# Inspect cases where the image helped
df_classified.where(col("quadrant") == "Image Helped").select(
    "user", "image", "answer", 
    col("result")["choice"].alias("with_image"),
    col("result_no_image")["choice"].alias("without_image")
).show(3)


In [None]:
# Inspect cases where the image hurt
df_classified.where(col("quadrant") == "Image Hurt").select(
    "user", "image", "answer",
    col("result")["choice"].alias("with_image"),
    col("result_no_image")["choice"].alias("without_image")
).show(3)


In [None]:
# Show breakdown by quadrant with percentages
total_count = df_classified.count_rows()

df_results = df_classified.groupby("quadrant").count().select(
    "quadrant",
    col("id").alias("count")
).with_column(
    "percentage",
    (col("count") / daft.lit(total_count) * 100)
).collect()

df_results.show()

## 6. LLM-as-a-Judge

We can use a LLM to judge the correctness of the model's structured outputs. Here we'll use a simple prompt to judge whether the model's choice is correct and to diagnose the failure mode.

In [None]:

JUDGE_SYSTEM_PROMPT = """
You are an impartial judge reviewing the results of a textbook academic questions multiple choice benchmark.
Inspect the attached image and provide high-signal feedback on why the model chose its answer.
First, reason about the model's answer with the image and the model's answer without the image.
Second, develop a hypothesis for why the model made the choice it did. 
Third, attribute the failure to a 'question' issue or an 'image' understanding issue.
Finally, assign whether the model's answer with the image is correct and whether the model's answer without the image is correct.
"""


class JudgeResponse(BaseModel):
    """Structured diagnostic feedback from the VLM judge."""

    reasoning: str = Field(
        ..., description="Why did the model choose the answer it did?"
    )
    hypothesis: str = Field(
        ..., description="What caused the divergence from the correct answer?"
    )
    attribution: str = Field(
        ...,
        description="Was this a 'question' issue or an 'image' understanding issue or 'other'?",
    )

In [None]:
from daft.functions import format

# Build a judge prompt 
judge_template = format(
    """Given the image attached and the multiple choice question of <question>{}</question>,
The model chose the following prediction <model_answer>{}</model_answer> and without the image, the model chose the following prediction <no_image_model_answer>{}</no_image_model_answer>, but the correct answer is <correct_answer>{}</correct_answer>.

Provide diagnostic feedback.
""",
    col("user"),
    col("result")["choice"],
    col("result_no_image")["choice"],
    col("answer"),
)

# Run judge on the same failure quadrants as the production pipeline
df_failures = df_classified.where(
    (col("quadrant") == "Image Hurt") | (col("quadrant") == "Both Incorrect")
)

# Judge needs more tokens than the multiple-choice inference passes.
JUDGE_PARAMS = {"temperature": 0.0, "max_tokens": 512}

df_judged = df_failures.with_column(
    "judge_response",
    prompt(
        messages=[col("image"), judge_template],
        system_message=JUDGE_SYSTEM_PROMPT,
        model=MODEL_ID,
        use_chat_completions=True,
        return_format=JudgeResponse,
        **JUDGE_PARAMS,
    ),
).collect()

print(f"Judged {df_judged.count_rows()} failure rows")


### Interpreting Judge Feedback

The Judge is most useful on:
- **Image Hurt**: the model was correct *without* the image but incorrect *with* the image (the image introduced confusion).
- **Both Incorrect**: the model missed in both conditions (hard question, ambiguity, or capability gap).

Use the judge‚Äôs `attribution` signal to quickly separate **question issues** (ambiguous prompt/choices) from **image understanding issues** (missed labels, small text, visual ambiguity). For more on these failure modes and what we observed at scale, see the accompanying blog post: [Multimodal Structured Outputs: Evaluating VLM Image Understanding at Scale](https://www.daft.ai/blog/multimodal-structured-outputs-evaluating-vlm-image-understanding-at-scale).



In [None]:
from daft.functions import unnest

# Inspect a few judged failures (interactive)
df_judged.select(
    "quadrant",
    "user",
    "image",
    "answer",
    col("result")["choice"].alias("with_image"),
    col("result_no_image")["choice"].alias("without_image"),
    unnest(col("judge_response")),
).show(3)



In [None]:
# Sanity checks: did we run every stage?

# Accuracies
print(f"Accuracy (with image):    {accuracy:.1%}")
print(f"Accuracy (without image): {accuracy_no_image:.1%}")
print(f"Delta:                   {accuracy - accuracy_no_image:+.1%}")

# Quadrant distribution
df_classified.groupby("quadrant").count().show()

# Judge coverage
print(f"Judge rows: {df_judged.count_rows()}")



## 7. Scale with Daft Cloud

**Everything above runs locally on 50 rows.**

But The Cauldron contains **millions of rows across 50 subsets**. To run this evaluation at scale with strong consistent performance we can scale on [Daft Cloud](https://daft.ai/cloud). The python script version of this notebook is available in the [daft-examples](https://github.com/Eventual-Inc/daft-examples) repo in the [use_cases/image_understanding_eval](https://github.com/Eventual-Inc/daft-examples/tree/main/use_cases/image_understanding_eval) directory.

üëâ [**Sign up for early access**](https://daft.ai/cloud) | [**Book a demo**](https://www.daft.ai/demo) 

## 8. Conclusion

In this notebook, we built a small pipeline to evaluate Qwen3-VL's image understanding:

1. **Structured Outputs**: Used Pydantic models to enforce consistent responses
2. **Ablation Study**: Isolated image understanding from general reasoning
3. **Quadrant Analysis**: Classified results into actionable categories
4. **LLM-as-a-Judge**: Diagnosed failures on the most informative subsets ("Image Hurt" + "Both Incorrect")

---

### Resources

- [Daft Documentation](https://docs.daft.ai)
- [Daft Cloud](https://daft.ai/cloud)
- [The Cauldron Dataset](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)

**Canonical References:**
- [Getting Structured LLM Output (DeepLearning.ai)](https://learn.deeplearning.ai/courses/getting-structured-llm-output/information)
- [Judging LLM-as-a-Judge (NeurIPS 2023)](https://arxiv.org/abs/2306.05685)


