# Evaluating Image Understanding at Scale with Structured Outputs and LLM-as-a-Judge Feedback

*An end-to-end example of **Multimodal Structured Outputs** with Daft, vLLM, and Qwen3-VL-8B-Instruct*

<a target="_blank" href="https://colab.research.google.com/github/everettVT/daft-examples-1/blob/mm-structured-outptus/notebooks/mm_structured_outputs.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Introduction

**Structured Outputs** refers to a family of features that enables language models to respond in a constrained format. While language models continue to improve and demonstrate emergent abilities, their unpredictable nature make them difficult to integrate with traditional software systems. Most real-world AI use cases leverage structured outputs to some extent, whether that be to execute tool calls or adhere to Pydantic Models.

As a core primitive of almost every LM workload, here are a few canonical references for structured outputs that are worth saving:

- [**Getting Structured LLM Output** from dottxt founders Will Kurt & Cameron Pfiffer (DeepLearning.ai Course)](https://learn.deeplearning.ai/courses/getting-structured-llm-output/information)
- [**Coding for Structured Generation with LLMs** by Will Kurt](https://blog.dottxt.ai/coding-for-structured-generation.html)
- [**Structured Decoding in vLLM: A Gentle Introduction** by Aaron Pham](https://www.bentoml.com/blog/structured-decoding-in-vllm-a-gentle-introduction#user-content-fn-7)

### What is Structured Outputs?

The underlying technology that makes structured outputs is called guided decoding. Guided decoding uses logits to control the output of a language model by adjusting the probabilities of the next possible tokens to enforce constraints or guide the generation process. This can be done through various methods, such as applying a logit bias to penalize or promote specific tokens, filtering invalid tokens based on rules like a Finite State Machine (FSM), or by using more advanced techniques to interact with the model's internal probability distribution.

Structured Outputs consists of 5 strategies that define the desired output type:

- Basic Python Types: `int`, `float`, `bool`...
- Multiple Choices: using `Literal` or `Enum`
- JSON Schemas: using Pydantic models or dataclasses
- Regex
- Context-free Grammars

While there are tremendous number of examples in pure python, few tutorials exist that demonstrate structured outputs within a large-scale processing context. Even fewer, if any, examples exist that demonstrate how to run batch structured outputs with multimodal data on your own inference server. Here, we will walk you through the entire process, using your own OpenAI-compatible server using [vLLM](https://docs.vllm.ai/en/v0.6.3.post1/serving/openai_compatible_server.html).

### What is an LLM-as-a-Judge?

**LLM-as-a-Judge** is a framework where a language model evaluates the outputs of other AI systems—providing qualitative assessments beyond simple accuracy metrics. Rather than relying solely on human evaluation (expensive, slow, inconsistent) or traditional metrics like BLEU/ROUGE (surface-level, miss semantic nuance), LLM judges can assess relevance, coherence, factual accuracy, and instruction adherence at scale.

The approach was formalized in the [seminal paper introducing MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685), which demonstrated that strong LLMs like GPT-4 can achieve ~80% agreement with human preferences—comparable to inter-annotator agreement between humans themselves.

Three common evaluation methods exist within the LLM-as-a-Judge framework:

1. **Pairwise Comparison**: The judge evaluates two responses and determines which is superior
2. **Single Answer Grading**: The judge assigns a score to a single response based on predefined criteria
3. **Reference-Guided Grading**: The judge compares a response against a known correct answer

In practice, these methods can be extended beyond simple grading to provide **diagnostic feedback**—analyzing *why* a model succeeded or failed, not just *whether* it did. This paradigm also extends naturally to vision-language models (**VLM-as-a-Judge**), enabling evaluation of multimodal outputs involving both text and images. In this notebook, we combine Reference-Guided Grading with diagnostic failure attribution to analyze our image understanding pipeline.

Some core canonical references for LLM-as-a-Judge include:

- [**Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena** by Zheng et al. (NeurIPS 2023)](https://arxiv.org/abs/2306.05685)
- [**Using an LLM-as-a-Judge** from the Cloud Security Alliance](https://cloudsecurityalliance.org/articles/using-an-llm-as-a-judge)
- [**LLM-as-a-Judge** on Wikipedia](https://en.wikipedia.org/wiki/LLM-as-a-Judge)

### Why This Matters

Large scale multimodal structured outputs is a real world use-case that every enterprise team faces when attempting to work with massive amounts of internal/private data. These teams face significant hurdles with traditional tooling, especially for cutting-edge uses cases like batch tool calls for background agents or reinforcement learning with verifiable rewards.

Daft's unified multimodal data processing engine is purpose built to support workloads like this and is rapidly becoming the default engine of choice for teams deploying frontier AI solutions in production.

In this notebook, we will leverage Daft to evaluate the image understanding accuracy of [Qwen3 VL 8b](https://github.com/QwenLM/Qwen3-VL) using HugginFace's [the_cauldron dataset](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron). By the end of this notebook, you will be ready to implement your own distributed batch structured outputs pipeline with a copy-paste script you can use in your own environment.

---

### Table of Contents

1. [Setup](#1-setup)
2. [Inference Configuration and Setup](#2-inference-configuration-and-setup)
3. [Sanity Checks](#3-sanity-check-openai-client-requests)
4. [Data Loading and Preprocessing](#4-data-loading-and-preprocessing)
5. [Multimodal Structured Outputs with `prompt`](#5-multimodal-structured-outputs-with-the-prompt-function)
6. [Ablation Study](#6-analysis)
7. [Accuracy Analysis and Comparison](#7-analyzing-the-results)
8. [LLM-as-a-Judge Evaluation](#8-llm-as-a-judge-evaluation)
9. [Putting it all together](#9-putting-it-all-together)
10. [Conclusion](#conclusion)

## 1. Setup

### Install Dependencies

In [None]:
!pip install -q "daft>=0.6.14" openai numpy pillow ipykernel ipywidgets

### Configure Parameters

In [None]:
MODEL_ID = "qwen/qwen3-vl-8b-instruct"        # vLLM & OpenRouter
DATASET_URI = "HuggingFaceM4/the_cauldron"

### Inference Configuration and Setup

**Using an OpenAI compatible Provider (OpenRouter)**

[OpenRouter](https://openrouter.ai/models?max_price=0.5&order=top-weekly) has model endpoints for [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct).

In [None]:
import os
from dotenv import load_dotenv

load_dotenv() # load your environment variables from .env file

OPENAI_API_KEY = os.environ.get("OPENROUTER_API_KEY")
OPENAI_BASE_URL = "https://openrouter.ai/api/v1/"

## 2. Data Loading and Preprocessing

[HuggingFaceM4/the_cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron/viewer?views%5B%5D=ai2d) is a massive collection of 50 vision-language datasets spanning millions of rows across:

1. General visual question answering
2. OCR document understanding & text transcription
3. Chart/figure understanding
4. Table understanding
5. Reasoning, logic, maths
6. Textbook/academic questions
7. Differences between 2 images
8. Screenshot to code

This dataset is a great resource for evaluating the image understanding capabilities of a vision language model as it gives us a wide range of tasks and image compositions to test on. It's size alone makes it particularly useful for training and validation.

For now we will begin with General visual Q&A subset AI2D.

In [None]:
import daft

df_raw = daft.read_huggingface("HuggingFaceM4/the_cauldron/ai2d").collect()
df_raw.show(3)

### Decoding Images into Daft's Image Type.

Daft provides a simple way to decode images into its internal Image type. This allows you to use Daft's powerful image processing capabilities to preprocess your images before sending them to a model.

Note: You can click on any cell to preview its contents.

In [None]:
from daft import col

# Explode the list of images (usually just one image anyways)
df_img = df_raw.explode(col("images"))

# Decode the image into daft.DataType.image()
df_img = df_img.with_column("image_decoded", col("images")["bytes"].decode_image())
df_img.show(3)

### Preprocessing the 'texts' column to extract Question, Choices, and Answer Columns

Copy/Pasting an entry from the `texts` column yields an openai messages list of dicts of the form:

```text
[{
    "user": """Question:
            
        From the above food web diagram, what cause kingfisher to increase

        Choices:
            A. decrease in fish
            B. decrease in water boatman
            C. increase in fish
            D. increase in algae

        Answer with the letter.""",

    "assistant": "Answer: C",
    "source": "AI2D",
}, ...]
```

In [None]:
from daft.functions import unnest

# Explode the List of Dicts inside "texts" and unnest the resulting Struct into dedicated "user", "assistant", and "source" columns
df_text = df_img.explode(col("texts")).select(unnest(col("texts")), "image_decoded")

df_text.show(3)

### Parsing Text with Regular Expressions

We can also leverage Daft's built-in [regular expressions](https://docs.daft.ai/en/stable/api/functions/regexp_extract/#daft.functions.regexp_extract) to parse each assistant message to extract the answer.  

In [None]:
# Parsing "assistant" message to extract the answer
df_prepped = df_text.with_column("answer", col("assistant").regexp_replace("Answer:", "")).collect()
df_prepped.show(3)

## 3. Multimodal Structured Outputs with the `prompt` function

Now we will move on to scaling our OpenAI client calls with Daft's new `prompt` function. Using a similar syntax to OpenAI client calls adapted for dataframes, we can quickly scale our structured output requests across the dataset.

For more info see the [API docs](https://docs.daft.ai/en/stable/api/functions/prompt/#daft.functions.prompt), [User Guide](https://docs.daft.ai/en/stable/ai-functions/prompt/), & [Usage Patterns](https://github.com/Eventual-Inc/daft-examples/tree/main/usage_patterns/prompt).

In [None]:
# First, we need to set the OpenAI Provider with our api_key and custom base_url that we set earlier.
daft.set_provider("openai", api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)

In [None]:
from daft import col
from daft.functions import prompt
from pydantic import BaseModel, Field
import time

# Lets keep our initial requests low to avoid spending too much $$$
LIMIT = 50

# We will also redefine our Pydantic Model to prioritize only the choice:
class ChoiceResponse(BaseModel):
    choice: str

start = time.time()
df = df_prepped.with_column(
    "result",
    prompt(
        messages = [col("image_decoded"), col("user")],
        system_message = "Observe the attached image and respond to the multiple choice question with just the letter corresponding to the correct answer.",
        model = MODEL_ID,
        use_chat_completions = True,
        return_format=ChoiceResponse,
    )
).limit(LIMIT).collect()
end = time.time()
print(f"Processed {df.count_rows()} rows in {end-start} seconds")

In [None]:
df.show()

In [None]:
from daft import DataType
import json

@daft.func(return_dtype=DataType.struct({"choice": DataType.string()}))
def normalize_output(result: str):
    return json.loads(result)

if OPENAI_BASE_URL == "http://0.0.0.0:8000/v1":
    df = df.with_column("result", normalize_vllm_output(col("result"))).collect()

df.show()

Now that we've run our structured output request, we can compare the model's answer to the correct answer to calculate a pass/fail rate.

In [None]:
df = df.with_column("is_correct", col("result")["choice"].lstrip().rstrip() == col("answer").lstrip().rstrip()) # strip whitespace

pass_fail_rate = df.where(col("is_correct")).count_rows() / df.count_rows()
print(f"Pass/Fail Rate: {pass_fail_rate}")

## 6. Analysis: Image Input Ablation Study

A simple accuracy score tells us *how often* the model is correct, but not *why*. To understand the contribution of image understanding to our model's performance, we'll conduct an **ablation study**—systematically removing the image input and comparing results.

Our analysis proceeds in three stages:

1. **Baseline Comparison**: Run the same prompts with and without images to establish accuracy deltas
2. **Quadrant Classification**: Categorize each example into one of four outcomes:
   - **Both Correct**: Model succeeds regardless of image (may indicate text-only solvable questions)
   - **Image Helped**: Model only succeeds when given the image (true image understanding)
   - **Image Hurt**: Model succeeds without the image but fails with it (potential visual confusion)
   - **Both Incorrect**: Model fails regardless (harder questions or model limitations)
3. **Case Inspection**: Examine specific examples from each quadrant to build intuition

This methodology isolates the model's image understanding capability from its general reasoning ability, giving us actionable signal about where the model excels and struggles.

In [None]:
# First lets investigate some of the Failures
df.where(col("is_correct") == False).select("user", "image_decoded", "answer", "result").show(8)

These failures could have been caused by a variety of factors, but manually reviewing each of them one-by-one is intensive and time consuming.

Moving beyond simple accuracy, we can learn a lot more about how strong our model's image understanding is by comparing our results with and without the image input.

In [None]:
# How do these results compare without images?
start = time.time()
df = df.with_column(
    "result_no_image",
    prompt(
        col("user"),
        system_message = "Respond to the multiple choice question with just the letter corresponding to the correct answer.",
        model = MODEL_ID,
        use_chat_completions=True,
        **KWARGS,
    )
).with_column("result_no_image", normalize_vllm_output(col("result"))
).with_column("is_correct_no_image", col("result_no_image")["choice"] == col("answer").lstrip().rstrip()
).limit(LIMIT).collect()
end = time.time()

print(f"Processed {df.count_rows()} rows in {end-start} seconds")

In [None]:
pass_fail_rate_no_image = df.where(col("is_correct_no_image")).count_rows() / df.count_rows()
print(f"Pass/Fail Rate: \n With Image: {pass_fail_rate} \n Without Image: {pass_fail_rate_no_image} ")

- Now that we've got our pass fail rates between our tests with and without images, how did the accuracy compare?
- Given your results is it clear whether or not the image helped or hurt the model's performance?
- Lets dive into the details to gain a better understanding of our model's performance by investigating further...

In [None]:
# Lets assign a label to each row and investigate the results.
df = df.with_column("id", daft.functions.monotonically_increasing_id())
df.select("id", "user", "image_decoded", "answer", "result", "result_no_image", "is_correct", "is_correct_no_image").show()

In [None]:
# Where did the model have a different answer with and without an image input?
df_img_diff = df.where(col("result")["choice"] != col("result_no_image")["choice"]).collect()
df_img_diff.count_rows()

In [None]:
# What were the differences in correctness?
df_correct_diff = df.where(col("is_correct") != col("is_correct_no_image")).collect()
df_correct_diff.count_rows()

In [None]:
# What is the combination of both of these?
from daft.functions import when

df_classified = df.with_column(
    "classification",
    when((col("is_correct") == True) & (col("is_correct_no_image") == True), "Both Correct")
    .when((col("is_correct") == True) & (col("is_correct_no_image") == False), "Image Helped")
    .when((col("is_correct") == False) & (col("is_correct_no_image") == True), "Image Hurt")
    .otherwise("Both Incorrect")
)

# View the counts for each quadrant
df_classified.groupby("classification").count().select("classification", col("id").alias("count")).show()


Now that we have a better idea of the distribution of our results, lets investigate specific cases where the image helped or hurt the model's performance.

In [None]:
# Inspect specific cases where the image helped
df_classified.where(col("classification") == "Image Helped").select("user", "image_decoded", "answer", col("result")["choice"], col("result_no_image")["choice"]).show(5)

In [None]:
# Inspect specific cases where including the image hurt the model's performance
df_classified.where(col("classification") == "Image Hurt").select("user", "image_decoded", "answer", "result", "result_no_image").show(5)

## LLM-as-a-Judge Evaluation

Now that we've identified *where* our model failed through ablation, let's use **VLM-as-a-Judge** to understand *why*. As introduced earlier, we're combining **Reference-Guided Grading** (comparing against the known correct answer) with **diagnostic failure attribution** (analyzing the root cause of errors).

For each failure case, our judge will inspect the image and provide:
- **Reasoning**: Why the model chose its answer
- **Hypothesis**: What caused the divergence from the correct answer  
- **Attribution**: Whether the failure stems from the question or the image

This diagnostic feedback can be integrated into experiment tracking systems to systematically improve prompts and model configurations over time.

In [None]:

# Have an LLM Judge run a post Mortem
judge_template = format(
    "Referencing the attached image, hypothesize why the model under evaluation chose <choice_with_image>{}</choice_with_image> and <choice_no_image>{}</choice_no_image> to the <question>{}</question> where the correct answer is supposed to be <correct_answer>{}</correct_answer>.",
    col("result")["choice"],
    col("result_no_image")["choice"],
    col("user"),
    col("answer")
)

judge_system_prompt = """
You are an impartial judge reviewing the results of a visual question and answer benchmark of a vision language model.
Focusing on the discrepancy between the model's answer with and without the image contrasted against the correct answer, inspect the attached image and provide high-signal feedback why the model chose the answer it did.
Do not propose improvements.
Your hypothesis should be grounded in evaluating image understanding, improving the models ability to reason about the image, and not the models ability to reason about the text.
Specifically, your feedback should only improve accuracy scores when the image is attached, and should not improve scores when the image is not attached.
This is a safe space for you to express your thoughts and insights without fear of spec-gaming.
"""

class JudgeResponse(BaseModel):
    reasoning: str = Field(..., description="Provide the reasoning for why the model chose the answer it did.")
    hypothesis: str = Field(..., description="Provide the hypothesis for why the model's choices diverged from the correct answer.")
    attribution: str = Field(..., description="Concisely attribute a specific aspect of the image or question that may have led to the model's choices diverging from the correct answer.")

In [None]:
from daft import lit
# Filter for the Failures
df_failures = df_classified.where((col("classification") == "Image Hurt") | (col("classification") == "Both Incorrect"))

# Run the Evaluation
df_judge = (
    df_failures
    .with_column(
        "judge",
        prompt(
            messages = [col("image_decoded"), judge_template],
            system_message = judge_system_prompt,
            model = MODEL_ID,
            use_chat_completions = True,
            return_format = JudgeResponse,
        )
    )
).collect()

In [None]:
# Lets review the Judge's outputs:
df_judge.select("user", "result", "answer", "image_decoded", unnest(col("judge"))).show()

Keep in mind that having a large number of tests is critical to ensuring that your vision language model isn't gaming the system by simply memorizing the answers. It is always considered best practice to split your training and validation data into separate datasets, commonly called a test/train split.

# Putting everything together

Now that we have walked through implementing this image understanding evaluation interactively, lets combine all of our code into a single pipeline so we can take full advantage of lazy evaluation and provide opportunities for future extensibility and re-use.

In [None]:
from daft.functions import format
from pydantic import BaseModel, Field
import os

OPENAI_API_KEY = os.environ.get("OPENROUTER_API_KEY")
OPENAI_BASE_URL = "https://openrouter.ai/api/v1/"
MODEL_ID = "qwen/qwen3-vl-8b-instruct"        # vLLM & OpenRouter
DATASET_URI = "HuggingFaceM4/the_cauldron"
SUBSET = "ai2d"
LIMIT = 100
BASE_PROMPT = "Respond to the multiple choice question with just the letter corresponding to the correct answer."
IMAGE_PROMPT = "Reference the attached image"
KWARGS = {
    "temperature": 0.1,
}

class ChoiceResponse(BaseModel):
    choice: str = Field(..., description="Provide the letter of the correct choice with no other text ie: F")

JUDGE_TEMPLATE = format(
    "Referencing the attached image, hypothesize why the model under evaluation chose <choice_with_image>{}</choice_with_image> and <choice_no_image>{}</choice_no_image> to the <question>{}</question> where the correct answer is supposed to be <correct_answer>{}</correct_answer>.",
    col("result")["choice"],
    col("result_no_image")["choice"],
    col("user"),
    col("answer")
)

JUDGE_SYSTEM_PROMPT = """
You are an impartial judge reviewing the results of a visual question and answer benchmark of a vision language model.
Focusing on the discrepancy between the model's answer with and without the image contrasted against the correct answer, inspect the attached image and provide high-signal feedback why the model chose the answer it did.
Do not propose improvements.
Your hypothesis should be grounded in evaluating image understanding, improving the models ability to reason about the image, and not the models ability to reason about the text.
Specifically, your feedback should only improve accuracy scores when the image is attached, and should not improve scores when the image is not attached.
This is a safe space for you to express your thoughts and insights without fear of spec-gaming.
"""

class JudgeResponse(BaseModel):
    reasoning: str = Field(..., description="Provide the reasoning for why the model chose the answer it did.")
    hypothesis: str = Field(..., description="Provide the hypothesis for why the model's choices diverged from the correct answer.")
    attribution: str = Field(..., description="Attribute a specific aspect of the image or question that may have led to the model's choices diverging from the correct answer.")

We can break each of our steps down into functions for reusability.

In [None]:
import daft
from daft import col, lit
from daft.functions import prompt, when, format, monotonically_increasing_id, unnest

# Read the Dataset
df_raw = daft.read_huggingface(f"HuggingFaceM4/the_cauldron/{SUBSET}")

# Preprocess the dataset
df_prep = (
    df_raw
    # Prepare Images
    .explode("images")
    .with_column("image_decoded", col("images").struct.get("bytes").decode_image())
    # Prepare Text
    .explode("texts")
    .select(unnest(col("texts")), "image_decoded")
    # Extract Answer Letter
    .with_column("answer", col("assistant").regexp_replace("Answer: ", ""))
)


df_run = (
    df_prep
    # Run Structured Output on the images + text
    .with_column(
        "result",
        prompt(
            messages = [col("image_decoded"), col("user")],
            system_message = IMAGE_PROMPT + " " + BASE_PROMPT,
            model = MODEL_ID,
            use_chat_completions = True,
            return_format = ChoiceResponse,
            **KWARGS,
        )
    )
    # Evaluate Correctness
    .with_column(
        "is_correct",
        col("result")["choice"] == col("answer"),
    )
    # Run Structured Output on the text only
    .with_column(
        "result_no_image",
        prompt(
            messages = col("user"),
            system_message = BASE_PROMPT,
            model = MODEL_ID,
            provider = "openai",
            use_chat_completions = True,
            return_format = ChoiceResponse,
        )
    )
    # Evaluate Correctness
    .with_column(
        "is_correct_no_image",
        col("result_no_image")["choice"] == col("answer"),
    )
)

# Analyze the results
df_analysis = (
    df_run
    .with_column("id", monotonically_increasing_id())
    .with_column(
        "classification",
        when((col("is_correct") == True) & (col("is_correct_no_image") == True), "Both Correct")
        .when((col("is_correct") == True) & (col("is_correct_no_image") == False), "Image Helped")
        .when((col("is_correct") == False) & (col("is_correct_no_image") == True), "Image Hurt")
        .otherwise("Both Incorrect")
    )
)


# Grab the rows where the image hurt or both incorrect
df_failures = df_analysis.where((col("classification") == lit("Image Hurt")) | (col("classification") == lit("Both Incorrect")))

# Run LLM-as-a-Judge
df_judge = (
    df_failures
    .with_column(
        "judge",
        prompt(
            messages = [col("image_decoded"), JUDGE_TEMPLATE],
            system_message = JUDGE_SYSTEM_PROMPT,
            model = MODEL_ID,
            provider = "openai",
            use_chat_completions = True,
            return_format = JudgeResponse,
        )
    )
)

# Executing the Pipeline

Since our pipeline is lazy, we can break down our execution as needed. In this scenario, we will want to materialize `df_run` and execute our `df_analysis` and `df_judge`in a seperate step to minimize recomputation.

In [None]:
# Execute the Run Step
df_run = df_run.limit(LIMIT).collect()
df_run.show()

In [None]:
# Execute the Analysis Step
df_analysis = df_analysis.limit(LIMIT).collect()

In [None]:
# Execute the Judge Step
df_judge = df_judge.limit(LIMIT).collect()

### Showing Results

In [None]:
# Show Counts of the quadrant
df_counts = df_analysis.groupby("classification").count().select("classification", col("id").alias("count")).show()

In [None]:
# Show which ids are in each quadrant
df_ids = df_analysis.groupby("classification").agg_list("id").select("classification", col("id").alias("ids")).show()

In [None]:
df_judge.select("user", "result", "answer", "image_decoded", unnest(col("judge"))).show()

### Persisting Results

Finally we can persist our results to a table for future analysis.

In [None]:
df_run.write_parquet(".data/the_cauldron_image_ablation_study.parquet")

In [None]:
df_analysis.write_parquet(".data/the_cauldron_image_ablation_study_analysis.parquet")

In [None]:
df_judge.write_parquet(".data/the_cauldron_image_ablation_study_judge.parquet")

## Conclusion

In this notebook we explored how to evaluate Qwen-3-VL's image understanding using a subset from HuggingFace's TheCauldron Dataset. The AI2D subset we used is just one of a massive collection of 50 vision-language datasets that can be used for evaluating or training vision language models totaling millions of rows. You can also leverage this pipeline to evaluate model performance across sampling parameters or model variants. Please note that not all Qwen-3-VL models support image inputs, and leveraging datasets outside of *The Cauldron* would require different preprocessing stages.

A natural next step would be to parallelize this pipeline across multiple datasets leveraging multiple GPUs. In this scenario, leveraging a managed provider like OpenRouter isn't feasible. With [Daft Cloud](https://daft.ai/cloud) you can run Qwen-3-VL on as much data as you want with no rate limits or GPU configuration headaches.

### Next Steps


**1. Multi-Dataset Evaluation**

Extend evaluation across all 50 subsets of The Cauldron to build a comprehensive benchmark:

```python
subsets = ["ai2d", "chartqa", "docvqa", "infographicvqa", ...]
for subset in subsets:
    df = pipeline(f"HuggingFaceM4/the_cauldron/{subset}")
    df.write_parquet(f"results/{subset}.parquet")
```

**2. Distributed Execution with Ray**

For large-scale runs, transition to Daft's distributed runner, [Flotilla](https://www.daft.ai/blog/introducing-flotilla-simplifying-multimodal-data-processing-at-scale) for distributed compute or Daft Cloud for fully managed service.

```python
import daft
daft.set_runner_ray()  # Daft orchestrates Ray automatically
```

**3. Experiment Tracking Integration**

Wire the judge feedback into MLflow, Weights & Biases, or similar tools to track how prompt/parameter changes affect the accuracy quadrants over time.

**4. Reinforcement Learning with Verifiable Rewards (RLVR)**

The pipeline we built produces exactly what's needed for RLVR training loops:
- **Verifiable rewards**: The `is_correct` column provides a binary reward signal—no human labeling required
- **Diagnostic signal**: The judge's `attribution` field ("question" vs "image") can inform reward shaping, penalizing failures caused by poor image understanding more heavily
- **Scalable generation**: Daft can generate millions of (prompt, response, reward) tuples for frameworks like [TRL](https://huggingface.co/docs/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), or [veRL](https://github.com/volcengine/verl)

