MM-CondChain: A Programmatically Verified Benchmark for
Visually Grounded Deep Compositional Reasoning
Haozhan Shen1,2,
Shilin Yan1β ,
Hongwei Xue1β‘,
Shuaiqi Lu1,
Xiaojun Tang1,
Guannan Zhang1,
Tiancheng Zhao3β‘,
Jianwei Yin2
β Project Leader β‘Corresponding Author
1Accio Team, Alibaba Group 2Zhejiang University 3ZJU-BJ
2026.03.13π We release MM-CondChain, the first benchmark for visually grounded deep compositional reasoning in MLLMs.
We introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning in Multimodal Large Language Models (MLLMs).
Key features of MM-CondChain:
- Multi-layer compositional reasoning: Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence.
- Programmatic verifiability: We propose a VPIR-based (Verifiable Programmatic Intermediate Representation) agentic synthesis pipeline that ensures each condition is mechanically verifiable.
- Paired hard negatives: The Composer automatically produces paired True-path and False-path instances, where they differ by exactly one flipped predicate.
- Three visual domains: Natural images, data charts, and GUI trajectories.
- Deterministic evaluation: All instances are formulated as multiple-choice questions with deterministic answers, enabling reproducible evaluation without LLM-as-judge.
Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, confirming that deep compositional reasoning remains a fundamental challenge.
| Domain | Images/Trajectories | Samples |
|---|---|---|
| Natural | 398 | 796 |
| Chart | 200 | 400 |
| GUI | 377 (3,421 frames) | 754 |
| Total | 975 | 1,950 |
Each image/trajectory yields one conditional chain, compiled into a paired True-path and False-path instance.
MM-CondChain/
βββ README.md
βββ data/
β βββ natural.jsonl
β βββ chart.jsonl
β βββ gui.jsonl
βββ images/
βββ natural/
β βββ *.jpg
βββ chart/
β βββ *.png
βββ gui/
βββ <trajectory_id>/
βββ <trajectory_id>_*.png
Each JSONL file contains samples with the following fields:
{
"id": "natural_001",
"domain": "natural",
"image": "images/natural/sa_24810.jpg",
"true_path": {
"full_instruction": "If the fisherman wearing a baseball cap is ...",
"pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
"correct_answer": "F1"
},
"false_path": {
"diverge_node": "qa_1",
"full_instruction": "If the fisherman wearing a baseball cap is ...",
"pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
"correct_answer": "A1"
}
}Note on image paths:
- For Natural and Chart domains,
imageis a single image path (e.g.,images/natural/sa_24810.jpg). - For GUI domain,
imageis a trajectory folder path (e.g.,images/gui/GENERAL-9532638838594693992). To load GUI images, list all PNG files in the folder sorted by filename.
| Model | Natural F1 | Chart F1 | GUI F1 | Avg F1 |
|---|---|---|---|---|
| Gemini-3-Pro | 55.91 | 66.04 | 38.05 | 53.33 |
| GPT-5-0807 | 47.51 | 65.44 | 38.06 | 50.34 |
| Gemini-3-Flash | 47.19 | 61.96 | 35.78 | 48.31 |
| Qwen3-VL-235B-Thinking | 49.31 | 59.96 | 31.23 | 46.83 |
| Qwen3.5-397B-A17B | 38.97 | 58.55 | 40.19 | 45.90 |
If you find MM-CondChain helpful for your research, please consider citing our work:
@article{shen2025mmcondchain,
title={MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning},
author={Haozhan Shen and Shilin Yan and Hongwei Xue and Shuaiqi Lu and Xiaojun Tang and Guannan Zhang and Tiancheng Zhao and Jianwei Yin},
year={2026},
eprint={xxxx.xxxxx},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/xxxx.xxxxx},
}This dataset is released under the Apache 2.0 License.

