MM-CondChain: A Programmatically Verified Benchmark for
Visually Grounded Deep Compositional Reasoning

Haozhan Shen^1,2, Shilin Yan^1†, Hongwei Xue^1‡, Shuaiqi Lu¹, Xiaojun Tang¹,
Guannan Zhang¹, Tiancheng Zhao^3‡, Jianwei Yin²

^†Project Leader ^‡Corresponding Author

¹Accio Team, Alibaba Group ²Zhejiang University ³ZJU-BJ

[🏠 Project Page] [📖 arXiv Paper] [💻 GitHub] [🏆 Leaderboard]

🔥 News

2026.03.13 🌟 We release MM-CondChain, the first benchmark for visually grounded deep compositional reasoning in MLLMs.

👀 MM-CondChain Overview

We introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning in Multimodal Large Language Models (MLLMs).

Key features of MM-CondChain:

Multi-layer compositional reasoning: Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence.
Programmatic verifiability: We propose a VPIR-based (Verifiable Programmatic Intermediate Representation) agentic synthesis pipeline that ensures each condition is mechanically verifiable.
Paired hard negatives: The Composer automatically produces paired True-path and False-path instances, where they differ by exactly one flipped predicate.
Three visual domains: Natural images, data charts, and GUI trajectories.
Deterministic evaluation: All instances are formulated as multiple-choice questions with deterministic answers, enabling reproducible evaluation without LLM-as-judge.

Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, confirming that deep compositional reasoning remains a fundamental challenge.

📊 Dataset Statistics

Domain	Images/Trajectories	Samples
Natural	398	796
Chart	200	400
GUI	377 (3,421 frames)	754
Total	975	1,950

Each image/trajectory yields one conditional chain, compiled into a paired True-path and False-path instance.

📁 Dataset Structure

MM-CondChain/
├── README.md
├── data/
│   ├── natural.jsonl
│   ├── chart.jsonl
│   └── gui.jsonl
└── images/
    ├── natural/
    │   └── *.jpg
    ├── chart/
    │   └── *.png
    └── gui/
        └── <trajectory_id>/
            └── <trajectory_id>_*.png

Each JSONL file contains samples with the following fields:

{
  "id": "natural_001",
  "domain": "natural",
  "image": "images/natural/sa_24810.jpg",
  "true_path": {
    "full_instruction": "If the fisherman wearing a baseball cap is ...",
    "pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
    "correct_answer": "F1"
  },
  "false_path": {
    "diverge_node": "qa_1",
    "full_instruction": "If the fisherman wearing a baseball cap is ...",
    "pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
    "correct_answer": "A1"
  }
}

Note on image paths:

For Natural and Chart domains, image is a single image path (e.g., images/natural/sa_24810.jpg).
For GUI domain, image is a trajectory folder path (e.g., images/gui/GENERAL-9532638838594693992). To load GUI images, list all PNG files in the folder sorted by filename.

📈 Experimental Results

Model	Natural F1	Chart F1	GUI F1	Avg F1
Gemini-3-Pro	55.91	66.04	38.05	53.33
GPT-5-0807	47.51	65.44	38.06	50.34
Gemini-3-Flash	47.19	61.96	35.78	48.31
Qwen3-VL-235B-Thinking	49.31	59.96	31.23	46.83
Qwen3.5-397B-A17B	38.97	58.55	40.19	45.90

📖 Citation

If you find MM-CondChain helpful for your research, please consider citing our work:

@article{shen2025mmcondchain,
    title={MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning},
    author={Haozhan Shen and Shilin Yan and Hongwei Xue and Shuaiqi Lu and Xiaojun Tang and Guannan Zhang and Tiancheng Zhao and Jianwei Yin},
    year={2026},
    eprint={xxxx.xxxxx},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/xxxx.xxxxx}, 
}

📜 License

This dataset is released under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MM-CondChain: A Programmatically Verified Benchmark for
Visually Grounded Deep Compositional Reasoning

🔥 News

👀 MM-CondChain Overview

📊 Dataset Statistics

📁 Dataset Structure

📈 Experimental Results

📖 Citation

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

🔥 News

👀 MM-CondChain Overview

📊 Dataset Statistics

📁 Dataset Structure

📈 Experimental Results

📖 Citation

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

MM-CondChain: A Programmatically Verified Benchmark for
Visually Grounded Deep Compositional Reasoning

Packages