Skip to content

Accio-Lab/MM-CondChain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 

Repository files navigation


MM-CondChain: A Programmatically Verified Benchmark for
Visually Grounded Deep Compositional Reasoning

Haozhan Shen1,2, Shilin Yan1†, Hongwei Xue1‑, Shuaiqi Lu1, Xiaojun Tang1,
Guannan Zhang1, Tiancheng Zhao3‑, Jianwei Yin2

†Project Leader ‑Corresponding Author

1Accio Team, Alibaba Group 2Zhejiang University 3ZJU-BJ


πŸ”₯ News

  • 2026.03.13 🌟 We release MM-CondChain, the first benchmark for visually grounded deep compositional reasoning in MLLMs.

πŸ‘€ MM-CondChain Overview

We introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning in Multimodal Large Language Models (MLLMs).

Key features of MM-CondChain:

  • Multi-layer compositional reasoning: Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence.
  • Programmatic verifiability: We propose a VPIR-based (Verifiable Programmatic Intermediate Representation) agentic synthesis pipeline that ensures each condition is mechanically verifiable.
  • Paired hard negatives: The Composer automatically produces paired True-path and False-path instances, where they differ by exactly one flipped predicate.
  • Three visual domains: Natural images, data charts, and GUI trajectories.
  • Deterministic evaluation: All instances are formulated as multiple-choice questions with deterministic answers, enabling reproducible evaluation without LLM-as-judge.

Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, confirming that deep compositional reasoning remains a fundamental challenge.

πŸ“Š Dataset Statistics

Domain Images/Trajectories Samples
Natural 398 796
Chart 200 400
GUI 377 (3,421 frames) 754
Total 975 1,950

Each image/trajectory yields one conditional chain, compiled into a paired True-path and False-path instance.

πŸ“ Dataset Structure

MM-CondChain/
β”œβ”€β”€ README.md
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ natural.jsonl
β”‚   β”œβ”€β”€ chart.jsonl
β”‚   └── gui.jsonl
└── images/
    β”œβ”€β”€ natural/
    β”‚   └── *.jpg
    β”œβ”€β”€ chart/
    β”‚   └── *.png
    └── gui/
        └── <trajectory_id>/
            └── <trajectory_id>_*.png

Each JSONL file contains samples with the following fields:

{
  "id": "natural_001",
  "domain": "natural",
  "image": "images/natural/sa_24810.jpg",
  "true_path": {
    "full_instruction": "If the fisherman wearing a baseball cap is ...",
    "pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
    "correct_answer": "F1"
  },
  "false_path": {
    "diverge_node": "qa_1",
    "full_instruction": "If the fisherman wearing a baseball cap is ...",
    "pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
    "correct_answer": "A1"
  }
}

Note on image paths:

  • For Natural and Chart domains, image is a single image path (e.g., images/natural/sa_24810.jpg).
  • For GUI domain, image is a trajectory folder path (e.g., images/gui/GENERAL-9532638838594693992). To load GUI images, list all PNG files in the folder sorted by filename.

πŸ“ˆ Experimental Results

Model Natural F1 Chart F1 GUI F1 Avg F1
Gemini-3-Pro 55.91 66.04 38.05 53.33
GPT-5-0807 47.51 65.44 38.06 50.34
Gemini-3-Flash 47.19 61.96 35.78 48.31
Qwen3-VL-235B-Thinking 49.31 59.96 31.23 46.83
Qwen3.5-397B-A17B 38.97 58.55 40.19 45.90

πŸ“– Citation

If you find MM-CondChain helpful for your research, please consider citing our work:

@article{shen2025mmcondchain,
    title={MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning},
    author={Haozhan Shen and Shilin Yan and Hongwei Xue and Shuaiqi Lu and Xiaojun Tang and Guannan Zhang and Tiancheng Zhao and Jianwei Yin},
    year={2026},
    eprint={xxxx.xxxxx},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/xxxx.xxxxx}, 
}

πŸ“œ License

This dataset is released under the Apache 2.0 License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors