This repository contains the evaluation code for COHERENCE.
In recent years, Multimodal Large Language Models (MLLMs) have achieved strong progress on many multimodal benchmarks. However, most existing benchmarks mainly evaluate single-image understanding, multi-image comparison, or general multimodal question answering. In real-world settings such as document reading, information is often presented as interleaved image-text context. This requires models to not only understand each individual image, but also perform fine-grained image-text alignment and identify accurate correspondences between textual and visual content across context.
In addition, models must integrate evidence across paragraphs and modalities for reasoning. Although this capability is important for practical applications, systematic benchmarks for quantifying fine-grained understanding in interleaved image-text context are still limited.
To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences in interleaved multimodal context. COHERENCE covers four representative domains and contains 6,161 high-quality questions. We also provide a six-type error analysis protocol for fine-grained attribution of failures in interleaved image-text understanding.
.
├── code/
│ ├── main_experiment/
│ │ ├── evaluate_arrangement_vllm.py
│ │ ├── evaluate_arrangement_api.py
│ │ ├── run_main_vllm.sh
│ │ ├── run_main_api.sh
│ │ ├── error_analysis.py
│ │ ├── metrics.py
│ │ ├── stats_accuracy_by_domain.py
│ │ └── stats_accuracy_by_difficulty.py
│ └── ablation_experiment/
│ ├── evaluate_arrangement_ablation_vllm.py
│ ├── run_ablation_text_only.sh
│ └── run_ablation_image_only.sh
└── datasets/ # created after download
pip install -U vllm transformers pillow tqdm openai
# Optional for some Qwen-VL setups
pip install -U qwen-vl-utilsUse one of the two routes below:
- Route A: API evaluation (
openaiSDK) - Route B: local vLLM evaluation (
vllm)
For API route:
python -m venv .venv-api
source .venv-api/bin/activate
pip install -U openai transformers pillow tqdmFor vLLM route:
python -m venv .venv-vllm
source .venv-vllm/bin/activate
pip install -U vllm transformers pillow tqdm
# Optional for some Qwen-VL setups
pip install -U qwen-vl-utilspip install -U "huggingface_hub[cli]"
huggingface-cli download BingliW/COHERENCE \
--repo-type dataset \
--local-dir datasetsCurrent defaults are already aligned with the above download command.
- API route: set
API_BASE,API_KEY,API_MODEL. - vLLM route / ablation: set
MODEL_PATHin the corresponding.shscript. - Optional only if you use a custom data layout: adjust
JSONL_DIRand/orIMAGES_ROOT.
export API_BASE="https://your-api-base/v1"
export API_KEY="your_api_key"
export API_MODEL="your_model_name"
bash code/main_experiment/run_main_api.sh code/main_experiment/results# Optional: export TP_SIZE=1 (or another value)
bash code/main_experiment/run_main_vllm.sh code/main_experiment/resultsAfter running, predictions are written to code/main_experiment/results/... and each jsonl has a matching .summary.json.
bash code/ablation_experiment/run_ablation_text_only.sh code/results/ablation_text_onlybash code/ablation_experiment/run_ablation_image_only.sh code/results/ablation_image_onlyFor each output jsonl, scripts also write:
<output>.summary.json
Typical record fields:
dataset_type,data_id,url_id,titleanswer,predictionexact_correctpartial_scorewith metrickendall_tau_0_1raw_input,model_input,raw_output
python code/main_experiment/error_analysis.py --help- Evaluation supports resume mode if output files already exist.
- Current vLLM scripts expose
tensor_parallel_sizeonly. stats_accuracy_by_domain.pyandstats_accuracy_by_difficulty.pyimportaccuracy_table_common, which is not included in this snapshot.
If you use COHERENCE, please cite:
@misc{wang2026coherencebenchmarkingfinegrainedimagetext,
title={COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts},
author={Bingli Wang and Huanze Tang and Haijun Lv and Zhishan Lin and Lixin Gu and Lei Feng and Qipeng Guo and Kai Chen},
year={2026},
eprint={2604.27389},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.27389},
}
This repository is licensed under ODC Attribution License (ODC-By) 1.0. See LICENSE.