COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

This repository contains the evaluation code for COHERENCE.

Introduction

In recent years, Multimodal Large Language Models (MLLMs) have achieved strong progress on many multimodal benchmarks. However, most existing benchmarks mainly evaluate single-image understanding, multi-image comparison, or general multimodal question answering. In real-world settings such as document reading, information is often presented as interleaved image-text context. This requires models to not only understand each individual image, but also perform fine-grained image-text alignment and identify accurate correspondences between textual and visual content across context.

In addition, models must integrate evidence across paragraphs and modalities for reasoning. Although this capability is important for practical applications, systematic benchmarks for quantifying fine-grained understanding in interleaved image-text context are still limited.

To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences in interleaved multimodal context. COHERENCE covers four representative domains and contains 6,161 high-quality questions. We also provide a six-type error analysis protocol for fine-grained attribution of failures in interleaved image-text understanding.

Repository Structure

.
├── code/
│   ├── main_experiment/
│   │   ├── evaluate_arrangement_vllm.py
│   │   ├── evaluate_arrangement_api.py
│   │   ├── run_main_vllm.sh
│   │   ├── run_main_api.sh
│   │   ├── error_analysis.py
│   │   ├── metrics.py
│   │   ├── stats_accuracy_by_domain.py
│   │   └── stats_accuracy_by_difficulty.py
│   └── ablation_experiment/
│       ├── evaluate_arrangement_ablation_vllm.py
│       ├── run_ablation_text_only.sh
│       └── run_ablation_image_only.sh
└── datasets/  # created after download

Installation

pip install -U vllm transformers pillow tqdm openai
# Optional for some Qwen-VL setups
pip install -U qwen-vl-utils

Step-by-Step Run Guide

Use one of the two routes below:

Route A: API evaluation (openai SDK)
Route B: local vLLM evaluation (vllm)

Step 1. Create environment

For API route:

python -m venv .venv-api
source .venv-api/bin/activate
pip install -U openai transformers pillow tqdm

For vLLM route:

python -m venv .venv-vllm
source .venv-vllm/bin/activate
pip install -U vllm transformers pillow tqdm
# Optional for some Qwen-VL setups
pip install -U qwen-vl-utils

Step 2. Download dataset

pip install -U "huggingface_hub[cli]"
huggingface-cli download BingliW/COHERENCE \
  --repo-type dataset \
  --local-dir datasets

Current defaults are already aligned with the above download command.

Step 3. Minimal config

API route: set API_BASE, API_KEY, API_MODEL.
vLLM route / ablation: set MODEL_PATH in the corresponding .sh script.
Optional only if you use a custom data layout: adjust JSONL_DIR and/or IMAGES_ROOT.

Step 4A. Run API evaluation

export API_BASE="https://your-api-base/v1"
export API_KEY="your_api_key"
export API_MODEL="your_model_name"

bash code/main_experiment/run_main_api.sh code/main_experiment/results

Step 4B. Run vLLM evaluation

# Optional: export TP_SIZE=1 (or another value)
bash code/main_experiment/run_main_vllm.sh code/main_experiment/results

After running, predictions are written to code/main_experiment/results/... and each jsonl has a matching .summary.json.

Ablation

Text-only

bash code/ablation_experiment/run_ablation_text_only.sh code/results/ablation_text_only

Image-only

bash code/ablation_experiment/run_ablation_image_only.sh code/results/ablation_image_only

Output Format

For each output jsonl, scripts also write:

<output>.summary.json

Typical record fields:

dataset_type, data_id, url_id, title
answer, prediction
exact_correct
partial_score with metric kendall_tau_0_1
raw_input, model_input, raw_output

Error Analysis

python code/main_experiment/error_analysis.py --help

Notes

Evaluation supports resume mode if output files already exist.
Current vLLM scripts expose tensor_parallel_size only.
stats_accuracy_by_domain.py and stats_accuracy_by_difficulty.py import accuracy_table_common, which is not included in this snapshot.

Citation

If you use COHERENCE, please cite:

@misc{wang2026coherencebenchmarkingfinegrainedimagetext,
      title={COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts}, 
      author={Bingli Wang and Huanze Tang and Haijun Lv and Zhishan Lin and Lixin Gu and Lei Feng and Qipeng Guo and Kai Chen},
      year={2026},
      eprint={2604.27389},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.27389}, 
}

License

This repository is licensed under ODC Attribution License (ODC-By) 1.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
code		code
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Introduction

Repository Structure

Installation

Step-by-Step Run Guide

Step 1. Create environment

Step 2. Download dataset

Step 3. Minimal config

Step 4A. Run API evaluation

Step 4B. Run vLLM evaluation

Ablation

Text-only

Image-only

Output Format

Error Analysis

Notes

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Introduction

Repository Structure

Installation

Step-by-Step Run Guide

Step 1. Create environment

Step 2. Download dataset

Step 3. Minimal config

Step 4A. Run API evaluation

Step 4B. Run vLLM evaluation

Ablation

Text-only

Image-only

Output Format

Error Analysis

Notes

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages