IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

📖 Arxiv Paper | 🤗 Paper | 🤗 IV-Bench Dataset

IV‑Bench is a benchmark for evaluating the capabilities of multimodal large‑language models in image‑grounded video perception and reasoning. It pairs 967 videos with 2,585 externally sourced image–text queries, each requiring both video and image context for an accurate answer.

👀 Instruction to IV-Bench

IV-Bench is the first comprehensive benchmark for evaluating Image-Grounded Video perception and reasoning. IV-Bench comprises 967 videos paired with 2,585 meticulously annotated image-text queries, where the images, collected from external sources rather than extracted from the videos themselves, provide the essential context required to accurately answer the queries. The dataset spans 5 major categories and covers 13 distinct tasks (7 perception and 6 reasoning tasks), ensuring substantial diversity across various scenarios and task types.

Features

Image–Text Queries Multiple queries per video, each pairing an externally sourced image with a question to provide essential contextual cues.
Five Diverse Categories Videos (≥ 5 min) span Knowledge, Film & TV, Sports, Artistic Performances, and Life Records for broad coverage.
Thirteen Evaluation Tasks A mix of perception and reasoning tasks designed to rigorously test multimodal understanding.

🎞️ Representative examples from IV-Bench

Each IV-Bench sample consists of a video paired with an image-text query. The correct answer is marked in green, with relevant video frames also highlighted in green.

🆚 Comparion with other video benchmarks

Different from other video benchmarks that contain only text-only queries or image-unnecessary queries, IV-Bench is the first manually annotated benchmark explicitly designed to evaluate image-grounded video understanding, employing two rigorous rounds of quality checks to ensure images are essential for correctly answering every query.

🛠️ How to use IV-Bench

1. Installation

We have provided the complete environment configuration required for evaluating the models in the paper. For detailed installation instructions and dependency settings, please refer to the installation.md file.

2. Download dataset

2.1 Download testdata from huggingface

Download all data without videos from here()

2.2 Video download

Download videos using the script we provide download_video.sh

3. Model Evaluation

conda activate image_video
python inference_ivbench.py \
    --model_name="$MODEL_NAME" \
    --question_file="$QUESTION_FILE" \
    --model_path="$MODEL_PATH" \
    --video_dir="$VIDEO_DIR" \
    --image_dir="$IMAGE_DIR" \
    --has_image="$HAS_IMAGE" \
    --nframes="$NFRAMES" \
    --output_file="$OUTPUT_FILE"

One example for evaluating InternVL-2.5 can be seen in internvl2_5.sh

📊 Results

Main Results

Ablation Study

Reference

@misc{ma2025ivbenchbenchmarkimagegroundedvideo,
      title={IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs}, 
      author={David Ma and Yuanxing Zhang and Jincheng Ren and Jarvis Guo and Yifan Yao and Zhenlin Wei and Zhenzhu Yang and Zhongyuan Peng and Boyu Feng and Jun Ma and Xiao Gu and Zhoufutu Wen and King Zhu and Yancheng He and Meng Cao and Shiwen Ni and Jiaheng Liu and Wenhao Huang and Ge Zhang and Xiaojie Jin},
      year={2025},
      eprint={2504.15415},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.15415}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LLaMA-VID		LLaMA-VID
LLaVA-Mini		LLaVA-Mini
LLaVA-NeXT		LLaVA-NeXT
VILA		VILA
dataset		dataset
imgs		imgs
scripts		scripts
video_bench		video_bench
download_video.sh		download_video.sh
environment.yml		environment.yml
inference.py		inference.py
installation.md		installation.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

👀 Instruction to IV-Bench

Features

🎞️ Representative examples from IV-Bench

🆚 Comparion with other video benchmarks

🛠️ How to use IV-Bench

1. Installation

2. Download dataset

2.1 Download testdata from huggingface

2.2 Video download

3. Model Evaluation

📊 Results

Main Results

Ablation Study

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Languages

multimodal-art-projection/IV-Bench

Folders and files

Latest commit

History

Repository files navigation

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

👀 Instruction to IV-Bench

Features

🎞️ Representative examples from IV-Bench

🆚 Comparion with other video benchmarks

🛠️ How to use IV-Bench

1. Installation

2. Download dataset

2.1 Download testdata from huggingface

2.2 Video download

3. Model Evaluation

📊 Results

Main Results

Ablation Study

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages