A benchmark designed to expose fundamental temporal integration failures in modern multimodal models, revealing their limitations in spatiotemporally consistent visual reasoning
First, clone the repository:
git clone https://github.com/ContinuousPerceptionResearch/cp-bench.gitWe use uv to manage Python dependencies. After installing uv, run the following commands to set up the environment:
uv sync
source .venv/bin/activate
MAX_JOBS=4 uv pip install flash-attn --no-build-isolationTo generate CP-Bench data, run the following command from the data_generation directory:
uv run bash render_videos.shRendering a single 10-second video may take several minutes, depending on your GPU. To speed up the process, you can split the configuration files in the configs directory and run them in parallel.
The data configurations are easy to modify, allowing you to generate variations with different object counts, colors, shapes, textures, and more.
Download the pre-rendered dataset from Hugging Face by running:
bash prepare_data.shLaunch fine-tuning with:
uv run bash finetune.shThe fine-tuning pipeline is built on the SFT Trainer from TRL. You can customize the training process by adjusting the relevant arguments as needed.
This repo is built upon TRL and CLEVR. We sincerely thank the developers of these projects.
If you find CP-Bench useful, please consider citing our work:
@article{cpbench2025,
title={Continuous Perception Matters: Diagnosing Temporal Integration Failures in Multimodal Models},
author={Zeyu Wang and Zhenzhen Weng and Serena Yeung-Levy},
journal={arXiv preprint arXiv:2408.07867},
year={2025}
}