Harold H. Chen1,2*, Disen Lan3*, Wen-Jie Shu2*, Qingyang Liu4, Zihan Wang1, Sirui Chen1, Wenkai Cheng1, Kanghao Chen1,2, Hongfei Zhang1, Zixin Zhang1,2, Rongjin Guo5,
Yu Cheng6†, Ying-Cong Chen1,2†
*Equal Contribution; †Corresponding Author
1HKUST(GZ), 2HKUST, 3FDU, 4SJTU, 5CityUHK, 6CUHK
- [11/2025] 🔥 We release TiViBench, a hierarchical manner benchmark tailored to visual reasoning for I2V generation models!
- Release Paper.
- Release data and eval code.
- Release VideoTPO inference code.
The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical manner benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
![]() |
![]() |
Pass@1 performance overview on our TiViBench of 3 commercial models and 4 open-source models:
![]() |
- Clone this repository and navigate to source folder
cd TiViBench- Build Environment
echo "Creating conda environment"
conda create -n TiViBench python=3.10
conda activate TiViBench
echo "Installing dependencies"
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install opencv-python pytesseract scikit-image pillow
pip install dds-cloudapi-sdk==0.5.3 # DINO-X for EvalThe inference prompts can be found in ~/eval_cache/**_prompt.json files:
├─SS_prompt.json # Structural Reasoning & Search
├─SV_prompt.json # Spatial & Visual Pattern Reasoning
├─SL_prompt.json # Symbolic & Logical Reasoning
├─AT_prompt.json # Action Planning & Task Execution
You can access our image suite on [Google Drive].
Automatic Download
pip install gdown
python scripts/image_suit_download.py
Data Format
The image suite is organized in the following format ~/images/:
├─AT # Action Planning & Task Execution
├─SL # Symbolic & Logical Reasoning
├─SS # Structural Reasoning & Search
├─SV # Spatial & Visual Pattern Reasoning
├──easy_graph_001.png
├──easy_graph_002.png
......
The default size of all images is 1280x720. We provide adaptive cropping of images to fit your video model.
For each image-prompt pair, sample 5 videos with 5 fixed random seeds to ensure the evaluation results are reproducible. To facilitate subsequent evaluation, we strongly recommend that you organize your generation results in the following format:
├─AT_easy_game_001
├──AT_easy_game_001-0.mp4
├──AT_easy_game_001-1.mp4
├──AT_easy_game_001-2.mp4
├──AT_easy_game_001-3.mp4
├──AT_easy_game_001-4.mp4
├─AT_easy_game_002
......
├─SV_medium_graph_050
Please download the [data] required for evaluations:
python scripts/eval_suit_download.py
and put them in the folder ./eval_cache:
├─AT
├─SL
├──easy_{type}_001
├───end.png
.....
├─SS
├─SV
Dimension-by-Dimension
To perform evaluation on one dimension:
python evaluate.py --base_path $VIDEO_FOLDER --dimension $DIMENSION
- Dimensions:
AT,SL,SS, andSV. - The evaluation result will be saved in
./evaluation_results. - Please specify the DINO-X and Gemini API in
./metrics/dinox.pyand./metrics/gemini.py.
All Four Dimensions
We also provide an overall evaluation for all four dimensions, just run:
python evaluate.py --base_path $VIDEO_FOLDER
Only Pass@1
python evaluate.py --base_path $VIDEO_FOLDER --metric 'pass@1'
...
Please consider citing our paper if our benchmark or test-time strategy are useful:
@article{chen2025tivibench,
title={TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models},
author={Chen, Harold Haodong and Lan, Disen and Shu, Wen-Jie and Liu, Qingyang and Wang, Zihan and Chen, Sirui and Cheng, Wenkai and Chen, Kanghao and Zhang, Hongfei and Zhang, Zixin and Guo, Rongjin and Cheng, Yu and Chen, Ying-Cong},
journal={arXiv preprint arXiv:2511.13704},
year={2025}
}For any question, feel free to email haroldchen328@gmail.com or disenlan1002@gmail.com.


