Skip to content

EnVision-Research/TiViBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Harold H. Chen1,2*, Disen Lan3*, Wen-Jie Shu2*, Qingyang Liu4, Zihan Wang1, Sirui Chen1, Wenkai Cheng1, Kanghao Chen1,2, Hongfei Zhang1, Zixin Zhang1,2, Rongjin Guo5,
Yu Cheng6†, Ying-Cong Chen1,2†


*Equal Contribution; Corresponding Author
1HKUST(GZ), 2HKUST, 3FDU, 4SJTU, 5CityUHK, 6CUHK

If you like our project, please give us a star ⭐ on GitHub for latest update.

Project Page

Table of Contents

📌 News

  • [11/2025] 🔥 We release TiViBench, a hierarchical manner benchmark tailored to visual reasoning for I2V generation models!

🧰 TODO

  • Release Paper.
  • Release data and eval code.
  • Release VideoTPO inference code.

🌟 Overview

The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical manner benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

📈 Evaluation Results

Pass@1 performance overview on our TiViBench of 3 commercial models and 4 open-source models:

🚀 Installation

  1. Clone this repository and navigate to source folder
cd TiViBench
  1. Build Environment
echo "Creating conda environment"
conda create -n TiViBench python=3.10
conda activate TiViBench

echo "Installing dependencies"
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install opencv-python pytesseract scikit-image pillow
pip install dds-cloudapi-sdk==0.5.3 # DINO-X for Eval

📍 Inference Suite

Prompt Suite

The inference prompts can be found in ~/eval_cache/**_prompt.json files:

├─SS_prompt.json # Structural Reasoning & Search
├─SV_prompt.json # Spatial & Visual Pattern Reasoning
├─SL_prompt.json # Symbolic & Logical Reasoning
├─AT_prompt.json # Action Planning & Task Execution

Image Suite

You can access our image suite on [Google Drive].

Automatic Download

pip install gdown
python scripts/image_suit_download.py

Data Format

The image suite is organized in the following format ~/images/:

├─AT # Action Planning & Task Execution
├─SL # Symbolic & Logical Reasoning
├─SS # Structural Reasoning & Search
├─SV # Spatial & Visual Pattern Reasoning
├──easy_graph_001.png
├──easy_graph_002.png
......

The default size of all images is 1280x720. We provide adaptive cropping of images to fit your video model.

Inference Details

For each image-prompt pair, sample 5 videos with 5 fixed random seeds to ensure the evaluation results are reproducible. To facilitate subsequent evaluation, we strongly recommend that you organize your generation results in the following format:

├─AT_easy_game_001
├──AT_easy_game_001-0.mp4
├──AT_easy_game_001-1.mp4
├──AT_easy_game_001-2.mp4
├──AT_easy_game_001-3.mp4
├──AT_easy_game_001-4.mp4
├─AT_easy_game_002
......
├─SV_medium_graph_050

🚩 Evaluation Suite

Data Preparation

Please download the [data] required for evaluations:

python scripts/eval_suit_download.py

and put them in the folder ./eval_cache:

├─AT
├─SL
├──easy_{type}_001
├───end.png
.....
├─SS
├─SV

Evaluation

Dimension-by-Dimension

To perform evaluation on one dimension:

python evaluate.py --base_path $VIDEO_FOLDER --dimension $DIMENSION
  • Dimensions: AT, SL, SS, and SV.
  • The evaluation result will be saved in ./evaluation_results.
  • Please specify the DINO-X and Gemini API in ./metrics/dinox.py and ./metrics/gemini.py.

All Four Dimensions

We also provide an overall evaluation for all four dimensions, just run:

python evaluate.py --base_path $VIDEO_FOLDER 

Only Pass@1

python evaluate.py --base_path $VIDEO_FOLDER --metric 'pass@1'

🚁 VideoTPO

...

📝 Citation

Please consider citing our paper if our benchmark or test-time strategy are useful:

@article{chen2025tivibench,
  title={TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models},
  author={Chen, Harold Haodong and Lan, Disen and Shu, Wen-Jie and Liu, Qingyang and Wang, Zihan and Chen, Sirui and Cheng, Wenkai and Chen, Kanghao and Zhang, Hongfei and Zhang, Zixin and Guo, Rongjin and Cheng, Yu and Chen, Ying-Cong},
  journal={arXiv preprint arXiv:2511.13704},
  year={2025}
}

📪 Contact

For any question, feel free to email haroldchen328@gmail.com or disenlan1002@gmail.com.

About

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published