- (2025.06.29): ✨Our paper is accepted to ICCV2025❗️
- (2024.12.15): ✨Code has been released❗️
DyTo is a Dynamic Token merging framework for zero-shot video understanding that optimizes token efficiency while preserving scene details through hierarchical frame selection and bipartite token merging.
Our paper: Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
- CUDA 11.7
- Python 3.10.12+
- PyTorch 2.1.0+
- Environment Setup
# Create and activate conda environment
conda create -n dyto python=3.10
conda activate dyto
# Install dependencies
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
apt-get update
apt-get install git-lfs
git-lfs install- API Configuration
export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY
export OPENAI_ORG=$YOUR_OPENAI_ORG # Optional- Model Download
# Get LLaVA-NeXT weights
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-34bThe QA files for most datasets can be downloaded from here. For VideMME dataset, please download the QA files from here.
You should prepare the QA files for the datasets you want to use. The expmple of the QA file is in the playground/gt_qa_files/ folder.
python scripts/data/prepare_${DATASET}_qa_file.py --qa_file $PATH_TO_CSV_FILE- Download directly from dataset providers:
Key parameters in yaml config:
SCRIPT: Task selectionDATA_DIR&CONV_MODE: Data paths and promptsNUM_FRAMES: Frame sampling countTEMPORAL_AGGREGATION: Dynamic Token Merge pathway settings
cd DYTO
python run_inference.py --exp_config $PATH_TO_CONFIG_FILEpython run_demo.py \
--video_path $PATH_TO_VIDEO \
--model_path $PATH_TO_YOUR_MODEL \
--question "Describe this video in details"outputs/
├── artifacts/ # Inference outputs
├── eval_save_dir/ # GPT-3.5-turbo intermediate results
└── logs/ # Evaluation results
If you are using the data/code/model provided here in a publication, please cite our paper:
@InProceedings{Zhang_2025_ICCV,
author = {Zhang, Yiming and Zhao, Zhuokai and Chen, Zhaorun and Ding, Zenghui and Yang, Xianjun and Sun, Yining},
title = {Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {22046-22055}
}