Skip to content

Jam1ezhang/DYTO

Repository files navigation

🎥 [ICCV2025] DyTo

A Training-Free Method for Zero-Shot Video Understanding

📣 News

  • (2025.06.29): ✨Our paper is accepted to ICCV2025❗️
  • (2024.12.15): ✨Code has been released❗️

📖 Overview

DyTo is a Dynamic Token merging framework for zero-shot video understanding that optimizes token efficiency while preserving scene details through hierarchical frame selection and bipartite token merging.

Our paper: Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

🚀 Quick Start

Environment

  • CUDA 11.7
  • Python 3.10.12+
  • PyTorch 2.1.0+

Setup Guide

  1. Environment Setup
# Create and activate conda environment
conda create -n dyto python=3.10
conda activate dyto

# Install dependencies
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir

apt-get update
apt-get install git-lfs
git-lfs install
  1. API Configuration
export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY
export OPENAI_ORG=$YOUR_OPENAI_ORG  # Optional
  1. Model Download
# Get LLaVA-NeXT weights
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-34b

📊 Data Setup

Ground Truth QA Files

The QA files for most datasets can be downloaded from here. For VideMME dataset, please download the QA files from here.

You should prepare the QA files for the datasets you want to use. The expmple of the QA file is in the playground/gt_qa_files/ folder.

python scripts/data/prepare_${DATASET}_qa_file.py --qa_file $PATH_TO_CSV_FILE

Video Datasets

⚙️ Configuration

Key parameters in yaml config:

  • SCRIPT: Task selection
  • DATA_DIR & CONV_MODE: Data paths and prompts
  • NUM_FRAMES: Frame sampling count
  • TEMPORAL_AGGREGATION: Dynamic Token Merge pathway settings

🔄 Running the Model

Evaluation

cd DYTO
python run_inference.py --exp_config $PATH_TO_CONFIG_FILE

Demo

python run_demo.py \
    --video_path $PATH_TO_VIDEO \
    --model_path $PATH_TO_YOUR_MODEL \
    --question "Describe this video in details"

📂 Output Structure

outputs/
├── artifacts/      # Inference outputs
├── eval_save_dir/  # GPT-3.5-turbo intermediate results
└── logs/          # Evaluation results

📚 Citation

If you are using the data/code/model provided here in a publication, please cite our paper:

@InProceedings{Zhang_2025_ICCV,
    author    = {Zhang, Yiming and Zhao, Zhuokai and Chen, Zhaorun and Ding, Zenghui and Yang, Xianjun and Sun, Yining},
    title     = {Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {22046-22055}
}

About

A Training-free Model for Video Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors