Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

Official pytorch implementation of our ICCV 2023 paper Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation.

Overview

To optimize the speed-accuracy trade-off, we propose Semantic-aware Temporal Accumulation score (STA) to prune spatio-temporal tokens integrally. STA score considers two critical factors: temporal redundancy and semantic importance. The former depicts a specific region based on whether it is a new occurrence or a seen entity by aggregating token-to-token similarity in consecutive frames while the latter evaluates each token based on its contribution to the overall prediction. As a result, tokens with higher scores of STA carry more temporal redundancy as well as lower semantics thus being pruned. Based on the STA score, we are able to progressively prune the tokens without introducing any additional parameters or requiring further re-training.

[Project Page] [arXiv] [PDF]

Usage

Requirements

torch == 2.0.1
timm == 0.4.12
tensorboardX == 2.6.2.2

Data preparation

Download the Kinetics400 dataset from the opendatalab website.
Download the Something-something V2 dataset from the opendatalab website.

Off-the-shelf Model

For off-the-shelf ViT, please refer to VideoMAE model zoo. [link].

Downstream Evaluation for Action Recognition

To perform downstream evaluation for action recognition, run the following bash script:

set -x
export MASTER_PORT=$((12000 + $RANDOM % 20000))
export OMP_NUM_THREADS=1 

python3 -u run_inference.py \
    --model vit_large_patch16_224 \
    --data_path ${DATA_PATH} \
    --ckpt_path ${CKPT_PATH} \
    --log_dir ${OUTPUT_DIR} \
    --output_dir ${OUTPUT_DIR} \
    --batch_size 32 \
    --dist_eval \
    --r 64

Here, r represents the prune number of the first stage, and the --model argument corresponds to the model you choose to load as specified in the model_vit.py file. Make sure to replace ${DATA_PATH}, ${CKPT_PATH}, and ${OUTPUT_DIR} with the actual paths you are using for your dataset, model checkpoint, and output log directory, respectively.

To run the flops and throughput calculation code provided at the end of model_vit.py, you can simply execute the following command in your terminal:

python3 model_vit.py

Please ensure that a GPU is available to activate the CUDA environment.

Visualization

We visualize GradNorm distribution for ViT-Large on the Kinetics-400 validation set. The heatmap reveals sparse patterns across the board, indicating that most tokens do not contribute significantly to the final prediction. STA retains almost all areas of high-activation GradNorm.

We visualize the proposed STA strategy. We masked out the discarded tokens with white boxes. STA not only retains informative tokens but also ensures diverse regions for improved video reasoning.

Acknowledgement

Our code is based on the implementation of VideoMAE. We sincerely thanks the authors for their great work.

Citation

If our code is helpful to your work, please consider citing:

@inproceedings{ding2023prune,
  title={Prune spatio-temporal tokens by semantic-aware temporal accumulation},
  author={Ding, Shuangrui and Zhao, Peisen and Zhang, Xiaopeng and Qian, Rui and Xiong, Hongkai and Tian, Qi},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={16945--16956},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Figure		Figure
LICENSE		LICENSE
README.md		README.md
datasets.py		datasets.py
engine.py		engine.py
functional.py		functional.py
kinetics.py		kinetics.py
mixup.py		mixup.py
model_vit.py		model_vit.py
rand_augment.py		rand_augment.py
random_erasing.py		random_erasing.py
run_inference.py		run_inference.py
ssv2.py		ssv2.py
transforms.py		transforms.py
utils.py		utils.py
video_transforms.py		video_transforms.py
volume_transforms.py		volume_transforms.py

License

Mark12Ding/STA

Folders and files

Latest commit

History

Repository files navigation

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

Overview

Usage

Requirements

Data preparation

Off-the-shelf Model

Downstream Evaluation for Action Recognition

Visualization

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Languages