Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
kxqt committed Sep 2, 2023
1 parent f681751 commit 7e99aec
Show file tree
Hide file tree
Showing 5 changed files with 28 additions and 103 deletions.
121 changes: 23 additions & 98 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,30 @@
# Video Swin Transformer
# Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-swin-transformer/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=video-swin-transformer)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-swin-transformer/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=video-swin-transformer)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-swin-transformer/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=video-swin-transformer)

By [Ze Liu](https://github.com/zeliu98/)\*, [Jia Ning](https://github.com/hust-nj)\*, [Yue Cao](http://yue-cao.me), [Yixuan Wei](https://github.com/weiyx16), [Zheng Zhang](https://stupidzz.github.io/), [Stephen Lin](https://scholar.google.com/citations?user=c3PYmxUAAAAJ&hl=en) and [Han Hu](https://ancientmooner.github.io/).

This repo is the official implementation of ["Video Swin Transformer"](https://arxiv.org/abs/2106.13230). It is based on [mmaction2](https://github.com/open-mmlab/mmaction2).

## Updates

***06/25/2021*** Initial commits

## Introduction

**Video Swin Transformer** is initially described in ["Video Swin Transformer"](https://arxiv.org/abs/2106.13230), which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including action recognition (`84.9` top-1 accuracy on Kinetics-400 and `86.1` top-1 accuracy on Kinetics-600 with `~20x` less pre-training data and `~3x` smaller model size) and temporal modeling (`69.6` top-1 accuracy on Something-Something v2).

This is the official implementation of the paper "[Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning](https://arxiv.org/abs/2210.01035)" on [Video Swin Transformer](https://arxiv.org/abs/2106.13230).

![teaser](figures/teaser.png)
![framework](figures/Hourglass_swin_framework.png)
![framework](figures/TokenClusterReconstruct_Details.png)

## Results and Models
## Results

### Kinetics 400

| Backbone | Pretrain | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Swin-T | ImageNet-1K | 30ep | 224 | 78.8 | 93.6 | 28M | 87.9G | [config](configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py) | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_tiny_patch244_window877_kinetics400_1k.pth)/[baidu](https://pan.baidu.com/s/1mIqRzk8RILeRsP2KB5T6fg) |
| Swin-S | ImageNet-1K | 30ep | 224 | 80.6 | 94.5 | 50M | 165.9G | [config](configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py) | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_small_patch244_window877_kinetics400_1k.pth)/[baidu](https://pan.baidu.com/s/1imq7LFNtSu3VkcRjd04D4Q) |
| Swin-B | ImageNet-1K | 30ep | 224 | 80.6 | 94.6 | 88M | 281.6G | [config](configs/recognition/swin/swin_base_patch244_window877_kinetics400_1k.py) | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window877_kinetics400_1k.pth)/[baidu](https://pan.baidu.com/s/1bD2lxGxqIV7xECr1n2slng) |
| Swin-B | ImageNet-22K | 30ep | 224 | 82.7 | 95.5 | 88M | 281.6G | [config](configs/recognition/swin/swin_base_patch244_window877_kinetics400_22k.py) | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window877_kinetics400_22k.pth)/[baidu](https://pan.baidu.com/s/1CcCNzJAIud4niNPcREbDbQ) |
| Method | $\alpha$ | t $\times$ h $\times$ w | GFLOPs | FPS | Acc@1 | Acc@5 | config |
| ------------- | -------- | ------------------------- | ------ | ---- | ----- | ----- | ------------------------------------------------------------ |
| Swin-L | - | 8 $\times$ 12 $\times$ 12 | 2107 | 1.10 | 84.7 | 96.6 | [config](configs/recognition/swin/swin_large_384_patch244_window81212_kinetics400_22k.py) |
| Swin-L + Ours | 10 | 8 $\times$ 6 $\times$ 6 | 1662 | 1.66 | 84.0 | 96.3 | [config](configs/recognition/swin/hourglass_swin_large_384_patch244_window81212_kinetics400_22k.py) |

### Kinetics 600

| Backbone | Pretrain | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Swin-B | ImageNet-22K | 30ep | 224 | 84.0 | 96.5 | 88M | 281.6G | [config](configs/recognition/swin/swin_base_patch244_window877_kinetics600_22k.py) | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window877_kinetics600_22k.pth)/[baidu](https://pan.baidu.com/s/1ZMeW6ylELTje-o3MiaZ-MQ) |
| Method | $\alpha$ | t $\times$ h $\times$ w | GFLOPs | FPS | Acc@1 | Acc@5 | config |
| ------------- | -------- | ------------------------- | ------ | ---- | ----- | ----- | ------------------------------------------------------------ |
| Swin-L | - | 8 $\times$ 12 $\times$ 12 | 2107 | 1.10 | 86.1 | 97.3 | [config](Expedit-Video-Swin-Transformer/configs/recognition/swin/swin_large_384_patch244_window81212_kinetics600_22k.py) |
| Swin-L + Ours | 10 | 8 $\times$ 6 $\times$ 6 | 1824 | 1.53 | 85.6 | 97.1 | [config](configs/recognition/swin/hourglass_swin_large_384_patch244_window81212_kinetics600_22k.py) |

### Something-Something V2

| Backbone | Pretrain | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Swin-B | Kinetics 400 | 60ep | 224 | 69.6 | 92.7 | 89M | 320.6G | [config](configs/recognition/swin/swin_base_patch244_window1677_sthv2.py) | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window1677_sthv2.pth)/[baidu](https://pan.baidu.com/s/18MOGf6L3LeUjrLoQEeA52Q) |

**Notes**:

- **Pre-trained image models can be downloaded from [Swin Transformer for ImageNet Classification](https://github.com/microsoft/Swin-Transformer)**.
- The pre-trained model of SSv2 could be downloaded at [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window1677_kinetics400_22k.pth)/[baidu](https://pan.baidu.com/s/1ZnJuX7-x2BflDKHpuvdLUg).
- Access code for baidu is `swin`.

## Usage

Expand All @@ -72,60 +50,17 @@ python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy
```

### Training

To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:
```
# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
```
For example, to train a `Swin-T` model for Kinetics-400 dataset with 8 gpus, run:
```
bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL>
```

To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:
```
# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
```
For example, to train a `Swin-B` model for SSv2 dataset with 8 gpus, run:
```
bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>
```

**Note:** `use_checkpoint` is used to save GPU memory. Please refer to [this page](https://pytorch.org/docs/stable/checkpoint.html) for more details.


### Apex (optional):
We use apex for mixed precision training by default. To install apex, use our provided docker or run:
```
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
```
If you would like to disable apex, comment out the following code block in the [configuration files](configs/recognition/swin):
```
# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
type="DistOptimizerHook",
update_interval=1,
grad_clip=None,
coalesce=True,
bucket_size_mb=-1,
use_fp16=True,
)
```

## Citation
If you find our work useful in your research, please cite:
If you find this project useful in your research, please consider cite:

```BibTex
@article{liang2022expediting,
author = {Liang, Weicong and Yuan, Yuhui and Ding, Henghui and Luo, Xiao and Lin, Weihong and Jia, Ding and Zhang, Zheng and Zhang, Chao and Hu, Han},
title = {Expediting large-scale vision transformer for dense prediction without fine-tuning},
journal = {arXiv preprint arXiv:2210.01035},
year = {2022},
}
```

```
@article{liu2021video,
Expand All @@ -142,13 +77,3 @@ If you find our work useful in your research, please cite:
year={2021}
}
```

## Other Links

> **Image Classification**: See [Swin Transformer for Image Classification](https://github.com/microsoft/Swin-Transformer).
> **Object Detection**: See [Swin Transformer for Object Detection](https://github.com/SwinTransformer/Swin-Transformer-Object-Detection).
> **Semantic Segmentation**: See [Swin Transformer for Semantic Segmentation](https://github.com/SwinTransformer/Swin-Transformer-Semantic-Segmentation).
> **Self-Supervised Learning**: See [MoBY with Swin Transformer](https://github.com/SwinTransformer/Transformer-SSL).
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@
patch_size=(2,4,4),
window_size=(8,12,12),
drop_path_rate=0.5,
clustering_location=12,
clustering_location=10,
token_clustering_cfg=dict(
clustering_shape=(8, 8, 8),
clustering_shape=(8, 6, 6),
n_iters=5,
temperature=0.01,
window_size=5,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,14 @@
drop_path_rate=0.5,
clustering_location=12,
token_clustering_cfg=dict(
clustering_shape=(8, 8, 8),
clustering_shape=(8, 6, 6),
n_iters=5,
temperature=0.01,
temperature=0.02,
window_size=5,
),
token_reconstruction_cfg=dict(
k=25,
temperature=0.01,
temperature=0.02,
),
),
test_cfg=dict(max_testing_views=1))
Expand Down
Binary file added figures/Hourglass_swin_framework.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/TokenClusterReconstruct_Details.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 7e99aec

Please sign in to comment.