Video classification with UniFormer

We currenent release the code and models for:

Kintics-400
Kinetics-600
Something-Something V1
Something-Something V2

Update

05/21/2022

Lightweight models are released, which surpass X3D and MoViNet.

01/13/2022

Pretrained models on Kinetics-400, Kinetics-600, Something-Something V1&V2 d

Model Zoo

The followed models and logs can be downloaded on Google Drive: total_models, total_logs.

We also release the models on Baidu Cloud: total_models (gphp), total_logs (q5bw).

Note

All the config.yaml in our exp are NOT the training config actually used, since some hyperparameters are changed in the run.sh or test.sh.
All the models are pretrained on ImageNet-1K without Token Labeling and Layer Scale. You can find those pre-trained models in image_classification. Reason can be found in issue #12.
#Frame = #input_frame x #crop x #clip
#input_frame means how many frames are input for model per inference
#crop means spatial crops (e.g., 3 for left/right/center)
#clip means temporal clips (e.g., 4 means repeted sampling four clips with different start indices)

Kinetics-400

Model	#Frame	Resolution	FLOPs	Top1	Model	Log	Shell
UniFormer-XXS	4x1x1	128	1.0G	63.2	google	google	run.sh/config
UniFormer-XXS	4x1x1	160	1.6G	65.8	google	google	run.sh/config
UniFormer-XXS	8x1x1	128	2.0G	68.3	google	google	run.sh/config
UniFormer-XXS	8x1x1	160	3.3G	71.4	google	google	run.sh/config
UniFormer-XXS	16x1x1	128	4.2G	73.3	google	google	run.sh/config
UniFormer-XXS	16x1x1	160	6.9G	75.1	google	google	run.sh/config
UniFormer-XXS	32x1x1	160	15.4G	77.9	google	google	run.sh/config
UniFormer-XS	32x1x1	192	34.2G	78.6	google	google	run.sh/config

We adopt sparse sampling method for lightweight models. And to avoid loss NAN , we use the following techiniques:

Close mixed precision training.
Use weaker data augmentation.
Add Layer Scale.

Model	#Frame	Sampling Stride	FLOPs	Top1	Model	Log	Shell
UniFormer-S	8x1x4	8	70G	78.4	google	google	run.sh/config
UniFormer-S	16x1x4	4	167G	80.8	google	google	run.sh/config
UniFormer-S	16x1x4	8	167G	80.8	google	google	run.sh/config
UniFormer-S	32x1x4	4	438G	82.0	-	google	run.sh/config
UniFormer-B	8x1x4	8	161G	79.8	google	google	run.sh/config
UniFormer-B	16x1x4	4	387G	82.0	google	google	run.sh/config
UniFormer-B	16x1x4	8	387G	81.7	google	google	run.sh/config
UniFormer-B	32x1x4	4	1036G	82.9	google	google	run.sh/config

Kinetics-600

Model	#Frame	Sampling Stride	FLOPs	Top1	Model	Log	Shell
UniFormer-S	16x1x4	4	167G	82.8	google	google	run.sh/config
UniFormer-S	16x1x4	8	167G	82.7	google	google	run.sh/config
UniFormer-B	16x1x4	4	387G	84.0	google	google	run.sh/config
UniFormer-B	16x1x4	8	387G	83.4	google	google	run.sh/config
UniFormer-B	32x1x4	4	1036G	84.5*	google	google	run.sh/config

* Since Kinetics-600 is too large to train (>1 month in single node with 8 A100 GPUs), we provide model trained in multi node (around 2 weeks with 32 V100 GPUs), but the result is lower due to the lack of tuning hyperparameters.

For Multi-node training, please install submitit or follow the training scripts in our UniFormerV2.

Something-Something V1

Model	Pretrain	#Frame	FLOPs	Top1	Model	Log	Shell
UniFormer-S	K400	16x3x1	125G	57.2	google	google	run.sh/config
UniFormer-S	K600	16x3x1	125G	57.6	google	google	run.sh/config
UniFormer-S	K400	32x3x1	329G	58.8	google	google	run.sh/config
UniFormer-S	K600	32x3x1	329G	59.9	google	google	run.sh/config
UniFormer-B	K400	16x3x1	290G	59.1	google	google	run.sh/config
UniFormer-B	K600	16x3x1	290G	58.8	google	google	run.sh/config
UniFormer-B	K400	32x3x1	777G	60.9	google	google	run.sh/config
UniFormer-B	K600	32x3x1	777G	61.0	google	google	run.sh/config

Something-Something V2

Model	Pretrain	#Frame	FLOPs	Top1	Model	Log	Shell
UniFormer-S	K400	16x3x1	125G	67.7	google	google	run.sh/config
UniFormer-S	K600	16x3x1	125G	69.4	google	google	run.sh/config
UniFormer-S	K400	32x3x1	329G	69.0	google	google	run.sh/config
UniFormer-S	K600	32x3x1	329G	70.4	google	google	run.sh/config
UniFormer-B	K400	16x3x1	290G	70.4	google	google	run.sh/config
UniFormer-B	K600	16x3x1	290G	70.2	google	google	run.sh/config
UniFormer-B	K400	32x3x1	777G	71.1	google	google	run.sh/config
UniFormer-B	K600	32x3x1	777G	71.2	google	google	run.sh/config

UCF101

Model	#Frame	Sampling Stride	FLOPs	Top1	Model	Log	Shell
UniFormer-S	16x3x5	4	625G	98.3	google	google	run.sh/config

HMDB51

Model	#Frame	Sampling Stride	FLOPs	Top1	Model	Log	Shell
UniFormer-S	16x3x5	4	625G	77.5	google	google	run.sh/config

Usage

Installation

Please follow the installation instructions in INSTALL.md. You may follow the instructions in DATASET.md to prepare the datasets.

Training

Download the pretrained models in our repository.
Simply run the training scripts in exp as followed:
```
bash ./exp/uniformer_s8x8_k400/run.sh
```

[Note]:

Due to some bugs in the SlowFast repository, the program will be terminated in the final testing.
During training, we follow the SlowFast repository and randomly crop videos for validation. For accurate testing, please follow our testing scripts.
For more config details, you can read the comments in slowfast/config/defaults.py.

To avoid out of memory, you can use torch.utils.checkpoint (in config.yaml or run.sh):

MODEL.USE_CHECKPOINT True # whether use checkpoint
MODEL.CHECKPOINT_NUM [0, 0, 4, 0] # index for using checkpoint in every stage

Testing

We provide testing example as followed:

bash ./exp/uniformer_s8x8_k400/test.sh

Specifically, we need to create our new config for testing and run multi-crop/multi-clip test:

Copy the training config file config.yaml and create new testing config test.yaml.

Change the hyperparameters of data (in test.yaml or test.sh):

DATA:
  TRAIN_JITTER_SCALES: [224, 224]
  TEST_CROP_SIZE: 224

Set the number of crops and clips (in test.yaml or test.sh):

Multi-clip testing for Kinetics
```
TEST.NUM_ENSEMBLE_VIEWS 4
TEST.NUM_SPATIAL_CROPS 1
```
Multi-crop testing for Something-Something
```
TEST.NUM_ENSEMBLE_VIEWS 1
TEST.NUM_SPATIAL_CROPS 3
```

You can also set the checkpoint path via:

TEST.CHECKPOINT_FILE_PATH your_model_path

Cite Uniformer

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{li2022uniformer,
      title={Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning}, 
      author={Kunchang Li and Yali Wang and Peng Gao and Guanglu Song and Yu Liu and Hongsheng Li and Yu Qiao},
      year={2022},
      eprint={2201.04676},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

This repository is built based on SlowFast repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Video classification with UniFormer

Update

Model Zoo

Note

Kinetics-400

Kinetics-600

Something-Something V1

Something-Something V2

UCF101

HMDB51

Usage

Installation

Training

Testing

Cite Uniformer

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Video classification with UniFormer

Update

Model Zoo

Note

Kinetics-400

Kinetics-600

Something-Something V1

Something-Something V2

UCF101

HMDB51

Usage

Installation

Training

Testing

Cite Uniformer

Acknowledgement