Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Release code of MixFormer (CVPR2022, Oral) #1820

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions docs/en/models/MixFormer_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# MixFormer
---
## Catalogue

- [1. Introduction](#1)
- [2. Main Results](#2)
- [2.1 Results on ImageNet-1K](#2.1)
- [2.2 Results on MS COCO with Mask R-CNN](#2.2)
- [2.3 Results on ADE20K with UperNet](#2.3)
- [2.4 Results on MS COCO for Keypoint Detection](#2.4)
- [2.4 Results on LVIS 1.0 with Mask R-CNN](#2.5)
- [3. Reference](#3)

<a name='1'></a>
## 1. Introduction

MixFormer is an efficient and general-purpose hybrid vision transformer. There are two main designs in MixFormer: (1) combining local-window self-attention and depth-wise convolution in a parallel design, (2) proposing bi-directional interactions across branches to provide complementary clues in channel and spatial dimensions. These two designs are integrated to achieve efficient feature mixing among windows and dimensions. MixFormer provides superior performance than other vision transformer variants on image classification and 5 dense prediction tasks.

> [**MixFormer: Mixing Features across Windows and Dimensions**](https://arxiv.org/abs/2204.02557)<br>
> Qiang Chen, Qiman Wu, Jian Wang, Qinghao Hu, Tao Hu, Errui Ding, Jian Cheng, Jingdong Wang<br>
> CVPR2022, **Oral** presentation

![image](../../images/MixFormer/MixingBlock.png)

<a name='2'></a>
## 2. Main Results

<a name='2.1'></a>
### Results on ImageNet-1K
We provide the accuracy with FLOPs of MixFormer on ImageNet-1K. Unlike other vision transformer variants, they only show promising results when the FLOPs are pretty large (e.g., 4.5G). MixFormer can achieve favorable results **with small model sizes (FLOPs < 1G)**, which are nontrivial results.

| Models | Top1 | FLOPs (G) |
|:--:|:--:|:--:|
| MixFormer-B0 | 0.765 | 0.4 |
| MixFormer-B1 | 0.789 | 0.7 |
| MixFormer-B2 | 0.800 | 0.9 |
| MixFormer-B3 | 0.817 | 1.9 |
| MixFormer-B4 | 0.830 | 3.6 |
| MixFormer-B5 | 0.835 | 6.8 |
| MixFormer-B6 | 0.838 | 12.7 |

<a name='2.2'></a>
### Results on MS COCO with Mask R-CNN
All results are trained with a multi-scale 3x schedule.

| Backbone | Params (M) | FLOPs (G) | schedule | mAP_box| mAP_mask |
|:--:|:--:|:--:|:--:|:--:| :--:|
| Swin-T | 48 | 264 | 3x | 46.0| 41.6 |
| MixFormer-B1 | 26 | 183 | 3x | 43.9 | 40.0 |
| MixFormer-B2 | 28 | 187 | 3x | 45.1 | 40.8 |
| MixFormer-B3 | 35 | 207 | 3x | 46.2 | 41.9 |
| MixFormer-B4 | 53 | 243 | 3x | **47.6** | **43.0** |

<a name='2.3'></a>
### Results on ADE20K with UperNet

| Backbone | Params (M) | FLOPs (G) | iterations | mIoU_ss | mIoU_ms |
|:--:|:--:|:--:|:--:|:--:| :--:|
| Swin-T | 60 | 945 | 160k | 44.5| 45.8 |
| MixFormer-B1 | 35 | 854 | 160k | 42.0 | 43.5 |
| MixFormer-B2 | 37 | 859 | 160k | 43.1 | 43.9 |
| MixFormer-B3 | 44 | 880 | 160k | 44.5 | 45.5 |
| MixFormer-B4 | 63 | 918 | 160k | **46.8** | **48.0** |

<a name='2.4'></a>
### Results on MS COCO for Keypoint Detection

| Backbone | mAP | mAP_50 | mAP_75 |
|:--:|:--:|:--:|:--:|
| ResNet50 | 71.8 | 89.8 | 79.5 |
| Swin-T | 74.2 | 92.5 | 82.5 |
| HRFormer-S | 74.5 | 92.3 | 82.1 |
| MixFormer-B4 | **75.3** | **93.5** | **83.5** |

<a name='2.5'></a>
### Results on LVIS 1.0 with Mask R-CNN

| Backbone | mAP_mask | mAP_mask_50 | mAP_mask_75 |
|:--:|:--:|:--:|:--:|
| ResNet50 | 21.7 | 34.3 | 23.0 |
| Swin-T | 27.6 | 43.0 | 29.3 |
| MixFormer-B4 | **28.6** | **43.4** | **30.5** |

<a name="3"></a>
## 3. Reference

If you find MixFormer helpful, please consider citing:
```
@inproceedings{chen2022mixformer,
title={MixFormer: Mixing Features across Windows and Dimensions},
author={Chen, Qiang and Wu, Qiman and Wang, Jian and Hu, Qinghao and Hu, Tao and Ding, Errui and Cheng, Jian and Wang, Jingdong},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition},
year={2022}
}
```
1 change: 1 addition & 0 deletions docs/en/models/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,4 @@ models
MixNet_en.md
Twins_en.md
PVTV2_en.md
MixFormer_en.md
Binary file added docs/images/MixFormer/MixingBlock.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
98 changes: 98 additions & 0 deletions docs/zh_CN/models/ImageNet1k/MixFormer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# MixFormer
---
## 目录

- [1. 介绍](#1)
- [2. 主要结果](#2)
- [2.1 图像分类](#2.1)
- [2.2 目标检测与实例分割](#2.2)
- [2.3 语义分割](#2.3)
- [2.4 人体关键点](#2.4)
- [2.4 长尾实例分割](#2.5)
- [3. 引用](#3)

<a name='1'></a>
## 1. 介绍

MixFormer是一个高效、通用的骨干网路(Vision Transformer)。在MixFormer中,主要有两个创新的设计:(1)通过平行分支的设计,将局部窗口自注意力(local-window self-attention)与Depthwise卷积进行组合,解决局部窗口自注意力的感受野受限的问题,(2)在平行分支之间提出双向交互模块,使得两个分支可以在channel和spatial两个维度都能实现信息互补,增强整体的建模能力。上述两个设计的组合使得MixFormer可以融合不同局部窗口与不同维度的特征信息,从而在图像分类以及5个重要的下游任务中取得比其他vision transformer骨干网络更好的结果。

> [**MixFormer: Mixing Features across Windows and Dimensions**](https://arxiv.org/abs/2204.02557)<br>
> Qiang Chen, Qiman Wu, Jian Wang, Qinghao Hu, Tao Hu, Errui Ding, Jian Cheng, Jingdong Wang<br>
> CVPR2022, **Oral** presentation

![image](../../../images/MixFormer/MixingBlock.png)

chensnathan marked this conversation as resolved.
Show resolved Hide resolved

<a name='2'></a>
## 2. 主要结果

<a name='2.1'></a>
### 图像分类
我们提供了MixFormer在ImageNet-1K数据集上的精度。与其他vision transformer不同的是,MixFormer即使在小模型的场景下,也能有很好的性能表现,而其他的vision transformer往往只有在模型较大的情况下(例如,4.5G FLOPs)才比较有效。

| Models | Top1 | FLOPs (G) |
|:--:|:--:|:--:|
| MixFormer-B0 | 0.765 | 0.4 |
| MixFormer-B1 | 0.789 | 0.7 |
| MixFormer-B2 | 0.800 | 0.9 |
| MixFormer-B3 | 0.817 | 1.9 |
| MixFormer-B4 | 0.830 | 3.6 |
| MixFormer-B5 | 0.835 | 6.8 |
| MixFormer-B6 | 0.838 | 12.7 |


<a name='2.2'></a>
### 目标检测与实例分割
下表中的所有结果都以Mask R-CNN为基础模型,并且采用3x多尺度训练得到的。

| Backbone | Params (M) | FLOPs (G) | schedule | mAP_box| mAP_mask |
|:--:|:--:|:--:|:--:|:--:| :--:|
| Swin-T | 48 | 264 | 3x | 46.0| 41.6 |
| MixFormer-B1 | 26 | 183 | 3x | 43.9 | 40.0 |
| MixFormer-B2 | 28 | 187 | 3x | 45.1 | 40.8 |
| MixFormer-B3 | 35 | 207 | 3x | 46.2 | 41.9 |
| MixFormer-B4 | 53 | 243 | 3x | **47.6** | **43.0** |

<a name='2.3'></a>
### 语义分割
下表中的所有结果都以UperNet为基础模型。

| Backbone | Params (M) | FLOPs (G) | iterations | mIoU_ss | mIoU_ms |
|:--:|:--:|:--:|:--:|:--:| :--:|
| Swin-T | 60 | 945 | 160k | 44.5| 45.8 |
| MixFormer-B1 | 35 | 854 | 160k | 42.0 | 43.5 |
| MixFormer-B2 | 37 | 859 | 160k | 43.1 | 43.9 |
| MixFormer-B3 | 44 | 880 | 160k | 44.5 | 45.5 |
| MixFormer-B4 | 63 | 918 | 160k | **46.8** | **48.0** |

<a name='2.4'></a>
### 人体关键点

| Backbone | mAP | mAP_50 | mAP_75 |
|:--:|:--:|:--:|:--:|
| ResNet50 | 71.8 | 89.8 | 79.5 |
| Swin-T | 74.2 | 92.5 | 82.5 |
| HRFormer-S | 74.5 | 92.3 | 82.1 |
| MixFormer-B4 | **75.3** | **93.5** | **83.5** |

<a name='2.5'></a>
### 长尾实例分割

| Backbone | mAP_mask | mAP_mask_50 | mAP_mask_75 |
|:--:|:--:|:--:|:--:|
| ResNet50 | 21.7 | 34.3 | 23.0 |
| Swin-T | 27.6 | 43.0 | 29.3 |
| MixFormer-B4 | **28.6** | **43.4** | **30.5** |

<a name="3"></a>
## 3. 引用

如果MixFormer对你有所启发,请考虑引用我们的MixFormer:
```
@inproceedings{chen2022mixformer,
title={MixFormer: Mixing Features across Windows and Dimensions},
author={Chen, Qiang and Wu, Qiman and Wang, Jian and Hu, Qinghao and Hu, Tao and Ding, Errui and Cheng, Jian and Wang, Jingdong},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition},
year={2022}
}
```
1 change: 1 addition & 0 deletions ppcls/arch/backbone/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@
from .model_zoo.cspnet import CSPDarkNet53
from .model_zoo.pvt_v2 import PVT_V2_B0, PVT_V2_B1, PVT_V2_B2_Linear, PVT_V2_B2, PVT_V2_B3, PVT_V2_B4, PVT_V2_B5
from .model_zoo.mobilevit import MobileViT_XXS, MobileViT_XS, MobileViT_S
from .model_zoo.mixformer import MixFormer_B0, MixFormer_B1, MixFormer_B2, MixFormer_B3, MixFormer_B4, MixFormer_B5, MixFormer_B6
from .model_zoo.repvgg import RepVGG_A0, RepVGG_A1, RepVGG_A2, RepVGG_B0, RepVGG_B1, RepVGG_B2, RepVGG_B1g2, RepVGG_B1g4, RepVGG_B2g4, RepVGG_B3g4
from .model_zoo.van import VAN_tiny
from .model_zoo.peleenet import PeleeNet
Expand Down
Loading