Skip to content

QingbeiGuo/HybridFormer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

Introduction

HybridFormer is composed of two main building blocks, i.e., local convolution block (LCB) and global transformer block (GTB). The proposed HybridFormer integrates the merits of improved convolution and self-attention to balance redundancy and dependency for effective and efficient representation learning. We estimate HybridFormer through extensive experiments, demonstrating that it achieves state-of-the-art (SOTA) performance on numerous vision tasks, including image classification, object detection, and semantic segmentation.

without any extra training data and labels, our HybridFormer achieves a top-1 accuracy of 84.6% on the ImageNet-1K image classification task with ∼11G FLOPs.without any extra training data and labels, our HybridFormer achieves a top-1 accuracy of 84.6% on the ImageNet-1K image classification task with ∼11G FLOPs. With only ImageNet-1K pre-training, on downstream tasks, it obtains 48.9 box AP and 44.1 mask AP on COCO object detection task, +1.5 box AP and +1.0 mask AP higher than UniFormer with ∼16% fewer parameters, and 48.9 mIoU on ADE20K semantic segmentation task, about +1 higher mIoU than UniFormer with ∼25% fewer FLOPs.

General Framework

Main results on Image Classification

We perform image classification experiments on the ImageNet-1K dataset. We compare our HybridFormer with recent stateof-the-art methods, as shown in the Table, where models are grouped by the amount of computations.

Model #Params(M) FLOPs(G) Top-1(%) Reference
LIT-S 27 4.1 81.5 AAAI22
CrossFormer-S 30.7 4.9 82.5 ICLR22
iFormer-S 20 4.8 83.4 NeurIPS22
CETNet-T 23 4.3 82.7 ECCV22
DaViT-Tiny 28.3 4.5 82.8 ECCV22
ScalableViT-S 32 4.2 83.1 ECCV22
MixFormer-B4 35 3.6 83.0 CVPR22
DAT-T 29 4.6 82.0 CVPR22
MViTv2-T 24 4.7 82.3 CVPR22
NAT-T 28 4.3 83.2 CVPR23
UniFormer-S 22 3.6 82.9 TPAMI23
HybridFormer-S (ours) 21.6 4.3 83.4
RegionViT-M 41.2 7.4 83.1 ICLR22
CETNet-S 34 6.8 83.4 ECCV22
MOAT-0 27.8 5.7 83.3 ICLR23
MViTv2-S 35 7.0 83.6 CVPR22
NAT-S 51 7.8 83.7 CVPR23
PaCa-Small 22.0 5.5 83.1 CVPR23
InternImage-T 30 5.0 83.5 CVPR23
HybridFormer-B (ours) 29.9 6.2 83.8
LIT-M 48 8.6 83.0 AAAI22
CrossFormer-B 52.0 9.2 83.4 ICLR22
DaViT-Small 49.7 8.8 84.2 ECCV22
ScalableViT-B 81 8.6 84.1 ECCV22
DAT-S 50 9.0 83.7 CVPR22
MOAT-1 41.6 9.1 84.2 ICLR23
PaCa-Base 46.9 9.5 84.0 CVPR23
InternImage-S 50 8.0 84.2 CVPR23
UniFormer-B 50 8.3 83.9 TPAMI23
HybridFormer-L (ours) 38.3 8.0 84.2
LIT-B 86 15.0 83.4 AAAI22
RegionViT-B 72.7 13.0 83.2 ICLR22
CrossFormer-L 92.0 16.1 84.0 ICLR22
CETNet-B 75 15.1 83.8 ECCV22
DaViT-Base 87.9 15.5 84.6 ECCV22
ScalableViT-L 104 14.7 84.4 ECCV22
MViTv2-B 52 10.2 84.4 CVPR22
DAT-B 88 15.8 84.0 CVPR22
NAT-B 90 13.7 84.3 CVPR23
HybridFormer-H (ours) 55.2 11.6 84.6

Main results on Object Detection and Instance Segmentation

We evaluate the proposed models on object detection and instance segmentation with the COCO2017 dataset. We report the comparison results on the object detection task and instance segmentation task in the Table.

Method #Params FLOPs Mask R-CNN 1 × schedule
APb APb50 APb75 APm APm50 APm75
DAT-T4827244.467.648.540.4 64.243.1
PVT-M6430242.064.445.639.0 61.6 42.1
CETNet-T4326145.567.750.040.7 64.443.7
ScalableViT-S4625645.867.650.0 41.764.744.8
CrossFormer-S5030145.468.049.7 41.464.844.6
MixFormer-B33520742.864.546.7 39.361.842.2
DAT-T4827244.467.648.540.4 64.243.1
PaCa-Small4229646.468.750.9 41.865.545.0
UniFormer-Sh144126945.668.149.7 41.664.845.0
HybridFormer-S (ours)4130946.5 68.850.842.465.545.7
DAT-S6937847.169.951.542.5 66.745.4
PVT-L8136442.965.046.639.5 61.942.5
CETNet-S5331546.668.751.441.6 65.444.8
ScalableViT-B9534946.868.751.5 42.565.845.9
CrossFormer-B7240847.269.951.8 42.766.646.2
MixFormer-B45324345.167.149.2 41.264.344.1
DAT-S6937847.169.951.542.5 66.745.4
PaCa-Base6737348.069.752.142.9 66.645.6
UniFormer-Bh146939947.469.752.1 43.166.046.5
HybridFormer-L (ours)5843348.9 71.053.744.1 68.047.8

Main results on Semantic Segmentation

We also evaluate our models on semantic segmentation with the ADE20K dataset. The table reports the results of different frameworks.

Method Semantic FPN
#Params(M) FLOPs(G) mIoU(%)
DAT-T3219842.6
PVT-M4821941.6
ScalableViT-S3017444.9
CrossFormer-S3422146.0
DAT-S5332046.1
UniFormer-S2524746.6
HybridFormer-S (ours)2522947.6
DAT-S5332046.1
PVT-L6528342.1
ScalableViT-B7927048.4
CrossFormer-B5633147.7
DAT-B9248147.0
UniFormer-B5447148.0
HybridFormer-L (ours)4235848.9

VISUALIZATION

Visualization of Attention Map

To understand how CHSA works, we visualize attention maps at the last block of the 3rd stage and compare them before and after the across-head interaction, as shown in Fig. 3.

Class Activation Mapping

To further demonstrate the effectiveness of our approach, we apply Grad-CAM to visualize discriminative regions generated by ResNet50, ConvNeXt-B, MViTv2-B, UniFormer-B, and our HybridFormer-B, as shown in Fig. 4.

Qualitative Visualization

In Fig. 5, we also conduct qualitative visualization on validation datasets for downstream tasks, including object detection and semantic segmentation.

Authorship

This project is contributed by Qingbei Guo.

Note: Code coming soon

Citation

About

The work is done. Note: Code coming soon

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published