HybridFormer is composed of two main building blocks, i.e., local convolution block (LCB) and global transformer block (GTB). The proposed HybridFormer integrates the merits of improved convolution and self-attention to balance redundancy and dependency for effective and efficient representation learning. We estimate HybridFormer through extensive experiments, demonstrating that it achieves state-of-the-art (SOTA) performance on numerous vision tasks, including image classification, object detection, and semantic segmentation.
without any extra training data and labels, our HybridFormer achieves a top-1 accuracy of 84.6% on the ImageNet-1K image classification task with ∼11G FLOPs.without any extra training data and labels, our HybridFormer achieves a top-1 accuracy of 84.6% on the ImageNet-1K image classification task with ∼11G FLOPs. With only ImageNet-1K pre-training, on downstream tasks, it obtains 48.9 box AP and 44.1 mask AP on COCO object detection task, +1.5 box AP and +1.0 mask AP higher than UniFormer with ∼16% fewer parameters, and 48.9 mIoU on ADE20K semantic segmentation task, about +1 higher mIoU than UniFormer with ∼25% fewer FLOPs.
We perform image classification experiments on the ImageNet-1K dataset. We compare our HybridFormer with recent stateof-the-art methods, as shown in the Table, where models are grouped by the amount of computations.
Model | #Params(M) | FLOPs(G) | Top-1(%) | Reference |
---|---|---|---|---|
LIT-S | 27 | 4.1 | 81.5 | AAAI22 |
CrossFormer-S | 30.7 | 4.9 | 82.5 | ICLR22 |
iFormer-S | 20 | 4.8 | 83.4 | NeurIPS22 |
CETNet-T | 23 | 4.3 | 82.7 | ECCV22 |
DaViT-Tiny | 28.3 | 4.5 | 82.8 | ECCV22 |
ScalableViT-S | 32 | 4.2 | 83.1 | ECCV22 |
MixFormer-B4 | 35 | 3.6 | 83.0 | CVPR22 |
DAT-T | 29 | 4.6 | 82.0 | CVPR22 |
MViTv2-T | 24 | 4.7 | 82.3 | CVPR22 |
NAT-T | 28 | 4.3 | 83.2 | CVPR23 |
UniFormer-S | 22 | 3.6 | 82.9 | TPAMI23 |
HybridFormer-S (ours) | 21.6 | 4.3 | 83.4 | |
RegionViT-M | 41.2 | 7.4 | 83.1 | ICLR22 |
CETNet-S | 34 | 6.8 | 83.4 | ECCV22 |
MOAT-0 | 27.8 | 5.7 | 83.3 | ICLR23 |
MViTv2-S | 35 | 7.0 | 83.6 | CVPR22 |
NAT-S | 51 | 7.8 | 83.7 | CVPR23 |
PaCa-Small | 22.0 | 5.5 | 83.1 | CVPR23 |
InternImage-T | 30 | 5.0 | 83.5 | CVPR23 |
HybridFormer-B (ours) | 29.9 | 6.2 | 83.8 | |
LIT-M | 48 | 8.6 | 83.0 | AAAI22 |
CrossFormer-B | 52.0 | 9.2 | 83.4 | ICLR22 |
DaViT-Small | 49.7 | 8.8 | 84.2 | ECCV22 |
ScalableViT-B | 81 | 8.6 | 84.1 | ECCV22 |
DAT-S | 50 | 9.0 | 83.7 | CVPR22 |
MOAT-1 | 41.6 | 9.1 | 84.2 | ICLR23 |
PaCa-Base | 46.9 | 9.5 | 84.0 | CVPR23 |
InternImage-S | 50 | 8.0 | 84.2 | CVPR23 |
UniFormer-B | 50 | 8.3 | 83.9 | TPAMI23 |
HybridFormer-L (ours) | 38.3 | 8.0 | 84.2 | |
LIT-B | 86 | 15.0 | 83.4 | AAAI22 |
RegionViT-B | 72.7 | 13.0 | 83.2 | ICLR22 |
CrossFormer-L | 92.0 | 16.1 | 84.0 | ICLR22 |
CETNet-B | 75 | 15.1 | 83.8 | ECCV22 |
DaViT-Base | 87.9 | 15.5 | 84.6 | ECCV22 |
ScalableViT-L | 104 | 14.7 | 84.4 | ECCV22 |
MViTv2-B | 52 | 10.2 | 84.4 | CVPR22 |
DAT-B | 88 | 15.8 | 84.0 | CVPR22 |
NAT-B | 90 | 13.7 | 84.3 | CVPR23 |
HybridFormer-H (ours) | 55.2 | 11.6 | 84.6 |
We evaluate the proposed models on object detection and instance segmentation with the COCO2017 dataset. We report the comparison results on the object detection task and instance segmentation task in the Table.
Method | #Params | FLOPs | Mask R-CNN 1 × schedule | |||||
APb | APb50 | APb75 | APm | APm50 | APm75 | |||
DAT-T | 48 | 272 | 44.4 | 67.6 | 48.5 | 40.4 | 64.2 | 43.1 |
PVT-M | 64 | 302 | 42.0 | 64.4 | 45.6 | 39.0 | 61.6 | 42.1 |
CETNet-T | 43 | 261 | 45.5 | 67.7 | 50.0 | 40.7 | 64.4 | 43.7 |
ScalableViT-S | 46 | 256 | 45.8 | 67.6 | 50.0 | 41.7 | 64.7 | 44.8 |
CrossFormer-S | 50 | 301 | 45.4 | 68.0 | 49.7 | 41.4 | 64.8 | 44.6 |
MixFormer-B3 | 35 | 207 | 42.8 | 64.5 | 46.7 | 39.3 | 61.8 | 42.2 |
DAT-T | 48 | 272 | 44.4 | 67.6 | 48.5 | 40.4 | 64.2 | 43.1 |
PaCa-Small | 42 | 296 | 46.4 | 68.7 | 50.9 | 41.8 | 65.5 | 45.0 |
UniFormer-Sh14 | 41 | 269 | 45.6 | 68.1 | 49.7 | 41.6 | 64.8 | 45.0 |
HybridFormer-S (ours) | 41 | 309 | 46.5 | 68.8 | 50.8 | 42.4 | 65.5 | 45.7 |
DAT-S | 69 | 378 | 47.1 | 69.9 | 51.5 | 42.5 | 66.7 | 45.4 |
PVT-L | 81 | 364 | 42.9 | 65.0 | 46.6 | 39.5 | 61.9 | 42.5 |
CETNet-S | 53 | 315 | 46.6 | 68.7 | 51.4 | 41.6 | 65.4 | 44.8 |
ScalableViT-B | 95 | 349 | 46.8 | 68.7 | 51.5 | 42.5 | 65.8 | 45.9 |
CrossFormer-B | 72 | 408 | 47.2 | 69.9 | 51.8 | 42.7 | 66.6 | 46.2 |
MixFormer-B4 | 53 | 243 | 45.1 | 67.1 | 49.2 | 41.2 | 64.3 | 44.1 |
DAT-S | 69 | 378 | 47.1 | 69.9 | 51.5 | 42.5 | 66.7 | 45.4 |
PaCa-Base | 67 | 373 | 48.0 | 69.7 | 52.1 | 42.9 | 66.6 | 45.6 |
UniFormer-Bh14 | 69 | 399 | 47.4 | 69.7 | 52.1 | 43.1 | 66.0 | 46.5 |
HybridFormer-L (ours) | 58 | 433 | 48.9 | 71.0 | 53.7 | 44.1 | 68.0 | 47.8 |
We also evaluate our models on semantic segmentation with the ADE20K dataset. The table reports the results of different frameworks.
Method | Semantic FPN | ||
#Params(M) | FLOPs(G) | mIoU(%) | |
DAT-T | 32 | 198 | 42.6 |
PVT-M | 48 | 219 | 41.6 |
ScalableViT-S | 30 | 174 | 44.9 |
CrossFormer-S | 34 | 221 | 46.0 |
DAT-S | 53 | 320 | 46.1 |
UniFormer-S | 25 | 247 | 46.6 |
HybridFormer-S (ours) | 25 | 229 | 47.6 |
DAT-S | 53 | 320 | 46.1 |
PVT-L | 65 | 283 | 42.1 |
ScalableViT-B | 79 | 270 | 48.4 |
CrossFormer-B | 56 | 331 | 47.7 |
DAT-B | 92 | 481 | 47.0 |
UniFormer-B | 54 | 471 | 48.0 |
HybridFormer-L (ours) | 42 | 358 | 48.9 |
To understand how CHSA works, we visualize attention maps at the last block of the 3rd stage and compare them before and after the across-head interaction, as shown in Fig. 3.
To further demonstrate the effectiveness of our approach, we apply Grad-CAM to visualize discriminative regions generated by ResNet50, ConvNeXt-B, MViTv2-B, UniFormer-B, and our HybridFormer-B, as shown in Fig. 4.
In Fig. 5, we also conduct qualitative visualization on validation datasets for downstream tasks, including object detection and semantic segmentation.
This project is contributed by Qingbei Guo.
Note: Code coming soon