GitHub - QingbeiGuo/HybridFormer: The work is done. Note: Code coming soon

Introduction

HybridFormer is composed of two main building blocks, i.e., local convolution block (LCB) and global transformer block (GTB). The proposed HybridFormer integrates the merits of improved convolution and self-attention to balance redundancy and dependency for effective and efficient representation learning. We estimate HybridFormer through extensive experiments, demonstrating that it achieves state-of-the-art (SOTA) performance on numerous vision tasks, including image classification, object detection, and semantic segmentation.

without any extra training data and labels, our HybridFormer achieves a top-1 accuracy of 84.6% on the ImageNet-1K image classification task with ∼11G FLOPs.without any extra training data and labels, our HybridFormer achieves a top-1 accuracy of 84.6% on the ImageNet-1K image classification task with ∼11G FLOPs. With only ImageNet-1K pre-training, on downstream tasks, it obtains 48.9 box AP and 44.1 mask AP on COCO object detection task, +1.5 box AP and +1.0 mask AP higher than UniFormer with ∼16% fewer parameters, and 48.9 mIoU on ADE20K semantic segmentation task, about +1 higher mIoU than UniFormer with ∼25% fewer FLOPs.

General Framework

Main results on Image Classification

We perform image classification experiments on the ImageNet-1K dataset. We compare our HybridFormer with recent stateof-the-art methods, as shown in the Table, where models are grouped by the amount of computations.

Model	#Params(M)	FLOPs(G)	Top-1(%)	Reference
LIT-S	27	4.1	81.5	AAAI22
CrossFormer-S	30.7	4.9	82.5	ICLR22
iFormer-S	20	4.8	83.4	NeurIPS22
CETNet-T	23	4.3	82.7	ECCV22
DaViT-Tiny	28.3	4.5	82.8	ECCV22
ScalableViT-S	32	4.2	83.1	ECCV22
MixFormer-B4	35	3.6	83.0	CVPR22
DAT-T	29	4.6	82.0	CVPR22
MViTv2-T	24	4.7	82.3	CVPR22
NAT-T	28	4.3	83.2	CVPR23
UniFormer-S	22	3.6	82.9	TPAMI23
HybridFormer-S (ours)	21.6	4.3	83.4
RegionViT-M	41.2	7.4	83.1	ICLR22
CETNet-S	34	6.8	83.4	ECCV22
MOAT-0	27.8	5.7	83.3	ICLR23
MViTv2-S	35	7.0	83.6	CVPR22
NAT-S	51	7.8	83.7	CVPR23
PaCa-Small	22.0	5.5	83.1	CVPR23
InternImage-T	30	5.0	83.5	CVPR23
HybridFormer-B (ours)	29.9	6.2	83.8
LIT-M	48	8.6	83.0	AAAI22
CrossFormer-B	52.0	9.2	83.4	ICLR22
DaViT-Small	49.7	8.8	84.2	ECCV22
ScalableViT-B	81	8.6	84.1	ECCV22
DAT-S	50	9.0	83.7	CVPR22
MOAT-1	41.6	9.1	84.2	ICLR23
PaCa-Base	46.9	9.5	84.0	CVPR23
InternImage-S	50	8.0	84.2	CVPR23
UniFormer-B	50	8.3	83.9	TPAMI23
HybridFormer-L (ours)	38.3	8.0	84.2
LIT-B	86	15.0	83.4	AAAI22
RegionViT-B	72.7	13.0	83.2	ICLR22
CrossFormer-L	92.0	16.1	84.0	ICLR22
CETNet-B	75	15.1	83.8	ECCV22
DaViT-Base	87.9	15.5	84.6	ECCV22
ScalableViT-L	104	14.7	84.4	ECCV22
MViTv2-B	52	10.2	84.4	CVPR22
DAT-B	88	15.8	84.0	CVPR22
NAT-B	90	13.7	84.3	CVPR23
HybridFormer-H (ours)	55.2	11.6	84.6

Main results on Object Detection and Instance Segmentation

We evaluate the proposed models on object detection and instance segmentation with the COCO2017 dataset. We report the comparison results on the object detection task and instance segmentation task in the Table.

Method	#Params	FLOPs	Mask R-CNN 1 × schedule
Method	#Params	FLOPs	AP^b	AP^b₅₀	AP^b₇₅	AP^m	AP^m₅₀	AP^m₇₅
DAT-T	48	272	44.4	67.6	48.5	40.4	64.2	43.1
PVT-M	64	302	42.0	64.4	45.6	39.0	61.6	42.1
CETNet-T	43	261	45.5	67.7	50.0	40.7	64.4	43.7
ScalableViT-S	46	256	45.8	67.6	50.0	41.7	64.7	44.8
CrossFormer-S	50	301	45.4	68.0	49.7	41.4	64.8	44.6
MixFormer-B3	35	207	42.8	64.5	46.7	39.3	61.8	42.2
DAT-T	48	272	44.4	67.6	48.5	40.4	64.2	43.1
PaCa-Small	42	296	46.4	68.7	50.9	41.8	65.5	45.0
UniFormer-Sh14	41	269	45.6	68.1	49.7	41.6	64.8	45.0
HybridFormer-S (ours)	41	309	46.5	68.8	50.8	42.4	65.5	45.7
DAT-S	69	378	47.1	69.9	51.5	42.5	66.7	45.4
PVT-L	81	364	42.9	65.0	46.6	39.5	61.9	42.5
CETNet-S	53	315	46.6	68.7	51.4	41.6	65.4	44.8
ScalableViT-B	95	349	46.8	68.7	51.5	42.5	65.8	45.9
CrossFormer-B	72	408	47.2	69.9	51.8	42.7	66.6	46.2
MixFormer-B4	53	243	45.1	67.1	49.2	41.2	64.3	44.1
DAT-S	69	378	47.1	69.9	51.5	42.5	66.7	45.4
PaCa-Base	67	373	48.0	69.7	52.1	42.9	66.6	45.6
UniFormer-Bh14	69	399	47.4	69.7	52.1	43.1	66.0	46.5
HybridFormer-L (ours)	58	433	48.9	71.0	53.7	44.1	68.0	47.8

Main results on Semantic Segmentation

We also evaluate our models on semantic segmentation with the ADE20K dataset. The table reports the results of different frameworks.

Method	Semantic FPN
Method	#Params(M)	FLOPs(G)	mIoU(%)
DAT-T	32	198	42.6
PVT-M	48	219	41.6
ScalableViT-S	30	174	44.9
CrossFormer-S	34	221	46.0
DAT-S	53	320	46.1
UniFormer-S	25	247	46.6
HybridFormer-S (ours)	25	229	47.6
DAT-S	53	320	46.1
PVT-L	65	283	42.1
ScalableViT-B	79	270	48.4
CrossFormer-B	56	331	47.7
DAT-B	92	481	47.0
UniFormer-B	54	471	48.0
HybridFormer-L (ours)	42	358	48.9

VISUALIZATION

Visualization of Attention Map

To understand how CHSA works, we visualize attention maps at the last block of the 3rd stage and compare them before and after the across-head interaction, as shown in Fig. 3.

Class Activation Mapping

To further demonstrate the effectiveness of our approach, we apply Grad-CAM to visualize discriminative regions generated by ResNet50, ConvNeXt-B, MViTv2-B, UniFormer-B, and our HybridFormer-B, as shown in Fig. 4.

Qualitative Visualization

In Fig. 5, we also conduct qualitative visualization on validation datasets for downstream tasks, including object detection and semantic segmentation.

Authorship

This project is contributed by Qingbei Guo.

Note: Code coming soon

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

General Framework

Main results on Image Classification

Main results on Object Detection and Instance Segmentation

Main results on Semantic Segmentation

VISUALIZATION

Visualization of Attention Map

Class Activation Mapping

Qualitative Visualization

Authorship

Citation

About

Releases

Packages

QingbeiGuo/HybridFormer

Folders and files

Latest commit

History

Repository files navigation

Introduction

General Framework

Main results on Image Classification

Main results on Object Detection and Instance Segmentation

Main results on Semantic Segmentation

VISUALIZATION

Visualization of Attention Map

Class Activation Mapping

Qualitative Visualization

Authorship

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages