We provide a collection of models trained with semantic softmax on ImageNet-21K-P dataset. All results are on input resolution of 224.
For proper comparison between the models, we also provide some throughput metrics.
Backbone | ImageNet-21K-P semantic top-1 Accuracy [%] |
ImageNet-1K top-1 Accuracy [%] |
Maximal batch size |
Maximal training speed (img/sec) |
Maximal inference speed (img/sec) |
---|---|---|---|---|---|
MobilenetV3_large_100 | 73.1 | 78.0 | 488 | 1210 | 5980 |
OFA_flops_595m_s | 75.0 | 81.0 | 288 | 500 | 3240 |
ResNet50 | 75.6 | 82.0 | 320 | 720 | 2760 |
Mixer-B-16 | 76.3 | 82.3 | 160 | 420 | 1420 |
TResNet-M | 76.4 | 83.1 | 520 | 670 | 2970 |
TResNet-L (V2) | 76.7 | 83.9 | 240 | 300 | 1460 |
ViT-B-16 | 77.6 | 84.4 | 160 | 340 | 1140 |
To initialize the different models and properly load the weights, use this file.
use the following models names (--model_name): tresnet_m, tresnet_l, ofa_flops_595m_s, resnet50, vit_base_patch16_224, mobilenetv3_large_100
Notes
- Maximal training and inference speeds were calculated on NVIDIA V100 GPU, with 90% of maximal batch size.
- ViT model highly benefits from O2 mixed-precision training and inference. O1 mixed-precision speeds (torch.autocast) are lower.
- We are still optimising ViT hyper parameters on ImageNet-1K training. Accuracy would probably be higher in the future.
- Our ofa_flops_595m model is slightly different than the orignal model - we converted all hard-sigmoids to regular sigmoids, since they are faster, both on CPU and GPU, and gives better scores. Hence we renamed the model to 'ofa_flops_595m_s'.