- Applicable to following tasks:
- Scene Parsing
- Human Parsing
- Face Parsing
- 20+ Datasets
- 10+ SOTA Backbones
- 10+ SOTA Semantic Segmentation Models
- PyTorch, ONNX, TFLite and OpenVINO Inference
Supported Backbones:
- ResNet (CVPR 2016)
- ResNetD (ArXiv 2018)
- MobileNetV2 (CVPR 2018)
- MobileNetV3 (ICCV 2019)
- PVTv2 (ArXiv 2021)
- ResT (ArXiv 2021)
- MicroNet (ICCV 2021)
Supported Heads/Methods:
- FCN (CVPR 2015)
- UPerNet (ECCV 2018)
- BiSeNetv1 (ECCV 2018)
- FPN (CVPR 2019)
- SFNet (ECCV 2020)
- SegFormer (ArXiv 2021)
- FaPN (ICCV 2021)
- CondNet (IEEE SPL 2021)
Supported Standalone Models:
Supported Modules:
ADE20K-val (Scene Parsing)
Method | Backbone | mIoU (%) | Params (M) |
GFLOPs (512x512) |
Weights |
---|---|---|---|---|---|
SegFormer | MiT-B1 | 43.1 | 14 | 16 | pt |
MiT-B2 | 47.5 | 28 | 62 | pt | |
MiT-B3 | 50.0 | 47 | 79 | pt |
CityScapes-val (Scene Parsing)
Method | Backbone | mIoU (%) | Params (M) | GFLOPs | Img Size | Weights |
---|---|---|---|---|---|---|
SegFormer | MiT-B0 | 78.1 | 4 | 126 | 1024x1024 | N/A |
MiT-B1 | 80.0 | 14 | 244 | 1024x1024 | N/A | |
FaPN | ResNet-50 | 80.0 | 33 | - | 512x1024 | N/A |
SFNet | ResNetD-18 | 79.0 | 13 | - | 1024x1024 | N/A |
FCHarDNet | HarDNet-70 | 77.7 | 4 | 35 | 1024x1024 | pt |
DDRNet | DDRNet-23slim | 77.8 | 6 | 36 | 1024x2048 | pt |
HELEN-val (Face Parsing)
Method | Backbone | mIoU (%) | Params (M) |
GFLOPs (512x512) |
FPS (GTX1660ti) |
Weights |
---|---|---|---|---|---|---|
BiSeNetv1 | MobileNetV2-1.0 | 58.22 | 5 | 5 | 160 | pt |
BiSeNetv1 | ResNet-18 | 58.50 | 14 | 13 | 263 | pt |
BiSeNetv2 | - | 58.58 | 18 | 15 | 195 | pt |
FCHarDNet | HarDNet-70 | 59.38 | 4 | 4 | 130 | pt |
DDRNet | DDRNet-23slim | 61.11 | 6 | 5 | 180 | pt|tflite(fp32)|tflite(fp16)|tflite(int8) |
SegFormer | MiT-B0 | 59.31 | 4 | 8 | 75 | pt |
SFNet | ResNetD-18 | 61.00 | 14 | 31 | 56 | pt |
Backbones
Model | Variants | ImageNet-1k Top-1 Acc (%) | Params (M) | GFLOPs | Weights |
---|---|---|---|---|---|
MicroNet | M1|M2|M3 | 51.4| 59.4| 62.5 |
1| 2| 3 |
6M| 12M| 21M |
download |
MobileNetV2 | 1.0 | 71.9 | 3 | 300M | download |
MobileNetV3 | S|L | 67.7| 74.0 |
3| 5 |
56M| 219M |
S|L |
ResNet | 18|50|101 | 69.8| 76.1| 77.4 |
12| 25| 44 |
2| 4| 8 |
download |
ResNetD | 18|50|101 | - | 12| 25| 44 |
2| 4| 8 |
download |
MiT | B1|B2|B3 | - | 14| 25| 45 |
2| 4| 8 |
download |
PVTv2 | B1|B2|B4 | 78.7| 82.0| 83.6 |
14| 25| 63 |
2| 4| 10 |
download |
ResT | S|B|L | 79.6| 81.6| 83.6 |
14| 30| 52 |
2| 4| 8 |
download |
Notes: Download backbones' weights for HarDNet-70 and DDRNet-23slim.
Dataset | Type | Categories | Train Images |
Val Images |
Test Images |
Image Size (HxW) |
---|---|---|---|---|---|---|
COCO-Stuff | General Scene Parsing | 171 | 118,000 | 5,000 | 20,000 | - |
ADE20K | General Scene Parsing | 150 | 20,210 | 2,000 | 3,352 | - |
PASCALContext | General Scene Parsing | 59 | 4,996 | 5,104 | 9,637 | - |
SUN RGB-D | Indoor Scene Parsing | 37 | 2,666 | 2,619 | 5,050+labels | - |
Mapillary Vistas | Street Scene Parsing | 65 | 18,000 | 2,000 | 5,000 | 1080x1920 |
CityScapes | Street Scene Parsing | 19 | 2,975 | 500 | 1,525+labels | 1024x2048 |
CamVid | Street Scene Parsing | 11 | 367 | 101 | 233+labels | 720x960 |
MHPv2 | Multi-Human Parsing | 59 | 15,403 | 5,000 | 5,000 | - |
MHPv1 | Multi-Human Parsing | 19 | 3,000 | 1,000 | 980+labels | - |
LIP | Multi-Human Parsing | 20 | 30,462 | 10,000 | - | - |
CCIHP | Multi-Human Parsing | 22 | 28,280 | 5,000 | 5,000 | - |
CIHP | Multi-Human Parsing | 20 | 28,280 | 5,000 | 5,000 | - |
ATR | Single-Human Parsing | 18 | 16,000 | 700 | 1,000+labels | - |
HELEN | Face Parsing | 11 | 2,000 | 230 | 100+labels | - |
LaPa | Face Parsing | 11 | 18,176 | 2,000 | 2,000+labels | - |
iBugMask | Face Parsing | 11 | 21,866 | - | 1,000+labels | - |
CelebAMaskHQ | Face Parsing | 19 | 24,183 | 2,993 | 2,824+labels | 512x512 |
FaceSynthetics | Face Parsing (Synthetic) | 19 | 100,000 | 1,000 | 100+labels | 512x512 |
SUIM | Underwater Imagery | 8 | 1,525 | - | 110+labels | - |
Check DATASETS to find more segmentation datasets.
Datasets Structure (click to expand)
Datasets should have the following structure:
data
|__ ADEChallenge
|__ ADEChallengeData2016
|__ images
|__ training
|__ validation
|__ annotations
|__ training
|__ validation
|__ CityScapes
|__ leftImg8bit
|__ train
|__ val
|__ test
|__ gtFine
|__ train
|__ val
|__ test
|__ CamVid
|__ train
|__ val
|__ test
|__ train_labels
|__ val_labels
|__ test_labels
|__ VOCdevkit
|__ VOC2010
|__ JPEGImages
|__ SegmentationClassContext
|__ ImageSets
|__ SegmentationContext
|__ train.txt
|__ val.txt
|__ COCO
|__ images
|__ train2017
|__ val2017
|__ labels
|__ train2017
|__ val2017
|__ MHPv1
|__ images
|__ annotations
|__ train_list.txt
|__ test_list.txt
|__ MHPv2
|__ train
|__ images
|__ parsing_annos
|__ val
|__ images
|__ parsing_annos
|__ LIP
|__ LIP
|__ TrainVal_images
|__ train_images
|__ val_images
|__ TrainVal_parsing_annotations
|__ train_segmentations
|__ val_segmentations
|__ CIHP/CCIHP
|__ instance-leve_human_parsing
|__ Training
|__ Images
|__ Category_ids
|__ Validation
|__ Images
|__ Category_ids
|__ ATR
|__ humanparsing
|__ JPEGImages
|__ SegmentationClassAug
|__ SUIM
|__ train_val
|__ images
|__ masks
|__ TEST
|__ images
|__ masks
|__ SunRGBD
|__ SUNRGBD
|__ kv1/kv2/realsense/xtion
|__ SUNRGBDtoolbox
|__ traintestSUNRGBD
|__ allsplit.mat
|__ Mapillary
|__ training
|__ images
|__ labels
|__ validation
|__ images
|__ labels
|__ SmithCVPR2013_dataset_resized (HELEN)
|__ images
|__ labels
|__ exemplars.txt
|__ testing.txt
|__ tuning.txt
|__ CelebAMask-HQ
|__ CelebA-HQ-img
|__ CelebAMask-HQ-mask-anno
|__ CelebA-HQ-to-CelebA-mapping.txt
|__ LaPa
|__ train
|__ images
|__ labels
|__ val
|__ images
|__ labels
|__ test
|__ images
|__ labels
|__ ibugmask_release
|__ train
|__ test
|__ FaceSynthetics
|__ dataset_100000
|__ dataset_1000
|__ dataset_100
Note: For PASCALContext, download the annotations from here and put it in VOC2010.
Note: For CelebAMask-HQ, run the preprocess script.
python3 scripts/preprocess_celebamaskhq.py --root <DATASET-ROOT-DIR>
.
Augmentations (click to expand)
Check out the notebook here to test the augmentation effects.
Pixel-level Transforms:
- ColorJitter (Brightness, Contrast, Saturation, Hue)
- Gamma, Sharpness, AutoContrast, Equalize, Posterize
- GaussianBlur, Grayscale
Spatial-level Transforms:
- Affine, RandomRotation
- HorizontalFlip, VerticalFlip
- CenterCrop, RandomCrop
- Pad, ResizePad, Resize
- RandomResizedCrop
Requirements
- python >= 3.6
- torch >= 1.8.1
- torchvision >= 0.9.1
Other requirements can be installed with pip install -r requirements.txt
.
Configuration (click to expand)
Create a configuration file in configs
. Sample configuration for ADE20K dataset can be found here. Then edit the fields you think if it is needed. This configuration file is needed for all of training, evaluation and prediction scripts.
Training (click to expand)
To train with a single GPU:
$ python tools/train.py --cfg configs/CONFIG_FILE.yaml
To train with multiple gpus, set DDP
field in config file to true
and run as follows:
$ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py --cfg configs/<CONFIG_FILE_NAME>.yaml
Evaluation (click to expand)
Make sure to set MODEL_PATH
of the configuration file to your trained model directory.
$ python tools/val.py --cfg configs/<CONFIG_FILE_NAME>.yaml
To evaluate with multi-scale and flip, change ENABLE
field in MSF
to true
and run the same command as above.
Inference
To make an inference, edit the parameters of the config file from below.
- Change
MODEL
>>NAME
andVARIANT
to your desired pretrained model. - Change
DATASET
>>NAME
to the dataset name depending on the pretrained model. - Set
TEST
>>MODEL_PATH
to pretrained weights of the testing model. - Change
TEST
>>FILE
to the file or image folder path you want to test. - Testing results will be saved in
SAVE_DIR
.
## example using ade20k pretrained models
$ python tools/infer.py --cfg configs/ade20k.yaml
Example test results:
Convert to other Frameworks (ONNX, CoreML, OpenVINO, TFLite)
To convert to ONNX and CoreML, run:
$ python tools/export.py --cfg configs/<CONFIG_FILE_NAME>.yaml
To convert to OpenVINO and TFLite, see torch_optimize.
Inference (ONNX, OpenVINO, TFLite)
## ONNX Inference
$ python scripts/onnx_infer.py --model <ONNX_MODEL_PATH> --img-path <TEST_IMAGE_PATH>
## OpenVINO Inference
$ python scripts/openvino_infer.py --model <OpenVINO_MODEL_PATH> --img-path <TEST_IMAGE_PATH>
## TFLite Inference
$ python scripts/tflite_infer.py --model <TFLite_MODEL_PATH> --img-path <TEST_IMAGE_PATH>
References (click to expand)
Citations (click to expand)
@article{xie2021segformer,
title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers},
author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping},
journal={arXiv preprint arXiv:2105.15203},
year={2021}
}
@misc{xiao2018unified,
title={Unified Perceptual Parsing for Scene Understanding},
author={Tete Xiao and Yingcheng Liu and Bolei Zhou and Yuning Jiang and Jian Sun},
year={2018},
eprint={1807.10221},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@article{hong2021deep,
title={Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes},
author={Hong, Yuanduo and Pan, Huihui and Sun, Weichao and Jia, Yisong},
journal={arXiv preprint arXiv:2101.06085},
year={2021}
}
@misc{zhang2021rest,
title={ResT: An Efficient Transformer for Visual Recognition},
author={Qinglong Zhang and Yubin Yang},
year={2021},
eprint={2105.13677},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{huang2021fapn,
title={FaPN: Feature-aligned Pyramid Network for Dense Image Prediction},
author={Shihua Huang and Zhichao Lu and Ran Cheng and Cheng He},
year={2021},
eprint={2108.07058},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{wang2021pvtv2,
title={PVTv2: Improved Baselines with Pyramid Vision Transformer},
author={Wenhai Wang and Enze Xie and Xiang Li and Deng-Ping Fan and Kaitao Song and Ding Liang and Tong Lu and Ping Luo and Ling Shao},
year={2021},
eprint={2106.13797},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@article{Liu2021PSA,
title={Polarized Self-Attention: Towards High-quality Pixel-wise Regression},
author={Huajun Liu and Fuqiang Liu and Xinyi Fan and Dong Huang},
journal={Arxiv Pre-Print arXiv:2107.00782 },
year={2021}
}
@misc{chao2019hardnet,
title={HarDNet: A Low Memory Traffic Network},
author={Ping Chao and Chao-Yang Kao and Yu-Shan Ruan and Chien-Hsiang Huang and Youn-Long Lin},
year={2019},
eprint={1909.00948},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{sfnet,
title={Semantic Flow for Fast and Accurate Scene Parsing},
author={Li, Xiangtai and You, Ansheng and Zhu, Zhen and Zhao, Houlong and Yang, Maoke and Yang, Kuiyuan and Tong, Yunhai},
booktitle={ECCV},
year={2020}
}
@article{Li2020SRNet,
title={Towards Efficient Scene Understanding via Squeeze Reasoning},
author={Xiangtai Li and Xia Li and Ansheng You and Li Zhang and Guang-Liang Cheng and Kuiyuan Yang and Y. Tong and Zhouchen Lin},
journal={ArXiv},
year={2020},
volume={abs/2011.03308}
}
@ARTICLE{Yucondnet21,
author={Yu, Changqian and Shao, Yuanjie and Gao, Changxin and Sang, Nong},
journal={IEEE Signal Processing Letters},
title={CondNet: Conditional Classifier for Scene Segmentation},
year={2021},
volume={28},
number={},
pages={758-762},
doi={10.1109/LSP.2021.3070472}
}