Revisiting Label Smoothing & Knowledge Distillation
Compatibility: What was Missing?

Keshigeyan Chandrasegaran / Ngoc‑Trung Tran / Yunqing Zhao / Ngai‑Man Cheung
Singapore University of Technology and Design (SUTD)
ICML 2022
Project | ICML Paper | Pre-trained Models

Abstract

This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). Contemporary findings addressing this thesis statement take dichotomous standpoints: Muller et al. (2019); Shen et al. (2021). Critically, there is no effort to understand and resolve these contradictory findings, leaving the primal question — to smooth or not to smooth a teacher network? — unanswered. The main contributions of our work are the discovery, analysis and validation of systematic diffusion as the missing concept which is instrumental in understanding and resolving these contradictory findings. This systematic diffusion essentially curtails the benefits of distilling from an LS-trained teacher, thereby rendering KD at increased temperatures ineffective. Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image classification, neural machine translation and compact student distillation tasks spanning across multiple datasets and teacher-student architectures. Based on our analysis, we suggest practitioners to use an LS-trained teacher with a low-temperature transfer to achieve high performance students.

A rule of thumb for practitioners. We suggest to use an LS-trained teacher with a low-temperature transfer (i.e. T = 1) to render high performance students.

About the code

This codebase is written in Pytorch. It is clearly documented with bash file execution points exposing all required arguments and hyper-parameters. We also provide Docker container details to run our code.

✅ Pytorch

✅ NVIDIA DALI

✅ Multi-GPU / Mixed-Precision training

✅ DockerFile

Running the code

ImageNet-1K LS / KD experiments : Clear steps on how to run and reproduce our results for ImageNet-1K LS and KD (Table 2, B.3) are provided in src/image_classification/README.md. We support Multi-GPU training and mixed-precision training. We use NVIDIA DALI library for training student networks.

Machine Translation experiments : Clear steps on how to run and reproduce our results for machine translation LS and KD (Table 5, B.2) are provided in src/neural_machine_translation/README.md. We use [1] following exact procedure as [2]

CUB200-2011 experiments : Clear steps on how to run and reproduce our results for fine-grained image classification (CUB200) LS and KD (Table 2, B.1) are provided in src/image_classification/README.md. We support Multi-GPU training and mixed-precision training.

Compact Student Distillation : Clear steps on how to run and reproduce our results for Compact Student distillation LS and KD (Table 4, B.3) are provided in src/image_classification/README.md. We support Multi-GPU training and mixed-precision training.

Penultimate Layer Visualization : Pseudocode for Penultimate visualization algorithm is provided in src/visualization/visualization_algorithm.png Refer src/visualization/alpha-LS-KD_imagenet_centroids.py for Penultimate layer visualization code to reproduce all visualizations in the main paper and Supplementary (Figures 1, A.1, A.2). The code is clearly documented.

ImageNet-1K KD Results

ResNet-50 → ResNet-18 KD
	$\alpha$ / T	$\alpha=$ 0.0	$\alpha=$ 0.1
Teacher : ResNet-50	-	76.132 / 92.862	76.200 / 93.082
Student : ResNet-18	T = 1	71.488 / 90.272	71.666 / 90.364
Student : ResNet-18	T = 2	71.360 / 90.362	68.860 / 89.352
Student : ResNet-18	T = 3	69.674 / 89.698	67.752 / 88.932
Student : ResNet-18	T = 64	66.194 / 88.706	64.362 / 87.698

ResNet-50 → ResNet-50 KD
	$\alpha$ / T	$\alpha=$ 0.0	$\alpha=$ 0.1
Teacher : ResNet-50	-	76.132 / 92.862	76.200 / 93.082
Student : ResNet-50	T = 1	76.328 / 92.996	76.896 / 93.236
Student : ResNet-50	T = 2	76.180 / 93.072	76.110 / 93.138
Student : ResNet-50	T = 3	75.488 / 92.670	75.790 / 93.006
Student : ResNet-50	T = 64	74.278 / 92.410	74.566 / 92.596

Results produced with 20.12-py3 (Nvidia Pytorch Docker container) + Pytorch LTS 1.8.2 + CUDA11.1

Pretrained Models

All pretrained image classification, fine-graind image clasification, neural machine translation and compact student distillation models are available here

Citation

@InProceedings{pmlr-v162-chandrasegaran22a,
    author    = {Chandrasegaran, Keshigeyan and Tran, Ngoc-Trung and Zhao, Yunqing and Cheung, Ngai-Man},
    title     = {Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?},
    booktitle = {Proceedings of the 39th International Conference on Machine Learning},
    pages     = {2890-2916},
    year      = {2022},
    editor    = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
    volume    = {162},
    series    = {Proceedings of Machine Learning Research},
    month     = {17-23 Jul},
    publisher = {PMLR},
}

Acknowledgements

We gratefully acknowledge the following works and libraries:

Pytorch official ImageNet Training : https://github.com/pytorch/examples/tree/main/imagenet
DALI ImageNet Training : https://github.com/NVIDIA/DALI/blob/ce25d722bc47b8b4f3633ef008a85535db305789/docs/examples/use_cases/pytorch/resnet50/main.py
Multilingual NMT with Knowledge Distillation on Fairseq (ICLR'19) : https://github.com/RayeRen/multilingual-kd-pytorch
FairSeq Library : https://github.com/facebookresearch/fairseq
Experiment Tracking with Weights and Biases : https://www.wandb.com/

Special thanks to Lingeng Foo and Timothy Liu for valuable discussion.

References

[1] Tan, Xu, et al. "Multilingual Neural Machine Translation with Knowledge Distillation." International Conference on Learning Representations. 2019.

[2] Shen, Zhiqiang, et al. "Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study." International Conference on Learning Representations. 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
bash_scripts		bash_scripts
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Revisiting Label Smoothing & Knowledge Distillation
Compatibility: What was Missing?

Abstract

About the code

Running the code

ImageNet-1K KD Results

Pretrained Models

Citation

Acknowledgements

References

About

Releases

Packages

Languages

License

sutd-visual-computing-group/LS-KD-compatibility

Folders and files

Latest commit

History

Repository files navigation

Revisiting Label Smoothing & Knowledge DistillationCompatibility: What was Missing?

Abstract

About the code

Running the code

ImageNet-1K KD Results

Pretrained Models

Citation

Acknowledgements

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Revisiting Label Smoothing & Knowledge Distillation
Compatibility: What was Missing?

Packages