# Knowledge Distillation (KD)

## Definition and Core Idea

**Knowledge Distillation (KD)** is a training paradigm where a smaller or cheaper **student model** learns to mimic a stronger **teacher model** (often a large model or an ensemble).  
The central objective is to **transfer knowledge** efficiently from teacher to student—maintaining high accuracy while reducing computational cost.

The foundational idea emerged from **model compression** (Buciluă, Caruana, Niculescu-Mizil, 2006) and was formalized by **Hinton et al. (2015)** through the use of **soft targets** and a **temperature-scaled softmax**.

---

## Why Soft Targets?

Hard one-hot labels contain no information about class similarity.  
Soft targets (from a high-temperature softmax) encode **“dark knowledge”**—latent information about how the teacher perceives similarities among classes.  
This provides richer gradients, smoother optimization, and better generalization.

---

## Main Families of Knowledge Distillation

### 1. Response / Logit-Based KD
Match the teacher’s **logits** or **probabilities**:
$$
L_{\text{KD}} = T^2 \cdot \text{KL}\!\left( p_T^{(T)} \parallel p_S^{(T)} \right)
$$
where  
$$
p_i^{(T)} = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)} .
$$  
This is the **canonical baseline**, widely used in CV and NLP.

---

### 2. Feature-Based KD
Align intermediate **hidden activations** between teacher and student:
$$
L_{\text{feat}} = \sum_{k=1}^{K} \| h_T^{(k)} - h_S^{(k)} \|_2^2
$$
Used when architectures differ or when mid-level representations are crucial.

---

### 3. Relation-Based KD
Match **relations** among samples or channels (e.g., pairwise distances or angles):
$$
L_{\text{rel}} = \sum_{i,j} \big( d_T(x_i, x_j) - d_S(x_i, x_j) \big)^2 .
$$

---

### 4. Attention-Based KD
Force the student to reproduce the teacher’s **attention maps**:
$$
L_{\text{attn}} = \| A_T - A_S \|_2^2 .
$$

---

### 5. Information-Theoretic KD
Maximize the **mutual information** between teacher and student representations:
$$
\max I(h_T; h_S) .
$$
Variational Information Distillation (VID) provides an estimation bound for this quantity.

---

### 6. Sequence-Level KD (Seq2Seq)
Distill **entire output sequences** instead of token-wise targets.  
Simplifies decoding in NMT and often removes the need for beam search.

---

### 7. Self-Distillation / Online & Mutual
The model teaches itself:
- *Born-Again Networks*: each generation serves as a teacher for the next.
- *Deep Mutual Learning*: multiple students learn cooperatively.

---

### 8. Data-Free / Privacy-Preserving KD
Distill knowledge **without access to original data**, using synthetic samples or inversion (e.g., *Zero-Shot KD*, *DeepInversion*).  
Useful for **federated** or **privacy-sensitive** contexts.

---

### 9. Diffusion-Model Distillation
Compress diffusion sampling or enable controllable generation:
- **Progressive Distillation:** halve steps repeatedly (e.g., 8192 → 4).
- **Consistency Models:** single-step generation distilled from diffusion backbones.
- **Score-Distillation Sampling (SDS):**
  $$ \nabla_\theta L = \mathbb{E}_x [ \| s_\theta(x) - s_T(x) \|^2 ] $$
  where \( s_T \) is the teacher score.
- **Adversarial Diffusion Distillation (ADD):** combines diffusion and GAN objectives.

---

### 10. LLM-Focused KD
Distillation for **Transformers and LLMs**:
- *DistilBERT*, *TinyBERT*, *MiniLM(v2)* — combine logit, hidden-state, and attention-relation objectives.
- Achieve up to **90 % reduction** in cost while maintaining strong GLUE performance.

---

## Applications and Goals

1. **Compression & Acceleration:** shrink model size and latency while preserving accuracy.  
2. **Better Training Signals:** improve generalization in low-data or noisy-label scenarios.  
3. **Heterogeneous / Federated Learning:** share only predictions, not parameters.  
4. **Diffusion & 3D:** compress multi-step sampling into one or few steps.

---

## How KD Works (Conceptual Model)

**Soft-target alignment:**
$$
L = (1 - \alpha)\,\text{CE}(y, p_S)
    + \alpha\,T^2\,\text{KL}\!\left( p_T^{(T)} \parallel p_S^{(T)} \right)
$$
- Higher \( T \) reveals inter-class structure.
- \( \alpha \) balances hard vs. soft supervision.

**Feature/attention/relational losses** regularize the internal geometry of representations, enhancing robustness to architecture mismatch.

**Information-theoretic view:** maximize \( I(h_T; h_S) \).  
**Sequence-level view:** approximate teacher decoding distributions to simplify generation.

---

## Practical Recipes

**Baseline Logit KD**
$$
L = \text{CE}(y, p_S) + \lambda\,T^2\,\text{KL}\!\left( p_T^{(T)} \parallel p_S^{(T)} \right)
$$
Typical ranges: \( T \in [2,8] \), \( \lambda \in [0.5,2] \).

**Intermediate Guidance (Feature KD):**  
Match \( K \) layers using MSE or cosine similarity (“Patient KD”).

**Attention/Relation Losses:**  
Add AT/RKD/CRD terms when architectures differ.

**For LLMs:**  
Combine LM loss + KD loss + attention/hidden-state alignment at both pre-training and task stages.

**For Diffusion Models:**  
Use progressive or consistency distillation; SDS for cross-space transfer (e.g., NeRF).

**For Federated Settings:**  
FedMD: share logits on public data; fit local students to ensemble consensus.

---

## Tuning Tips & Pitfalls

- **Temperature \(T\)** and **weight \(\lambda\)** are crucial.  
  Too low \(T\): little dark knowledge; too high \(T\): overly flat distributions.
- **Teacher quality vs. student capacity:** small students may underfit; feature-based KD helps.
- **Tokenizer mismatch (LLMs):** rely on attention/hidden-state KD.
- **Data-free KD:** quality of synthetic data matters; expect some accuracy gap.
- **Seq-level KD:** balance with a small hard-label loss to avoid over-imitation.
- **Diffusion KD:** ensure correct supervision of score dynamics; beware gradient bias in SDS.

---

## When to Use KD

Use KD when:
- You need **smaller/faster/cheaper** models.
- You are deploying on **edge devices**.
- You want to **consolidate ensembles**.
- You are aligning **different architectures or modalities**.
- You aim to **accelerate diffusion or generative sampling**.

Avoid or combine with alternatives when:
- Bottleneck is I/O or CPU rather than FLOPs.
- Teacher is not substantially better than student.
- Pruning or quantization alone achieves the same gains.

---

## Minimal Starter KD Recipe

**Setup**
- Teacher: larger, well-trained model.  
- Student: thinner or shallower model from the same family.

**Loss**
$$
L = \text{CE}(y, s)
  + \lambda\,T^2\,\text{KL}\!\left( p_T^{(T)} \parallel p_S^{(T)} \right)
  + \alpha \sum_{k=1}^{K} \| h_T^{(k)} - h_S^{(k)} \|_2^2
$$

**Typical Hyperparameters**
- \( T \in [2,6] \)
- \( \lambda \in [0.5,2] \)
- \( \alpha \in [0.1,0.5] \)

**Evaluation**
- Report accuracy vs. compute (latency, memory, energy).  
- For diffusion: report FID / IS vs. number of sampling steps.

---


# Foundational and Recent Papers on Knowledge Distillation

## Root / Foundational Papers

**Distilling the Knowledge in a Neural Network**  
*(Hinton, Vinyals & Dean, 2015)*  
This seminal work introduced **knowledge distillation (KD)** — a process where a smaller *student* model learns from the softened output probabilities of a larger *teacher* model using a **temperature-scaled softmax**. This allows the student to capture “dark knowledge” about inter-class similarities that the teacher has learned.  
Source: *arXiv*

---

**Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks**  
*(Papernot et al., 2015)*  
This paper introduced **defensive distillation**, showing that the distillation process can enhance **adversarial robustness** by smoothing the model’s decision surface, thus reducing sensitivity to small input perturbations.  
Source: *arXiv*

---

**On the Efficacy of Knowledge Distillation**  
*(Cho & Hariharan, ICCV 2019)*  
An empirical study exploring *when and why* distillation succeeds or fails. The results show that teacher-student **capacity mismatch** and **architectural differences** significantly affect distillation quality and generalization.  
Source: *openaccess.thecvf.com*

---

**Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning**  
*(Allen-Zhu & Li, 2020)*  
A theoretical exploration connecting ensembles, distillation, and self-distillation. The paper provides a formal understanding of how ensemble knowledge can be compressed into a single model and explains the benefits of iterative self-distillation.  
Source: *arXiv*

---

**Like What You Like: Knowledge Distill via Neuron Selectivity Transfer**  
*(Mocol, Zhang et al., ~2019)*  
Proposes **feature-based distillation** using *neuron selectivity distributions* and **maximum mean discrepancy (MMD)** to align internal feature representations between teacher and student.  
Source: *openreview.net*

---

## Recent / Newer Papers

**What Knowledge Gets Distilled in Knowledge Distillation?**  
*(NeurIPS 2023)*  
Investigates *what specific knowledge* is actually transferred during distillation — analyzing whether the student learns structure, label relationships, or other latent representations beyond mere output matching.  
Source: *proceedings.neurips.cc*

---

**A Comprehensive Survey on Knowledge Distillation**  
*(Mansourian et al., 2025)*  
A comprehensive modern review of KD covering **transformers, large language models (LLMs), and diffusion models**, categorizing methods and analyzing theoretical and practical advances.  
Source: *arXiv*

---

**Knowledge Distillation Meets Self-Supervision**  
*(ECCV 2020)*  
Integrates **self-supervised learning (SSL)** signals with distillation, enabling students to benefit from unlabeled data and enhancing generalization, especially under limited or noisy labels.  
Source: *ecva.net*

---

**Knowledge Distillation Meets Open-Set Semi-Supervised Learning**  
*(2024)*  
Extends KD to **open-set semi-supervised** settings, demonstrating that distillation can guide students in domains with partially labeled or unknown-class data.  
Source: *SpringerLink*

---

**A Coded Knowledge Distillation Framework for Image Classification**  
*(2024)*  
Introduces a **coded representation** approach for KD, encoding teacher knowledge into compact transferable forms to improve student learning in image classification.  
Source: *ScienceDirect*

---

## Additional Notes & Thematic Insights

- **Surveys** such as *Knowledge Distillation: A Survey* (Gou et al., 2020) provide taxonomies of KD types (response-based, feature-based, relation-based) and discuss architecture compatibility.  
  Source: *arXiv*

- **Broader perspective:** KD extends beyond compression — it is also a framework for **knowledge adaptation**, **domain transfer**, and **cross-modal learning**.

- **Forms of transferred knowledge:**
  - **Logit-based:** Teacher soft outputs.
  - **Feature-based:** Intermediate activations or attention maps.
  - **Relation-based:** Structural or pairwise sample relationships.

- **Practical considerations:**
  - Teacher–student capacity ratio.
  - Architecture mismatch.
  - Data size and availability.
  - Distillation temperature and loss weighting.
  - Whether distillation improves **generalization** or merely **memorization**.

- **Empirical challenges** noted by *Cho & Hariharan (2019)* emphasize that KD is not universally beneficial — effectiveness depends on *teacher quality*, *data overlap*, and *training dynamics*.

---

## Core Distillation Equation

Let the teacher produce logits \( z_T \) and the student produce logits \( z_S \). The **softened softmax** with temperature \( T \) is:

$$
p_i^{(T)} = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}
$$

The **distillation loss** is typically a weighted combination of the Kullback–Leibler divergence between teacher and student outputs and the standard cross-entropy with ground truth labels:

$$
L = (1 - \alpha) \cdot \text{CE}(y, p_S) + \alpha \cdot T^2 \cdot \text{KL}(p_T^{(T)} \parallel p_S^{(T)})
$$

where:
- \( \alpha \) balances between hard and soft targets,  
- \( T \) controls the smoothness of distributions,  
- \( p_T^{(T)} \) and \( p_S^{(T)} \) denote teacher and student soft probabilities.

This formulation captures the essence of how **dark knowledge** (latent inter-class structure) is transferred from teacher to student.


# Related Works Connected to *“Distilling the Knowledge in a Neural Network”* (Hinton, Vinyals & Dean, 2015)

| **Category** | **Author(s)** | **Year** | **Title** | **Venue / Source** |
|---------------|----------------|-----------|------------|--------------------|
| **Origin Paper** | Geoffrey E. Hinton, Oriol Vinyals, Jeff Dean | 2015 | *Distilling the Knowledge in a Neural Network* | arXiv.org |
| **Prior Works** | Cristian Bucila, R. Caruana, Alexandru Niculescu-Mizil | 2006 | *Model Compression* | KDD ’06 |
|  | Jimmy Ba, R. Caruana | 2013 | *Do Deep Nets Really Need to be Deep?* | NIPS |
|  | A. Krizhevsky | 2009 | *Learning Multiple Layers of Features from Tiny Images* | University of Toronto Technical Report |
|  | Yann LeCun, L. Bottou, Yoshua Bengio, P. Haffner | 1998 | *Gradient-based Learning Applied to Document Recognition* | Proceedings of the IEEE |
|  | Jia Deng, Wei Dong, R. Socher, Li-Jia Li, K. Li, Li Fei-Fei | 2009 | *ImageNet: A Large-Scale Hierarchical Image Database* | CVPR |
| **Derivative Works** | Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, C. Gatta, Yoshua Bengio | 2014 | *FitNets: Hints for Thin Deep Nets* | ICLR |
|  | Sergey Zagoruyko, N. Komodakis | 2016 | *Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer* | ICLR |
|  | Yonglong Tian, Dilip Krishnan, Phillip Isola | 2019 | *Contrastive Representation Distillation* | CVPR |
|  | Shan You, Chang Xu, Chao Xu, D. Tao | 2017 | *Learning from Multiple Teacher Networks* | KDD |
|  | Guobin Chen, Wongun Choi, Xiang Yu, T. Han, Manmohan Chandraker | 2017 | *Learning Efficient Object Detection Models with Knowledge Distillation* | NIPS |
|  | Baoyun Peng, Xiao Jin, Jiaheng Liu, Shunfeng Zhou, Yichao Wu, Yu Liu, Dongsheng Li, Zhaoning Zhang | 2019 | *Correlation Congruence for Knowledge Distillation* | ICCV |
|  | Ying Zhang, T. Xiang, Timothy M. Hospedales, Huchuan Lu | 2017 | *Deep Mutual Learning* | CVPR |
|  | Jianping Gou, B. Yu, S. Maybank, D. Tao | 2020 | *Knowledge Distillation: A Survey* | IJCV |
|  | Sungsoo Ahn, S. Hu, Andreas C. Damianou, Neil D. Lawrence, Zhenwen Dai | 2019 | *Variational Information Distillation for Knowledge Transfer* | CVPR |
|  | Byeongho Heo, Minsik Lee, Sangdoo Yun, J. Choi | 2018 | *Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons* | AAAI |
|  | Jangho Kim, Seonguk Park, Nojun Kwak | 2018 | *Paraphrasing Complex Network: Network Compression via Factor Transfer* | NIPS |
|  | Suraj Srinivas, R. Venkatesh Babu | 2015 | *Data-Free Parameter Pruning for Deep Neural Networks* | BMVC |
|  | Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, H. Ghasemzadeh | 2019 | *Improved Knowledge Distillation via Teacher Assistant* | AAAI |
|  | Tommaso Furlanello, Zachary Chase Lipton, Michael Tschannen, L. Itti, Anima Anandkumar | 2018 | *Born Again Neural Networks* | ICML |
|  | Guodong Xu, Ziwei Liu, Xiaoxiao Li, Chen Change Loy | 2020 | *Knowledge Distillation Meets Self-Supervision* | ECCV |
|  | Xu Lan, Xiatian Zhu, S. Gong | 2018 | *Knowledge Distillation by On-the-Fly Native Ensemble* | NIPS |
|  | Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, Jiajun Liang | 2022 | *Decoupled Knowledge Distillation* | CVPR |
|  | Hao Li, Asim Kadav, Igor Durdanovic, H. Samet, H. Graf | 2016 | *Pruning Filters for Efficient ConvNets* | ICLR |
|  | Pengguang Chen, Shu Liu, Hengshuang Zhao, Jiaya Jia | 2021 | *Distilling Knowledge via Knowledge Review* | CVPR |
|  | Quanquan Li, Sheng Jin, Junjie Yan | 2017 | *Mimicking Very Efficient Network for Object Detection* | CVPR |
|  | Lin Wang, Kuk-Jin Yoon | 2020 | *Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks* | IEEE TPAMI |
|  | Xiao Jin, Baoyun Peng, Yichao Wu, Yu Liu, Jiaheng Liu, Ding Liang, Junjie Yan, Xiaolin Hu | 2019 | *Knowledge Distillation via Route Constrained Optimization* | ICCV |
|  | Li Liu, Qingle Huang, Sihao Lin, Hongwei Xie, Bing Wang, Xiaojun Chang, Xiao-Xue Liang | 2021 | *Exploring Inter-Channel Correlation for Diversity-Preserved Knowledge Distillation* | CVPR |
|  | Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, M. Andreetto, Hartwig Adam | 2017 | *MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications* | arXiv |
|  | Tao Wang, Li Yuan, Xiaopeng Zhang, Jiashi Feng | 2019 | *Distilling Object Detectors with Fine-Grained Feature Imitation* | CVPR |
|  | Jang Hyun Cho, B. Hariharan | 2019 | *On the Efficacy of Knowledge Distillation* | ICCV |
|  | Nikolaos Passalis, A. Tefas | 2018 | *Learning Deep Representations with Probabilistic Knowledge Transfer* | ECCV |
|  | Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, Chun Chen | 2020 | *Cross-Layer Distillation with Semantic Calibration* | AAAI |
|  | Dmytro Mishkin, Jiri Matas | 2015 | *All You Need Is a Good Init* | arXiv |


# Related Works of *“Distilling the Knowledge in a Neural Network”* (Hinton, Vinyals & Dean, 2015)

| **Category** | **Author(s)** | **Year** | **Title** | **Venue / Source** |
|---------------|----------------|-----------|------------|--------------------|
| **Origin Paper** | Geoffrey E. Hinton, Oriol Vinyals, Jeff Dean | 2015 | *Distilling the Knowledge in a Neural Network* | arXiv.org |
| **Prior Works** | Cristian Bucila, R. Caruana, Alexandru Niculescu-Mizil | 2006 | *Model Compression* | KDD |
|  | Jimmy Ba, R. Caruana | 2013 | *Do Deep Nets Really Need to be Deep?* | NIPS |
|  | A. Krizhevsky | 2009 | *Learning Multiple Layers of Features from Tiny Images* | University of Toronto Technical Report |
|  | Jia Deng, Wei Dong, R. Socher, Li-Jia Li, K. Li, Li Fei-Fei | 2009 | *ImageNet: A Large-Scale Hierarchical Image Database* | CVPR |
|  | Yann LeCun, L. Bottou, Yoshua Bengio, P. Haffner | 1998 | *Gradient-Based Learning Applied to Document Recognition* | Proceedings of the IEEE |
| **Derivative Works** | Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, C. Gatta, Yoshua Bengio | 2014 | *FitNets: Hints for Thin Deep Nets* | ICLR |
|  | Sergey Zagoruyko, N. Komodakis | 2016 | *Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer* | ICLR |
|  | Yonglong Tian, Dilip Krishnan, Phillip Isola | 2019 | *Contrastive Representation Distillation* | CVPR |
|  | Shan You, Chang Xu, Chao Xu, D. Tao | 2017 | *Learning from Multiple Teacher Networks* | KDD |
|  | Guobin Chen, Wongun Choi, Xiang Yu, T. Han, Manmohan Chandraker | 2017 | *Learning Efficient Object Detection Models with Knowledge Distillation* | NIPS |
|  | Baoyun Peng, Xiao Jin, Jiaheng Liu, Shunfeng Zhou, Yichao Wu, Yu Liu, Dongsheng Li, Zhaoning Zhang | 2019 | *Correlation Congruence for Knowledge Distillation* | ICCV |
|  | Ying Zhang, T. Xiang, Timothy M. Hospedales, Huchuan Lu | 2017 | *Deep Mutual Learning* | CVPR |
|  | Jianping Gou, B. Yu, S. Maybank, D. Tao | 2020 | *Knowledge Distillation: A Survey* | IJCV |
|  | Sungsoo Ahn, S. Hu, Andreas C. Damianou, Neil D. Lawrence, Zhenwen Dai | 2019 | *Variational Information Distillation for Knowledge Transfer* | CVPR |
|  | Byeongho Heo, Minsik Lee, Sangdoo Yun, J. Choi | 2018 | *Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons* | AAAI |
|  | Jangho Kim, Seonguk Park, Nojun Kwak | 2018 | *Paraphrasing Complex Network: Network Compression via Factor Transfer* | NIPS |
|  | Suraj Srinivas, R. Venkatesh Babu | 2015 | *Data-Free Parameter Pruning for Deep Neural Networks* | BMVC |
|  | Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, H. Ghasemzadeh | 2019 | *Improved Knowledge Distillation via Teacher Assistant* | AAAI |
|  | Tommaso Furlanello, Zachary Chase Lipton, Michael Tschannen, L. Itti, Anima Anandkumar | 2018 | *Born Again Neural Networks* | ICML |
|  | Guodong Xu, Ziwei Liu, Xiaoxiao Li, Chen Change Loy | 2020 | *Knowledge Distillation Meets Self-Supervision* | ECCV |
|  | Xu Lan, Xiatian Zhu, S. Gong | 2018 | *Knowledge Distillation by On-the-Fly Native Ensemble* | NIPS |
|  | Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, Jiajun Liang | 2022 | *Decoupled Knowledge Distillation* | CVPR |
|  | Hao Li, Asim Kadav, Igor Durdanovic, H. Samet, H. Graf | 2016 | *Pruning Filters for Efficient ConvNets* | ICLR |
|  | Pengguang Chen, Shu Liu, Hengshuang Zhao, Jiaya Jia | 2021 | *Distilling Knowledge via Knowledge Review* | CVPR |
|  | Quanquan Li, Sheng Jin, Junjie Yan | 2017 | *Mimicking Very Efficient Network for Object Detection* | CVPR |
|  | Lin Wang, Kuk-Jin Yoon | 2020 | *Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks* | IEEE TPAMI |
|  | Xiao Jin, Baoyun Peng, Yichao Wu, Yu Liu, Jiaheng Liu, Ding Liang, Junjie Yan, Xiaolin Hu | 2019 | *Knowledge Distillation via Route Constrained Optimization* | ICCV |
|  | Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, M. Andreetto, Hartwig Adam | 2017 | *MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications* | arXiv |
|  | Tao Wang, Li Yuan, Xiaopeng Zhang, Jiashi Feng | 2019 | *Distilling Object Detectors with Fine-Grained Feature Imitation* | CVPR |
|  | Jang Hyun Cho, B. Hariharan | 2019 | *On the Efficacy of Knowledge Distillation* | ICCV |
|  | Nikolaos Passalis, A. Tefas | 2018 | *Learning Deep Representations with Probabilistic Knowledge Transfer* | ECCV |
|  | Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, Chun Chen | 2020 | *Cross-Layer Distillation with Semantic Calibration* | AAAI |
|  | Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf | 2019 | *DistilBERT: A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter* | arXiv |
|  | Yihui He, Xiangyu Zhang, Jian Sun | 2017 | *Channel Pruning for Accelerating Very Deep Neural Networks* | ICCV |
|  | Hao Zhou, J. Álvarez, F. Porikli | 2016 | *Less Is More: Towards Compact CNNs* | ECCV |
|  | Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, Changshui Zhang | 2017 | *Learning Efficient Convolutional Networks through Network Slimming* | ICCV |
|  | Li Liu, Qingle Huang, Sihao Lin, Hongwei Xie, Bing Wang, Xiaojun Chang, Xiao-Xue Liang | 2021 | *Exploring Inter-Channel Correlation for Diversity-Preserved Knowledge Distillation* | CVPR |
|  | Dmytro Mishkin, Jiri Matas | 2015 | *All You Need Is a Good Init* | arXiv |
