Knowledge Distillation (KD) is a powerful technique for enhancing the performance of compact student models. Among its variants, Correlation Congruence Knowledge Distillation (CCKD) has demonstrated excellent generalization in practice by aligning the Gram matrices of feature maps between teacher and student networks. However, the theoretical principles underlying its success have remained largely underexplored.
This paper aims to provide a rigorous theoretical justification for the generalization benefits of CCKD. Our analysis is conducted within the Neural Tangent Kernel (NTK) framework, a cornerstone of modern deep learning theory. Our key contributions are as follows:
-
A Novel Geometric Condition: We introduce a new, interpretable geometric assumption, which we term "Hessian-Transformed Geometric Alignment." This condition establishes a link between the direction of CCKD's feature alignment and the curvature (i.e., flatness) of the standard distillation loss landscape.
-
Proof of Implicit Regularization: Under this geometric condition, we employ a stability analysis of the optimization dynamics to formally prove that CCKD guides the student solution towards a smaller norm in the Reproducing Kernel Hilbert Space (RKHS). This reveals the role of CCKD as an effective implicit regularizer.
-
Derivation of a Generalization Bound: Building on the aforementioned norm-reduction effect, we derive a comparative upper bound on the generalization error. This bound clearly demonstrates that, with a proper hyperparameter setting, the generalization bound for CCKD can be strictly tighter than that for standard KD. We further prove that for a sufficiently large sample size, a non-empty interval for the hyperparameter
$$\lambda$$
that guarantees this improvement is certain to exist.
In summary, this work is the first to theoretically establish a clear connection between CCKD's feature matching, the geometric curvature of the loss landscape, model complexity (RKHS norm), and final generalization performance. It offers a new theoretical perspective for understanding and applying feature-based distillation methods.