# Knowledge Distillation Transfer knowledge from a large teacher model to a smaller student model. ## Logit-Level Distillation Classic knowledge distillation using softened output logits. ```python from flashoptim import FlashOptim, KnowledgeDistiller, Trainer teacher = FlashOptim("pretrained/teacher_large.pth") student = FlashOptim("pretrained/student_small.pth") distiller = KnowledgeDistiller( temperature=4.0, alpha=0.7, # Weight of distillation loss vs task loss ) trainer = Trainer(distiller=distiller, epochs=100) trained_student = trainer.train(teacher=teacher, student=student, data="data/train/") ``` ## Feature-Level Distillation Match intermediate feature representations between teacher and student. ```python from flashoptim import FeatureDistiller distiller = FeatureDistiller( teacher_layers=["backbone.layer3", "backbone.layer4"], student_layers=["backbone.layer2", "backbone.layer3"], loss_type="mse", ) ``` ## Self-Distillation Distill knowledge within the same model across layers or augmentations. ```python from flashoptim.distillation import SelfDistiller distiller = SelfDistiller( method="layer", # "layer" or "augmentation" teacher_layer="backbone.layer4", student_layer="backbone.layer2", ) ``` ## Configuration ```yaml distillation: method: knowledge temperature: 4.0 alpha: 0.7 teacher_path: pretrained/teacher_large.pth student_path: pretrained/student_small.pth ``` ## CLI Usage ```bash flashoptim distill --config configs/flashoptim_distill_det.yaml ``` ## Tips - Higher temperature (3-10) softens probabilities more, transferring "dark knowledge" - Alpha balances task loss vs distillation loss (0.5-0.9 typical) - Feature distillation helps when teacher and student have different architectures - Combine with pruning or quantization for maximum compression