Skip to content

Distillation

Gaurav14cs17 edited this page Jun 21, 2026 · 1 revision

Knowledge Distillation

Transfer knowledge from a large teacher model to a smaller student model.

Logit-Level Distillation

Classic knowledge distillation using softened output logits.

from flashoptim import FlashOptim, KnowledgeDistiller, Trainer

teacher = FlashOptim("pretrained/teacher_large.pth")
student = FlashOptim("pretrained/student_small.pth")

distiller = KnowledgeDistiller(
    temperature=4.0,
    alpha=0.7,  # Weight of distillation loss vs task loss
)
trainer = Trainer(distiller=distiller, epochs=100)
trained_student = trainer.train(teacher=teacher, student=student, data="data/train/")

Feature-Level Distillation

Match intermediate feature representations between teacher and student.

from flashoptim import FeatureDistiller

distiller = FeatureDistiller(
    teacher_layers=["backbone.layer3", "backbone.layer4"],
    student_layers=["backbone.layer2", "backbone.layer3"],
    loss_type="mse",
)

Self-Distillation

Distill knowledge within the same model across layers or augmentations.

from flashoptim.distillation import SelfDistiller

distiller = SelfDistiller(
    method="layer",  # "layer" or "augmentation"
    teacher_layer="backbone.layer4",
    student_layer="backbone.layer2",
)

Configuration

distillation:
  method: knowledge
  temperature: 4.0
  alpha: 0.7
  teacher_path: pretrained/teacher_large.pth
  student_path: pretrained/student_small.pth

CLI Usage

flashoptim distill --config configs/flashoptim_distill_det.yaml

Tips

  • Higher temperature (3-10) softens probabilities more, transferring "dark knowledge"
  • Alpha balances task loss vs distillation loss (0.5-0.9 typical)
  • Feature distillation helps when teacher and student have different architectures
  • Combine with pruning or quantization for maximum compression

Clone this wiki locally