-
Notifications
You must be signed in to change notification settings - Fork 0
Distillation
Gaurav14cs17 edited this page Jun 21, 2026
·
1 revision
Transfer knowledge from a large teacher model to a smaller student model.
Classic knowledge distillation using softened output logits.
from flashoptim import FlashOptim, KnowledgeDistiller, Trainer
teacher = FlashOptim("pretrained/teacher_large.pth")
student = FlashOptim("pretrained/student_small.pth")
distiller = KnowledgeDistiller(
temperature=4.0,
alpha=0.7, # Weight of distillation loss vs task loss
)
trainer = Trainer(distiller=distiller, epochs=100)
trained_student = trainer.train(teacher=teacher, student=student, data="data/train/")Match intermediate feature representations between teacher and student.
from flashoptim import FeatureDistiller
distiller = FeatureDistiller(
teacher_layers=["backbone.layer3", "backbone.layer4"],
student_layers=["backbone.layer2", "backbone.layer3"],
loss_type="mse",
)Distill knowledge within the same model across layers or augmentations.
from flashoptim.distillation import SelfDistiller
distiller = SelfDistiller(
method="layer", # "layer" or "augmentation"
teacher_layer="backbone.layer4",
student_layer="backbone.layer2",
)distillation:
method: knowledge
temperature: 4.0
alpha: 0.7
teacher_path: pretrained/teacher_large.pth
student_path: pretrained/student_small.pthflashoptim distill --config configs/flashoptim_distill_det.yaml- Higher temperature (3-10) softens probabilities more, transferring "dark knowledge"
- Alpha balances task loss vs distillation loss (0.5-0.9 typical)
- Feature distillation helps when teacher and student have different architectures
- Combine with pruning or quantization for maximum compression
FlashOptim — Model optimization toolkit | PyPI | MIT License