Skip to content

Conversation

@RaymondLi0
Copy link
Contributor

@RaymondLi0 RaymondLi0 commented Nov 14, 2025

✨ Description

Closes #385

Sanity checks:

  • loading student and teacher from the same pretrained model gives 0 loss ✔️. But loss then increases to a small value instead of staying at 0.
  • Distilling from scratch with the same architecture doesn't lead to 0 loss (orange)
  • Distilling the pretrained model, but with a sliding-window leads to low loss, lower with a larger window (green and purple)
Screenshot 2025-11-18 at 10 47 00 AM
  lm-loss Logit distillation Logit + Activation distillation
Tokens/s/gpu 3500 2900 2800
max_reserved (GB) 44 77 78

With the caveat that distillation seems to experience memory spikes at specific points in training. The actual usage was lower most of the time:

Screenshot 2025-11-18 at 2 41 05 PM

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

  • 📊 I have run benchmarks where applicable to evaluate the performance impact.
  • ✅ The benchmarks show no performance regression.
  • 🚀 The benchmarks indicate a potential performance improvement.
  • ⚠️ The benchmarks indicate a potential performance degradation.
  • 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

@tscholak
Copy link
Collaborator

great progress! did you freeze everything except the randomly initialized mixers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Activation/feature-level distillation

3 participants