activation-level disillation #388

RaymondLi0 · 2025-11-14T20:48:27Z

✨ Description

Closes #385

Sanity checks:

loading student and teacher from the same pretrained model gives 0 loss ✔️. But loss then increases to a small value instead of staying at 0.
Distilling from scratch with the same architecture doesn't lead to 0 loss (orange)
Distilling the pretrained model, but with a sliding-window leads to low loss, lower with a larger window (green and purple)

	lm-loss	Logit distillation	Logit + Activation distillation
Tokens/s/gpu	3500	2900	2800
max_reserved (GB)	44	77	78

With the caveat that distillation seems to experience memory spikes at specific points in training. The actual usage was lower most of the time:

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

…lation

tscholak · 2025-11-19T03:55:14Z

great progress! did you freeze everything except the randomly initialized mixers?

RaymondLi0 added 16 commits November 12, 2025 21:15

activation distillation: first draft

9fa4c46

fix kwargs

11708ff

remove count, add auxiliaryLoss hook

9437310

fix auxiliary loss

d3ac964

wrap in method

56fc8db

fixes

5d75f01

move activation distillation loss reporting to decoder block

f1bfca9

fix logging

8b16752

remove root kwargs

efa8cf0

fix mistral mlp conversion

4cda56d

Merge branch 'raymond/fix_mistral_conv' into raymond/activation_disil…

9ca2347

…lation

remove duplicate from apriel conversion

41692e9

fix

99c42c0

move assert

d3df7a5

Merge branch 'raymond/fix_mistral_conv' into raymond/activation_disil…

f2f097e

…lation

remove tp-1 check for reference models

8e04aba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

activation-level disillation #388

activation-level disillation #388

Uh oh!

RaymondLi0 commented Nov 14, 2025 •

edited

Loading

Uh oh!

tscholak commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

activation-level disillation #388

Are you sure you want to change the base?

activation-level disillation #388

Uh oh!

Conversation

RaymondLi0 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

Testing

Performance Impact

📊 Performance Impact Details

Uh oh!

tscholak commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RaymondLi0 commented Nov 14, 2025 •

edited

Loading