Experimental: Trainer with Separate Teacher&Student#352
Experimental: Trainer with Separate Teacher&Student#352
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
47cddea to
94cbb2a
Compare
6ec1c4b to
b70b418
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #352 +/- ##
==========================================
- Coverage 73.79% 73.37% -0.42%
==========================================
Files 171 180 +9
Lines 17591 17937 +346
==========================================
+ Hits 12981 13162 +181
- Misses 4610 4775 +165 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ddf5d5d to
5ae4479
Compare
438c616 to
d3494c0
Compare
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
d3494c0 to
7f46298
Compare
What does this PR do?
Type of change: New Feature
Overview:
Previous related discussion:
This experimental PR explores to resolve distillation inefficiency by a trainer with decoupled student and teacher placement and nccl communication backend using
torch.distributed.It has a few benefits compares to previously used
hf.trainer+FSDP:torch.compile. Observed 1.5x speedup when student and teacher speed match eachother.See section below for profiling results.
Some additional benefits includes:
torch.Some drawbacks includes:
Usage
To launch training using the new trainer:
Testing
All tests are done on eagle3 online training workload, with 8xH100 (nvlink) machine on coreweave cluster:
Training Speed
We use 4TP+4DDP setting with the new trainer:

Comments
min(teacher, student), bottlenecked by the slower side.Mem Efficiency (max training length)
Comments:
Before your PR is "Ready for review"
Additional Information