DeepEP, torch.compile and Fix Megatron Training Bug#646
Conversation
…_and_trainability_main # Conflicts: # src/art/megatron/train.py
|
Nice, training works for me! My AI had these comments:
|
Summary: 1.15x faster Megatron training and it actually trains now.
DeepEP
DeepEP allows for faster expert parallel (EP) comm. EP comm and the pre/post-processing work surrounding it take roughly as much time as the actual expert MLP computation (on Qwen 3 30B A3B at least) so improvements to this are important. DeepEP gives us a ~1.05x speedup. The gap between DeepEP and Megatron may grow in multi-node settings.
torch.compileWe add
torch.compileto the model layers and disable some regions that are not compatible. This gives a ~1.10x speedup on top of DeepEP. I did not test max-autotune or cuda graphs here, just basic compilation.Megatron Training
We noticed that megatron failed to train in the simple yes-no-maybe example. This was caused by the parameter offload. Megatron expects param data tensors to stay constant and offload/reload creating new tensors caused Megatron to lose track of them for updates. We shift to Megatron's offload API to do this properly.
We also remove the optimizer offload, since the optimizer is loaded from disk at the start of each job anyways.
Megatron Provider Options
We expose env variables for controlling Megatron parallelism. We will refactor the configuration system at some point so that you can naturally modify these, but this is the minimal control plane.