-
Notifications
You must be signed in to change notification settings - Fork 602
Description
Describe the bug
TransformerEngine PR-2411 introduced optimizations to improve NVFP4 computation performance. Previously, using an image built with this PR delivered significant performance gains during the pre-training of the Moonlight-16B model. This PR was merged into the TE main branch two weeks ago. However, when building an image from the main branch and running the same NVFP4 training configuration, the throughput per GPU dropped dramatically and became highly unstable.
Steps/Code to reproduce bug
To reproduce the issue, perform the following comparative experiments:
Step1: Build two Slurm images:
Refer to the script at link to build two Slurm images, one image is based on PR-2411, and another image is based on TE main (commit id c988548)
Step2: Launch a training job:
Refer to the script at link to launch a training job. It doesn’t necessarily have to be the Moonlight model, other models are also likely to experience similar performance issues.
Expected behavior
The training performance based on these two images are expected to be similar, but in reality, the performance gap is huge (570 TFLOPS VS 240-400 TFLOPS), see log1,log2, which makes NVFP4 training solution unusable.
Environment overview (please complete the following information)