Skip to content

NVFP4 performance regression in TE main branch #2558

@cael-ling

Description

@cael-ling

Describe the bug

TransformerEngine PR-2411 introduced optimizations to improve NVFP4 computation performance. Previously, using an image built with this PR delivered significant performance gains during the pre-training of the Moonlight-16B model. This PR was merged into the TE main branch two weeks ago. However, when building an image from the main branch and running the same NVFP4 training configuration, the throughput per GPU dropped dramatically and became highly unstable.

Steps/Code to reproduce bug

To reproduce the issue, perform the following comparative experiments:

Step1: Build two Slurm images:
Refer to the script at link to build two Slurm images, one image is based on PR-2411, and another image is based on TE main (commit id c988548

Step2: Launch a training job:
Refer to the script at link to launch a training job. It doesn’t necessarily have to be the Moonlight model, other models are also likely to experience similar performance issues.

Expected behavior

The training performance based on these two images are expected to be similar, but in reality, the performance gap is huge (570 TFLOPS VS 240-400 TFLOPS), see log1,log2, which makes NVFP4 training solution unusable.

Environment overview (please complete the following information)

  • Environment location: Slurm cluster
  • Method of Transformer Engine install: please refer to link
  • If method of install is [Docker], provide docker pull & docker run commands used
    please refer to link

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions