NVFP4 performance regression in TE main branch

**Describe the bug**

TransformerEngine [PR-2411](https://github.com/NVIDIA/TransformerEngine/pull/2411) introduced optimizations to improve NVFP4 computation performance. Previously, using an image built with this PR delivered significant performance gains during the pre-training of the Moonlight-16B model. This PR was merged into the TE main branch two weeks ago. However, when building an image from the main branch and running the same NVFP4 training configuration, the throughput per GPU dropped dramatically and became highly unstable.

**Steps/Code to reproduce bug**

To reproduce the issue, perform the following comparative experiments:

Step1: Build two Slurm images:
Refer to the script at [link](https://gitlab-master.nvidia.com/caell/te_issues/-/blob/main/20260105/build_te_image_0104_main_c988548.sh?ref_type=heads) to build two Slurm images, one image is based on PR-2411, and another image is based on TE main (commit id c988548f72bbc271fe2ab7bad1046b91b577aa29）

Step2: Launch a training job:
Refer to the script at [link](https://gitlab-master.nvidia.com/caell/te_issues/-/blob/main/20260105/eval_nvfp4_train_moonlight.sh?ref_type=heads) to launch a training job. It doesn’t necessarily have to be the Moonlight model, other models are also likely to experience similar performance issues.

**Expected behavior**

The training performance based on these two images are expected to be similar, but in reality, the performance gap is huge (570 TFLOPS VS 240-400 TFLOPS), see [log1](https://gitlab-master.nvidia.com/caell/te_issues/-/blob/main/20260105/result1.log?ref_type=heads),[log2](https://gitlab-master.nvidia.com/caell/te_issues/-/blob/main/20260105/result2.log?ref_type=heads), which makes NVFP4 training solution unusable. 

**Environment overview (please complete the following information)**

 - Environment location: Slurm cluster
 - Method of Transformer Engine install: please refer to [link](https://gitlab-master.nvidia.com/caell/te_issues/-/blob/main/20260105/build_te_image_0104_main_c988548.sh?ref_type=heads) 
 - If method of install is [Docker], provide `docker pull` & `docker run` commands used
please refer to [link](https://gitlab-master.nvidia.com/caell/te_issues/-/blob/main/20260105/eval_nvfp4_train_moonlight.sh?ref_type=heads)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NVFP4 performance regression in TE main branch #2558

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NVFP4 performance regression in TE main branch #2558

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions