Ensure weight transpose is valid for Hopper FP8 training by guyueh1 · Pull Request #1596 · NVIDIA/TransformerEngine

guyueh1 · 2025-03-20T18:32:08Z

Description

Currently for Hopper FP8 training, when weight is Float8Tensor and weight._transpose is invalidated by the optimizer step, no mechanism handles the creation of weight transpose thus the Dgrad errors out. This PR updates weight usage and ensure columnwise data is there before saved for bprop.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

for more information, see https://pre-commit.ci

guyueh1 · 2025-03-20T18:50:26Z

@ksivaman @ptrendx please review, this is to fix the hopper fp8 error: at the beginning of a new step, the weight transpose cache is invalid so the backprop errors out. I am specifiying weight to have column-wise usage to make sure it creates transpose if it's on hopper.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

timmoon10

LGTM

timmoon10 · 2025-03-22T00:50:01Z

/te-ci pytorch L1

guyueh1 · 2025-03-24T16:04:38Z

@timmoon10 I looked at one of the failed L0 test (unittest H100 1GPU) and it's failing for a module I didn't change (page attention). Is L0 ut expected to pass?

* Update usage of weightmat before saving for backward Signed-off-by: Guyue Huang <guyueh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix for layernorm mlp Signed-off-by: Guyue Huang <guyueh@nvidia.com> --------- Signed-off-by: Guyue Huang <guyueh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

guyueh1 and others added 2 commits March 20, 2025 11:00

Update usage of weightmat before saving for backward

b2c6dd0

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0f883b0

for more information, see https://pre-commit.ci

Fix for layernorm mlp

945246f

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

timmoon10 approved these changes Mar 22, 2025

View reviewed changes

Merge branch 'main' into fix_hopper_fp8_tranpose_missing

0bc8368

timmoon10 mentioned this pull request Mar 22, 2025

Fix mxfp8 columnwise data missing #1593

Merged

13 tasks

timmoon10 added 2.2.0 bug Something isn't working labels Mar 24, 2025

timmoon10 merged commit 1321b9b into NVIDIA:main Mar 24, 2025

sudhu2k mentioned this pull request Aug 20, 2025

Ensure weight transpose is valid for FP8 training ROCm/TransformerEngine#276

Merged

13 tasks

hungryGeek16 mentioned this pull request May 31, 2026

fix unfused padding causal sdpa #3063

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure weight transpose is valid for Hopper FP8 training#1596

Ensure weight transpose is valid for Hopper FP8 training#1596
timmoon10 merged 4 commits into
NVIDIA:mainfrom
guyueh1:fix_hopper_fp8_tranpose_missing

guyueh1 commented Mar 20, 2025

Uh oh!

guyueh1 commented Mar 20, 2025

Uh oh!

timmoon10 left a comment

Uh oh!

timmoon10 commented Mar 22, 2025

Uh oh!

guyueh1 commented Mar 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guyueh1 commented Mar 20, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

guyueh1 commented Mar 20, 2025

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Mar 22, 2025

Uh oh!

guyueh1 commented Mar 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants