Skip to content

Ensure weight transpose is valid for Hopper FP8 training#1596

Merged
timmoon10 merged 4 commits into
NVIDIA:mainfrom
guyueh1:fix_hopper_fp8_tranpose_missing
Mar 24, 2025
Merged

Ensure weight transpose is valid for Hopper FP8 training#1596
timmoon10 merged 4 commits into
NVIDIA:mainfrom
guyueh1:fix_hopper_fp8_tranpose_missing

Conversation

@guyueh1
Copy link
Copy Markdown
Contributor

@guyueh1 guyueh1 commented Mar 20, 2025

Description

Currently for Hopper FP8 training, when weight is Float8Tensor and weight._transpose is invalidated by the optimizer step, no mechanism handles the creation of weight transpose thus the Dgrad errors out. This PR updates weight usage and ensure columnwise data is there before saved for bprop.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

guyueh1 and others added 2 commits March 20, 2025 11:00
@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented Mar 20, 2025

@ksivaman @ptrendx please review, this is to fix the hopper fp8 error: at the beginning of a new step, the weight transpose cache is invalid so the backprop errors out. I am specifiying weight to have column-wise usage to make sure it creates transpose if it's on hopper.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Copy link
Copy Markdown
Member

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@timmoon10
Copy link
Copy Markdown
Member

/te-ci pytorch L1

@timmoon10 timmoon10 mentioned this pull request Mar 22, 2025
13 tasks
@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented Mar 24, 2025

@timmoon10 I looked at one of the failed L0 test (unittest H100 1GPU) and it's failing for a module I didn't change (page attention). Is L0 ut expected to pass?

@timmoon10 timmoon10 added 2.2.0 bug Something isn't working labels Mar 24, 2025
@timmoon10 timmoon10 merged commit 1321b9b into NVIDIA:main Mar 24, 2025
KshitijLakhani pushed a commit that referenced this pull request Mar 24, 2025
* Update usage of weightmat before saving for backward

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix for layernorm mlp

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

---------

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
lhb8125 pushed a commit to lhb8125/TransformerEngine that referenced this pull request Apr 8, 2025
* Update usage of weightmat before saving for backward

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix for layernorm mlp

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

---------

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2.2.0 bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants