Support FP8 primary weight in FSDP training by shjwudp · Pull Request #1630 · NVIDIA/TransformerEngine

shjwudp · 2025-04-01T09:09:45Z

Description

This MR modifies the cast_master_weights_to_fp8 function in the FP8 primary weight application, allowing us to use FP8 primary weight in FSDP training.

In FSDP training, the model weight may be incomplete, and model_weight._data may be DTensor(FSDP2) or resized for parameter sharding. We cannot obtain the actual model weight shard address through the slice reading method like model_weight._data.view(-1)[start_offset:end_offset]. This MR extends the cast_master_weights_to_fp8 function to accept the direct input of shard model weight, so that the special use of FSDP can be implemented.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: jianbinc <shjwudp@gmail.com>

ksivaman · 2025-04-03T01:47:02Z

/te-ci pytorch L0 L1

Support fp8 primary weight in fsdp training Signed-off-by: jianbinc <shjwudp@gmail.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Peter Dykas <wdykas@nvidia.com>

shjwudp force-pushed the fp8_primary_weight_support_for_fsdp branch from 422f432 to bc4b9e9 Compare April 2, 2025 11:25

shjwudp closed this Apr 2, 2025

shjwudp reopened this Apr 2, 2025

Support fp8 primary weight in fsdp training

d14c1f0

Signed-off-by: jianbinc <shjwudp@gmail.com>

shjwudp force-pushed the fp8_primary_weight_support_for_fsdp branch from bc4b9e9 to d14c1f0 Compare April 2, 2025 11:32

shjwudp changed the title ~~Support FP8 primary weight with FSDP~~ Support FP8 primary weight in FSDP training Apr 2, 2025

Merge branch 'main' into fp8_primary_weight_support_for_fsdp

8734e8b

Merge branch 'main' into fp8_primary_weight_support_for_fsdp

90ab070

ksivaman approved these changes Apr 7, 2025

View reviewed changes

ksivaman merged commit c84d170 into NVIDIA:main Apr 7, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FP8 primary weight in FSDP training#1630

Support FP8 primary weight in FSDP training#1630
ksivaman merged 3 commits intoNVIDIA:mainfrom
shjwudp:fp8_primary_weight_support_for_fsdp

shjwudp commented Apr 1, 2025 •

edited

Loading

Uh oh!

ksivaman commented Apr 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shjwudp commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

ksivaman commented Apr 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shjwudp commented Apr 1, 2025 •

edited

Loading