Fix MXFP8-training related issue by WanZzzzzz · Pull Request #1832 · NVIDIA/TransformerEngine

WanZzzzzz · 2025-05-28T23:54:16Z

Description

Fix 2 bugs in mxfp8 training.

(1) The following error happened after validation since colwise data is cleared by update_usage() during validation.

0: [rank0]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/tensor/_internal/mxfp8_tensor_base.py", line 192, in update_usage
0: [rank0]:     raise RuntimeError(
0: [rank0]: RuntimeError: Requested column-wise usage, but MXFP8Tensor is missing column-scaled FP8 data

(2) This happened after training completes and copying tensor to CPU:

1: [rank1]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/tensor/mxfp8_tensor.py", line 394, in _set_data                                                                           1: [rank1]:     new_device = tensor.device if tensor.is_cuda else self.device
1: [rank1]:                                   ^^^^^^^^^^^^^^
1: [rank1]: AttributeError: 'MXFP8TensorBase' object has no attribute 'is_cuda'

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
[ ] Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

timmoon10 · 2025-05-29T01:39:59Z

+            if update_workspace and quantizer is not None and quantizer.columnwise_usage:
                tensor.update_usage(
                    rowwise_usage=quantizer.rowwise_usage,
                    columnwise_usage=quantizer.columnwise_usage,


This change makes the update_usage logic redundant, so we might as well remove it:

Suggested change

if update_workspace and quantizer is not None and quantizer.columnwise_usage:

tensor.update_usage(

rowwise_usage=quantizer.rowwise_usage,

columnwise_usage=quantizer.columnwise_usage,

That leads us to a more fundamental challenge: inference and validation have different data requirements, and we don't really expose a good way to distinguish between them. If you're doing MXFP8 inference, you don't want MXFP8 column-wise data since that's wasted memory. If you're loading a BF16 checkpoint, doing MXFP8 validation, and then doing MXFP8 training, you need to cast to MXFP8 row-wise and column-wise data when loading the checkpoint. Removing this update_usage logic has the effect of "however you've initialized the weights, you're stuck with those buffers forever". This is unfortunate for inference because the default behavior is to initialize weights for training, i.e. with both row-wise and column-wise data (#1827 does add an option to initialize for inference).

I think removing this logic does make sense eventually, especially after Mcore adds support for #1827, but for now I'm wary of breaking users doing inference. For now, is it possible to fix the bug by configuring distopt to avoid overlapping the param AG with validation?

Thanks! Why do you think this bug has to do with overlapping the param AG with validation? I thought as long as we do training after validation, this error will occur.

Ah, you're right. I'll incorporate this logic in #1827.

timmoon10 · 2025-05-29T18:37:43Z

I've included these fixes in #1827.

Qiyu Wan added 2 commits May 28, 2025 13:53

fix conflict

787dbba

update tensor usage only when col usage is true

0d45556

timmoon10 reviewed May 29, 2025

View reviewed changes

fix

a5f8cbd

timmoon10 mentioned this pull request May 29, 2025

[PyTorch] Recipe heuristics for initializing quantized weights #1827

Closed

13 tasks

timmoon10 closed this May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MXFP8-training related issue#1832

Fix MXFP8-training related issue#1832
WanZzzzzz wants to merge 3 commits intoNVIDIA:mainfrom
WanZzzzzz:fix-colusage-on-main

WanZzzzzz commented May 28, 2025

Uh oh!

Uh oh!

timmoon10 May 29, 2025 •

edited

Loading

Uh oh!

WanZzzzzz May 29, 2025

Uh oh!

timmoon10 May 29, 2025

Uh oh!

timmoon10 commented May 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WanZzzzzz commented May 28, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

timmoon10 May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WanZzzzzz May 29, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 May 29, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented May 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timmoon10 May 29, 2025 •

edited

Loading