Skip to content

Fix MXFP8-training related issue#1832

Closed
WanZzzzzz wants to merge 3 commits intoNVIDIA:mainfrom
WanZzzzzz:fix-colusage-on-main
Closed

Fix MXFP8-training related issue#1832
WanZzzzzz wants to merge 3 commits intoNVIDIA:mainfrom
WanZzzzzz:fix-colusage-on-main

Conversation

@WanZzzzzz
Copy link
Copy Markdown
Contributor

Description

Fix 2 bugs in mxfp8 training.

(1) The following error happened after validation since colwise data is cleared by update_usage() during validation.

0: [rank0]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/tensor/_internal/mxfp8_tensor_base.py", line 192, in update_usage
0: [rank0]:     raise RuntimeError(
0: [rank0]: RuntimeError: Requested column-wise usage, but MXFP8Tensor is missing column-scaled FP8 data

(2) This happened after training completes and copying tensor to CPU:

1: [rank1]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/tensor/mxfp8_tensor.py", line 394, in _set_data                                                                           1: [rank1]:     new_device = tensor.device if tensor.is_cuda else self.device
1: [rank1]:                                   ^^^^^^^^^^^^^^
1: [rank1]: AttributeError: 'MXFP8TensorBase' object has no attribute 'is_cuda'

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
    [ ] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Comment thread transformer_engine/pytorch/tensor/mxfp8_tensor.py Outdated
Comment on lines 1245 to 1248
if update_workspace and quantizer is not None and quantizer.columnwise_usage:
tensor.update_usage(
rowwise_usage=quantizer.rowwise_usage,
columnwise_usage=quantizer.columnwise_usage,
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes the update_usage logic redundant, so we might as well remove it:

Suggested change
if update_workspace and quantizer is not None and quantizer.columnwise_usage:
tensor.update_usage(
rowwise_usage=quantizer.rowwise_usage,
columnwise_usage=quantizer.columnwise_usage,

That leads us to a more fundamental challenge: inference and validation have different data requirements, and we don't really expose a good way to distinguish between them. If you're doing MXFP8 inference, you don't want MXFP8 column-wise data since that's wasted memory. If you're loading a BF16 checkpoint, doing MXFP8 validation, and then doing MXFP8 training, you need to cast to MXFP8 row-wise and column-wise data when loading the checkpoint. Removing this update_usage logic has the effect of "however you've initialized the weights, you're stuck with those buffers forever". This is unfortunate for inference because the default behavior is to initialize weights for training, i.e. with both row-wise and column-wise data (#1827 does add an option to initialize for inference).

I think removing this logic does make sense eventually, especially after Mcore adds support for #1827, but for now I'm wary of breaking users doing inference. For now, is it possible to fix the bug by configuring distopt to avoid overlapping the param AG with validation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Why do you think this bug has to do with overlapping the param AG with validation? I thought as long as we do training after validation, this error will occur.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. I'll incorporate this logic in #1827.

@timmoon10
Copy link
Copy Markdown
Collaborator

I've included these fixes in #1827.

@timmoon10 timmoon10 closed this May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants