Skip to content

Conversation

@sudhu2k
Copy link
Contributor

@sudhu2k sudhu2k commented Sep 4, 2025

Description

While keep_fp8_weight_transpose_cache is False, the expectation is to not create any transpose in the forward pass and compute transpose in the backward pass. ad76b62#diff-ba97b0d1ae75d17a678bc38b4fa69ffec1e0ea007657a28d65565ee2cff35b95
The above commit introduced check to ensure transpose is created when input requires grad is True. But we don't want to create transpose when keep_fp8_weight_transpose_cache is False.
Without this check, it leads to creating transpose while transpose data ptr isn't initialized.

RuntimeError: /workspace/TransformerEngine/transformer_engine/common/transpose/transpose.hip:206 in function transpose: Assertion failed: output.data.dptr != nullptr. Output is not allocated.

Fixes # (13552)
https://ontrack-internal.amd.com/browse/SWDEV-553639

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Added keep_fp8_weight_transpose_cache checks while updating transpose

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@sudhu2k sudhu2k marked this pull request as ready for review September 4, 2025 15:06
@sudhu2k
Copy link
Contributor Author

sudhu2k commented Sep 4, 2025

Hi @ipanfilo, @wangye805, @wenchenvincent
I've added the unit test for the fix. Also modified it to test layernorm linear and layernorm mlp along with linear.
The issue happened, because the backward call clears transpose if keep_fp8_weight_transpose_cache is set to false and in the next iteration of the forward pass, the code tried to create transpose on a cleared memory of transpose.
https://github.com/ROCm/TransformerEngine/blob/66340ddbd8917ddf5f471de61301fc979027179b/transformer_engine/pytorch/module/linear.py#L563C40-L564

@sudhu2k sudhu2k self-assigned this Sep 5, 2025
Copy link
Collaborator

@wangye805 wangye805 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a section of documentation for keep_fp8_weight_transpose_cache, either in the source codes before it's definition or in our README

@sudhu2k sudhu2k force-pushed the 13552-fix-fp8-transpose-cache branch from 34cc7fb to 38956fd Compare September 8, 2025 19:19
@sudhu2k sudhu2k requested a review from wangye805 September 8, 2025 19:26
@sudhu2k sudhu2k force-pushed the 13552-fix-fp8-transpose-cache branch from 6a0e621 to 5a35e71 Compare September 8, 2025 22:40
@sudhu2k sudhu2k requested a review from wangye805 September 8, 2025 22:41
@sudhu2k sudhu2k requested a review from ipanfilo September 9, 2025 07:09
@Micky774
Copy link
Contributor

Micky774 commented Sep 9, 2025

Several meaningful test failures on CI, all stemming from

        if keep_fp8_weight_transpose_cache:
>           assert not transpose_is_empty_or_none, "Expected _transpose to be a valid, non-empty tensor when transpose cache is enabled."
E           AssertionError: Expected _transpose to be a valid, non-empty tensor when transpose cache is enabled.

# This file was modified for portability to AMDGPU
# Copyright (c) 2024-2025, Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep it

@ipanfilo
Copy link
Collaborator

ipanfilo commented Sep 9, 2025

Several meaningful test failures on CI, all stemming from

        if keep_fp8_weight_transpose_cache:
>           assert not transpose_is_empty_or_none, "Expected _transpose to be a valid, non-empty tensor when transpose cache is enabled."
E           AssertionError: Expected _transpose to be a valid, non-empty tensor when transpose cache is enabled.

I might be OK to assert that _transpose is not valid when keep flag is False but there might be other reasons for _transpose not to be valid when the flag is True (default behaviour)

@sudhu2k sudhu2k requested a review from ipanfilo September 9, 2025 19:06
fc2_out = torch.empty(dim_size, dtype=activation_dtype, device=device)

# FC2 GEMM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this to reduce unnecessary changes

@sudhu2k sudhu2k merged commit 9b912e9 into dev Sep 10, 2025
sudhu2k added a commit that referenced this pull request Nov 5, 2025
… in fwd pass (#298)

* Added keep_fp8_weight_transpose_cache checks while updating transpose

* Added unittest for the fix

* Added comment for the unit test

* Fixed comment

* Reverted test for single iteration, added assert statements to check for transpose cache, Modified docstring

* Fixed test_numerics spacing

* Added HIP Guards

* Addressed PR Comments, and moved assertion statements under fp8 check

* Reverting assertion to fix the dev ticket

* Removed spacing

---------

Co-authored-by: Sudharshan Govindan <sugovind@amd.com>
@sudhu2k sudhu2k mentioned this pull request Nov 5, 2025
13 tasks
sudhu2k added a commit that referenced this pull request Nov 10, 2025
* Ensure weight transpose is valid for FP8 training (#1596) (#276)

* Update usage of weightmat before saving for backward

* Added keep_fp8_weight_transpose_cache checks while updating transpose in fwd pass (#298)

* Added keep_fp8_weight_transpose_cache checks while updating transpose

* Added unittest for the fix

* Added comment for the unit test

* Fixed comment

* Reverted test for single iteration, added assert statements to check for transpose cache, Modified docstring

* Fixed test_numerics spacing

* Added HIP Guards

* Addressed PR Comments, and moved assertion statements under fp8 check

* Reverting assertion to fix the dev ticket

* Removed spacing

---------

Co-authored-by: Sudharshan Govindan <sugovind@amd.com>

* Bug fix for get_fp8_metas

* Added keep_fp8_transpose_cache fix for base.py

* added _fp8_metas check for None

* Added comment

---------

Co-authored-by: Sudharshan Govindan <sugovind@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants