Added keep_fp8_weight_transpose_cache checks while updating transpose in fwd pass #298

sudhu2k · 2025-09-04T04:56:03Z

Description

While keep_fp8_weight_transpose_cache is False, the expectation is to not create any transpose in the forward pass and compute transpose in the backward pass. ad76b62#diff-ba97b0d1ae75d17a678bc38b4fa69ffec1e0ea007657a28d65565ee2cff35b95
The above commit introduced check to ensure transpose is created when input requires grad is True. But we don't want to create transpose when keep_fp8_weight_transpose_cache is False.
Without this check, it leads to creating transpose while transpose data ptr isn't initialized.

RuntimeError: /workspace/TransformerEngine/transformer_engine/common/transpose/transpose.hip:206 in function transpose: Assertion failed: output.data.dptr != nullptr. Output is not allocated.

Fixes # (13552)
https://ontrack-internal.amd.com/browse/SWDEV-553639

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added keep_fp8_weight_transpose_cache checks while updating transpose

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

sudhu2k · 2025-09-04T20:47:52Z

Hi @ipanfilo, @wangye805, @wenchenvincent
I've added the unit test for the fix. Also modified it to test layernorm linear and layernorm mlp along with linear.
The issue happened, because the backward call clears transpose if keep_fp8_weight_transpose_cache is set to false and in the next iteration of the forward pass, the code tried to create transpose on a cleared memory of transpose.
https://github.com/ROCm/TransformerEngine/blob/66340ddbd8917ddf5f471de61301fc979027179b/transformer_engine/pytorch/module/linear.py#L563C40-L564

tests/pytorch/test_numerics.py

wangye805

Please also add a section of documentation for keep_fp8_weight_transpose_cache, either in the source codes before it's definition or in our README

tests/pytorch/test_numerics.py

…for transpose cache, Modified docstring

transformer_engine/pytorch/utils.py

transformer_engine/pytorch/module/linear.py

transformer_engine/pytorch/module/layernorm_mlp.py

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/pytorch/module/linear.py

transformer_engine/pytorch/utils.py

Micky774 · 2025-09-09T14:37:21Z

Several meaningful test failures on CI, all stemming from

        if keep_fp8_weight_transpose_cache:
>           assert not transpose_is_empty_or_none, "Expected _transpose to be a valid, non-empty tensor when transpose cache is enabled."
E           AssertionError: Expected _transpose to be a valid, non-empty tensor when transpose cache is enabled.

ipanfilo · 2025-09-09T14:53:57Z

transformer_engine/pytorch/module/linear.py

+# This file was modified for portability to AMDGPU
+# Copyright (c) 2024-2025, Advanced Micro Devices, Inc. All rights reserved.
 # Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#


transformer_engine/pytorch/module/linear.py

transformer_engine/pytorch/module/layernorm_linear.py

ipanfilo · 2025-09-09T16:38:23Z

Several meaningful test failures on CI, all stemming from

        if keep_fp8_weight_transpose_cache:
>           assert not transpose_is_empty_or_none, "Expected _transpose to be a valid, non-empty tensor when transpose cache is enabled."
E           AssertionError: Expected _transpose to be a valid, non-empty tensor when transpose cache is enabled.

I might be OK to assert that _transpose is not valid when keep flag is False but there might be other reasons for _transpose not to be valid when the flag is True (default behaviour)

wangye805 · 2025-09-10T15:04:56Z

transformer_engine/pytorch/module/layernorm_mlp.py

            fc2_out = torch.empty(dim_size, dtype=activation_dtype, device=device)

        # FC2 GEMM
+


nit: remove this to reduce unnecessary changes

… in fwd pass (#298) * Added keep_fp8_weight_transpose_cache checks while updating transpose * Added unittest for the fix * Added comment for the unit test * Fixed comment * Reverted test for single iteration, added assert statements to check for transpose cache, Modified docstring * Fixed test_numerics spacing * Added HIP Guards * Addressed PR Comments, and moved assertion statements under fp8 check * Reverting assertion to fix the dev ticket * Removed spacing --------- Co-authored-by: Sudharshan Govindan <sugovind@amd.com>

* Ensure weight transpose is valid for FP8 training (#1596) (#276) * Update usage of weightmat before saving for backward * Added keep_fp8_weight_transpose_cache checks while updating transpose in fwd pass (#298) * Added keep_fp8_weight_transpose_cache checks while updating transpose * Added unittest for the fix * Added comment for the unit test * Fixed comment * Reverted test for single iteration, added assert statements to check for transpose cache, Modified docstring * Fixed test_numerics spacing * Added HIP Guards * Addressed PR Comments, and moved assertion statements under fp8 check * Reverting assertion to fix the dev ticket * Removed spacing --------- Co-authored-by: Sudharshan Govindan <sugovind@amd.com> * Bug fix for get_fp8_metas * Added keep_fp8_transpose_cache fix for base.py * added _fp8_metas check for None * Added comment --------- Co-authored-by: Sudharshan Govindan <sugovind@amd.com>

sudhu2k marked this pull request as ready for review September 4, 2025 15:06

sudhu2k requested review from ipanfilo, wangye805 and wenchenvincent as code owners September 4, 2025 15:06

sudhu2k self-assigned this Sep 5, 2025

ipanfilo reviewed Sep 5, 2025

View reviewed changes

tests/pytorch/test_numerics.py Outdated Show resolved Hide resolved

sudhu2k requested a review from ipanfilo September 5, 2025 16:22

ipanfilo mentioned this pull request Sep 5, 2025

ResetParam Columnwise usage for Wt FP8 Tranpose #270

Merged

sudhu2k mentioned this pull request Sep 5, 2025

Megatron IFU 08 22 25 ROCm/Megatron-LM#91

Merged

1 task

Sudharshan Govindan added 4 commits September 8, 2025 05:03

Added keep_fp8_weight_transpose_cache checks while updating transpose

2d80b35

Added unittest for the fix

cce235e

Added comment for the unit test

59fd881

Fixed comment

47d9b9f

wangye805 requested changes Sep 8, 2025

View reviewed changes

tests/pytorch/test_numerics.py Outdated Show resolved Hide resolved

wangye805 requested changes Sep 8, 2025

View reviewed changes

tests/pytorch/test_numerics.py Outdated Show resolved Hide resolved

Reverted test for single iteration, added assert statements to check …

38956fd

…for transpose cache, Modified docstring

sudhu2k force-pushed the 13552-fix-fp8-transpose-cache branch from 34cc7fb to 38956fd Compare September 8, 2025 19:19

Fixed test_numerics spacing

b8adc28

sudhu2k requested a review from wangye805 September 8, 2025 19:26

wangye805 requested changes Sep 8, 2025

View reviewed changes

Added HIP Guards

5a35e71

sudhu2k force-pushed the 13552-fix-fp8-transpose-cache branch from 6a0e621 to 5a35e71 Compare September 8, 2025 22:40

sudhu2k requested a review from wangye805 September 8, 2025 22:41

ipanfilo requested changes Sep 9, 2025

View reviewed changes

Addressed PR Comments, and moved assertion statements under fp8 check

dcbda5b

sudhu2k requested a review from ipanfilo September 9, 2025 07:09

ipanfilo reviewed Sep 9, 2025

View reviewed changes

Reverting assertion to fix the dev ticket

8c5528e

sudhu2k requested a review from ipanfilo September 9, 2025 19:06

ipanfilo approved these changes Sep 9, 2025

View reviewed changes

wangye805 approved these changes Sep 10, 2025

View reviewed changes

Removed spacing

5a1318c

sudhu2k merged commit 9b912e9 into dev Sep 10, 2025

sudhu2k mentioned this pull request Nov 5, 2025

Release v2.2 cherrypicks megatron lm #362

Merged

13 tasks

		fc2_out = torch.empty(dim_size, dtype=activation_dtype, device=device)

		# FC2 GEMM

Added keep_fp8_weight_transpose_cache checks while updating transpose in fwd pass #298

Added keep_fp8_weight_transpose_cache checks while updating transpose in fwd pass #298

Uh oh!

Conversation

sudhu2k commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

sudhu2k commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Micky774 commented Sep 9, 2025

Uh oh!

ipanfilo Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo commented Sep 9, 2025

Uh oh!

wangye805 Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sudhu2k commented Sep 4, 2025 •

edited

Loading

sudhu2k commented Sep 4, 2025 •

edited

Loading