Current scaling: two-stage amax kernel #369

matthiasdiener · 2025-11-12T20:16:12Z

Description

Implements a two-stage HIP kernel for the amax operation, as an alternative to the original implementation that uses atomic reductions. Make the two-stage kernel the default implementation. Users can use export NVTE_USE_ATOMIC_AMAX=1 to use the atomic amax kernel.

Fixes https://github.com/ROCm/frameworks-internal/issues/14303.

See https://github.com/ROCm/frameworks-internal/issues/14303#issuecomment-3554900809 for a performance analysis.

TODO:

Fix other call sites of nvte_compute_amax
Address FIXMEs in the code

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

transformer_engine/common/recipe/current_scaling.cu

ipanfilo · 2025-11-24T19:43:33Z

transformer_engine/pytorch/csrc/extensions/activation.cpp

    auto [te_output_act, out_act] =
        my_quantizer_none->create_tensor(input_shape, GetTransformerEngineDType(fake_tensor_type));

+    // Workspace for nvte_compute_amax_with_workspace


Can it be encapsulated inside nvte_compute_amax()? Moreover, of atomic paths is selected, no need to allocate WS

Can it be encapsulated inside nvte_compute_amax()?

Unfortunately, I did not manage to encapsulate this inside nvte_compute_amax. The main issue is that there would be a need to allocate/deallocate memory for the workspace in that function, which appears to be fragile when running with CUDA graph capture, leading to random crashes in that function when capture is enabled.

Moreover, of atomic paths is selected, no need to allocate WS

Good catch, thanks. 16d3bf9 should address this.

Unfortunately, I did not manage to encapsulate this inside nvte_compute_amax. The main issue is that there would be a need to allocate/deallocate memory for the workspace in that function, which appears to be fragile when running with CUDA graph capture, leading to random crashes in that function when capture is enabled.

Is it because memory allocation/freeing should be outside of NVTE_SCOPED_GIL_RELEASE()? If so repeating code still can be encapsulated in separate function inside pytorch extension.
Common TE code on the other hand should not worry about environment variable and choose the code path based on workspace presence. It actually does not need use_block_amax too, because block_amax itself may be nullptr or not

I think f933ef3 covers what we discussed today? cc @wangye805 @Micky774

ef532b1 removes the use_block_amax parameter.

wangye805 · 2025-11-25T17:11:49Z

transformer_engine/common/include/transformer_engine/recipe.h

 */
 void nvte_compute_amax(const NVTETensor input, NVTETensor output, cudaStream_t stream);

+void nvte_compute_amax_with_workspace(const NVTETensor input_, const NVTETensor output_, const NVTETensor workspace_, cudaStream_t stream);


output and workspace should be writable (drop the const)?

Upstream also uses const in the actual function implementation for the output (as well as naming the arguments input_ etc.):

https://github.com/NVIDIA/TransformerEngine/blob/b3c25057405fc35d10be8109b635696e341ccf86/transformer_engine/common/recipe/current_scaling.cu#L180

I think the const only refers to the pointer itself, not the content of what it points to (which is what gets modified). Not sure which way is better.

const dropped in c7d44a7

wangye805 · 2025-11-25T17:12:42Z

transformer_engine/common/include/transformer_engine/recipe.h

 */
 void nvte_compute_amax(const NVTETensor input, NVTETensor output, cudaStream_t stream);

+void nvte_compute_amax_with_workspace(const NVTETensor input_, const NVTETensor output_, const NVTETensor workspace_, cudaStream_t stream);


By the way, let's align our naming style with NV upstream:
input_ --> input,
output_ --> output,
workspace_ --> workspace

See above regarding the naming. I'm happy to change it, but not sure what the best way is. It seems to be somewhat inconsistent either way.

Naming aligned in c7d44a7

wangye805 · 2025-11-25T17:17:20Z

transformer_engine/common/recipe/current_scaling.cu

 #endif //__HIP_PLATFORM_AMD__

-constexpr int amax_kernel_threads = 512;
+


Let's guard our rocm specific code changes by macro HIP_PLATFORM_AMD

Done in c7d44a7

wangye805 · 2025-11-25T17:17:43Z

transformer_engine/common/recipe/current_scaling.cu

 template <int nvec, bool aligned, typename InputType>
 __launch_bounds__(amax_kernel_threads) __global__
-    void amax_kernel(const InputType *input, float *amax, const size_t N,
+    void amax_kernel(const InputType *input, float *amax, float* __restrict__ block_amax, const size_t N,


Guard the api change so NV upstream can remain their flow

Done in c7d44a7

wangye805 · 2025-11-25T17:22:27Z

transformer_engine/pytorch/csrc/extensions/activation.cpp

+      size_t max_blocks = std::min(DIVUP(N, static_cast<size_t>(amax_kernel_threads)), max_blocks_hw);
+
+      // Allocate FP32 workspace for block-wise amax
+      auto ws = at::empty({static_cast<long>(max_blocks)}, at::CUDA(at::kFloat));


Do we need to cast max_blocks to long where the maximum block is 65535

Removed the cast in 8eda427.

wangye805 · 2025-11-25T22:14:29Z

transformer_engine/common/include/transformer_engine/recipe.h


+constexpr int amax_kernel_threads = 512;
+
+inline bool nvte_use_atomic_amax() {


Do we need to cache the env evaluation? Usually those host side operations are pretty cheap and are ahead of gpu kernels in e2e training

This method is not needed here but can be moved to pytorch extension

Removed caching and moved to pytorch extension in eba552e. I did not notice a performance difference.

wangye805 · 2025-11-25T22:15:59Z

transformer_engine/common/include/transformer_engine/recipe.h

 *  \param[in]     stream           CUDA stream used for the operation.
 */
 void nvte_compute_amax(const NVTETensor input, NVTETensor output, cudaStream_t stream);



Let's have a brief doc just like the nvte_compute_amax above

Done in c7d44a7

wangye805 · 2025-11-25T22:17:33Z

transformer_engine/pytorch/csrc/extensions/activation.cpp

-      // use te_output_act as input to the compute amax and find the amax of activated tensor
-      nvte_compute_amax(te_output_act.data(), te_output.data(), at::cuda::getCurrentCUDAStream());
-    });
+    if (nvte_use_atomic_amax()) {


Guard our rocm specific behavior

Done in c7d44a7

wangye805 · 2025-11-25T22:18:24Z

transformer_engine/pytorch/csrc/extensions/bias.cpp

-      nvte_compute_amax(input_tensor.data(), out_tensor.data(), at::cuda::getCurrentCUDAStream());
-    });
+
+    if (nvte_use_atomic_amax()) {


Guard the rocm specific code changes

Done in c7d44a7

wangye805 · 2025-11-25T22:18:45Z

transformer_engine/pytorch/csrc/extensions/cast.cpp

-    NVTE_SCOPED_GIL_RELEASE({
-      nvte_compute_amax(te_input.data(), te_output.data(), at::cuda::getCurrentCUDAStream());
-    });
+


Same here, needs guarding

Done in c7d44a7

This reverts commit 7d4054e.

ipanfilo · 2025-11-26T00:42:03Z

transformer_engine/common/include/transformer_engine/recipe.h


+constexpr int amax_kernel_threads = 512;
+
+inline bool nvte_use_atomic_amax() {


This method is not needed here but can be moved to pytorch extension

ipanfilo · 2025-11-26T00:43:30Z

transformer_engine/common/recipe/current_scaling.cu

+  const bool UseBlockAmax =
+      (block_amax != nullptr) &&
+      (block_capacity >= num_blocks) &&
+      !nvte_use_atomic_amax();


block_amax is expected to be nullptr if nvte_use_atomic_amax() is True so it is redundant

Changed the logic in eba552e.

matthiasdiener · 2025-11-26T21:04:47Z

See #384 for the GH actions CI.

matthiasdiener added 3 commits November 12, 2025 14:10

Current scaling: two-stage amax kernel

c15d93b

Merge branch 'dev' into speedup-amax-kernel

51fab36

bugfix graph capture

ae35e4c

matthiasdiener self-assigned this Nov 14, 2025

matthiasdiener added 10 commits November 17, 2025 10:36

Merge branch 'dev' into speedup-amax-kernel

77a68a7

outline workspace allocation

c0d8e73

Merge branch 'dev' into speedup-amax-kernel

6c3507d

Proper allocation of workspace

3c9de07

Merge branch 'dev' into speedup-amax-kernel

91249cc

add a test to compare the accuracy of both amax implementations

be0e0c8

add possibility to force using previous (atomic) kernel

bce34da

Merge branch 'dev' into speedup-amax-kernel

8c388cc

add copyrights

6388604

don't add extra template to kernel

9e6586f

matthiasdiener force-pushed the speedup-amax-kernel branch from 619fc5c to 9e6586f Compare November 20, 2025 23:06

matthiasdiener added 3 commits November 21, 2025 15:03

make amax_kernel_threads usable in pytorch

18292bf

update remaining calls to nvte_compute_amax

a389455

Merge branch 'dev' into speedup-amax-kernel

d87ab8a

matthiasdiener marked this pull request as ready for review November 24, 2025 16:47

matthiasdiener requested review from ipanfilo, wangye805 and wenchenvincent as code owners November 24, 2025 16:47

additional copyrights

fd5dead

ipanfilo reviewed Nov 24, 2025

View reviewed changes

matthiasdiener added 3 commits November 24, 2025 14:52

avoid workspace allocations if NVTE_USE_ATOMIC_AMAX is set

16d3bf9

Merge branch 'dev' into speedup-amax-kernel

50b34aa

remove use_block_amax parameter, more cleanups

ef532b1

matthiasdiener mentioned this pull request Nov 25, 2025

CI: Fix failures on forked PRs and centralize Docker image config #380

Merged

wangye805 requested changes Nov 25, 2025

View reviewed changes

Factor workspace allocation into function

f933ef3

matthiasdiener and others added 3 commits November 25, 2025 16:52

expand test slightly

7d4054e

Revert "expand test slightly"

63cff98

This reverts commit 7d4054e.

guard by HIP macro, address review comments

c7d44a7

ipanfilo reviewed Nov 26, 2025

View reviewed changes

bugfix workspace.data.dptr

f92b926

matthiasdiener mentioned this pull request Nov 26, 2025

[DO NOT MERGE] Speedup amax kernel CI test #384

Draft

matthiasdiener added 2 commits November 26, 2025 13:33

various cleanups

eba552e

Merge branch 'dev' into speedup-amax-kernel

0d6a177

matthiasdiener requested review from ipanfilo and wangye805 November 26, 2025 21:04

		#endif //__HIP_PLATFORM_AMD__

		constexpr int amax_kernel_threads = 512;


		constexpr int amax_kernel_threads = 512;

		inline bool nvte_use_atomic_amax() {

Current scaling: two-stage amax kernel #369

Are you sure you want to change the base?

Current scaling: two-stage amax kernel #369

Uh oh!

Conversation

matthiasdiener commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ipanfilo Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener commented Nov 26, 2025

Uh oh!

Reviewers

matthiasdiener commented Nov 12, 2025 •

edited

Loading

ipanfilo Nov 25, 2025 •

edited

Loading

matthiasdiener Nov 25, 2025 •

edited

Loading