Skip to content

Conversation

@matthiasdiener
Copy link
Contributor

@matthiasdiener matthiasdiener commented Nov 12, 2025

Description

Implements a two-stage HIP kernel for the amax operation, as an alternative to the original implementation that uses atomic reductions. Make the two-stage kernel the default implementation. Users can use export NVTE_USE_ATOMIC_AMAX=1 to use the atomic amax kernel.

Fixes https://github.com/ROCm/frameworks-internal/issues/14303.

See https://github.com/ROCm/frameworks-internal/issues/14303#issuecomment-3554900809 for a performance analysis.

TODO:

  • Fix other call sites of nvte_compute_amax
  • Address FIXMEs in the code

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@matthiasdiener matthiasdiener self-assigned this Nov 14, 2025
@matthiasdiener matthiasdiener marked this pull request as ready for review November 24, 2025 16:47
auto [te_output_act, out_act] =
my_quantizer_none->create_tensor(input_shape, GetTransformerEngineDType(fake_tensor_type));

// Workspace for nvte_compute_amax_with_workspace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be encapsulated inside nvte_compute_amax()? Moreover, of atomic paths is selected, no need to allocate WS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be encapsulated inside nvte_compute_amax()?

Unfortunately, I did not manage to encapsulate this inside nvte_compute_amax. The main issue is that there would be a need to allocate/deallocate memory for the workspace in that function, which appears to be fragile when running with CUDA graph capture, leading to random crashes in that function when capture is enabled.

Moreover, of atomic paths is selected, no need to allocate WS

Good catch, thanks. 16d3bf9 should address this.

Copy link
Collaborator

@ipanfilo ipanfilo Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I did not manage to encapsulate this inside nvte_compute_amax. The main issue is that there would be a need to allocate/deallocate memory for the workspace in that function, which appears to be fragile when running with CUDA graph capture, leading to random crashes in that function when capture is enabled.

Is it because memory allocation/freeing should be outside of NVTE_SCOPED_GIL_RELEASE()? If so repeating code still can be encapsulated in separate function inside pytorch extension.
Common TE code on the other hand should not worry about environment variable and choose the code path based on workspace presence. It actually does not need use_block_amax too, because block_amax itself may be nullptr or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think f933ef3 covers what we discussed today? cc @wangye805 @Micky774

ef532b1 removes the use_block_amax parameter.

*/
void nvte_compute_amax(const NVTETensor input, NVTETensor output, cudaStream_t stream);

void nvte_compute_amax_with_workspace(const NVTETensor input_, const NVTETensor output_, const NVTETensor workspace_, cudaStream_t stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output and workspace should be writable (drop the const)?

Copy link
Contributor Author

@matthiasdiener matthiasdiener Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream also uses const in the actual function implementation for the output (as well as naming the arguments input_ etc.):

https://github.com/NVIDIA/TransformerEngine/blob/b3c25057405fc35d10be8109b635696e341ccf86/transformer_engine/common/recipe/current_scaling.cu#L180

I think the const only refers to the pointer itself, not the content of what it points to (which is what gets modified). Not sure which way is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const dropped in c7d44a7

*/
void nvte_compute_amax(const NVTETensor input, NVTETensor output, cudaStream_t stream);

void nvte_compute_amax_with_workspace(const NVTETensor input_, const NVTETensor output_, const NVTETensor workspace_, cudaStream_t stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, let's align our naming style with NV upstream:
input_ --> input,
output_ --> output,
workspace_ --> workspace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above regarding the naming. I'm happy to change it, but not sure what the best way is. It seems to be somewhat inconsistent either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming aligned in c7d44a7

#endif //__HIP_PLATFORM_AMD__

constexpr int amax_kernel_threads = 512;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's guard our rocm specific code changes by macro HIP_PLATFORM_AMD

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c7d44a7

template <int nvec, bool aligned, typename InputType>
__launch_bounds__(amax_kernel_threads) __global__
void amax_kernel(const InputType *input, float *amax, const size_t N,
void amax_kernel(const InputType *input, float *amax, float* __restrict__ block_amax, const size_t N,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guard the api change so NV upstream can remain their flow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c7d44a7

size_t max_blocks = std::min(DIVUP(N, static_cast<size_t>(amax_kernel_threads)), max_blocks_hw);

// Allocate FP32 workspace for block-wise amax
auto ws = at::empty({static_cast<long>(max_blocks)}, at::CUDA(at::kFloat));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cast max_blocks to long where the maximum block is 65535

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the cast in 8eda427.


constexpr int amax_kernel_threads = 512;

inline bool nvte_use_atomic_amax() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cache the env evaluation? Usually those host side operations are pretty cheap and are ahead of gpu kernels in e2e training

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not needed here but can be moved to pytorch extension

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed caching and moved to pytorch extension in eba552e. I did not notice a performance difference.

* \param[in] stream CUDA stream used for the operation.
*/
void nvte_compute_amax(const NVTETensor input, NVTETensor output, cudaStream_t stream);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a brief doc just like the nvte_compute_amax above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c7d44a7

// use te_output_act as input to the compute amax and find the amax of activated tensor
nvte_compute_amax(te_output_act.data(), te_output.data(), at::cuda::getCurrentCUDAStream());
});
if (nvte_use_atomic_amax()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guard our rocm specific behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c7d44a7

nvte_compute_amax(input_tensor.data(), out_tensor.data(), at::cuda::getCurrentCUDAStream());
});

if (nvte_use_atomic_amax()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guard the rocm specific code changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c7d44a7

NVTE_SCOPED_GIL_RELEASE({
nvte_compute_amax(te_input.data(), te_output.data(), at::cuda::getCurrentCUDAStream());
});

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, needs guarding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c7d44a7


constexpr int amax_kernel_threads = 512;

inline bool nvte_use_atomic_amax() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not needed here but can be moved to pytorch extension

const bool UseBlockAmax =
(block_amax != nullptr) &&
(block_capacity >= num_blocks) &&
!nvte_use_atomic_amax();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block_amax is expected to be nullptr if nvte_use_atomic_amax() is True so it is redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the logic in eba552e.

@matthiasdiener
Copy link
Contributor Author

See #384 for the GH actions CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants