-
Notifications
You must be signed in to change notification settings - Fork 23
Current scaling: two-stage amax kernel #369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
619fc5c to
9e6586f
Compare
| auto [te_output_act, out_act] = | ||
| my_quantizer_none->create_tensor(input_shape, GetTransformerEngineDType(fake_tensor_type)); | ||
|
|
||
| // Workspace for nvte_compute_amax_with_workspace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be encapsulated inside nvte_compute_amax()? Moreover, of atomic paths is selected, no need to allocate WS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be encapsulated inside nvte_compute_amax()?
Unfortunately, I did not manage to encapsulate this inside nvte_compute_amax. The main issue is that there would be a need to allocate/deallocate memory for the workspace in that function, which appears to be fragile when running with CUDA graph capture, leading to random crashes in that function when capture is enabled.
Moreover, of atomic paths is selected, no need to allocate WS
Good catch, thanks. 16d3bf9 should address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I did not manage to encapsulate this inside
nvte_compute_amax. The main issue is that there would be a need to allocate/deallocate memory for the workspace in that function, which appears to be fragile when running with CUDA graph capture, leading to random crashes in that function when capture is enabled.
Is it because memory allocation/freeing should be outside of NVTE_SCOPED_GIL_RELEASE()? If so repeating code still can be encapsulated in separate function inside pytorch extension.
Common TE code on the other hand should not worry about environment variable and choose the code path based on workspace presence. It actually does not need use_block_amax too, because block_amax itself may be nullptr or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think f933ef3 covers what we discussed today? cc @wangye805 @Micky774
ef532b1 removes the use_block_amax parameter.
| */ | ||
| void nvte_compute_amax(const NVTETensor input, NVTETensor output, cudaStream_t stream); | ||
|
|
||
| void nvte_compute_amax_with_workspace(const NVTETensor input_, const NVTETensor output_, const NVTETensor workspace_, cudaStream_t stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output and workspace should be writable (drop the const)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream also uses const in the actual function implementation for the output (as well as naming the arguments input_ etc.):
I think the const only refers to the pointer itself, not the content of what it points to (which is what gets modified). Not sure which way is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const dropped in c7d44a7
| */ | ||
| void nvte_compute_amax(const NVTETensor input, NVTETensor output, cudaStream_t stream); | ||
|
|
||
| void nvte_compute_amax_with_workspace(const NVTETensor input_, const NVTETensor output_, const NVTETensor workspace_, cudaStream_t stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, let's align our naming style with NV upstream:
input_ --> input,
output_ --> output,
workspace_ --> workspace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above regarding the naming. I'm happy to change it, but not sure what the best way is. It seems to be somewhat inconsistent either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming aligned in c7d44a7
| #endif //__HIP_PLATFORM_AMD__ | ||
|
|
||
| constexpr int amax_kernel_threads = 512; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's guard our rocm specific code changes by macro HIP_PLATFORM_AMD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c7d44a7
| template <int nvec, bool aligned, typename InputType> | ||
| __launch_bounds__(amax_kernel_threads) __global__ | ||
| void amax_kernel(const InputType *input, float *amax, const size_t N, | ||
| void amax_kernel(const InputType *input, float *amax, float* __restrict__ block_amax, const size_t N, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard the api change so NV upstream can remain their flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c7d44a7
| size_t max_blocks = std::min(DIVUP(N, static_cast<size_t>(amax_kernel_threads)), max_blocks_hw); | ||
|
|
||
| // Allocate FP32 workspace for block-wise amax | ||
| auto ws = at::empty({static_cast<long>(max_blocks)}, at::CUDA(at::kFloat)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to cast max_blocks to long where the maximum block is 65535
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the cast in 8eda427.
|
|
||
| constexpr int amax_kernel_threads = 512; | ||
|
|
||
| inline bool nvte_use_atomic_amax() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to cache the env evaluation? Usually those host side operations are pretty cheap and are ahead of gpu kernels in e2e training
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not needed here but can be moved to pytorch extension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed caching and moved to pytorch extension in eba552e. I did not notice a performance difference.
| * \param[in] stream CUDA stream used for the operation. | ||
| */ | ||
| void nvte_compute_amax(const NVTETensor input, NVTETensor output, cudaStream_t stream); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have a brief doc just like the nvte_compute_amax above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c7d44a7
| // use te_output_act as input to the compute amax and find the amax of activated tensor | ||
| nvte_compute_amax(te_output_act.data(), te_output.data(), at::cuda::getCurrentCUDAStream()); | ||
| }); | ||
| if (nvte_use_atomic_amax()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard our rocm specific behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c7d44a7
| nvte_compute_amax(input_tensor.data(), out_tensor.data(), at::cuda::getCurrentCUDAStream()); | ||
| }); | ||
|
|
||
| if (nvte_use_atomic_amax()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard the rocm specific code changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c7d44a7
| NVTE_SCOPED_GIL_RELEASE({ | ||
| nvte_compute_amax(te_input.data(), te_output.data(), at::cuda::getCurrentCUDAStream()); | ||
| }); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, needs guarding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c7d44a7
|
|
||
| constexpr int amax_kernel_threads = 512; | ||
|
|
||
| inline bool nvte_use_atomic_amax() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not needed here but can be moved to pytorch extension
| const bool UseBlockAmax = | ||
| (block_amax != nullptr) && | ||
| (block_capacity >= num_blocks) && | ||
| !nvte_use_atomic_amax(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block_amax is expected to be nullptr if nvte_use_atomic_amax() is True so it is redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the logic in eba552e.
|
See #384 for the GH actions CI. |
Description
Implements a two-stage HIP kernel for the amax operation, as an alternative to the original implementation that uses atomic reductions. Make the two-stage kernel the default implementation. Users can use
export NVTE_USE_ATOMIC_AMAX=1to use the atomic amax kernel.Fixes https://github.com/ROCm/frameworks-internal/issues/14303.
See https://github.com/ROCm/frameworks-internal/issues/14303#issuecomment-3554900809 for a performance analysis.
TODO:
nvte_compute_amaxFIXMEs in the codeType of change
Changes
Please list the changes introduced in this PR:
Checklist: