Optimize pow function the focal loss, by JohnNikolay84 · Pull Request #1979 · NVIDIA/apex

JohnNikolay84 · 2026-01-21T14:21:36Z

Pow implementation is very expensive on AMD mi355x. This commit changes it to a mathematically equivalent
exp(y*log(x)) for x > 0. However 1-2 ULP prec loss might be possible. Improvement however is very solid, the cost of the runtime for focal_loss_forward_cuda_kernel is more than 50% down.

* enable deprecated fused adam optimizer * enable deprecated fused lamb * enable xentropy extension * add warpsize 32 for nv and 64 for amd * update compiler arguments * update the syncwarp conditions * update syncwarp condition

* fix warp size in WARP_SHFL* in layernorm * enable fused_layer_norm tests on ROCm

Hipify revamp changes for apex extensions on ROCm.

Skip the unit tests

Fix reduce_block_into_lanes for multi_tensor_l2norm for ROCm

Conflicts: csrc/multi_tensor_apply.cuh setup.py tests/L0/run_optimizers/test_adagrad.py tests/L0/run_optimizers/test_fused_optimizer.py tests/L0/run_optimizers/test_lamb.py

Mostly whitespace or formatting issues addressed. Diff with upstream is reduced; ROCm changes are more clear.

IFU-2021-01-18

use __launch_bounds__(1024) for multi_tensor_apply, re-enable skipped tests

- incorrect use of __shfl_down - fix warp size assumptions - update unit tests to exit on failure

…13)" This reverts commit bdd481d.

Revert "pass all TensorListMetadata as pointer to pinned host memory (#13)

IFU-2020-03-04

…port different grad (#207) * Fix `DistributedFusedAdam` for grad dtype != param dtype (NVIDIA#1893) * Pipeline `reduce-scatter` and `all-reduce`. (NVIDIA#1895) --------- Co-authored-by: Tailing Yuan <yuantailing@gmail.com> Co-authored-by: Wil Kong <alpha0422@gmail.com>

The error: File /tmp/easy_install-_pfhn8pn/matplotlib-3.5.1/.eggs/setuptools_scm-8.3.1-py3.12.egg/setuptools_scm/_integration/pyproject_reading.py, line 36, in read_pyproject section = defn.get(tool, {})[tool_name] ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^ KeyError: 'setuptools_scm' Solution : https://github.com/matplotlib/matplotlib/blob/v3.8.x/pyproject.toml#L22 matplotlib 3.8 is the first version to have pyproject.toml with this tool.setuptools_scm section. This higher version of setuptools expects this structure in the python packages it installs. Matplotlib 3.5.1 doesn't satisfy this condition. The solution is to change the condition to matplotlib>=3.8.

…reductions code inside cuWelfordMuSigma2() function in layer norm kernel assumes a warp size of 32, so added a condition for rocm to support gpu warp size (based on earlier apex code). For rocm, adjust the threadsize, based on earlier apex code. (#215)

* Fix fused_dense_gelu_dense, change the names of the parameters so that they can be accessed by the test appropriately * Update the absolute tolerances in test_mlp from 0 and 1e-7 to 1e-5 * Deactivate the amp state handle for optimization level other than O0. This helps to pass the UT after this. * Update condition for deactivating amp state handle from opt level equal to 1 to opt level not equal to 0 * Update torch set default dtype method to remove warning * Update the method to create overflow buffer for amp optimizer * Update the method to create overflow buffer for amp optimizer * Update the method to create overflow buffer for amp optimizer * reset the default device to cpu so that the generator uses cuda, as run_amp tests set its to cuda

…220)

…ed dense gelu dense (#223) Fixes : https://ontrack-internal.amd.com/browse/SWDEV-534531

In ROCm 7.0, the warpSize variable is no longer constexpr. This commit replaces the variable use with the correct values based on the architecture we're running on.

Do not use warpSize as a constexpr in nhwc_batch_norm_kernel.h

* Added aiter support in fused_rope.py for all 4 variants. Updated fused rope test, reduced tolerances according to unit test in aiter repo. * Add aiter as a submodule and install it if it is rocm. Switch on aiter backend if it is rocm and aiter is installed * add pandas to the requirements so that aiter can be used without numpy error - ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject * Replace ROCM_HOME condition to IS_ROCM_PYTORCH for installing aiter and use pip install -e . instead of python setup.py develop for installing aiter. * Create apex and aiter subclasses for the four variants of FusedRoPEFunc and select apex or aiter subclass based on AITER_ROPE_BACKEND value. The user can specify the environment variable USE_ROCM_AITER_ROPE_BACKEND to select between aiter and apex backends for fused rope. * If the AITER backend is selected, use lowered precision in the unit test otherwise use the original precision 1e-3 * warn user about the lower precision when using aiter backend for fused rope * Update fused_rope.py remove spaces * simplify the switch between aiter and apex subclasses * install aiter without editable mode

fixes :https://ontrack-internal.amd.com/browse/SWDEV-541725

…rp_size() (#237)

* add test to extract extensions from setup.py and test if there can be imported * moved test outside tests/L0

* made a flag to switch on/off aiter compile using --aiter when installing apex * Added information on building AITER during installation in readme

* replace c10_warp_size in fused rope * replace c10_warp_size in fused softmax * replace c10_warp_size in group batch norm * replace c10_warp_size in multiheadattention * replace c10_warp_size in tramsducer * replace c10_warp_size in xentropy * replace c10_warp_size in sync batch normalization * replace c10_warp_size in group batch norm * replace warp_size in multihead attention

…e the test_gelu pass (#269)

…modes (#271)

…ad of version.txt when creating a wheel (#278)

This commit changes it to a mathematically equivalent exp(y*log(x)) for x > 0. However 1-2 ULP prec loss might be possible.

lcskrishna and others added 30 commits August 18, 2020 11:53

[contrib] Support for xentropy extension. (#34)

3344233

* enable deprecated fused adam optimizer * enable deprecated fused lamb * enable xentropy extension * add warpsize 32 for nv and 64 for amd * update compiler arguments * update the syncwarp conditions * update syncwarp condition

update readme with ninja build instruction and pip3.6 install (#35)

e9c43d6

Fix LayerNorm op on ROCm (#36)

7eed38a

* fix warp size in WARP_SHFL* in layernorm * enable fused_layer_norm tests on ROCm

update setup file for rocm due to newer hipify changes

ef209a7

updated hipify changes for apex contrib

9b4c68c

cleanup of extensions

539bad2

refactor based on latest hipify revamp

9100334

fix compile args for multi-tensor extension

f4ad42c

update readme and add a note about deprecating old hipification process

3b917de

fixed spelling mistakes

8efd60b

update readme and minor changes

3fdb8db

Merge pull request #38 from lcskrishna/cl/rocm-hipify-revamp

663d5a4

Hipify revamp changes for apex extensions on ROCm.

skip the unit tests

5bae299

missing import statement

41bbf93

Merge pull request #41 from lcskrishna/cl/skip-tests

76e4e05

Skip the unit tests

Fix reduce_block_into_lanes for multi_tensor_l2norm for ROCm

ff232fb

Merge pull request #42 from sarunyap/reduce-block-fix

d061bf2

Fix reduce_block_into_lanes for multi_tensor_l2norm for ROCm

Merge remote-tracking branch 'upstream/master'

dcc7b51

Conflicts: csrc/multi_tensor_apply.cuh setup.py tests/L0/run_optimizers/test_adagrad.py tests/L0/run_optimizers/test_fused_optimizer.py tests/L0/run_optimizers/test_lamb.py

update setup.py to more closely align with upstream

2332c4d

Mostly whitespace or formatting issues addressed. Diff with upstream is reduced; ROCm changes are more clear.

missing #include <c10/cuda/CUDAGuard.h>

4ebf2b9

skip failing tests on ROCm

13c8d15

Merge pull request #43 from ROCmSoftwarePlatform/IFU-2021-01-18

85b56d0

IFU-2021-01-18

use __launch_bounds__ for multi_tensor_apply (#44)

5baa68d

use __launch_bounds__(1024) for multi_tensor_apply, re-enable skipped tests

fix cross-compiled ROCm builds when no GPUs detected (#45)

c1e88fa

fix bugs in syncbn (#46)

3f49dbf

- incorrect use of __shfl_down - fix warp size assumptions - update unit tests to exit on failure

Revert "pass all TensorListMetadata as pointer to pinned host memory (#…

fbb8cd9

…13)" This reverts commit bdd481d.

Merge pull request #47 from ROCmSoftwarePlatform/revert_workaround

dde39c9

Revert "pass all TensorListMetadata as pointer to pinned host memory (#13)

Merge remote-tracking branch 'upstream/master' into IFU-2020-03-04

c285a67

Merge pull request #48 from ROCmSoftwarePlatform/IFU-2020-03-04

107f1ff

IFU-2020-03-04

Make torch version check numeric

799785a

amd-sriram and others added 28 commits April 26, 2025 10:44

Reset torch default device to cpu after running the amp unit tests. (#…

81eb2fb

…220)

change epilogue parameter for hipblaslt matmul in cuda kernel for fus…

89c37c8

…ed dense gelu dense (#223) Fixes : https://ontrack-internal.amd.com/browse/SWDEV-534531

Do not use warpSize as a constexpr in nhwc_batch_norm_kernel.h

7f38d9d

In ROCm 7.0, the warpSize variable is no longer constexpr. This commit replaces the variable use with the correct values based on the architecture we're running on.

Merge pull request #227 from ROCm/amd/dev/iassiour/SWDEV-541770

a8ba9a6

Do not use warpSize as a constexpr in nhwc_batch_norm_kernel.h

Replacing c10_warp_size with platform based warp_size values (#228)

95c7ed2

fixes :https://ontrack-internal.amd.com/browse/SWDEV-541725

Fixing the C10_warpsize issue. replacing the macros with at::cuda::wa…

7d9b032

…rp_size() (#237)

Apex extensions import test (#245)

ed2d044

* add test to extract extensions from setup.py and test if there can be imported * moved test outside tests/L0

correct the approach to get to the apex folder from the test file (#248)

6e23ced

Replaced warpsize with C10_WARP_SIZE (#249)

99c6242

Disabling Aiter Installation in default build (#254)

19eed3c

* made a flag to switch on/off aiter compile using --aiter when installing apex * Added information on building AITER during installation in readme

Update version.txt (#261)

61431e1

Update README.md (#262)

7221c68

Fix build error (#264)

1e9236f

reset parameters for FusedDenseGeluDense similar to FusedDense to mak…

62c94ed

…e the test_gelu pass (#269)

update the param_id calculation so that it works on both CPX and SPX …

4b03581

…modes (#271)

Update README.md (#273)

053a9b1

Update version.txt (#274)

34160b8

Update aiter submodule to latest commit (#275)

2190fba

add code to read BUILD_VERSION env variable, so that it is used inste…

4a04a64

…ad of version.txt when creating a wheel (#278)

Update version to 1.10.0 (#282)

b986681

Update README.md (#289)

267d397

Pow implementation is very expensive on AMD CDNA4.

7837eef

This commit changes it to a mathematically equivalent exp(y*log(x)) for x > 0. However 1-2 ULP prec loss might be possible.

JohnNikolay84 closed this Jan 21, 2026

amd-sriram deleted the focal_loss_pow_optim_prod branch January 22, 2026 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize pow function the focal loss,#1979

Optimize pow function the focal loss,#1979
JohnNikolay84 wants to merge 310 commits intoNVIDIA:masterfrom
ROCm:focal_loss_pow_optim_prod

JohnNikolay84 commented Jan 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

JohnNikolay84 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

JohnNikolay84 commented Jan 21, 2026 •

edited

Loading