[release/2.4] Backport AOTriton 0.10b to support gfx950 and ROCM 7.0 #2318

xinyazhang · 2025-07-07T22:17:07Z

This also enables gfx950 unit tests for ROCM >= 6.5.

rocm-repo-management-api · 2025-07-07T22:20:55Z

Jenkins build for d4ca84094906731b10ccfdfacdaa5a8f319848cf commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-07-07T22:35:33Z

Jenkins build for 9ee293482d0827c87561b23bcdd41c310d4ccca6 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

xinyazhang · 2025-07-07T22:57:59Z

caffe2/CMakeLists.txt

This removal is necessary to let non-root user build editable pytorch. Not necessary but neat to have.

rocm-repo-management-api · 2025-07-07T23:05:33Z

Jenkins build for 5171177128902f894069e25542e81a473f46fce5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-07-07T23:20:49Z

Jenkins build for 5171177128902f894069e25542e81a473f46fce5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

jithunnair-amd · 2025-07-08T00:35:27Z

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

Where does USE_ROCM_ATTENTION get defined?

Fixed, legacy code from CK integration patch.

jithunnair-amd

@xinyazhang The magnitude of changes to port 0.9.2b support to PyTorch 2.4 is substantial. I can see the benefit of not having to support more than one AOTriton version across multiple PyTorch versions in case more bc-breaking ROCm7.0 changes necessitate generation of a new tarball.

However, given the magnitude of changes, I'd request separating the non-AOTriton-related ROCm7.0-compatibility changes into a different PR.

rocm-repo-management-api · 2025-07-08T07:35:36Z

Jenkins build for 478abd95c7f94a68043748e61c774716c284733f commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-07-08T08:27:11Z

Jenkins build for 478abd95c7f94a68043748e61c774716c284733f commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

xinyazhang · 2025-07-08T15:38:42Z

@xinyazhang The magnitude of changes to port 0.9.2b support to PyTorch 2.4 is substantial. I can see the benefit of not having to support more than one AOTriton version across multiple PyTorch versions in case more bc-breaking ROCm7.0 changes necessitate generation of a new tarball.

However, given the magnitude of changes, I'd request separating the non-AOTriton-related ROCm7.0-compatibility changes into a different PR.

Done. Build fix PR: #2325

rocm-repo-management-api · 2025-07-08T15:40:54Z

Jenkins build for 77c5ab6c7f2d1e2d1a736e8d7f1150aa2ddf7505 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

pruthvistony · 2025-07-08T21:48:05Z

Waiting for MI300 testing.

rocm-repo-management-api · 2025-07-11T04:06:12Z

Jenkins build for 77c5ab6c7f2d1e2d1a736e8d7f1150aa2ddf7505 commit is in progress
Links: Blue Ocean view / Build artifacts

…g AOTriton from source (pytorch#139432) Pull Request resolved: pytorch#139432 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Vicky Tsang <vtsang@amd.com>

…te aotriton_version.txt (pytorch#137443) We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch. This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch *outside* a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var. Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path. Pull Request resolved: pytorch#137443 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Jithun Nair <jithun.nair@amd.com>

We received reports AOTriton kernels mishandles the bias pointer and it causes NaN during fine-tuning llama3.2-11b vision model. This PR will fix the problem. Note: this AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases, but it is considered experimental and will not be enabled right now. Pull Request resolved: pytorch#145508 Approved by: https://github.com/jeffdaily

This is backporting the following commit: [ROCm] Bump AOTriton to 0.9.2b (pytorch#148433) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b: * Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore. * `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs * `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs * The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so` + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten. * The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead. * Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Pull Request resolved: pytorch#148433 Approved by: https://github.com/jeffdaily

#2105) Also fixes the URL problem, where release page does not always match the version string in file name.

Per request from SWDEV-540108

This reverts commit 2211aace36d46e98a0081e2ea91ef8c16818157c.

…pytorch#136627) This change fixes the RUNPATH of installed c++ tests so that the linker can find the shared libraries they depend on. For example, currently: ```bash venv/lib/python3.10/site-packages/torch $ ./bin/test_lazy ./bin/test_lazy: error while loading shared libraries: libtorch.so: cannot open shared object file: No such file or directory ``` Pull Request resolved: pytorch#136627 Approved by: https://github.com/malfet

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#134436 Approved by: https://github.com/r-barnes

This reverts commit b7eaa03dc469a73e4fe10f93fa779180c96c763e.

…ibraries (pytorch#136627)" This reverts commit 84040be83ee4f8850e2384065415c1f8c8e997a5.

xinyazhang · 2025-07-15T23:04:08Z

@pruthvistony tested on MI300X but saw multiple failures on 0.9.2b. (recall release/2.4 has more UTs than release/2.5)

A subset of the observed errors:

FAILED [0.0271s] test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_fused_inputs_head_dim_kernel1_cuda - AssertionError: RuntimeError not raised by <lambda>
FAILED [1.1922s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0214s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0077s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0075s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0093s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0092s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0094s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0088s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.9083s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
FAILED [0.0429s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
FAILED [0.0444s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
FAILED [0.0229s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
FAILED [0.0543s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
FAILED [0.0352s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!

Trying 0.10b to see if there are failures.

rocm-repo-management-api · 2025-07-16T02:00:30Z

Jenkins build for c3834b3e0166735c256bf998fee0979f8eee8ff4 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-07-17T05:05:02Z

Jenkins build for c3834b3e0166735c256bf998fee0979f8eee8ff4 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

…OCm#2318) This also enables gfx950 unit tests for ROCM >= 6.5. --------- Co-authored-by: Vicky Tsang <vtsang@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Mwiza Kunda <mwizak@graphcore.ai> Co-authored-by: cyyever <cyyever@outlook.com>

xinyazhang force-pushed the xinyazhang/rocm7.0torch2.4-backport_aotriton0.9 branch 2 times, most recently from d4ca840 to f7fffcb Compare July 7, 2025 22:29

xinyazhang changed the title ~~Xinyazhang/rocm7.0torch2.4 backport aotriton0.9~~ [release/2.4] Backport AOTriton 0.9.2b to support gfx950 Jul 7, 2025

xinyazhang changed the title ~~[release/2.4] Backport AOTriton 0.9.2b to support gfx950~~ [release/2.4] Backport AOTriton 0.9.2b to support gfx950 and ROCM 7.0 Jul 7, 2025

xinyazhang marked this pull request as ready for review July 7, 2025 22:53

xinyazhang requested review from jeffdaily and jithunnair-amd as code owners July 7, 2025 22:53

xinyazhang commented Jul 7, 2025

View reviewed changes

caffe2/CMakeLists.txt Outdated

Copy link

Author

xinyazhang Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removal is necessary to let non-root user build editable pytorch. Not necessary but neat to have.

xinyazhang mentioned this pull request Jul 7, 2025

[release/2.4] Prevent static initialization of at::cuda::warp_size() (Backport #2293) #2308

Closed

jithunnair-amd reviewed Jul 8, 2025

View reviewed changes

jithunnair-amd requested changes Jul 8, 2025

View reviewed changes

xinyazhang force-pushed the xinyazhang/rocm7.0torch2.4-backport_aotriton0.9 branch from 478abd9 to 77c5ab6 Compare July 8, 2025 15:38

xinyazhang requested a review from jithunnair-amd July 8, 2025 19:17

vickytsang and others added 7 commits July 15, 2025 16:04

[ROCm] Select gpu targets according to PYTORCH_ROCM_ARCH when buildin…

33b9345

…g AOTriton from source (pytorch#139432) Pull Request resolved: pytorch#139432 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Vicky Tsang <vtsang@amd.com>

AOTriton: add 0.9.2b version built on ROCM 6.5, with gfx950 supported. (

cef2cbc

#2105) Also fixes the URL problem, where release page does not always match the version string in file name.

[release/2.7] [AOTriton] Support ROCM 7.0 ABI (#2302)

c584d44

Per request from SWDEV-540108

Try to fix linking error

f943edb

xinyazhang and others added 16 commits July 15, 2025 16:04

Add missing using aotriton::v2::flash::attn_fwd_compact_varlen

e6b625a

Revert "Try to fix linking error"

2c56ccd

This reverts commit 2211aace36d46e98a0081e2ea91ef8c16818157c.

[CMake] Remove pthread linking (pytorch#134436)

34adeb7

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#134436 Approved by: https://github.com/r-barnes

build: add missing file

b47f007

do not hipify tools/amd_build/build_amd.py

8e80a7b

Revert "[CMake] Remove pthread linking (pytorch#134436)"

102b3f3

This reverts commit b7eaa03dc469a73e4fe10f93fa779180c96c763e.

Revert "Set RUNPATH so installed tests can find the required shared l…

a143ee5

…ibraries (pytorch#136627)" This reverts commit 84040be83ee4f8850e2384065415c1f8c8e997a5.

fix build error

e75771d

fix build error

d62e294

fix "file INSTALL cannot make directory" when build with non-root users

611f5b2

add missing aotriton.images

faa5235

remove files newer than release/2.4

604c22e

enable UT for arch supported by AOTriton 0.9.x

83f9fca

USE_ROCM_ATTENTION -> USE_AOTRITON

ff5e4b1

flash_api: remove _ck functions

1d3b6c3

xinyazhang added 3 commits July 15, 2025 20:07

Use AOTriton 0.10b instead to pass all UTs

6aa4ae3

fix test_invalid_fused_inputs_head_dim. AOTriton supports hdim <= 512

b2fdd04

AOTriton 0.10b needs slightly larger fudge factor for dq

714e850

xinyazhang changed the title ~~[release/2.4] Backport AOTriton 0.9.2b to support gfx950 and ROCM 7.0~~ [release/2.4] Backport AOTriton 0.10b to support gfx950 and ROCM 7.0 Jul 16, 2025

xinyazhang force-pushed the xinyazhang/rocm7.0torch2.4-backport_aotriton0.9 branch from 77c5ab6 to 714e850 Compare July 16, 2025 01:17

Fix the adaptor code

c3834b3

xinyazhang requested a review from pruthvistony July 17, 2025 00:54

jithunnair-amd merged commit 8ecb643 into release/2.4 Jul 17, 2025
1 of 12 checks passed

jithunnair-amd deleted the xinyazhang/rocm7.0torch2.4-backport_aotriton0.9 branch July 17, 2025 16:26

xinyazhang restored the xinyazhang/rocm7.0torch2.4-backport_aotriton0.9 branch July 23, 2025 00:54

[release/2.4] Backport AOTriton 0.10b to support gfx950 and ROCM 7.0 #2318

[release/2.4] Backport AOTriton 0.10b to support gfx950 and ROCM 7.0 #2318

Uh oh!

Conversation

xinyazhang commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xinyazhang Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

rocm-repo-management-api bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

xinyazhang Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd left a comment

Choose a reason for hiding this comment

Uh oh!

rocm-repo-management-api bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xinyazhang commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pruthvistony commented Jul 8, 2025

Uh oh!

rocm-repo-management-api bot commented Jul 11, 2025

Uh oh!

xinyazhang commented Jul 15, 2025 • edited by jithunnair-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xinyazhang commented Jul 7, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jul 7, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jul 7, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jul 7, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jul 7, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jul 8, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jul 8, 2025 •

edited

Loading

xinyazhang commented Jul 8, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jul 8, 2025 •

edited

Loading

xinyazhang commented Jul 15, 2025 •

edited by jithunnair-amd

Loading

rocm-repo-management-api bot commented Jul 16, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jul 17, 2025 •

edited

Loading