Skip to content

Conversation

@xinyazhang
Copy link

@xinyazhang xinyazhang commented Jul 7, 2025

This also enables gfx950 unit tests for ROCM >= 6.5.

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 7, 2025

Jenkins build for d4ca84094906731b10ccfdfacdaa5a8f319848cf commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@xinyazhang xinyazhang force-pushed the xinyazhang/rocm7.0torch2.4-backport_aotriton0.9 branch 2 times, most recently from d4ca840 to f7fffcb Compare July 7, 2025 22:29
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 7, 2025

Jenkins build for 9ee293482d0827c87561b23bcdd41c310d4ccca6 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@xinyazhang xinyazhang changed the title Xinyazhang/rocm7.0torch2.4 backport aotriton0.9 [release/2.4] Backport AOTriton 0.9.2b to support gfx950 Jul 7, 2025
@xinyazhang xinyazhang changed the title [release/2.4] Backport AOTriton 0.9.2b to support gfx950 [release/2.4] Backport AOTriton 0.9.2b to support gfx950 and ROCM 7.0 Jul 7, 2025
@xinyazhang xinyazhang marked this pull request as ready for review July 7, 2025 22:53
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removal is necessary to let non-root user build editable pytorch. Not necessary but neat to have.

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 7, 2025

Jenkins build for 5171177128902f894069e25542e81a473f46fce5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 7, 2025

Jenkins build for 5171177128902f894069e25542e81a473f46fce5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does USE_ROCM_ATTENTION get defined?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, legacy code from CK integration patch.

Copy link
Collaborator

@jithunnair-amd jithunnair-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinyazhang The magnitude of changes to port 0.9.2b support to PyTorch 2.4 is substantial. I can see the benefit of not having to support more than one AOTriton version across multiple PyTorch versions in case more bc-breaking ROCm7.0 changes necessitate generation of a new tarball.

However, given the magnitude of changes, I'd request separating the non-AOTriton-related ROCm7.0-compatibility changes into a different PR.

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 8, 2025

Jenkins build for 478abd95c7f94a68043748e61c774716c284733f commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 8, 2025

Jenkins build for 478abd95c7f94a68043748e61c774716c284733f commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@xinyazhang xinyazhang force-pushed the xinyazhang/rocm7.0torch2.4-backport_aotriton0.9 branch from 478abd9 to 77c5ab6 Compare July 8, 2025 15:38
@xinyazhang
Copy link
Author

xinyazhang commented Jul 8, 2025

@xinyazhang The magnitude of changes to port 0.9.2b support to PyTorch 2.4 is substantial. I can see the benefit of not having to support more than one AOTriton version across multiple PyTorch versions in case more bc-breaking ROCm7.0 changes necessitate generation of a new tarball.

However, given the magnitude of changes, I'd request separating the non-AOTriton-related ROCm7.0-compatibility changes into a different PR.

Done. Build fix PR: #2325

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 8, 2025

Jenkins build for 77c5ab6c7f2d1e2d1a736e8d7f1150aa2ddf7505 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@xinyazhang xinyazhang requested a review from jithunnair-amd July 8, 2025 19:17
@pruthvistony
Copy link
Collaborator

Waiting for MI300 testing.

@rocm-repo-management-api
Copy link

Jenkins build for 77c5ab6c7f2d1e2d1a736e8d7f1150aa2ddf7505 commit is in progress
Links: Blue Ocean view / Build artifacts

vickytsang and others added 7 commits July 15, 2025 16:04
…g AOTriton from source (pytorch#139432)

Pull Request resolved: pytorch#139432
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Vicky Tsang <vtsang@amd.com>
…te aotriton_version.txt (pytorch#137443)

We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch.

This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch *outside* a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var.

Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path.

Pull Request resolved: pytorch#137443
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
We received reports AOTriton kernels mishandles the bias pointer and it causes NaN during fine-tuning llama3.2-11b vision model. This PR will fix the problem.

Note: this AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases,  but it is considered experimental and will not be enabled right now.

Pull Request resolved: pytorch#145508
Approved by: https://github.com/jeffdaily
This is backporting the following commit:

[ROCm] Bump AOTriton to 0.9.2b (pytorch#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: pytorch#148433
Approved by: https://github.com/jeffdaily
#2105)

Also fixes the URL problem, where release page does not always match the
version string in file name.
xinyazhang and others added 16 commits July 15, 2025 16:04
This reverts commit 2211aace36d46e98a0081e2ea91ef8c16818157c.
…pytorch#136627)

This change fixes the RUNPATH of installed c++ tests so that the linker can find the shared libraries they depend on.

For example, currently:
```bash
venv/lib/python3.10/site-packages/torch $ ./bin/test_lazy
./bin/test_lazy: error while loading shared libraries: libtorch.so: cannot open shared object file: No such file or directory
```

Pull Request resolved: pytorch#136627
Approved by: https://github.com/malfet
This reverts commit b7eaa03dc469a73e4fe10f93fa779180c96c763e.
…ibraries (pytorch#136627)"

This reverts commit 84040be83ee4f8850e2384065415c1f8c8e997a5.
@xinyazhang
Copy link
Author

xinyazhang commented Jul 15, 2025

@pruthvistony tested on MI300X but saw multiple failures on 0.9.2b. (recall release/2.4 has more UTs than release/2.5)

A subset of the observed errors:

FAILED [0.0271s] test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_fused_inputs_head_dim_kernel1_cuda - AssertionError: RuntimeError not raised by <lambda>
FAILED [1.1922s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0214s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0077s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0075s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0093s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0092s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0094s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.0088s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16 - AssertionError: Tensor-likes are not close!
FAILED [0.9083s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
FAILED [0.0429s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
FAILED [0.0444s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
FAILED [0.0229s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
FAILED [0.0543s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
FAILED [0.0352s] test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16 - AssertionError: Tensor-likes are not close!

Trying 0.10b to see if there are failures.

@xinyazhang xinyazhang changed the title [release/2.4] Backport AOTriton 0.9.2b to support gfx950 and ROCM 7.0 [release/2.4] Backport AOTriton 0.10b to support gfx950 and ROCM 7.0 Jul 16, 2025
@xinyazhang xinyazhang force-pushed the xinyazhang/rocm7.0torch2.4-backport_aotriton0.9 branch from 77c5ab6 to 714e850 Compare July 16, 2025 01:17
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 16, 2025

Jenkins build for c3834b3e0166735c256bf998fee0979f8eee8ff4 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@xinyazhang xinyazhang requested a review from pruthvistony July 17, 2025 00:54
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 17, 2025

Jenkins build for c3834b3e0166735c256bf998fee0979f8eee8ff4 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@jithunnair-amd jithunnair-amd merged commit 8ecb643 into release/2.4 Jul 17, 2025
1 of 12 checks passed
@jithunnair-amd jithunnair-amd deleted the xinyazhang/rocm7.0torch2.4-backport_aotriton0.9 branch July 17, 2025 16:26
alugorey pushed a commit to alugorey/pytorch that referenced this pull request Jul 17, 2025
…OCm#2318)

This also enables gfx950 unit tests for ROCM >= 6.5.

---------

Co-authored-by: Vicky Tsang <vtsang@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Mwiza Kunda <mwizak@graphcore.ai>
Co-authored-by: cyyever <cyyever@outlook.com>
@xinyazhang xinyazhang restored the xinyazhang/rocm7.0torch2.4-backport_aotriton0.9 branch July 23, 2025 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants