[ROCM] Backport AOTriton 0.11b to release/2.8 and other fixes #2686

xinyazhang · 2025-09-29T21:46:10Z

Cherry-pick pytorch#161754
Cherry-pick pytorch#162330
Cherry-pick pytorch#163373
Cherry-pick pytorch#163745

Note TF32 support is still being plagued by HIPBLASLT_ALLOW_TF32, which should be handled by another PR due to its complexity.

Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b: * Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements - AITER ASM kernels deliver over 500TFLOPS training performance. See [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more details. * Now returns natural based `logsumexp` tensor, matching CUDA's behavior - PR pytorch#156903 is reverted in this PR as well since it is not needed anymore. * Enables `CausalVariant.LOWER_RIGHT` The build system changes drastically along with new packaging scheme of AOTriton 0.11 * AOTriton 0.11 packs GPU images separately from AOTriton runtime * `aotriton.cmake` now selectively downloads image packs according to `PYTORCH_ROCM_ARCH` * `aotriton.cmake` now only use pre-compiled runtime library that exactly matches the ROCM in the build environment. For PyTorch builds with ROCm versions not listed in the file, the build process will build AOTriton runtime without GPU images from source - This avoids any further ABI breaks like ROCM 6.4 -> 7.0 - recursive git clone is disabled since building AOTriton runtime does not require submodules. Bug fixes: * Fix a kernel bug introduced when implementing SWA Known Problems: * gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status due to accuracy issues. Triton compiler fixes are needed to restore the support status. * Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0. This issue is under investigation. Pull Request resolved: pytorch#161754 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>

…3373) Early assignment of `__AOTRITON_LIB` breaks the usage of environment variable `$AOTRITON_INSTALLED_PREFIX` Pull Request resolved: pytorch#163373 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

* Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes. - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required. * Fix `atomic_counter` handling in varlen FA API * Unskips a few unit tests. Fixes pytorch#157120 Fixes pytorch#157121 Fixes pytorch#157122 Fixes pytorch#157167 Fixes pytorch#155217 Fixes pytorch#157043 Fixes pytorch#157060 Pull Request resolved: pytorch#163745 Approved by: https://github.com/jeffdaily

rocm-repo-management-api · 2025-09-29T21:59:02Z

Jenkins build for c497508b3a83a1a6293f7c5c842614eec613f2f7 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

slojosic-amd · 2025-09-30T08:50:40Z

@xinyazhang Are these changes already present in release/2.9 branch?

xinyazhang · 2025-10-03T16:16:06Z

@xinyazhang Are these changes already present in release/2.9 branch?

Yes AOTriton 0.11 is already included in release/2.9

xinyazhang · 2025-10-03T16:35:17Z

Testing results on MI350:

Irrelevant errors from test/test_nn.py

Irrelevant errors from test/test_flop_counter.py

Known TF32 related errors. Need to cherry-pick pytorch#162998

xinyazhang and others added 4 commits September 29, 2025 20:41

xinyazhang changed the title ~~[ROCM] Bump AOTriton to 0.11b and other fixes~~ [ROCM] Backport AOTriton 0.11b to release/2.8 and other fixes Sep 29, 2025

xinyazhang requested review from jeffdaily and jithunnair-amd September 29, 2025 21:50

xinyazhang mentioned this pull request Sep 29, 2025

ROCm context parallel backward lse not scaled pytorch/pytorch#163958

Open

xinyazhang requested a review from pruthvistony October 1, 2025 19:31

pruthvistony approved these changes Oct 7, 2025

View reviewed changes

pruthvistony merged commit cbd27ae into release/2.8 Oct 7, 2025
1 of 3 checks passed

pruthvistony deleted the xinyazhang/2.8-aotriton_0.11 branch October 7, 2025 17:03

This was referenced Oct 15, 2025

[release/2.9] Cherrypick aotriton build fixes and Windows support #2712

Merged

Extend FBGEMM_GENAI and FlashAttention Disablement to PyTorch 2.8 and Later on Linux ROCm/TheRock#1801

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCM] Backport AOTriton 0.11b to release/2.8 and other fixes #2686

[ROCM] Backport AOTriton 0.11b to release/2.8 and other fixes #2686

Uh oh!

xinyazhang commented Sep 29, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Sep 29, 2025 •

edited

Loading

Uh oh!

slojosic-amd commented Sep 30, 2025

Uh oh!

xinyazhang commented Oct 3, 2025

Uh oh!

xinyazhang commented Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[ROCM] Backport AOTriton 0.11b to release/2.8 and other fixes #2686

[ROCM] Backport AOTriton 0.11b to release/2.8 and other fixes #2686

Uh oh!

Conversation

xinyazhang commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slojosic-amd commented Sep 30, 2025

Uh oh!

xinyazhang commented Oct 3, 2025

Uh oh!

xinyazhang commented Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xinyazhang commented Sep 29, 2025 •

edited

Loading

rocm-repo-management-api bot commented Sep 29, 2025 •

edited

Loading