Skip to content

Conversation

@xinyazhang
Copy link

@xinyazhang xinyazhang commented Sep 29, 2025

Fixes: pytorch#163958

Cherry-pick pytorch#161754
Cherry-pick pytorch#162330
Cherry-pick pytorch#163373
Cherry-pick pytorch#163745

Note TF32 support is still being plagued by HIPBLASLT_ALLOW_TF32, which should be handled by another PR due to its complexity.

xinyazhang and others added 4 commits September 29, 2025 20:41
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b:

* Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements
  - AITER ASM kernels deliver over 500TFLOPS training performance. See
    [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more
    details.
* Now returns natural based `logsumexp` tensor, matching CUDA's behavior
  - PR pytorch#156903 is reverted in this PR as well since it is not needed anymore.
* Enables `CausalVariant.LOWER_RIGHT`

The build system changes drastically along with new packaging scheme of
AOTriton 0.11

* AOTriton 0.11 packs GPU images separately from AOTriton runtime
* `aotriton.cmake` now selectively downloads image packs according to
  `PYTORCH_ROCM_ARCH`
* `aotriton.cmake` now only use pre-compiled runtime library that exactly
  matches the ROCM in the build environment. For PyTorch builds with ROCm
  versions not listed in the file, the build process will build AOTriton
  runtime without GPU images from source
  - This avoids any further ABI breaks like ROCM 6.4 -> 7.0
  - recursive git clone is disabled since building AOTriton runtime does not
    require submodules.

Bug fixes:

* Fix a kernel bug introduced when implementing SWA

Known Problems:

* gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status
  due to accuracy issues. Triton compiler fixes are needed to restore the
  support status.
* Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0.
  This issue is under investigation.

Pull Request resolved: pytorch#161754
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
…indows. (pytorch#162330)

Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton.
Already tested to be working on Windows with TheRock.

Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604

Pull Request resolved: pytorch#162330
Approved by: https://github.com/jeffdaily

Co-authored-by: Scott Todd <scott.todd0@gmail.com>
…3373)

Early assignment of `__AOTRITON_LIB` breaks the usage of environment variable `$AOTRITON_INSTALLED_PREFIX`

Pull Request resolved: pytorch#163373
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
* Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes.
  - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required.
* Fix `atomic_counter` handling in varlen FA API
* Unskips a few unit tests.

Fixes pytorch#157120
Fixes pytorch#157121
Fixes pytorch#157122
Fixes pytorch#157167
Fixes pytorch#155217
Fixes pytorch#157043
Fixes pytorch#157060

Pull Request resolved: pytorch#163745
Approved by: https://github.com/jeffdaily
@xinyazhang xinyazhang changed the title [ROCM] Bump AOTriton to 0.11b and other fixes [ROCM] Backport AOTriton 0.11b to release/2.8 and other fixes Sep 29, 2025
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Sep 29, 2025

Jenkins build for c497508b3a83a1a6293f7c5c842614eec613f2f7 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@slojosic-amd
Copy link

@xinyazhang Are these changes already present in release/2.9 branch?

@xinyazhang
Copy link
Author

@xinyazhang Are these changes already present in release/2.9 branch?

Yes AOTriton 0.11 is already included in release/2.9

@xinyazhang
Copy link
Author

Testing results on MI350:
image
image
Irrelevant errors from test/test_nn.py
image
Irrelevant errors from test/test_flop_counter.py
image
Known TF32 related errors. Need to cherry-pick pytorch#162998

@pruthvistony pruthvistony merged commit cbd27ae into release/2.8 Oct 7, 2025
1 of 3 checks passed
@pruthvistony pruthvistony deleted the xinyazhang/2.8-aotriton_0.11 branch October 7, 2025 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants