[release/2.9] Cherrypick aotriton build fixes and Windows support #2712

ScottTodd · 2025-10-15T02:46:40Z

Overview

This cherry-picks a few changes to the release/2.9 branch:

Notes

The -DHIP_PLATFORM=amd configure line is required for building on Linux via TheRock at least until [Issue]: hip-config.cmake in _rocm_sdk_devel has HIP_INSTALLS_HIPCC set to OFF TheRock#1402 is resolved. Note that implicit detection of the "HIP platform" is not recommended, per https://github.com/ROCm/rocm-systems/blob/c8ecf77a94e2d9afe48dae7d9e549937abe25777/projects/clr/CMakeLists.txt#L62-L63
The other changes here are required for building on Windows
If this does not rise to being viable to cherry-pick, we will likely have to keep aotriton disabled on both platforms until torch 2.10 IMO

Testing

I have not tested on Linux from this release branch, but we have been building from source nightly on Windows in TheRock using this code since it landed upstream, and we are seeing build failures using this release branch suggesting that this cherry-pick will help, like https://github.com/ROCm/TheRock/actions/runs/18512997354/job/52757715762 on Windows.

I have tested with local Windows builds.

…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>

ScottTodd · 2025-10-15T02:47:33Z

I have not tested on Linux or Windows from this release branch, but we have been building from source nightly on Windows in TheRock using this code since it landed upstream, and we are seeing build failures using this release branch suggesting that this cherry-pick will help, like https://github.com/ROCm/TheRock/actions/runs/18512997354/job/52757715762 on Windows.

We can also test this more exhaustively via TheRock tomorrow.

cmake/External/aotriton.cmake

rocm-repo-management-api · 2025-10-15T02:54:54Z

Jenkins build for f77d860bdeec894b8a7886025d72ed21ebe2f562 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

slojosic-amd · 2025-10-15T09:20:32Z

@ScottTodd you can check my branch: https://github.com/ROCm/pytorch/tree/release/2.9-rocm7.x-gfx115x

I think you should cherry-pick these 3 commits also:

because these changes are same as for @xinyazhang PR: #2686 + pytorch#162998 according to Xinya's comment: #2686 (comment)

A few UT failures are caused by `HIPBLASLT_ALLOW_TF32` Fixes pytorch#157094 Fixes pytorch#157093 Fixes pytorch#157092 Fixes pytorch#157091 Fixes pytorch#157064 Fixes pytorch#157063 Fixes pytorch#157062 Fixes pytorch#157061 Fixes pytorch#157042 Fixes pytorch#157041 Fixes pytorch#157039 Fixes pytorch#157004 Pull Request resolved: pytorch#162998 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

…3373) Early assignment of `__AOTRITON_LIB` breaks the usage of environment variable `$AOTRITON_INSTALLED_PREFIX` Pull Request resolved: pytorch#163373 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

## Major Changes * Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes. - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required. * Fix `atomic_counter` handling in varlen FA API * Unskips a few unit tests. Fixes pytorch#157120 Fixes pytorch#157121 Fixes pytorch#157122 Fixes pytorch#157167 Fixes pytorch#155217 Fixes pytorch#157043 Fixes pytorch#157060 Pull Request resolved: pytorch#163745 Approved by: https://github.com/jeffdaily

ScottTodd · 2025-10-15T20:00:28Z

I think you should cherry-pick these 3 commits also:

Thanks. My local Windows builds succeed both with just the one cherry-pick and with the additional three cherry-picks you suggest. I can add those additional cherry-picks to this PR if we want to "rebase and merge" them all in a batch, or I can send them individually.

The Jenkins build seemed to fail with

[2025-10-15T03:25:57.300Z] #57 23.83 fatal: reference is not a tree: d08c31a24d622b4bf767a6645135b7b3d0d886f4

[2025-10-15T03:25:57.300Z] #57 ERROR: process "/bin/sh -c if [ -n \"${TRITON}\" ]; then bash ./install_triton.sh; fi" did not complete successfully: exit code: 128

Is that related to this change or not? I think it isn't, since triton =/= aotriton.

jammm · 2025-10-15T20:04:39Z

The Jenkins build seemed to fail with
[2025-10-15T03:25:57.300Z] #57 23.83 fatal: reference is not a tree: d08c31a24d622b4bf767a6645135b7b3d0d886f4

[2025-10-15T03:25:57.300Z] #57 ERROR: process "/bin/sh -c if [ -n \"${TRITON}\" ]; then bash ./install_triton.sh; fi" did not complete successfully: exit code: 128
Is that related to this change or not? I think it isn't, since triton =/= aotriton.

hmm it could be aotriton related. I see it in https://github.com/ROCm/aotriton/blob/main/dockerfile/input/install.sh#L14
The no image mode should be enabled for Window. Is the CI building on Windows?

EDIT: not sure where if [ -n \"${TRITON}\" ] comes from, but the install_triton.sh script was referenced.

ScottTodd · 2025-10-15T20:08:08Z

I pushed those other cherry-picks. I expect we'll see Jenkins job results in ~50 minutes?

The no image mode should be enabled for Window. Is the CI building on Windows?

I don't believe the Jenkins CI here builds for Windows.

rocm-repo-management-api · 2025-10-15T20:45:19Z

Jenkins build for 7286cf8a19fba6420029944ae0c35eb576ed650f commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

ScottTodd · 2025-10-15T21:07:06Z

FYI I've pushed these changes to a new release/2.9_rocm7.9 branch so we can trigger builds from it in https://github.com/ROCm/TheRock. I'd still like to merge here if the Jenkins build passes and reviewers agree that it can be merged.

ScottTodd · 2025-10-15T21:37:49Z

The Jenkins build seemed to fail with
[2025-10-15T03:25:57.300Z] #57 23.83 fatal: reference is not a tree: d08c31a24d622b4bf767a6645135b7b3d0d886f4

[2025-10-15T03:25:57.300Z] #57 ERROR: process "/bin/sh -c if [ -n \"${TRITON}\" ]; then bash ./install_triton.sh; fi" did not complete successfully: exit code: 128
Is that related to this change or not? I think it isn't, since triton =/= aotriton.
hmm it could be aotriton related. I see it in https://github.com/ROCm/aotriton/blob/main/dockerfile/input/install.sh#L14 The no image mode should be enabled for Window. Is the CI building on Windows?

EDIT: not sure where if [ -n \"${TRITON}\" ] comes from, but the install_triton.sh script was referenced.

The more recent Jenkins build failed with the same error. I'm not sure what to do about that.

My local Windows builds succeed both with just the one cherry-pick and with the additional three cherry-picks you suggest.

Argh, my builds had aotriton disabled due to how the PyTorch build caches config variables. Double checking with a clean build now.

ScottTodd · 2025-10-15T23:10:50Z

cmake/External/aotriton.cmake

My local builds seemed to succeed with this branch and aotriton actually enabled (visible in build logs + files present in the .whls). However, I'm seeing the same performance via comfyui with and without aotriton on gfx1100, even with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 and python D:\projects\ComfyUI\main.py --use-split-cross-attention. I see about 12.6it/s for image generation tasks while a month ago I reported 20.0it/s with aotriton 🤔

Logs before updating comfyui itself to latest had this:

D:\projects\ComfyUI\comfy\ops.py:47: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:800.) return torch.nn.functional.scaled_dot_product_attention(q, k, v, *args, **kwargs)

Those logs are not present after updating comfyui to latest.

The latest torch + rocm wheels from https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-pytorch-python-packages get me about 14it/s.

rocm==7.10.0a20251015 rocm-sdk-core==7.10.0a20251015 rocm-sdk-libraries-gfx110X-dgpu==7.10.0a20251015 torch==2.10.0a0+rocm7.10.0a20251015 torchaudio==2.8.0a0+rocm7.10.0a20251015 torchsde==0.2.6 torchvision==0.25.0a0+rocm7.10.0a20251015

Not sure where the diffs are coming from. Could be:

Missing more changes on 2.9 that are present on 2.10a

My system is under more load now (could also test with older releases)

Aotriton is not actually enabled / in use?

Ah I missed the part where you already had TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 set.

Those logs are not present after updating comfyui to latest.

The latest one on main disables MIOpen itself, but aotriton should still be running I think.

Weird...

My locally built .whl files have torch/lib/aotriton_v2.dll

I do not see that DLL in site-packages/torch/lib/ after installing the locally built .whl files

I do see that DLL after installing our nightly built .whl files (from torch 2.10a / nightly / main)

The script for installing the locally built wheels shows missing aotriton:
(3.12.venv) λ python D:\scratch\python\validate_torch_vroom.py Benchmarking Scaled Dot-Product Attention (Flash) in FP16 ... D:\scratch\python\validate_torch_vroom.py:72: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:800.) out = scaled_dot_product_attention(q, k, v) D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Memory efficient kernel not used because: (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:938.) out = scaled_dot_product_attention(q, k, v) D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Flash attention kernel not used because: (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:940.) out = scaled_dot_product_attention(q, k, v) D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Torch was not compiled with flash attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:749.) out = scaled_dot_product_attention(q, k, v) D:\scratch\python\validate_torch_vroom.py:72: UserWarning: cuDNN attention kernel not used because: (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:942.) out = scaled_dot_product_attention(q, k, v) D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Torch was not compiled with cuDNN attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:683.) out = scaled_dot_product_attention(q, k, v) Traceback (most recent call last): File "D:\scratch\python\validate_torch_vroom.py", line 215, in <module> sdpa_time, sdpa_mem, sdpa_gflops = measure_op(run_sdpa, warmup=3, total_runs=10) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\scratch\python\validate_torch_vroom.py", line 34, in measure_op t_ms, peak_mb, gf_s = op_func() ^^^^^^^^^ File "D:\scratch\python\validate_torch_vroom.py", line 72, in run_sdpa out = scaled_dot_product_attention(q, k, v) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: No available kernel. Aborting execution.

@ScottTodd the aotriton_v2.dll file is copied over from <torch_src>/torch/lib which could be a remnant of previous builds. It's likely that it got copied over even though torch was built without aotriton.

🤦 I built torch-2.9.0 after we changed the version but installed my prior build of torch-2.9.0a0...

Okay, aotriton is there with my local build from this PR (or the release/2.9_rocm7.9 branch)

17 it/s with

set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 python D:\projects\ComfyUI\main.py --use-pytorch-cross-attention

14.5 it/s with

set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=0 python D:\projects\ComfyUI\main.py

@ScottTodd @jammm maybe this change is missing: pytorch#165538

@ScottTodd @jammm maybe this change is missing: pytorch#165538

Could be useful. In my build logs I see this though, showing that it isn't strictly required here yet:

-- Cannot find AOTriton runtime for ROCM 7.1. Build runtime from source

(the top level version should technically be 7.9 I think, but it is specified in multiple subprojects with different values)

jammm · 2025-10-16T02:17:00Z

Try running this script https://gist.githubusercontent.com/scottt/fb45ba422f9f133223ebb281fca8dc5d/raw/26cb846bf293d75a0c769638c14a976ecc8d663a/validate_torch_vroom.py if it works, it means aotriton is enabled. If not, it means it's not built in yet. Also, I heard gfx1100 was added back to the experimental pool, so you might have to turn on that experimental env var (not on PC yet so can't paste it here, will do soon)

…

On Thu, Oct 16, 2025 at 8:11 AM Scott Todd ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ On cmake/External/aotriton.cmake <#2712 (comment)>: My local builds seemed to succeed with this branch and aotriton actually enabled (visible in build logs + files present in the .whls). However, I'm seeing the same performance via comfyui with and without aotriton on gfx1100, even with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 and python D:\projects\ComfyUI\main.py --use-split-cross-attention. I see about 12.6it/s for image generation tasks while a month ago I reported 20.0it/s with aotriton 🤔 Logs *before updating comfyui itself to latest* had this: D:\projects\ComfyUI\comfy\ops.py:47: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:800.) return torch.nn.functional.scaled_dot_product_attention(q, k, v, *args, **kwargs) Those logs are not present after updating comfyui to latest. The latest torch + rocm wheels from https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-pytorch-python-packages get me about 14it/s. rocm==7.10.0a20251015 rocm-sdk-core==7.10.0a20251015 rocm-sdk-libraries-gfx110X-dgpu==7.10.0a20251015 torch==2.10.0a0+rocm7.10.0a20251015 torchaudio==2.8.0a0+rocm7.10.0a20251015 torchsde==0.2.6 torchvision==0.25.0a0+rocm7.10.0a20251015 Not sure where the diffs are coming from. Could be: - Missing more changes on 2.9 that are present on 2.10a - My system is under more load now (could also test with older releases) - Aotriton is not actually enabled / in use? — Reply to this email directly, view it on GitHub <#2712 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATCSOH2MJHIQ6TQGFDNNAD3X3IBBAVCNFSM6AAAAACJGUJQEOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTGNBSGYYTONJZGY> . You are receiving this because your review was requested.Message ID: ***@***.***>

slojosic-amd · 2025-10-16T05:35:35Z

@ScottTodd flag for ComfyUI should be:
--use-pytorch-cross-attention instead of --use-split-cross-attention

ScottTodd · 2025-10-16T16:36:29Z

Can someone advise on the Jenkins failures on Linux? We're going to trigger a Windows release build including these cherrypicks and I'd like them landed on the common release/2.9 branch if possible, instead of us needing to use release/2.9_rocm7.9 branch.

jammm · 2025-10-16T16:44:17Z

Can someone advise on the Jenkins failures on Linux? We're going to trigger a Windows release build including these cherrypicks and I'd like them landed on the common release/2.9 branch if possible, instead of us needing to use release/2.9_rocm7.9 branch.

it seems like it's trying to git checkout d08c31a24d622b4bf767a6645135b7b3d0d886f4 from https://github.com/ROCm/triton but it errors out with fatal: reference is not a tree: d08c31a24d622b4bf767a6645135b7b3d0d886f4. But this commit does appear in the tree ROCm/triton@d08c31a it's just not associated with any branch for some reason. That could explain why git couldn't find it in its local clone.

) ## Motivation Fixes #1677, filling in the latest support matrix for supported PyTorch versions for our nightly release builds. I also took the opportunity to clarify and refresh the documentation. ## Technical Details For now this uses `release/2.9_rocm7.9`. Depends on ROCm/pytorch#2712 to use `release/2.9`. ## Test Plan Trigger test release builds using https://github.com/ROCm/TheRock/actions/workflows/release_windows_pytorch_wheels.yml - [ ] Test `release/2.9` once that PR is merged - [x] Test `release/2.9_rocm7.9`: https://github.com/ROCm/TheRock/actions/runs/18598633844 ## Test Results Test release builds completed and sanity checks passed.

jithunnair-amd · 2025-10-22T04:36:49Z

Can someone advise on the Jenkins failures on Linux? We're going to trigger a Windows release build including these cherrypicks and I'd like them landed on the common release/2.9 branch if possible, instead of us needing to use release/2.9_rocm7.9 branch.

it seems like it's trying to git checkout d08c31a24d622b4bf767a6645135b7b3d0d886f4 from https://github.com/ROCm/triton but it errors out with fatal: reference is not a tree: d08c31a24d622b4bf767a6645135b7b3d0d886f4. But this commit does appear in the tree ROCm/triton@d08c31a it's just not associated with any branch for some reason. That could explain why git couldn't find it in its local clone.

The reason for the fatal: reference is not a tree error is that the PR branch is using commit hash d08c31a24d622b4bf767a6645135b7b3d0d886f4 in triton.txt, but that commit was updated in release/2.9 via #2727. I'm not sure why the other commit went missing, but I'd consider this a non-blocker for this PR, as it doesn't have anything to do with direct triton build (not counting the triton used for aotriton build).

jithunnair-amd · 2025-10-22T04:52:18Z

Merging as it seems all questions have been resolved...

ScottTodd requested review from jammm, jeffdaily, jithunnair-amd and xinyazhang October 15, 2025 02:46

ScottTodd commented Oct 15, 2025

View reviewed changes

cmake/External/aotriton.cmake Show resolved Hide resolved

ScottTodd mentioned this pull request Oct 15, 2025

[Feature] Fill in build/test matrix for more PyTorch versions ROCm/TheRock#1677

Closed

xinyazhang and others added 3 commits October 15, 2025 12:43

ScottTodd commented Oct 15, 2025

View reviewed changes

This was referenced Oct 16, 2025

[torch] Enable stable torch 2.9 builds on Windows and update docs ROCm/TheRock#1809

Merged

Add xz as sysdep ROCm/TheRock#1788

Merged

jithunnair-amd merged commit caf1b51 into ROCm:release/2.9 Oct 22, 2025
1 of 3 checks passed

[release/2.9] Cherrypick aotriton build fixes and Windows support #2712

[release/2.9] Cherrypick aotriton build fixes and Windows support #2712

Uh oh!

Conversation

ScottTodd commented Oct 15, 2025 • edited by jithunnair-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Notes

Testing

Uh oh!

ScottTodd commented Oct 15, 2025

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slojosic-amd commented Oct 15, 2025

Uh oh!

ScottTodd commented Oct 15, 2025

Uh oh!

jammm commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ScottTodd commented Oct 15, 2025

Uh oh!

rocm-repo-management-api bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ScottTodd commented Oct 15, 2025

Uh oh!

ScottTodd commented Oct 15, 2025

Uh oh!

ScottTodd Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

jammm Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScottTodd Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

jammm Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

ScottTodd Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slojosic-amd Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

ScottTodd Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

jammm commented Oct 16, 2025 via email

Uh oh!

slojosic-amd commented Oct 16, 2025

Uh oh!

ScottTodd commented Oct 16, 2025

Uh oh!

jammm commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd commented Oct 22, 2025

Uh oh!

jithunnair-amd commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ScottTodd commented Oct 15, 2025 •

edited by jithunnair-amd

Loading

rocm-repo-management-api bot commented Oct 15, 2025 •

edited

Loading

jammm commented Oct 15, 2025 •

edited

Loading

rocm-repo-management-api bot commented Oct 15, 2025 •

edited

Loading

jammm Oct 16, 2025 •

edited

Loading

ScottTodd Oct 16, 2025 •

edited

Loading

jammm commented Oct 16, 2025 •

edited

Loading