-
Notifications
You must be signed in to change notification settings - Fork 74
[release/2.9] Cherrypick aotriton build fixes and Windows support #2712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release/2.9] Cherrypick aotriton build fixes and Windows support #2712
Conversation
…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>
We can also test this more exhaustively via TheRock tomorrow. |
|
Jenkins build for f77d860bdeec894b8a7886025d72ed21ebe2f562 commit finished as FAILURE |
|
@ScottTodd you can check my branch: https://github.com/ROCm/pytorch/tree/release/2.9-rocm7.x-gfx115x I think you should cherry-pick these 3 commits also: |
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32` Fixes pytorch#157094 Fixes pytorch#157093 Fixes pytorch#157092 Fixes pytorch#157091 Fixes pytorch#157064 Fixes pytorch#157063 Fixes pytorch#157062 Fixes pytorch#157061 Fixes pytorch#157042 Fixes pytorch#157041 Fixes pytorch#157039 Fixes pytorch#157004 Pull Request resolved: pytorch#162998 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
…3373) Early assignment of `__AOTRITON_LIB` breaks the usage of environment variable `$AOTRITON_INSTALLED_PREFIX` Pull Request resolved: pytorch#163373 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
## Major Changes * Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes. - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required. * Fix `atomic_counter` handling in varlen FA API * Unskips a few unit tests. Fixes pytorch#157120 Fixes pytorch#157121 Fixes pytorch#157122 Fixes pytorch#157167 Fixes pytorch#155217 Fixes pytorch#157043 Fixes pytorch#157060 Pull Request resolved: pytorch#163745 Approved by: https://github.com/jeffdaily
Thanks. My local Windows builds succeed both with just the one cherry-pick and with the additional three cherry-picks you suggest. I can add those additional cherry-picks to this PR if we want to "rebase and merge" them all in a batch, or I can send them individually. The Jenkins build seemed to fail with Is that related to this change or not? I think it isn't, since triton =/= aotriton. |
hmm it could be aotriton related. I see it in https://github.com/ROCm/aotriton/blob/main/dockerfile/input/install.sh#L14 EDIT: not sure where |
|
I pushed those other cherry-picks. I expect we'll see Jenkins job results in ~50 minutes?
I don't believe the Jenkins CI here builds for Windows. |
|
Jenkins build for 7286cf8a19fba6420029944ae0c35eb576ed650f commit finished as FAILURE |
|
FYI I've pushed these changes to a new |
The more recent Jenkins build failed with the same error. I'm not sure what to do about that.
Argh, my builds had aotriton disabled due to how the PyTorch build caches config variables. Double checking with a clean build now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My local builds seemed to succeed with this branch and aotriton actually enabled (visible in build logs + files present in the .whls). However, I'm seeing the same performance via comfyui with and without aotriton on gfx1100, even with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 and python D:\projects\ComfyUI\main.py --use-split-cross-attention. I see about 12.6it/s for image generation tasks while a month ago I reported 20.0it/s with aotriton 🤔
Logs before updating comfyui itself to latest had this:
D:\projects\ComfyUI\comfy\ops.py:47: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:800.)
return torch.nn.functional.scaled_dot_product_attention(q, k, v, *args, **kwargs)
Those logs are not present after updating comfyui to latest.
The latest torch + rocm wheels from https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-pytorch-python-packages get me about 14it/s.
rocm==7.10.0a20251015
rocm-sdk-core==7.10.0a20251015
rocm-sdk-libraries-gfx110X-dgpu==7.10.0a20251015
torch==2.10.0a0+rocm7.10.0a20251015
torchaudio==2.8.0a0+rocm7.10.0a20251015
torchsde==0.2.6
torchvision==0.25.0a0+rocm7.10.0a20251015
Not sure where the diffs are coming from. Could be:
- Missing more changes on 2.9 that are present on 2.10a
- My system is under more load now (could also test with older releases)
- Aotriton is not actually enabled / in use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I missed the part where you already had TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 set.
Those logs are not present after updating comfyui to latest.
The latest one on main disables MIOpen itself, but aotriton should still be running I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird...
- My locally built .whl files have
torch/lib/aotriton_v2.dll - I do not see that DLL in
site-packages/torch/lib/after installing the locally built .whl files - I do see that DLL after installing our nightly built .whl files (from torch 2.10a / nightly / main)
- The script for installing the locally built wheels shows missing aotriton:
(3.12.venv) λ python D:\scratch\python\validate_torch_vroom.py Benchmarking Scaled Dot-Product Attention (Flash) in FP16 ... D:\scratch\python\validate_torch_vroom.py:72: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:800.) out = scaled_dot_product_attention(q, k, v) D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Memory efficient kernel not used because: (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:938.) out = scaled_dot_product_attention(q, k, v) D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Flash attention kernel not used because: (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:940.) out = scaled_dot_product_attention(q, k, v) D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Torch was not compiled with flash attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:749.) out = scaled_dot_product_attention(q, k, v) D:\scratch\python\validate_torch_vroom.py:72: UserWarning: cuDNN attention kernel not used because: (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:942.) out = scaled_dot_product_attention(q, k, v) D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Torch was not compiled with cuDNN attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:683.) out = scaled_dot_product_attention(q, k, v) Traceback (most recent call last): File "D:\scratch\python\validate_torch_vroom.py", line 215, in <module> sdpa_time, sdpa_mem, sdpa_gflops = measure_op(run_sdpa, warmup=3, total_runs=10) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\scratch\python\validate_torch_vroom.py", line 34, in measure_op t_ms, peak_mb, gf_s = op_func() ^^^^^^^^^ File "D:\scratch\python\validate_torch_vroom.py", line 72, in run_sdpa out = scaled_dot_product_attention(q, k, v) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: No available kernel. Aborting execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottTodd the aotriton_v2.dll file is copied over from <torch_src>/torch/lib which could be a remnant of previous builds. It's likely that it got copied over even though torch was built without aotriton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I built torch-2.9.0 after we changed the version but installed my prior build of torch-2.9.0a0...
Okay, aotriton is there with my local build from this PR (or the release/2.9_rocm7.9 branch)
17 it/s with
set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
python D:\projects\ComfyUI\main.py --use-pytorch-cross-attention
14.5 it/s with
set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=0
python D:\projects\ComfyUI\main.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottTodd @jammm maybe this change is missing: pytorch#165538
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottTodd @jammm maybe this change is missing: pytorch#165538
Could be useful. In my build logs I see this though, showing that it isn't strictly required here yet:
-- Cannot find AOTriton runtime for ROCM 7.1. Build runtime from source
(the top level version should technically be 7.9 I think, but it is specified in multiple subprojects with different values)
|
Try running this script
https://gist.githubusercontent.com/scottt/fb45ba422f9f133223ebb281fca8dc5d/raw/26cb846bf293d75a0c769638c14a976ecc8d663a/validate_torch_vroom.py
if it works, it means aotriton is enabled. If not, it means it's not built
in yet. Also, I heard gfx1100 was added back to the experimental pool, so
you might have to turn on that experimental env var (not on PC yet so can't
paste it here, will do soon)
…On Thu, Oct 16, 2025 at 8:11 AM Scott Todd ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
On cmake/External/aotriton.cmake
<#2712 (comment)>:
My local builds seemed to succeed with this branch and aotriton actually
enabled (visible in build logs + files present in the .whls). However, I'm
seeing the same performance via comfyui with and without aotriton on
gfx1100, even with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 and python
D:\projects\ComfyUI\main.py --use-split-cross-attention. I see about
12.6it/s for image generation tasks while a month ago I reported 20.0it/s
with aotriton 🤔
Logs *before updating comfyui itself to latest* had this:
D:\projects\ComfyUI\comfy\ops.py:47: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:800.)
return torch.nn.functional.scaled_dot_product_attention(q, k, v, *args, **kwargs)
Those logs are not present after updating comfyui to latest.
The latest torch + rocm wheels from
https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-pytorch-python-packages
get me about 14it/s.
rocm==7.10.0a20251015
rocm-sdk-core==7.10.0a20251015
rocm-sdk-libraries-gfx110X-dgpu==7.10.0a20251015
torch==2.10.0a0+rocm7.10.0a20251015
torchaudio==2.8.0a0+rocm7.10.0a20251015
torchsde==0.2.6
torchvision==0.25.0a0+rocm7.10.0a20251015
Not sure where the diffs are coming from. Could be:
- Missing more changes on 2.9 that are present on 2.10a
- My system is under more load now (could also test with older
releases)
- Aotriton is not actually enabled / in use?
—
Reply to this email directly, view it on GitHub
<#2712 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATCSOH2MJHIQ6TQGFDNNAD3X3IBBAVCNFSM6AAAAACJGUJQEOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTGNBSGYYTONJZGY>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
|
@ScottTodd flag for ComfyUI should be: |
|
Can someone advise on the Jenkins failures on Linux? We're going to trigger a Windows release build including these cherrypicks and I'd like them landed on the common |
it seems like it's trying to |
) ## Motivation Fixes #1677, filling in the latest support matrix for supported PyTorch versions for our nightly release builds. I also took the opportunity to clarify and refresh the documentation. ## Technical Details For now this uses `release/2.9_rocm7.9`. Depends on ROCm/pytorch#2712 to use `release/2.9`. ## Test Plan Trigger test release builds using https://github.com/ROCm/TheRock/actions/workflows/release_windows_pytorch_wheels.yml - [ ] Test `release/2.9` once that PR is merged - [x] Test `release/2.9_rocm7.9`: https://github.com/ROCm/TheRock/actions/runs/18598633844 ## Test Results Test release builds completed and sanity checks passed.
The reason for the |
|
Merging as it seems all questions have been resolved... |

Overview
This cherry-picks a few changes to the release/2.9 branch:
Notes
-DHIP_PLATFORM=amdconfigure line is required for building on Linux via TheRock at least until [Issue]: hip-config.cmake in _rocm_sdk_devel has HIP_INSTALLS_HIPCC set to OFF TheRock#1402 is resolved. Note that implicit detection of the "HIP platform" is not recommended, per https://github.com/ROCm/rocm-systems/blob/c8ecf77a94e2d9afe48dae7d9e549937abe25777/projects/clr/CMakeLists.txt#L62-L63Testing
I have not tested on Linux from this release branch, but we have been building from source nightly on Windows in TheRock using this code since it landed upstream, and we are seeing build failures using this release branch suggesting that this cherry-pick will help, like https://github.com/ROCm/TheRock/actions/runs/18512997354/job/52757715762 on Windows.
I have tested with local Windows builds.