Skip to content

Conversation

@ScottTodd
Copy link
Member

@ScottTodd ScottTodd commented Oct 15, 2025

Overview

This cherry-picks a few changes to the release/2.9 branch:

Notes

Testing

I have not tested on Linux from this release branch, but we have been building from source nightly on Windows in TheRock using this code since it landed upstream, and we are seeing build failures using this release branch suggesting that this cherry-pick will help, like https://github.com/ROCm/TheRock/actions/runs/18512997354/job/52757715762 on Windows.

I have tested with local Windows builds.

…indows. (pytorch#162330)

Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton.
Already tested to be working on Windows with TheRock.

Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604

Pull Request resolved: pytorch#162330
Approved by: https://github.com/jeffdaily

Co-authored-by: Scott Todd <scott.todd0@gmail.com>
@ScottTodd
Copy link
Member Author

I have not tested on Linux or Windows from this release branch, but we have been building from source nightly on Windows in TheRock using this code since it landed upstream, and we are seeing build failures using this release branch suggesting that this cherry-pick will help, like https://github.com/ROCm/TheRock/actions/runs/18512997354/job/52757715762 on Windows.

We can also test this more exhaustively via TheRock tomorrow.

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Oct 15, 2025

Jenkins build for f77d860bdeec894b8a7886025d72ed21ebe2f562 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@slojosic-amd
Copy link

@ScottTodd you can check my branch: https://github.com/ROCm/pytorch/tree/release/2.9-rocm7.x-gfx115x

I think you should cherry-pick these 3 commits also:
image
because these changes are same as for @xinyazhang PR: #2686 + pytorch#162998 according to Xinya's comment: #2686 (comment)

xinyazhang and others added 3 commits October 15, 2025 12:43
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes pytorch#157094
Fixes pytorch#157093
Fixes pytorch#157092
Fixes pytorch#157091
Fixes pytorch#157064
Fixes pytorch#157063
Fixes pytorch#157062
Fixes pytorch#157061
Fixes pytorch#157042
Fixes pytorch#157041
Fixes pytorch#157039
Fixes pytorch#157004

Pull Request resolved: pytorch#162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
…3373)

Early assignment of `__AOTRITON_LIB` breaks the usage of environment variable `$AOTRITON_INSTALLED_PREFIX`

Pull Request resolved: pytorch#163373
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
## Major Changes

* Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes.
  - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required.
* Fix `atomic_counter` handling in varlen FA API
* Unskips a few unit tests.

Fixes pytorch#157120
Fixes pytorch#157121
Fixes pytorch#157122
Fixes pytorch#157167
Fixes pytorch#155217
Fixes pytorch#157043
Fixes pytorch#157060

Pull Request resolved: pytorch#163745
Approved by: https://github.com/jeffdaily
@ScottTodd
Copy link
Member Author

I think you should cherry-pick these 3 commits also:

Thanks. My local Windows builds succeed both with just the one cherry-pick and with the additional three cherry-picks you suggest. I can add those additional cherry-picks to this PR if we want to "rebase and merge" them all in a batch, or I can send them individually.

The Jenkins build seemed to fail with

[2025-10-15T03:25:57.300Z] #57 23.83 fatal: reference is not a tree: d08c31a24d622b4bf767a6645135b7b3d0d886f4

[2025-10-15T03:25:57.300Z] #57 ERROR: process "/bin/sh -c if [ -n \"${TRITON}\" ]; then bash ./install_triton.sh; fi" did not complete successfully: exit code: 128

Is that related to this change or not? I think it isn't, since triton =/= aotriton.

@jammm
Copy link

jammm commented Oct 15, 2025

The Jenkins build seemed to fail with

[2025-10-15T03:25:57.300Z] #57 23.83 fatal: reference is not a tree: d08c31a24d622b4bf767a6645135b7b3d0d886f4

[2025-10-15T03:25:57.300Z] #57 ERROR: process "/bin/sh -c if [ -n \"${TRITON}\" ]; then bash ./install_triton.sh; fi" did not complete successfully: exit code: 128

Is that related to this change or not? I think it isn't, since triton =/= aotriton.

hmm it could be aotriton related. I see it in https://github.com/ROCm/aotriton/blob/main/dockerfile/input/install.sh#L14
The no image mode should be enabled for Window. Is the CI building on Windows?

EDIT: not sure where if [ -n \"${TRITON}\" ] comes from, but the install_triton.sh script was referenced.

@ScottTodd
Copy link
Member Author

I pushed those other cherry-picks. I expect we'll see Jenkins job results in ~50 minutes?

The no image mode should be enabled for Window. Is the CI building on Windows?

I don't believe the Jenkins CI here builds for Windows.

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Oct 15, 2025

Jenkins build for 7286cf8a19fba6420029944ae0c35eb576ed650f commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@ScottTodd
Copy link
Member Author

FYI I've pushed these changes to a new release/2.9_rocm7.9 branch so we can trigger builds from it in https://github.com/ROCm/TheRock. I'd still like to merge here if the Jenkins build passes and reviewers agree that it can be merged.

@ScottTodd
Copy link
Member Author

The Jenkins build seemed to fail with

[2025-10-15T03:25:57.300Z] #57 23.83 fatal: reference is not a tree: d08c31a24d622b4bf767a6645135b7b3d0d886f4

[2025-10-15T03:25:57.300Z] #57 ERROR: process "/bin/sh -c if [ -n \"${TRITON}\" ]; then bash ./install_triton.sh; fi" did not complete successfully: exit code: 128

Is that related to this change or not? I think it isn't, since triton =/= aotriton.

hmm it could be aotriton related. I see it in https://github.com/ROCm/aotriton/blob/main/dockerfile/input/install.sh#L14 The no image mode should be enabled for Window. Is the CI building on Windows?

EDIT: not sure where if [ -n \"${TRITON}\" ] comes from, but the install_triton.sh script was referenced.

The more recent Jenkins build failed with the same error. I'm not sure what to do about that.

My local Windows builds succeed both with just the one cherry-pick and with the additional three cherry-picks you suggest.

Argh, my builds had aotriton disabled due to how the PyTorch build caches config variables. Double checking with a clean build now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My local builds seemed to succeed with this branch and aotriton actually enabled (visible in build logs + files present in the .whls). However, I'm seeing the same performance via comfyui with and without aotriton on gfx1100, even with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 and python D:\projects\ComfyUI\main.py --use-split-cross-attention. I see about 12.6it/s for image generation tasks while a month ago I reported 20.0it/s with aotriton 🤔

Logs before updating comfyui itself to latest had this:

D:\projects\ComfyUI\comfy\ops.py:47: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:800.)
  return torch.nn.functional.scaled_dot_product_attention(q, k, v, *args, **kwargs)

Those logs are not present after updating comfyui to latest.

The latest torch + rocm wheels from https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-pytorch-python-packages get me about 14it/s.

rocm==7.10.0a20251015
rocm-sdk-core==7.10.0a20251015
rocm-sdk-libraries-gfx110X-dgpu==7.10.0a20251015
torch==2.10.0a0+rocm7.10.0a20251015
torchaudio==2.8.0a0+rocm7.10.0a20251015
torchsde==0.2.6
torchvision==0.25.0a0+rocm7.10.0a20251015

Not sure where the diffs are coming from. Could be:

  • Missing more changes on 2.9 that are present on 2.10a
  • My system is under more load now (could also test with older releases)
  • Aotriton is not actually enabled / in use?

Copy link

@jammm jammm Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed the part where you already had TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 set.

Those logs are not present after updating comfyui to latest.

The latest one on main disables MIOpen itself, but aotriton should still be running I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird...

  • My locally built .whl files have torch/lib/aotriton_v2.dll
  • I do not see that DLL in site-packages/torch/lib/ after installing the locally built .whl files
  • I do see that DLL after installing our nightly built .whl files (from torch 2.10a / nightly / main)
  • The script for installing the locally built wheels shows missing aotriton:
    (3.12.venv) λ python D:\scratch\python\validate_torch_vroom.py
    Benchmarking Scaled Dot-Product Attention (Flash) in FP16 ...
    D:\scratch\python\validate_torch_vroom.py:72: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:800.)
      out = scaled_dot_product_attention(q, k, v)
    D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Memory efficient kernel not used because: (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:938.)
      out = scaled_dot_product_attention(q, k, v)
    D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Flash attention kernel not used because: (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:940.)
      out = scaled_dot_product_attention(q, k, v)
    D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Torch was not compiled with flash attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:749.)
      out = scaled_dot_product_attention(q, k, v)
    D:\scratch\python\validate_torch_vroom.py:72: UserWarning: cuDNN attention kernel not used because: (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:942.)
      out = scaled_dot_product_attention(q, k, v)
    D:\scratch\python\validate_torch_vroom.py:72: UserWarning: Torch was not compiled with cuDNN attention. (Triggered internally at D:/b/pytorch_2_9/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:683.)
      out = scaled_dot_product_attention(q, k, v)
    Traceback (most recent call last):
      File "D:\scratch\python\validate_torch_vroom.py", line 215, in <module>
        sdpa_time, sdpa_mem, sdpa_gflops = measure_op(run_sdpa, warmup=3, total_runs=10)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "D:\scratch\python\validate_torch_vroom.py", line 34, in measure_op
        t_ms, peak_mb, gf_s = op_func()
                              ^^^^^^^^^
      File "D:\scratch\python\validate_torch_vroom.py", line 72, in run_sdpa
        out = scaled_dot_product_attention(q, k, v)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    RuntimeError: No available kernel. Aborting execution.
    

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd the aotriton_v2.dll file is copied over from <torch_src>/torch/lib which could be a remnant of previous builds. It's likely that it got copied over even though torch was built without aotriton.

Copy link
Member Author

@ScottTodd ScottTodd Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I built torch-2.9.0 after we changed the version but installed my prior build of torch-2.9.0a0...

Okay, aotriton is there with my local build from this PR (or the release/2.9_rocm7.9 branch)

17 it/s with

set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
python D:\projects\ComfyUI\main.py --use-pytorch-cross-attention

14.5 it/s with

set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=0
python D:\projects\ComfyUI\main.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd @jammm maybe this change is missing: pytorch#165538

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd @jammm maybe this change is missing: pytorch#165538

Could be useful. In my build logs I see this though, showing that it isn't strictly required here yet:

-- Cannot find AOTriton runtime for ROCM 7.1.       Build runtime from source

(the top level version should technically be 7.9 I think, but it is specified in multiple subprojects with different values)

@jammm
Copy link

jammm commented Oct 16, 2025 via email

@slojosic-amd
Copy link

@ScottTodd flag for ComfyUI should be:
--use-pytorch-cross-attention instead of --use-split-cross-attention

@ScottTodd
Copy link
Member Author

Can someone advise on the Jenkins failures on Linux? We're going to trigger a Windows release build including these cherrypicks and I'd like them landed on the common release/2.9 branch if possible, instead of us needing to use release/2.9_rocm7.9 branch.

@jammm
Copy link

jammm commented Oct 16, 2025

Can someone advise on the Jenkins failures on Linux? We're going to trigger a Windows release build including these cherrypicks and I'd like them landed on the common release/2.9 branch if possible, instead of us needing to use release/2.9_rocm7.9 branch.

it seems like it's trying to git checkout d08c31a24d622b4bf767a6645135b7b3d0d886f4 from https://github.com/ROCm/triton but it errors out with fatal: reference is not a tree: d08c31a24d622b4bf767a6645135b7b3d0d886f4. But this commit does appear in the tree ROCm/triton@d08c31a it's just not associated with any branch for some reason. That could explain why git couldn't find it in its local clone.

ScottTodd added a commit to ROCm/TheRock that referenced this pull request Oct 17, 2025
)

## Motivation

Fixes #1677, filling in the latest
support matrix for supported PyTorch versions for our nightly release
builds. I also took the opportunity to clarify and refresh the
documentation.

## Technical Details

For now this uses `release/2.9_rocm7.9`. Depends on
ROCm/pytorch#2712 to use `release/2.9`.

## Test Plan

Trigger test release builds using
https://github.com/ROCm/TheRock/actions/workflows/release_windows_pytorch_wheels.yml

- [ ] Test `release/2.9` once that PR is merged
- [x] Test `release/2.9_rocm7.9`:
https://github.com/ROCm/TheRock/actions/runs/18598633844

## Test Results

Test release builds completed and sanity checks passed.
@jithunnair-amd
Copy link
Collaborator

Can someone advise on the Jenkins failures on Linux? We're going to trigger a Windows release build including these cherrypicks and I'd like them landed on the common release/2.9 branch if possible, instead of us needing to use release/2.9_rocm7.9 branch.

it seems like it's trying to git checkout d08c31a24d622b4bf767a6645135b7b3d0d886f4 from https://github.com/ROCm/triton but it errors out with fatal: reference is not a tree: d08c31a24d622b4bf767a6645135b7b3d0d886f4. But this commit does appear in the tree ROCm/triton@d08c31a it's just not associated with any branch for some reason. That could explain why git couldn't find it in its local clone.

The reason for the fatal: reference is not a tree error is that the PR branch is using commit hash d08c31a24d622b4bf767a6645135b7b3d0d886f4 in triton.txt, but that commit was updated in release/2.9 via #2727. I'm not sure why the other commit went missing, but I'd consider this a non-blocker for this PR, as it doesn't have anything to do with direct triton build (not counting the triton used for aotriton build).

@jithunnair-amd
Copy link
Collaborator

Merging as it seems all questions have been resolved...

@jithunnair-amd jithunnair-amd merged commit caf1b51 into ROCm:release/2.9 Oct 22, 2025
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants