Skip to content

Conversation

@jataylo
Copy link

@jataylo jataylo commented Aug 8, 2025

Bump to triton pytorch/rocm7.1_internal_testing for gfx950 related improvements - https://github.com/ROCm/triton/tree/pytorch/rocm7.1_internal_testing

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Aug 8, 2025

Jenkins build for 9b631efb86e8f2e0d5209a1c7faa27e59cb2a968 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

jataylo and others added 6 commits August 10, 2025 18:56
…#2421)

Relands ROCm#2416 with caching fix

Upstream equivalent pytorch#159146

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit f0aebdc)
ROCm#2442)

ROCm#2421 didn't bring in all required
changes to reland ROCm#2416

(cherry picked from commit 19431ba)
…im removal (ROCm#2417)

We noticed persistent reduction kernels can be extremely poor performing
https://ontrack-internal.amd.com/browse/SWDEV-539215

The root cause is that in certain size restrictions and kernels
"no_x_dim" mode is enabled, which embeds static XBLOCK=1 into the
kernel. This means tuning is not optimal. Removing this mode and
enabling autotune we achieve 2x performance proving that new heuristics
must be made.

We will bring this into 2.7 for perf uplift, discussion is undergoing
with upstream on removing no_x_dim, if there is no perf regression they
are in agreement. Draft PR shows no perf loss on ROCm for any inductor
benchmark pytorch#159048

Removing tests because no longer relevant.

(cherry picked from commit 6c845c6)
Adds initial autotuning for foreach support required for
https://ontrack-internal.amd.com/browse/SWDEV-539076

4x improvement for some kernels

Before:
triton_for_fused_18.kd 🔍 | 4.986 ms | 4.986 ms | 2.493 ms | 2 |  
triton_for_fused_6.kd 🔍 | 0.098 ms | 0.098 ms | 0.049 ms | 2 |  
triton_for_fused_7.kd 🔍 | 0.036 ms | 0.036 ms | 0.018 ms | 2 |  

After:
triton_for_fused_18.kd 🔍 | 1.273 ms | 1.273 ms | 0.636 ms | 2 |  
triton_for_fused_6.kd 🔍 | 0.044 ms | 0.044 ms | 0.022 ms | 2 |  
triton_for_fused_7.kd 🔍 | 0.024 ms | 0.024 ms | 0.012 ms | 2 |  

(cherry picked from commit f07b7f7)
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Aug 11, 2025

Jenkins build for 38188d2f012030fd73176d701a60c93bde1a8921 commit finished as NOT_BUILT
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Aug 11, 2025

Jenkins build for 38188d2f012030fd73176d701a60c93bde1a8921 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Aug 11, 2025

Jenkins build for 4afc25a5eb6f4600a121ed3af806f3713340c046 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@jataylo jataylo changed the title [Draft] [WIP] Triton bump in 7.1_internal_testing Triton bump in 7.1_internal_testing Aug 11, 2025
@jataylo jataylo marked this pull request as ready for review August 11, 2025 22:48
Copy link
Collaborator

@pruthvistony pruthvistony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying this new approach to update triton more frequently than upstream in the internal_testing branch.
Along with upstream guys (Meta and OpenAI) we tried this for few months (Oct 2024 to Mar 2025) and stopped it due to many failures and lot of work, however we are now implementing this internally only.

So the downside for this approach is we will have few UT(about 5% currently) failing and chasing these failures is a continuous task as we will move with commit dump.

@pruthvistony pruthvistony merged commit 23c0876 into ROCm:rocm7.1_internal_testing Aug 12, 2025
0 of 2 checks passed
ethanwee1 added a commit that referenced this pull request Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants