-
Notifications
You must be signed in to change notification settings - Fork 75
Triton bump in 7.1_internal_testing #2479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Triton bump in 7.1_internal_testing #2479
Conversation
(cherry picked from commit 7bcbafe)
|
Jenkins build for 9b631efb86e8f2e0d5209a1c7faa27e59cb2a968 commit finished as FAILURE |
…#2421) Relands ROCm#2416 with caching fix Upstream equivalent pytorch#159146 --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit f0aebdc)
…im removal (ROCm#2417) We noticed persistent reduction kernels can be extremely poor performing https://ontrack-internal.amd.com/browse/SWDEV-539215 The root cause is that in certain size restrictions and kernels "no_x_dim" mode is enabled, which embeds static XBLOCK=1 into the kernel. This means tuning is not optimal. Removing this mode and enabling autotune we achieve 2x performance proving that new heuristics must be made. We will bring this into 2.7 for perf uplift, discussion is undergoing with upstream on removing no_x_dim, if there is no perf regression they are in agreement. Draft PR shows no perf loss on ROCm for any inductor benchmark pytorch#159048 Removing tests because no longer relevant. (cherry picked from commit 6c845c6)
Adds initial autotuning for foreach support required for https://ontrack-internal.amd.com/browse/SWDEV-539076 4x improvement for some kernels Before: triton_for_fused_18.kd 🔍 | 4.986 ms | 4.986 ms | 2.493 ms | 2 | triton_for_fused_6.kd 🔍 | 0.098 ms | 0.098 ms | 0.049 ms | 2 | triton_for_fused_7.kd 🔍 | 0.036 ms | 0.036 ms | 0.018 ms | 2 | After: triton_for_fused_18.kd 🔍 | 1.273 ms | 1.273 ms | 0.636 ms | 2 | triton_for_fused_6.kd 🔍 | 0.044 ms | 0.044 ms | 0.022 ms | 2 | triton_for_fused_7.kd 🔍 | 0.024 ms | 0.024 ms | 0.012 ms | 2 | (cherry picked from commit f07b7f7)
|
Jenkins build for 38188d2f012030fd73176d701a60c93bde1a8921 commit finished as NOT_BUILT |
|
Jenkins build for 38188d2f012030fd73176d701a60c93bde1a8921 commit finished as FAILURE |
|
Jenkins build for 4afc25a5eb6f4600a121ed3af806f3713340c046 commit finished as FAILURE |
pruthvistony
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are trying this new approach to update triton more frequently than upstream in the internal_testing branch.
Along with upstream guys (Meta and OpenAI) we tried this for few months (Oct 2024 to Mar 2025) and stopped it due to many failures and lot of work, however we are now implementing this internally only.
So the downside for this approach is we will have few UT(about 5% currently) failing and chasing these failures is a continuous task as we will move with commit dump.
23c0876
into
ROCm:rocm7.1_internal_testing
This reverts commit 23c0876.
Bump to triton pytorch/rocm7.1_internal_testing for gfx950 related improvements - https://github.com/ROCm/triton/tree/pytorch/rocm7.1_internal_testing