Skip to content

Conversation

@naromero77amd
Copy link

@naromero77amd naromero77amd commented Nov 15, 2025

These are backports based on these upstream PRs. Cherrypicks were performed when they where possible.

pytorch#163908 (persistent reduction autotune)
pytorch#161280 (reduction)
pytorch#162053 (foreach)
pytorch#163197 (pointwise)
pytorch#166470 (pointwise config for atomic add)

Also included are some additional customer-specific configs which were not upstreamed but are in this backport to 2.9
#2723

Did not backport filter functions such as _maybe_filter_configs_for_tma_restrictions
https://github.com/ROCm/pytorch/blob/release/2.9/torch/_inductor/runtime/triton_heuristics.py#L2614

jataylo and others added 19 commits November 14, 2025 22:52
(cherry picked from commit 5d4455f)
(cherry picked from commit d3d77f5)
(cherry picked from commit 2fc7525)
(cherry picked from commit 528cf02)
(cherry picked from commit d5c71f0)
(cherry picked from commit 11e1dfc)
(cherry picked from commit 262a33e)
(cherry picked from commit 0cf1c89)
(cherry picked from commit 9f19754)
(cherry picked from commit dee2fdf)
removed the (erroneous?) check that disables autotuning for pointwise
kernels

(cherry picked from commit e3b8e25)
(cherry picked from commit 10af207)
(cherry picked from commit b9e0182)
Added two nice grid configs for the 2d pointwise kernel cases for WRT5
workload.
Confirmed that they were picked up when using max autotune.

(cherry picked from commit f1eac49)
(cherry picked from commit 2e79001)
(cherry picked from commit 04aa3e4)
This config improves the performance of a 1D pointwise kernel by 20% as
measured on MI350.

(cherry picked from commit a7bac0a)
(cherry picked from commit 0bdb796)
(cherry picked from commit af5f678)
(cherry picked from commit 16e8266)
(cherry picked from commit 8bd33f9)
(cherry picked from commit dfc1579)
(cherry picked from commit 8f60456)
(cherry picked from commit 666e81b)
(cherry picked from commit f6aaaf8)
(cherry picked from commit f97c7a9)
(cherry picked from commit db49466)
(cherry picked from commit 6e9b4ee)
(cherry picked from commit c36d85f)
(cherry picked from commit 0c52d01)
(cherry picked from commit 83e453f)
(cherry picked from commit dd990a3)
(cherry picked from commit 0de435f)
(cherry picked from commit 9534cbd)
(cherry picked from commit 189481e)
(cherry picked from commit 7eeb1ba)
(cherry picked from commit eea659c)
Reorganized slightly the adding of hard-coded autotuning configs.
Fixed wrt1 configs.
Added wrt2 & 3 configs.

(cherry picked from commit e3e9a17)
(cherry picked from commit 6534df0)
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 15, 2025

Jenkins build for 7850a9c97813ff2687769efd9a6c4ff5ff749187 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 15, 2025

Jenkins build for dbdb5542c2ae0f09415495c33bfd7d5d0f77bc53 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@naromero77amd naromero77amd changed the title [release/2.7][ROCm][inductor] Inductor heuristic upstream backports [NO CP][release/2.7][ROCm][inductor] Inductor heuristic upstream backports Nov 20, 2025
Added a check that includes autotune configs for 2D POI only if their
size is big enough.

(cherry picked from commit a2b0fd7)
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 20, 2025

Jenkins build for d235a1504f6702249dd72deef1a8f68ce991320a commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 20, 2025

Jenkins build for 627a5718c93f8c54fca6787f3167b2b454717226 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 21, 2025

Jenkins build for b1cdd5584626c1f0c2c6bad6b58272da6901e619 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 21, 2025

Jenkins build for b1cdd5584626c1f0c2c6bad6b58272da6901e619 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 21, 2025

Jenkins build for d356b844b19b6dfb588b2f5815ebbefca0bba579 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@naromero77amd
Copy link
Author

naromero77amd commented Nov 22, 2025

Tested with TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 to confirm we are getting the extra configs (note that some of them are getting filtered/scaled out as expected).

For triton_red_fused_sum_view_22.py:

V1122 00:32:56.142000 102705 torch/_inductor/runtime/triton_heuristics.py:256] CachingAutotuner gets 12 configs for triton_red_fused_sum_view_22
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1, R0_BLOCK: 128, num_warps: 1, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, R0_BLOCK: 8, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 4, R0_BLOCK: 128, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, R0_BLOCK: 64, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 8, R0_BLOCK: 128, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, R0_BLOCK: 4, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1024, R0_BLOCK: 8, waves_per_eu: 2, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 512, R0_BLOCK: 8, waves_per_eu: 1, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, R0_BLOCK: 4, waves_per_eu: 1, num_warps: 2, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1, R0_BLOCK: 128, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, R0_BLOCK: 128, waves_per_eu: 1, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 2, R0_BLOCK: 128, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:271] Triton cache dir: /tmp/torchinductor_root/triton/0

triton_poi_fused_threshold_backward_36 (1D)

V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1024, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 512, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 8192, waves_per_eu: 2, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 4096, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 2048, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, waves_per_eu: 1, num_warps: 2, num_ctas: 1, num_stages: 2, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:271] Triton cache dir: /tmp/torchinductor_root/triton/0

triton_poi_fused_slice_13 (2D)

V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:256] CachingAutotuner gets 12 configs for triton_poi_fused_slice_31
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 32, YBLOCK: 32, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, YBLOCK: 32, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, YBLOCK: 64, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, YBLOCK: 32, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 16, YBLOCK: 256, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, YBLOCK: 16, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 32, YBLOCK: 512, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, YBLOCK: 8, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1, YBLOCK: 1024, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 32, YBLOCK: 128, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, YBLOCK: 32, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, YBLOCK: 256, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:271] Triton cache dir: /tmp/torchinductor_root/triton/0

triton_poi_fused__to_copy_index_add_new_zeros_4 (contans the atomic add config)

V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:256] CachingAutotuner gets 7 configs for triton_poi_fused__to_copy_index_add_new_zeros_4
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1024, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 512, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 8192, waves_per_eu: 2, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 4096, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 2048, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, waves_per_eu: 1, num_warps: 2, num_ctas: 1, num_stages: 2, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, num_warps: 1, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.416000 121806 torch/_inductor/runtime/triton_heuristics.py:271] Triton cache dir: /tmp/torchinductor_root/triton/0

triton_per_fused_sum_view_23

V1122 01:06:49.104000 124286 torch/_inductor/runtime/triton_heuristics.py:256] CachingAutotuner gets 8 configs for triton_per_fused_sum_view_23
V1122 01:06:49.104000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1, num_warps: 1, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 4, num_warps: 2, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 8, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 16, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 32, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 256, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:271] Triton cache dir: /tmp/torchinductor_root/triton/0

@naromero77amd
Copy link
Author

Ran linter several times to clean the file up.

@naromero77amd naromero77amd force-pushed the release_/2.7_triton_heuristic_backports branch from a5d6423 to badfab0 Compare November 22, 2025 02:08
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 22, 2025

Jenkins build for badfab0d09d48b0a580339e5119455ce0f30fcc7 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@naromero77amd
Copy link
Author

naromero77amd commented Nov 24, 2025

Ran the following test suites as follows:
PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py -v --keep-going -i inductor/test_torchinductor.py inductor/test_max_autotune.py
with and without these changes.

No new regressions reported:

< ================== 74 passed, 29 skipped in 439.48s (0:07:19) ==================
> ================== 74 passed, 29 skipped in 464.02s (0:07:44) ==================
< ===================== 2 passed, 1708 deselected in 22.16s ======================
> ===================== 2 passed, 1708 deselected in 21.07s ======================
< ========= 1631 passed, 77 skipped, 2 deselected in 2681.23s (0:44:41) ==========
> ========= 1631 passed, 77 skipped, 2 deselected in 2749.65s (0:45:49) ==========

@naromero77amd naromero77amd marked this pull request as ready for review November 24, 2025 15:47
Copy link
Collaborator

@jataylo jataylo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pruthvistony pruthvistony merged commit 7de1214 into release/2.7 Nov 26, 2025
0 of 2 checks passed
@pruthvistony pruthvistony deleted the release_/2.7_triton_heuristic_backports branch November 26, 2025 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants