[ROCm][Windows] Enable AOTriton runtime compile on Windows #2725

tvukovic-amd · 2025-10-17T10:21:26Z

This PR is applied cherry-pick commit from merged PR. Since there's no prebuilt aotriton runtime available for Windows we need to apply this PR to always build aotriton runtime on Windows case.

Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci

…rch#159889) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#159889 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#160449

Pull Request resolved: pytorch#160842 Approved by: https://github.com/Skylion007

@mlazos

Adding a test that is closer to real use case. Thanks @mlazos for fixing a few issues so this test works for most cases. We still have to skip the AOTI and dynamic case due to accuracy issues. Pull Request resolved: pytorch#160782 Approved by: https://github.com/mlazos

This PR fuses ROPE from 2 kernels into 1 kernel. Shape: ``` q: [B, Hq, S, D] k: [B, Hkv, S, D] ``` `Hq=32, Hkv=8, D=128` following Llama3 setting. <img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" /> Pull Request resolved: pytorch#161420 Approved by: https://github.com/shunting314

…orch#156633) - Enable communication of tensors with Complex datatype in ProcessGroupGloo, similar to how ProcessGroupNCCL handles it. - Move a function, which checks if Complex datatype is supported by a reduce operation, from ProcessGroupNCCL.cpp into a new file to be shared with ProcessGroupGloo. Fixes pytorch#156632 Pull Request resolved: pytorch#156633 Approved by: https://github.com/d4l3k

As those weren't really a pins to begin with, and requirments.txt already has those Pull Request resolved: pytorch#162266 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: pytorch#162263, pytorch#162264

`vmap(F.embedding)(DTensor, DTensor)` was failing because F.embedding's batching rule generates a new tensor via at::arange, at::arange generates a regular tensor, and DTensor rightfully errors on mixed DTensor-regular Tensor operations. This PR fixes the problem by activating DTensor implicit replication on just the at::arange and the subsequent add operation. In order to accomplish this I move the DTensor implicit replication flag to C++ (most batching rules are in C++). Test Plan: - new test Pull Request resolved: pytorch#162117 Approved by: https://github.com/bdhirsh

Pull Request resolved: pytorch#160402 Approved by: https://github.com/aorenste

… SafeTensors file (pytorch#162214) Summary: The current dequantization implementation assumes that the weight and scale tenors are in the same SafeTensors files. This diff fixes the issue to support the case when these could be in different files. Test Plan: buck test fbcode//caffe2/test/distributed/checkpoint\:test_quantized_hf_storage Buck UI: https://www.internalfb.com/buck2/532bf151-bb40-41fd-b080-ff898675afe2 Test UI: https://www.internalfb.com/intern/testinfra/testrun/15199648851011082 Rollback Plan: Differential Revision: D81718598 Pull Request resolved: pytorch#162214 Approved by: https://github.com/wwwjn

if the user doesn't provide their own guard filter fn, we should by default filter global guards. pytest test/dynamo/test_aot_compile.py Pull Request resolved: pytorch#162090 Approved by: https://github.com/zhxchen17

Not sure why it was at 3.9 Pull Request resolved: pytorch#162297 Approved by: https://github.com/clee2000, https://github.com/atalman

And raise error when building for an unsupported version Pull Request resolved: pytorch#162265 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: pytorch#162297

Pull Request resolved: pytorch#157580 Approved by: https://github.com/bdhirsh

# Summary ### Update API ```Py class AuxRequest(NamedTuple): """Request which auxiliary outputs to compute from flex_attention. Each field is a boolean indicating whether that auxiliary output should be computed. """ lse: bool = False max_scores: bool = False class AuxOutput(NamedTuple): """Auxiliary outputs from flex_attention operation. Fields will be None if not requested, or contain the tensor if requested. """ lse: Optional[Tensor] = None max_scores: Optional[Tensor] = None out_only = flex_attention(query, key, value, score_mod) out_max, aux_max = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(max_scores=True), ) out_both, aux_both = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True), ) ``` Returns the max post mod scores from flex attention. Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups. Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args. We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors. ### Req Grad I currently dont return a max_scores that supports backproping grads. I think this might be feasible but since max is essentially 1 hot on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch). For now no grad, we can re-visit if needed. ## Perf I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path. ```Shell 🔝 Top 5 TFlops Deltas (by absolute %): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ 🔺 Top 5 Positive TFlops Deltas (highest +%): shape: (5, 7) ┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 161.031318 ┆ 158.597808 ┆ 2.43351 ┆ 1.534391 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘ 🔻 Top 5 Negative TFlops Deltas (lowest -%): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, ┆ 175.546923 ┆ 177.81205 ┆ -2.265127 ┆ -1.273888 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, ┆ 156.282597 ┆ 158.209134 ┆ -1.926537 ┆ -1.217715 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16, ┆ 232.542929 ┆ 235.140136 ┆ -2.597207 ┆ -1.104536 │ │ ┆ ┆ 2048, 128) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 169.652791 ┆ 171.475986 ┆ -1.823195 ┆ -1.063236 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ ``` Pull Request resolved: pytorch#161667 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ pytorch#161730 * pytorch#161667 ```Py with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32) buf1 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32) buf2 = empty_strided_cuda((2, 32, 1024, 64), (2097152, 65536, 64, 1), torch.float32) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream0 = get_raw_stream(0) triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, buf1, arg4_1, arg3_1, arg5_1, arg6_1, buf2, 8, 2, 32, stream=stream0) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 del buf1 return (buf2, ) ``` Vs ```Py with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32) buf1 = empty_strided_cuda((0, ), (1, ), torch.float32) buf2 = empty_strided_cuda((2, 32, 1024, 64), (2097152, 65536, 64, 1), torch.float32) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream0 = get_raw_stream(0) triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, buf1, arg4_1, arg3_1, arg5_1, arg6_1, buf2, 8, 2, 32, stream=stream0) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 del buf1 return (buf2, ) ``` <img width="428" height="145" alt="Screenshot 2025-08-28 at 12 37 11 PM" src="https://github.com/user-attachments/assets/240a7bca-97e1-40c4-bf93-f075fdc1a40d" /> Pull Request resolved: pytorch#161730 Approved by: https://github.com/Skylion007, https://github.com/BoyuanFeng ghstack dependencies: pytorch#161667

…orch#161670) Currently, user-defined classes inside of a compiled frame will cause the whole frame to be skipped by dynamo. This change defers the Unsupported exception until the __build_class__ builtin is actually called, which allows a graph break to be inserted. Fixes pytorch#161562 Pull Request resolved: pytorch#161670 Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas

Summary: original pr - pytorch#161798 Test Plan: ci Rollback Plan: Differential Revision: D81724234 Pull Request resolved: pytorch#162217 Approved by: https://github.com/SherlockNoMad

Fixes pytorch#159469. See pytorch#159469 (comment) for root-cause analysis. Pull Request resolved: pytorch#162304 Approved by: https://github.com/bdhirsh, https://github.com/zou3519, https://github.com/eellison

Fixes pytorch#162274 the test is removed from vllm side Pull Request resolved: pytorch#162306 Approved by: https://github.com/malfet

…torch#162300) As titled, this PR ensures peak memory is estimated only when buffer reuse is enabled. Without this config, some nodes' successor nodes are eliminated from memory estimation after inductor bucketing, which can cause errors. The original codegen peak memory estimation code is from this PR: pytorch#159530 Pull Request resolved: pytorch#162300 Approved by: https://github.com/eellison, https://github.com/v0i0

Fixes some `d_qk` != `d_v` cases on Hopper that are broken by cuDNN 9.11-9.12 Pull Request resolved: pytorch#162268 Approved by: https://github.com/drisspg, https://github.com/Skylion007

📚 The doc update adding description about torchgen folder in code structure guide Pull Request resolved: pytorch#162261 Approved by: https://github.com/ezyang

Pull Request resolved: pytorch#162192 Approved by: https://github.com/coconutruben

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#159006 Approved by: https://github.com/SherlockNoMad

This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](vllm-project/vllm#24281) Pull Request resolved: pytorch#162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg

This reverts commit 081cab0. Reverted pytorch#161730 on behalf of https://github.com/davidberard98 due to functorch/test_aotdispatch.py::TestAOTModuleSimplified::test_flex_attn_noncontiguous_tangents [GH job link](https://github.com/pytorch/pytorch/actions/runs/17506617662/job/49731934012) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/081cab045472ce045634548cc6c14a4870641e23) ([comment](pytorch#161730 (comment)))

Fix the `DeviceMesh._flatten` docstring example of use. Alternative fix would be to replace `mesh_3d["dp", "cp"]` with `mesh_3d["cp", "tp"]`. (I verified the fix using the `gloo` backend) Pull Request resolved: pytorch#162277 Approved by: https://github.com/ezyang

Fixes pytorch#160077, pytorch#154721 Pull Request resolved: pytorch#162224 Approved by: https://github.com/ezyang

Fix part of pytorch#148404 APis involved are as followed: - cross_entropy_loss - hardsigmoid_ - hardswish - hardswish_ - huber_loss Pull Request resolved: pytorch#162148 Approved by: https://github.com/FFFrog, https://github.com/ezyang

…_rcpf(x) instead of 1.f/x (#1800) Cherry-pick of #1688 Co-authored-by: Michael Halkenhäuser <michaelhalk@web.de> Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com> (cherry picked from commit f8544af) (cherry picked from commit ed48754) (cherry picked from commit d62a39e) (cherry picked from commit b26ddb8)

Related to c7a1e32 Fixes https://ontrack-internal.amd.com/browse/SWDEV-537835 Not a Navi specific failure: ``` File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1412, in only_fn return fn(slf, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/jenkins/pytorch/test/test_binary_ufuncs.py", line 1671, in test_cuda_tensor_pow_scalar_tensor self._test_pow(base, exp) File "/var/lib/jenkins/pytorch/test/test_binary_ufuncs.py", line 1482, in _test_pow self.assertEqual(actual, expected) File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 4052, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: The values for attribute 'dtype' do not match: torch.float32 != torch.float64. ``` Using .to(actual) without specifying dtype/device assumes actual is a tensor or tensor-like, which may fail silently or promote. Fixed by explicitly matching dtype and device. Going from pytorch#107302 Fix: ``` root@ubb4-rack-22:/var/lib/jenkins/pytorch# TEST_CONFIG=default HIP_VISIBLE_DEVICES=0 PYTORCH_TEST_WITH_ROCM=1 python test/test_binary_ufuncs.py TestBinaryUfuncsCUDA.test_cuda_tensor_pow_scalar_tensor_cuda /opt/conda/envs/py_3.12/lib/python3.12/site-packages/hypothesis/entry_points.py:23: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources Running tests... ---------------------------------------------------------------------- . ---------------------------------------------------------------------- Ran 1 test in 0.141s OK Generating XML reports... root@ubb4-rack-22:/var/lib/jenkins/pytorch# pip list | grep numpy numpy 2.1.2 ``` (cherry picked from commit a4d60fa) (cherry picked from commit 9f11871)

This PR fixes the unit test, test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s] ``` Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda") RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432] ``` This error occurs only on gfx1101 arch. This error is coming from an integer overflow when another unit test, test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel creates a tensor with a huge numel, which overflows into a higher torch.cuda.max_memory_reserved() when you call test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction afterward. To avoid this we introduced torch.cuda.empty_cache() and torch.cuda.reset_peak_memory_stats() to clean up CUDA states. JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295 (cherry picked from commit f86d184) (cherry picked from commit 1b44228)

…g torch and numpy tensors (#2362) Cherry-pick of #2340 Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com> (cherry picked from commit 22c98ea) (cherry picked from commit 2d72fcd)

pip installed requirements.txt and .ci/docker/requirements-ci.txt Local validation: `Successfully installed jinja2-3.1.6 lintrunner-0.12.7 mypy-1.14.0 onnxscript-0.2.2 sympy-1.13.3 tlparse-0.3.30 z3-solver-4.12.6.0` (cherry picked from commit 30508ff) (cherry picked from commit 22d02e8)

Adds initial autotuning for foreach support required for https://ontrack-internal.amd.com/browse/SWDEV-539076 4x improvement for some kernels Before: triton_for_fused_18.kd 🔍 | 4.986 ms | 4.986 ms | 2.493 ms | 2 | triton_for_fused_6.kd 🔍 | 0.098 ms | 0.098 ms | 0.049 ms | 2 | triton_for_fused_7.kd 🔍 | 0.036 ms | 0.036 ms | 0.018 ms | 2 | After: triton_for_fused_18.kd 🔍 | 1.273 ms | 1.273 ms | 0.636 ms | 2 | triton_for_fused_6.kd 🔍 | 0.044 ms | 0.044 ms | 0.022 ms | 2 | triton_for_fused_7.kd 🔍 | 0.024 ms | 0.024 ms | 0.012 ms | 2 | (cherry picked from commit f07b7f7) (cherry picked from commit ed0d0a7)

Relands #2416 with caching fix Upstream equivalent pytorch#159146 --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit f0aebdc) (cherry picked from commit 9c429dd)

… Fix warps runtime part 2 (#2455) Cherry-pick of #2442 Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> (cherry picked from commit 77a6760)

…ersistent reduction and no_x_dim removal (#2454) Cherry-pick of #2417 Need to resolve conflicts --------- Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> (cherry picked from commit eb47158)

Perf improvement for triton tanh (cherry picked from commit 4febbd8)

… rocm version (#2529) Cherry-pick of #2518 Co-authored-by: Ethan Wee <Ethan.Wee@amd.com> (cherry picked from commit c03be63)

Fixes SWDEV-543698 (https://ontrack-internal.amd.com/browse/SWDEV-543698) Cherry-picked from #2502 This PR fixes the errors like below: ``` [rank3]: RuntimeError: The following operation failed in the TorchScript interpreter. [rank3]: Traceback of TorchScript (most recent call last): [rank3]: RuntimeError: /tmp/comgr-28f951/input/CompileSourceACC062:67:7: error: unknown type name 'uint32_t'; did you mean '__hip_internal::uint32_t'? [rank3]: 67 | uint32_t int32; [rank3]: | ^~~~~~~~ [rank3]: | __hip_internal::uint32_t ``` Earlier uint32_t was defined in HIP headers in std namespace. Now it is moved to __hip_internal namespace in hip headers. This change is made in ROCm 7.0. (cherry picked from commit b2fb688)

…2598) Cherry-pick of #2597 Co-authored-by: Jerry Mannil <65309407+jerrymannil@users.noreply.github.com> (cherry picked from commit 9ea02c4)

Original PR (#2417) had incorrect indentation. Updated PR such that autotune will always add tiny configs, otherwise use the hinted configs only. Tested locally on test_torchinductor: Ran 894 tests in 952.242s FAILED (failures=1, skipped=28) And completed autotune runs for microbench models Microbenchmark for network : resnet152 Num devices: 1 Dtype: FP32 Mini batch size [img] : 64 Time per mini-batch : 0.09107530117034912 Throughput [img/sec] : 702.7152167226226 (cherry picked from commit db3ba66)

cherry-pick of 8d42697 (cherry picked from commit 0b82d9a)

* cherry-pick of pytorch@2aadcea (cherry picked from commit bd74018)

cherry-pick of pytorch#163869 (cherry picked from commit dfd386f)

…IFU_2025-10-14

[AUTOGENERATED] release/2.9_IFU_2025-10-14

Cherry-pick of #2693 Co-authored-by: Gheorghe-Teodor Bercea <gt.bercea@gmail.com>

Valdiation: http://rocm-ci.amd.com/job/mainline-pytorch2.9-manylinux-wheels/21/

Cherry-pick of #2710 Co-authored-by: Jerry Mannil <65309407+jerrymannil@users.noreply.github.com>

ezyang and others added 30 commits September 5, 2025 20:15

Always build USE_DISTRIBUTED. (pytorch#160449)

de893e9

Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci

[BE][pytree] cleanup parameterized pytree tests (pytorch#160842)

2fa0520

Pull Request resolved: pytorch#160842 Approved by: https://github.com/Skylion007

symbolic cpp channels_last_contiguous (pytorch#160402)

79fcd52

Pull Request resolved: pytorch#160402 Approved by: https://github.com/aorenste

[aot-precompile] default-filter global guards (pytorch#162090)

e0a62b2

if the user doesn't provide their own guard filter fn, we should by default filter global guards. pytest test/dynamo/test_aot_compile.py Pull Request resolved: pytorch#162090 Approved by: https://github.com/zhxchen17

[CD][EZ] Update libtorch python version to 3.10 (pytorch#162297)

8d50355

Not sure why it was at 3.9 Pull Request resolved: pytorch#162297 Approved by: https://github.com/clee2000, https://github.com/atalman

[CD][BE] Delete Python-3.9 case (pytorch#162265)

9c03d6b

And raise error when building for an unsupported version Pull Request resolved: pytorch#162265 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: pytorch#162297

allow user to pass in custom partitioner function (pytorch#157580)

4d4abec

Pull Request resolved: pytorch#157580 Approved by: https://github.com/bdhirsh

re-land triton runtime implementation" (pytorch#162217)

4f72d93

Summary: original pr - pytorch#161798 Test Plan: ci Rollback Plan: Differential Revision: D81724234 Pull Request resolved: pytorch#162217 Approved by: https://github.com/SherlockNoMad

Disable autocast when running joint graph passes (pytorch#162304)

0f45aaf

Fixes pytorch#159469. See pytorch#159469 (comment) for root-cause analysis. Pull Request resolved: pytorch#162304 Approved by: https://github.com/bdhirsh, https://github.com/zou3519, https://github.com/eellison

remove deprecated vllm test (pytorch#162306)

7f4ff79

Fixes pytorch#162274 the test is removed from vllm side Pull Request resolved: pytorch#162306 Approved by: https://github.com/malfet

[CUDA 13][cuDNN] Bump CUDA 13 to cuDNN 9.13.0 (pytorch#162268)

145a3a7

Fixes some `d_qk` != `d_v` cases on Hopper that are broken by cuDNN 9.11-9.12 Pull Request resolved: pytorch#162268 Approved by: https://github.com/drisspg, https://github.com/Skylion007

codebase structure documentation to include torchgen (pytorch#162261)

c3ceca2

📚 The doc update adding description about torchgen folder in code structure guide Pull Request resolved: pytorch#162261 Approved by: https://github.com/ezyang

Add contiguous subgraph transformation threshold (pytorch#162192)

20629b1

Pull Request resolved: pytorch#162192 Approved by: https://github.com/coconutruben

Docs on export joint with descriptors (pytorch#159006)

b2b4add

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#159006 Approved by: https://github.com/SherlockNoMad

[fx] fix qualified name for methods of torch.Tensor (pytorch#162224)

20b47ac

Fixes pytorch#160077, pytorch#154721 Pull Request resolved: pytorch#162224 Approved by: https://github.com/ezyang

Add api info for torch._C._nn.pyi (pytorch#162148)

aac1a50

Fix part of pytorch#148404 APis involved are as followed: - cross_entropy_loss - hardsigmoid_ - hardswish - hardswish_ - huber_loss Pull Request resolved: pytorch#162148 Approved by: https://github.com/FFFrog, https://github.com/ezyang

rocm-mici and others added 25 commits October 10, 2025 14:55

[AUTOGENERATED] [release/2.8] [release/2.7] [SWDEV-543214] Reland #2416…

5e67be1

… Fix warps runtime part 2 (#2455) Cherry-pick of #2442 Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> (cherry picked from commit 77a6760)

[SWDEV-539119] [release/2.8] Add fast_tanh support (#2484)

406100f

Perf improvement for triton tanh (cherry picked from commit 4febbd8)

[AUTOGENERATED] [release/2.8] Change triton package name depending on…

fcc0d85

… rocm version (#2529) Cherry-pick of #2518 Co-authored-by: Ethan Wee <Ethan.Wee@amd.com> (cherry picked from commit c03be63)

[AUTOGENERATED] [release/2.8] [ROCm] OffsetCalc Unroll Optimization (#…

2711b3e

…2598) Cherry-pick of #2597 Co-authored-by: Jerry Mannil <65309407+jerrymannil@users.noreply.github.com> (cherry picked from commit 9ea02c4)

[ROCm] Fix indexing_backward_kernel perf (#2667)

7b8bc05

cherry-pick of 8d42697 (cherry picked from commit 0b82d9a)

[ROCm] Improve perf for elementwise broadcast with mixed dtype (#2672)

506d5ce

* cherry-pick of pytorch@2aadcea (cherry picked from commit bd74018)

[ROCm] Implement float32 copy kernel (#2683)

55b2445

cherry-pick of pytorch#163869 (cherry picked from commit dfd386f)

Bump triton to 3.5.x

123b638

Update fbgemm submodule to avoid ck errors

426b2e8

Merge remote-tracking branch 'upstream/release/2.9' into release/2.9_…

31b3b8e

…IFU_2025-10-14

Merge pull request #2709 from ROCm/release/2.9_IFU_2025-10-14

06ee6e4

[AUTOGENERATED] release/2.9_IFU_2025-10-14

Update version to 2.9.0

c126ff5

[ROCm] Fix non-stride-one backwards indexing performance

4fe15f2

Cherry-pick of #2693 Co-authored-by: Gheorghe-Teodor Bercea <gt.bercea@gmail.com>

[release/2.9] remove amdgpu-coerce-illegal-types=1 (#2720)

9bb5bae

Valdiation: http://rocm-ci.amd.com/job/mainline-pytorch2.9-manylinux-wheels/21/

[ROCm] Adjust grid size for non-unit stride backwards indexing

fa57f9c

Cherry-pick of #2710 Co-authored-by: Jerry Mannil <65309407+jerrymannil@users.noreply.github.com>

tvukovic-amd requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony as code owners October 17, 2025 10:21

tvukovic-amd closed this Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][Windows] Enable AOTriton runtime compile on Windows #2725

[ROCm][Windows] Enable AOTriton runtime compile on Windows #2725

Uh oh!

tvukovic-amd commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

74 participants

[ROCm][Windows] Enable AOTriton runtime compile on Windows #2725

[ROCm][Windows] Enable AOTriton runtime compile on Windows #2725

Uh oh!

Conversation

tvukovic-amd commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

74 participants