[AUTOGENERATED] develop_IFU_20251104 #2784

pragupta · 2025-11-04T15:37:02Z

~~rocm_base: 56002f4~~
rocm_base: 56002f4
upstream_main: 24db5c4

Make the bucketer use multi-dtype bucketing for all gathers. Pull Request resolved: pytorch#166527 Approved by: https://github.com/IvanKobzarev, https://github.com/ezyang

@IvanKobzarev

Bucket all reduce in bucketer, thanks to @IvanKobzarev's earlier pr. Pull Request resolved: pytorch#166528 Approved by: https://github.com/IvanKobzarev ghstack dependencies: pytorch#166527

…orch#164518)" This reverts commit 311ea0d. Reverted pytorch#164518 on behalf of https://github.com/atalman due to breaks internal builds Error: from logging_utils import ( ModuleNotFoundError: No module named 'logging_utils' ([comment](pytorch#164518 (comment)))

… func `_init_one_process_group` (pytorch#166614) To makes pg cache change easier and code modularization, we isolate the logic of process group creation into a separate function named `_init_one_process_group`. Pull Request resolved: pytorch#166614 Approved by: https://github.com/lw

Fixes pytorch#165177 When converting guards to sources if we were unable to get the expected symbol from symbol_to_source then try to get it from var_to_sources. I was unable to make a simpler repro than what was described in the issue (which relies on llama3 - so inappropriate for a unit test). Pull Request resolved: pytorch#165723 Approved by: https://github.com/bobrenjc93

Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We can keep erf inline this way. Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19: Before: f32 erf: 2539.179us After: f32 erf: 1329.063us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452 Pull Request resolved: pytorch#166594 Approved by: https://github.com/mcfi, https://github.com/fadara01

This reverts commit 9051940. Reverted pytorch#166541 on behalf of https://github.com/atalman due to breaks internal test ([comment](pytorch#166541 (comment)))

…rd error (pytorch#166367) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#166367 Approved by: https://github.com/bdhirsh

…#165139) As the title stated. - AOTI_TORCH_CHECK depend on TORCH_CHECK_MSG which located in c10/util/Exception.h, which maybe break BC - AOTI_TORCH_CHECK is not used everywhere - STD_TORCH_CHECK have ABI check tests. Pull Request resolved: pytorch#165139 Approved by: https://github.com/Skylion007, https://github.com/janeyx99

As the title stated. **Changes**: - torch/csrc/inductor(Part 2) Pull Request resolved: pytorch#165119 Approved by: https://github.com/janeyx99 ghstack dependencies: pytorch#165139

It appears to be unused based on `cd torch; rg has_symints`. Pull Request resolved: pytorch#163667 Approved by: https://github.com/xmfan, https://github.com/azahed98, https://github.com/albanD ghstack dependencies: pytorch#162990

Just like we use `is/is not` for class comparisons, it is generally advised to use `is/is not` for comparisons against torch functions. Pull Request resolved: pytorch#166624 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007

In some cases the warning from pytorch#147744 still gets emitted because [atexit hooks aren't called](python/cpython#114279). Even in those cases, if the atexit hooks _were_ called you could end up with issues due to the directory being deleted in one process, but still being used elsewhere. It's better all round to load these modules entirely in-memory. Pull Request resolved: pytorch#157713 Approved by: https://github.com/xush6528

After pytorch#163049, this PR fixes the type annotations to match the actual implementation for ProcessGroupXCCL::Options. Pull Request resolved: pytorch#166418 Approved by: https://github.com/guangyey, https://github.com/ezyang

…h#165511) # Motivation Aligned with other backends, this PR introduces a new API torch.xpu.get_per_process_memory_fraction to allow user to retrieve the allowed memory fraction per a single process. Pull Request resolved: pytorch#165511 Approved by: https://github.com/EikanWang, https://github.com/ezyang ghstack dependencies: pytorch#165508, pytorch#165509, pytorch#165510

Fixes pytorch#166630 Pull Request resolved: pytorch#166631 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>

Fixes diskspace cutoff to say that the machine does not have difference=100 - diskspace_cutoff_int space available. Pull Request resolved: pytorch#166645 Approved by: https://github.com/jeffdaily

…serts (pytorch#166171) Summary: Using `CUDA_KERNEL_ASSERT_PRINTF` inside kernels allows us to log invalid values to the console (that can be in turn used to surface _hopefully_ more clearer error messages). This does have an impact in the number of registers needed for the values being logged (I confirmed via diffing PTX that there is no other impact relative to using `__assert_fail`) To avoid causing perf bottlenecks, this change adds a compile-time switch to enable more verbose errors in some of the common kernels that cause DSAs. There is also a Buck config that can be used to configure this switch more conveniently. ## Alternatives considered I considered making the behavior of `CUDA_KERNEL_ASSERT_PRINTF` controllable via a compile-time macro instead of writing another wrapper for it but there are kernels where the extra register pressure is not as severe and in those cases, having more useful error messages by default is pretty useful. Test Plan: ## Simple Python Driver: ``` # scatter_errors.py import torch def main() -> None: a = torch.rand(128, device="cuda:0") idx = torch.randint(0, 128, (100,), device="cuda:0") idx[0] = 9999 b = torch.scatter(a, 0, idx, 555.0) print(b) ``` When running normally via: ``` $ buck2 run @//mode/opt :scatter_errors ``` we see the followng DSA message: ``` fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed. ``` Running via: ``` $ buck2 run @//mode/opt -c fbcode.c10_enable_verbose_assert=1 :scatter_errors ``` however produces: ``` [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0]: Assertion failed: `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"`: Expected 0 <= idx_dim < index_size (128), but got idx_dim = 9999 ``` Differential Revision: D85185987 Pull Request resolved: pytorch#166171 Approved by: https://github.com/ngimel

Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected. Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC. Pull Request resolved: pytorch#166456 Approved by: https://github.com/Skylion007

Dynamo now supports Python 3.13. Pull Request resolved: pytorch#166497 Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42

…icient SM count (pytorch#165921) Pull Request resolved: pytorch#165921 Approved by: https://github.com/ngimel

Summary: Blockwise 1x128 and 128x128 scaling is only available on CUDA >= 12.9 and only on Hopper GPUs. Attempting to run on B200 would give a hard-to-debug `CUBLAS_STATUS_NOT_SUPPORTED`. Add a more helpful `NotImplementedError` to catch this case. Also more explicitly disable ROCm builds for relevant methods, based on lack of support per [hipBLASlt docs](https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/reference/datatypes.html#_CPPv4N28hipblasLtMatmulMatrixScale_t40HIPBLASLT_MATMUL_MATRIX_SCALE_VEC128_32FE). Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: pytorch#166639 Approved by: https://github.com/drisspg

This PR removes unused loop variables in tests. Pull Request resolved: pytorch#166509 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007

…ch#166603) Looking for feedback on this approach. Received user reports of spurious pyrefly errors for users using hg instead of git. I think this was due to the fact that when using a venv and git, `make setup-env` installs requirements and pulls from a nightly torch wheel, which is needed for pyrefly to type check properly. Initial documentation for `make setup-env` I found here: https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#developing-pytorch Testing: ``` hg clone --git ssh://git@github.com/pytorch/pytorch.git conda create -n pytorch_env python=3.10 # (or manually create venv instead of using script) cd pytorch pip install -r requirements.txt pip install -r requirements-build.txt lintrunner init # check how many pyrefly errors - 15,709 errors (11,693 ignored) lintrunner # confirm error message / warning appears >>> General linter failure: Warning (PYREFLY) nightly-wheel-not-run pytorch-nightly.pth not found. You may need to run make setup-env or make setup-env-conda to install nightly binaries and type stubs. ``` Pull Request resolved: pytorch#166603 Approved by: https://github.com/aorenste

…ytorch#166480) Pull Request resolved: pytorch#166480 Approved by: https://github.com/Skylion007, https://github.com/malfet

) Replace assert statements with explicit if/raise patterns in: - torch/ao/quantization/fx/* (177 errors) fix partialy pytorch#164878 Pull Request resolved: pytorch#165420 Approved by: https://github.com/RohitRathore1, https://github.com/fffrog, https://github.com/albanD

This config improves performance by 250% on some kernels that contain `t1.atomic_add(...)`. Again, we conditionalize for ROCm/HIP, so there is no impact to NV. Pull Request resolved: pytorch#166470 Approved by: https://github.com/PaulZhang12, https://github.com/mlazos, https://github.com/eellison, https://github.com/jansel

Fix for pytorch#163528 Pull Request resolved: pytorch#165064 Approved by: https://github.com/benjaminglass1, https://github.com/mlazos

…h#166397) According to next commit to AOTriton: ROCm/aotriton@8625c4f These changes missed in 0.11b release: pytorch#161754 Pull Request resolved: pytorch#166397 Approved by: https://github.com/jeffdaily

Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected. Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC. This is the second PR. Pull Request resolved: pytorch#166501 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#166456

This PR enables `test_fxir_backend.py`'s tests formerly skipped xpu tests. No additional changes needed for the features. Pull Request resolved: pytorch#166493 Approved by: https://github.com/angelayi, https://github.com/EikanWang

This ensures that we graph break at the right time, leading to the right stack trace. Pull Request resolved: pytorch#166909 Approved by: https://github.com/tugsbayasgalan

…#165686) Wraps the header in a C file and compile it using a C compiler, which should support designated initializers Fix issue pytorch#160647 Pull Request resolved: pytorch#165686 Approved by: https://github.com/williamwen42

In pytorch#166440, didn't realize you could turn on anomaly mode while disabling NaN checks for these stacks. Adding them to `debug_mode.operators[*].fwd_stack_trace`. Pull Request resolved: pytorch#166842 Approved by: https://github.com/yushangdi, https://github.com/mikaylagawarecki

# Motivation This PR intends to add `ExpandableSegment` struct, which is used to help support the expandable segment feature. I split it to a single PR to facilitate the code review. Pull Request resolved: pytorch#166299 Approved by: https://github.com/EikanWang, https://github.com/albanD, https://github.com/gujinghui

…6292) # Motivation This PR intends to add expandable segment feature support on XPU. This will help - Reduce memory fragmentation; - Gradually map physical pages into virtual address space as needed. # Additional Context The traditional caching allocator frequently allocates and frees device memory blocks. However, over time, with varying tensor size, the device address space becomes fragmented. Even when there's enough total free memory, a lack of contiguous space can cause large allocations to fail. The **expandable segment** feature addresses this by dynamically extending physical memory within a reserved virtual address range, reducing fragmentation and minimizing reallocation overhead. The potential drawbacks are - Virtual memory overhead; - Potential page mapping overhead; - Increased complexity. Pull Request resolved: pytorch#166292 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui ghstack dependencies: pytorch#166299

# Motivation This PR introduces support for peer-to-peer (P2P) access between devices, including querying and enabling P2P connections between two devices. It supports two categories of allocations: - Regular allocations; - Expandable segment allocations. # Additional Context The follow-up is that we should use this feature to optimize our copy kernel when P2P is supported. Pull Request resolved: pytorch#166424 Approved by: https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: pytorch#166299, pytorch#166292

Adding distributed testing back to trunk since we have been observing [reasonable queueing](https://hud.pytorch.org/queue_time_analysis?dateRange=30&startDate=2025-10-05T01%3A44%3A55.924Z&endDate=2025-11-04T01%3A44%3A55.925Z&granularity=week&chartType=bar&repos=pytorch%2Fpytorch&category=machine_type&machineTypes=linux.rocm.gpu.gfx942.1&items=linux.rocm.gpu.gfx942.1) based on current MI3xx capacity. Partially addresses pytorch#166108. Pull Request resolved: pytorch#166915 Approved by: https://github.com/jeffdaily

… to trace through typing.get_type_hints (pytorch#166757) This is covered by `test_get_type_hints` in test/dynamo/test_repros.py Pull Request resolved: pytorch#166757 Approved by: https://github.com/Lucaskabela

Pull Request resolved: pytorch#166894 Approved by: https://github.com/malfet ghstack dependencies: pytorch#166757

Pull Request resolved: pytorch#166895 Approved by: https://github.com/guilhermeleobas ghstack dependencies: pytorch#166757, pytorch#166894

Pull Request resolved: pytorch#166902 Approved by: https://github.com/malfet, https://github.com/pianpwk ghstack dependencies: pytorch#166757, pytorch#166894, pytorch#166895

This PR continues to apply ruff UP035 rule to inductor code. ruff UP035 rule aims to use Python 3.10 syntax and libraries. Pull Request resolved: pytorch#166225 Approved by: https://github.com/aorenste

…#166839) Fixes pytorch#166838 1. Register `convolution_overrideable` key for flop_counter. CUDA relies on keys with `cudnn_convolution`. For devices like `XPU`, it falls to `convolution_overrideable`. Without the correct registration, the flop_couter will silently return 0 for XPU in line: https://github.com/pytorch/pytorch/blob/e1d011d6eb571cd98ec7c7ed8e8b518a5463ec97/torch/_inductor/analysis/profile_analysis.py#L178-L179 2. Enable the tests when enabling the XPU on `test_analysis.py`. Pull Request resolved: pytorch#166839 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jansel

Make sure you're on cutlass 4.2.0+ Test Plan: Tritonbench(oss): `clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy` Unit Tests(oss): `clear; python test/inductor/test_cutedsl_grouped_mm.py` Differential Revision: D82010227 Pull Request resolved: pytorch#165036 Approved by: https://github.com/alexsamardzic, https://github.com/drisspg, https://github.com/mlazos

This reverts commit c21868b. Reverted pytorch#162275 on behalf of https://github.com/izaitsevfb due to breaking test_rms_norm_bwd_float32_split_reductions_True_shape2 ([comment](pytorch#162275 (comment)))

…#166830) # Motivation I believe this is a bug - here's why: In [dnnl_common_types.h](https://github.com/uxlfoundation/oneDNN/blob/98132c49080c19caffee1f7ba65be83339cba985/include/oneapi/dnnl/dnnl_common_types.h#L116-L125) is defined as a pointer to an `int64_t[12]` array; We can confirm this from the implementation in [memory_desc.cpp](https://github.com/uxlfoundation/oneDNN/blob/98132c49080c19caffee1f7ba65be83339cba985/src/common/memory_desc.cpp#L746-L748) where the member indeed points to an internal array. # Solution Therefore, when accessing `md_padded_dims`, we should first dereference the pointer and then use it with an index - directly using it without dereferencing would corrupt memory. Pull Request resolved: pytorch#166830 Approved by: https://github.com/EikanWang

…peError (pytorch#165264) Fixes pytorch#164684 ### Description Symbolic tracing fails during multiplication between a `SymBool` and a `Tensor`. This scenario is triggered when `.item()` is called on a 0-dim boolean tensor within a `torch.compile` region. In compile mode, this yields a `SymBool`, and the subsequent `SymBool * FakeTensor` operation is unsupported, leading to a `TypeError` or a data-dependent `UserError`. ### Solution This PR addresses the issue at the type-conversion level, as suggested by reviewers. The root cause of the TypeError is that torch.sym_float() (which is called by _maybe_convert_to_dtype during type promotion for aten.mul) lacks a conversion path for SymBool and incorrectly falls back to builtins.float(SymBool). This fix addresses this by implementing the __sym_float__(self) method within the SymBool class (defined in torch/__init__.py). The torch.sym_float(a) utility function is already designed to check for hasattr(a, "__sym_float__") before falling back to builtins.float(). By adding this method, SymBool instances now correctly advertise their ability to be cast to SymFloat. The new method implementation leverages self.node.sym_float() to correctly convert the symbolic boolean value to its symbolic float representation (0.0 or 1.0), resolving the TypeError at its source. This approach is more fundamental than modifying a specific operation in builtin.py and ensures SymBool can be correctly promoted to SymFloat in any operation, while still preserving its boolean nature for control flow operations like guard_or_false (which is verified by a new test case). ### Verification 1. **Bug Reproduced**: The initial `UserError: Could not guard on data-dependent expression` was successfully reproduced with the script from the issue. As shown below <img width="1369" height="945" alt="Screenshot 2025-10-13 at 10 29 05" src="https://github.com/user-attachments/assets/8daa4555-3347-4af5-906a-02150b8df9d1" /> 2. **Fix Validated**: After applying the code changes, the same script now runs to completion, printing `✅ eager success` and `✅ compile success`. As shown below <img width="1228" height="82" alt="Screenshot 2025-10-13 at 10 29 21" src="https://github.com/user-attachments/assets/94c4f143-b898-4dda-9bff-0ad5450a30fa" /> 3. Added a new test class DynamoOpPromotionTests to test/dynamo/test_misc.py with three new test cases: 1. test_symbool_tensor_mul_does_not_fail: Verifies that the original bug report code (with .item() + *) no longer raises an error when compiled. 2. test_symbool_guard_or_false: Verifies that this fix does not cause a regression for guard_or_false(SymBool) (the concern raised by reviewers). 3. test_symbool_tensor_mul: Verifies the behavior of Tensor(bool) * Tensor(float) (without .item()) for completeness. All new tests were added and pass locally. Pull Request resolved: pytorch#165264 Approved by: https://github.com/laithsakka, https://github.com/Lucaskabela

…pytorch#166613) Pull Request resolved: pytorch#166613 Approved by: https://github.com/hl475

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#166867 Approved by: https://github.com/albanD ghstack dependencies: pytorch#166868

# Motivation This PR aims to reuse some UT to validate the expandable segment feature. # Additional Context Currently, the failure is related to the internal track `GSD-11403`, we could get the fix when upgrading the driver to `ci-neo-master-034630` or greater TODO: add test conv and gemm into this test case when upgrading the driver. Pull Request resolved: pytorch#166495 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui ghstack dependencies: pytorch#166299, pytorch#166292, pytorch#166424

@vkuzo

Summary: cuBlasLt enforces size/stride requirements for 1x128 and 128x128 blockwise scaling kernels, some of which weren't being handled, causing silent incorrect answers especially for 128x128 scaling cases. cuBlasLt enforces ([docs](https://docs.nvidia.com/cuda/cublas/#scaling-factors-layouts)) for deepseek-style scaling, for `A: MxN`, `B: KxN` you have the following: ```Py L = K // 128 L4 = round_up(L, 4) 1x128 x 128x128: * A_scale: [M, K // 128], stride: [1, M] * B_scale: [L4, N // 128], stride: [1, L4] 128x128 x 1x128: * A_scale: [L4, M // 128], stride: [1, L4] * B_scale: [N, K // 128], stride: [1, N] 1x128 x 1x128: * A_scale: [M, K // 128], stride: [1, M] * B_scale: [N, K // 128], stride: [1, N] ``` Notable here is the `L4` term, which means that we must round up to the nearest multiple of 4 blocks in the `K` dimension. This wasn't enforced previously, and caused silent wrong answers where `(K // 128) % 4 != 0`. Test Plan: Reviewers: Subscribers: @vkuzo Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: pytorch#166752 Approved by: https://github.com/drisspg, https://github.com/vkuzo

Forward fix for failing Docker release builds Related to: pytorch#166897 Nightly Docker build failure https://github.com/pytorch/pytorch/actions/runs/18900508440/job/53946606434 Due to missing base image: ``` ERROR: failed to build: failed to solve: docker.io/nvidia/cuda:13.0.2-devel-ubuntu22.04: not found ``` Pull Request resolved: pytorch#166904 Approved by: https://github.com/tinglvv, https://github.com/malfet

…6869) Pull Request resolved: pytorch#166869 Approved by: https://github.com/eellison ghstack dependencies: pytorch#166521

# Conflicts: # .ci/docker/requirements-ci.txt

eellison and others added 30 commits October 30, 2025 16:54

use multi-dtype bucketing (pytorch#166527)

c37802a

Make the bucketer use multi-dtype bucketing for all gathers. Pull Request resolved: pytorch#166527 Approved by: https://github.com/IvanKobzarev, https://github.com/ezyang

bucket all reduce (pytorch#166528)

629293f

Bucket all reduce in bucketer, thanks to @IvanKobzarev's earlier pr. Pull Request resolved: pytorch#166528 Approved by: https://github.com/IvanKobzarev ghstack dependencies: pytorch#166527

Revert "address DDE in matmul decomp (pytorch#166541)"

8f40a0c

This reverts commit 9051940. Reverted pytorch#166541 on behalf of https://github.com/atalman due to breaks internal test ([comment](pytorch#166541 (comment)))

Make PT2 compile backprop through custom op without autograd key a ha…

4acc66f

…rd error (pytorch#166367) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#166367 Approved by: https://github.com/bdhirsh

[CodeClean] Replace std::runtime_error with TORCH_CHECK (pytorch#165119)

398775a

As the title stated. **Changes**: - torch/csrc/inductor(Part 2) Pull Request resolved: pytorch#165119 Approved by: https://github.com/janeyx99 ghstack dependencies: pytorch#165139

fix: Add missing signals_to_handle to launcher logging (pytorch#166631)

181ee3b

Fixes pytorch#166630 Pull Request resolved: pytorch#166631 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>

[ROCm][CI] fix disk space message (pytorch#166645)

a7fd0b4

Fixes diskspace cutoff to say that the machine does not have difference=100 - diskspace_cutoff_int space available. Pull Request resolved: pytorch#166645 Approved by: https://github.com/jeffdaily

Enable verify_dynamo on Python 3.13 (pytorch#166497)

52db601

Dynamo now supports Python 3.13. Pull Request resolved: pytorch#166497 Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42

[CUDA] xFail max-autotune grouped gemm tests on devices with insuff…

f911d64

…icient SM count (pytorch#165921) Pull Request resolved: pytorch#165921 Approved by: https://github.com/ngimel

[3/N] Fix unused loop variables (pytorch#166509)

0d50e5d

This PR removes unused loop variables in tests. Pull Request resolved: pytorch#166509 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007

[cuDNN][conv] Re-enable cuDNN for 3D convolutions (fixed in 9.15+) (p…

df71b70

…ytorch#166480) Pull Request resolved: pytorch#166480 Approved by: https://github.com/Skylion007, https://github.com/malfet

[wip] fix searchsorted non dense (pytorch#165064)

f5543e3

Fix for pytorch#163528 Pull Request resolved: pytorch#165064 Approved by: https://github.com/benjaminglass1, https://github.com/mlazos

[ROCm] moved gfx1100 back to experimental status for AOTriton (pytorc…

45c3f02

…h#166397) According to next commit to AOTriton: ROCm/aotriton@8625c4f These changes missed in 0.11b release: pytorch#161754 Pull Request resolved: pytorch#166397 Approved by: https://github.com/jeffdaily

Stonepia and others added 25 commits November 4, 2025 02:14

[dynamo] Raise on as_python_constant error on getattr (pytorch#166909)

f288433

This ensures that we graph break at the right time, leading to the right stack trace. Pull Request resolved: pytorch#166909 Approved by: https://github.com/tugsbayasgalan

[dynamo, 3.14] support tracing type.__dict__[__annotations__].__get__…

888efcc

… to trace through typing.get_type_hints (pytorch#166757) This is covered by `test_get_type_hints` in test/dynamo/test_repros.py Pull Request resolved: pytorch#166757 Approved by: https://github.com/Lucaskabela

[dynamo, 3.14] fix dynamo error message test for 3.14 (pytorch#166894)

ba72c6b

Pull Request resolved: pytorch#166894 Approved by: https://github.com/malfet ghstack dependencies: pytorch#166757

[dynamo, 3.14] disable cpython dynamo unittests if 3.14 (pytorch#166895)

344cebd

Pull Request resolved: pytorch#166895 Approved by: https://github.com/guilhermeleobas ghstack dependencies: pytorch#166757, pytorch#166894

[dynamo, 3.14] add explicit SymFloat int conversion (pytorch#166902)

55be1cc

Pull Request resolved: pytorch#166902 Approved by: https://github.com/malfet, https://github.com/pianpwk ghstack dependencies: pytorch#166757, pytorch#166894, pytorch#166895

[11/N] Apply ruff UP035 rule (pytorch#166225)

a6c6ace

This PR continues to apply ruff UP035 rule to inductor code. ruff UP035 rule aims to use Python 3.10 syntax and libraries. Pull Request resolved: pytorch#166225 Approved by: https://github.com/aorenste

Revert "[inductor] require shape in TritonCSEVariable (pytorch#162275)"

d3cf90a

This reverts commit c21868b. Reverted pytorch#162275 on behalf of https://github.com/izaitsevfb due to breaking test_rms_norm_bwd_float32_split_reductions_True_shape2 ([comment](pytorch#162275 (comment)))

[xpu][feature][inductor] Enable decompose_mm_pass and UT on Intel GPU (…

09e0285

…pytorch#166613) Pull Request resolved: pytorch#166613 Approved by: https://github.com/hl475

DTensor: Fix trivial as_strided case, add alias support (pytorch#166867)

82fa2aa

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#166867 Approved by: https://github.com/albanD ghstack dependencies: pytorch#166868

[inductor] do not hard fail on FakePG with nccl estimator (pytorch#16…

24db5c4

…6869) Pull Request resolved: pytorch#166869 Approved by: https://github.com/eellison ghstack dependencies: pytorch#166521

Merge remote-tracking branch 'upstream/main' into develop_IFU_20251104

223b9c5

# Conflicts: # .ci/docker/requirements-ci.txt

pragupta requested a review from jeffdaily as a code owner November 4, 2025 15:37

Fix merge conflict

b4c1e1e

pragupta mentioned this pull request Nov 4, 2025

[AUTOGENERATED] develop_IFU_20251103 #2777

Merged

pragupta merged commit 3d74218 into develop Nov 4, 2025
11 checks passed

pragupta deleted the develop_IFU_20251104 branch November 4, 2025 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AUTOGENERATED] develop_IFU_20251104 #2784

[AUTOGENERATED] develop_IFU_20251104 #2784

Uh oh!

pragupta commented Nov 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

87 participants

[AUTOGENERATED] develop_IFU_20251104 #2784

[AUTOGENERATED] develop_IFU_20251104 #2784

Uh oh!

Conversation

pragupta commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

87 participants

pragupta commented Nov 4, 2025 •

edited

Loading