[AUTOGENERATED] develop_IFU_20251124 #2827

pragupta · 2025-11-24T16:09:20Z

rocm_base: 5ca076d
rocm_base: 5ca076d
upstream_main: 654c5fb

…7661)" This reverts commit 1b43d6c. Reverted pytorch#167661 on behalf of https://github.com/yangw-dev due to break internal tests and build, please reach out meta fellas to have fix it and reland again, error examplke: hip/KernelUtils.cuh:74:5: error: no matching function for call to 'unsafeAtomicAdd' ([comment](pytorch#167661 (comment)))

Summary: The export_memory_timeline method in torch.profiler is being deprecated in favor of the newer memory snapshot API (torch.cuda.memory._record_memory_history and torch.cuda.memory._export_memory_snapshot). This change adds the deprecated decorator from typing_extensions and updates the docstring to guide users to the recommended alternative. The decorator will emit a FutureWarning at runtime, and the docstring now includes a .. deprecated:: directive for documentation visibility. Test Plan: Manual verification that the decorator is properly applied and the deprecation message is informative. Differential Revision: D87272399 Pull Request resolved: pytorch#168036 Approved by: https://github.com/valentinandrei

This PR introduces a `Tensor` subclass which represents a complex tensor in terms of two real ones. Ops are decomposed as individual ops on the real and imaginary parts. It is compatible with `torch.compile`, so long as the real ops used are also compatible. Autograd "works", but is WIP due to different edge-case behaviour. Pull Request resolved: pytorch#167621 Approved by: https://github.com/ezyang

Repeatition of pytorch#155708 Has been broken for a while, and ET pin in Pytorch are so old that `torch==2.10.0.dev20250915` could no longer be found in nightly indices Pull Request resolved: pytorch#168090 Approved by: https://github.com/atalman, https://github.com/yangw-dev

This PR enables special matmuls on Thor devices. This includes row-wise scaled matmul on `fp8` and group gemm on `bfloat16`. Pull Request resolved: pytorch#164836 Approved by: https://github.com/ngimel

…orch#167395) This adds a debug HTTP server for debugging stuck or slow jobs. It runs the WorkerServer on every worker and then launches a separate flask process on rank 0 to have users connect to for debugging. This can easily be improved to trigger profilers as well as visualize the data much better. Initial handlers: * pytorch profiler * FlightRecorder data * Python stacks ``` os.environ["TORCH_NCCL_TRACE_BUFFER_SIZE"] = "2000" from torch.distributed.debug import enable_debug_server enable_debug_server() ``` Test plan: ``` torchrun --nnodes 1 --nproc_per_node=gpu ~/scripts/debug_test.py ``` <img width="2000" height="1045" alt="20251117_16h58m18s_grim" src="https://github.com/user-attachments/assets/82305b75-227c-4412-a481-00b622db6bd1" /> <img width="2172" height="1624" alt="20251117_16h58m11s_grim" src="https://github.com/user-attachments/assets/def9841c-c7e6-483a-81c3-cf0c56f6bad8" /> <img width="1985" height="1635" alt="20251117_16h58m03s_grim" src="https://github.com/user-attachments/assets/04fcf148-df58-41b4-8754-8706ee0d1de6" /> Pull Request resolved: pytorch#167395 Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/atalman

…ytorch#167079) Summary: As title. Knowing the size of the leaked tensor is useful, it allows us to focus on the largest leaks. Differential Revision: D86218574 Pull Request resolved: pytorch#167079 Approved by: https://github.com/kausv

…torch#161703) it's another pr to port distributed tensor test for Intel GPU, while the other pr is pytorch#161604 We could enable Intel GPU with following methods and try the best to keep the original code styles: Use torch.accelerator for general gpu Skip the case if running on xpu which has known issues Pull Request resolved: pytorch#161703 Approved by: https://github.com/guangyey, https://github.com/d4l3k, https://github.com/albanD

Pull Request resolved: pytorch#167852 Approved by: https://github.com/fmassa

The all gather bucketing was part of the way to fusing in dtype casts into the bucket. We do this by allocating the group bucket buffer, then viewing each slice of it as the destination dtype. We then foreach_copy_ into the allocated buffer, with each collective copying in to its destination dtype. This logic was causing an issue in a later part of the stack, but not fully firing, so might as well fix it. Note: custom ops dont yet support list[dtype], so i worked around by list[int], but will fix in a follow up. Pull Request resolved: pytorch#167853 Approved by: https://github.com/ruisizhang123 ghstack dependencies: pytorch#167852

The bucketing dtype fusing was causing nodes which had dependencies to be erased. Transfer those deps over to the new nodes, and also add an assertion that none of our deps are erased to catch this type of error in the future. Pull Request resolved: pytorch#167863 Approved by: https://github.com/fmassa ghstack dependencies: pytorch#167852, pytorch#167853

Since the currently intended workflow on the new MI3xx CI capacity is [trunk-rocm-mi300.yml](https://github.com/pytorch/pytorch/blob/d91269e8ce309437c1f849b5ab3362d69b178ef4/.github/workflows/trunk-rocm-mi300.yml#L54), which only needs the jammy images, limiting those to optimize docker caching times. Pull Request resolved: pytorch#168088 Approved by: https://github.com/jeffdaily

For GPU: Previously reported that only a single sample could be tested with huber_loss functional. Current snapshot of the code does not appear to suffer from numerical issues as reported before. For CPU: While testing GPU, it was discovered that with Half appears to be numerically unstable. This commit resolves issue with CPU by upcasting Half to float for the computation. Pull Request resolved: pytorch#166952 Approved by: https://github.com/benjaminglass1, https://github.com/isuruf

…h/csrc/Exceptions.h (pytorch#168056) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: dtolnay Differential Revision: D87273132 Pull Request resolved: pytorch#168056 Approved by: https://github.com/malfet, https://github.com/Skylion007

Summary: If the Tensor has a PyObject, it's use count will now be two instead of one. Test Plan: `buck test -j 18 fbcode//mode/dev-nosan fbcode//caffe2/test:torch` Differential Revision: D87297965 Pull Request resolved: pytorch#168060 Approved by: https://github.com/albanD, https://github.com/Skylion007

As compiler has not been supported for last 3 years and all manylinux2_28 builds should have at least gcc-11 Prep change for C++20 standard migration Pull Request resolved: pytorch#167933 Approved by: https://github.com/yangw-dev, https://github.com/atalman ghstack dependencies: pytorch#168090

Pull Request resolved: pytorch#166903 Approved by: https://github.com/malfet

…rch#168104) We only want to cache the latest CI docker image for `main` and `release` branches in cases where multiple `docker-builds` workflow runs get triggered in quick succession. This is because the latest run will anyway overwrite the cached images, since we do not maintain a cached image per-SHA, instead it's only one-per-branch (to minimize cache size and docker load times at runner bringup). Also removing `workflow_dispatch` as a trigger since it won't work (needs artifacts from `docker-builds` run) Pull Request resolved: pytorch#168104 Approved by: https://github.com/jeffdaily

Fixes pytorch#167905 Below typo correction has been done. Existing comment: // List of Any can contains heterogenous types Suggested comment: // List of Any can contains heterogeneous types Pull Request resolved: pytorch#167907 Approved by: https://github.com/albanD

Unclear which PR in the ghstack caused the ROCm failure. Stack was (oldest at bottom): - pytorch#167962 - pytorch#167804 - pytorch#167803 - pytorch#167802 - pytorch#168025 Fixes the following test: PYTORCH_TEST_WITH_ROCM=1 python test/cpp_extensions/libtorch_agnostic_2_10_extension/test_version_compatibility.py FunctionVersionCompatibilityTest.test_mv_tensor_accessor_cuda_works_with_2_9 Pull Request resolved: pytorch#168087 Approved by: https://github.com/jeffdaily, https://github.com/janeyx99 Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>

Fixes false negative (illusion): "all B200 periodic nvshmem-triton tests passed" Pull Request resolved: pytorch#167760 Approved by: https://github.com/ngimel

# Motivation This is definitely a bug: we were attempting to release cached memory back to the system without proper **synchronization**. Callers must ensure that all accesses to memory blocks allocated by SYCL APIs have completed before invoking `sycl::free`. For a simple example, in the following code: ```python pool = torch.xpu.MemPool() with torch.xpu.use_mem_pool(pool): input = torch.randn(100, device='xpu') sum = input.sum() del pool print(sum) ``` `sum` may exhibit undefined behavior because `input.sum()` might not have finished executing before `del pool` triggers `input`'s memory release. With this fix, we ensure that all kernels on the associated streams complete before the memory pool is destroyed, guaranteeing that `sum` holds the correct value. # Solution Because `c10::xpu::syncStreamsOnDevice` has host overhead, we use a boolean flag `streams_synced` to ensure it is called only once. Pull Request resolved: pytorch#168074 Approved by: https://github.com/EikanWang

…rch#167923) As in the title. The my_shape test is added to reproduce https://github.com/pytorch/audio/actions/runs/19395471276/job/55494871226: Pull Request resolved: pytorch#167923 Approved by: https://github.com/janeyx99, https://github.com/mikaylagawarecki

…6833) The implementation plan of MemPool for XPU, which is the dependance of [XPUGraph](pytorch#166285), following the [RFC](pytorch#162143). - [ ] pytorch#166831 - [ ] ->pytorch#166833 - [ ] pytorch#166843 Pull Request resolved: pytorch#166833 Approved by: https://github.com/EikanWang, https://github.com/gujinghui

Summary: Fix pytorch#167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key. Pull Request resolved: pytorch#168063 Approved by: https://github.com/yushangdi

Pull Request resolved: pytorch#167769 Approved by: https://github.com/ngimel, https://github.com/leofang

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#168111 Approved by: https://github.com/ezyang

…torch#166273) Partially vibe-coded with ClaudeCode, and changes following ops (summary also created by Claude): - **Activation operations**: Added checks rejecting Long, Complex, and Bool types for operations like log_softmax, log_sigmoid, mish, softplus, and silu, as MPS doesn't support exponent operations on these types - **Linear algebra operations**: Restricted linalg_lu_factor, linalg_solve, and linalg_solve_triangular to Float type only (previously only checked for complex types) - **Pooling operations**: Added checks to reject Complex types for avg_pool2d and max_pool2d operations - **Loss functions**: Added type checks for nll_loss (Complex), huber_loss (Long, Complex), and grid_sampler_2d (Complex) - **Reduction operations**: - Fixed NANSUM to handle integral types correctly (can't contain NaN, so just performs regular sum) - Added Long type check for std/var operations - **Other operations**: - softmax: Now explicitly requires floating point types - bincount: Rejects Bool type to prevent crashes All checks use `TORCH_CHECK_NOT_IMPLEMENTED` Pull Request resolved: pytorch#166273 Approved by: https://github.com/manuelcandales

Summary: Shrink binary size to reduce relocation overflows. The most important change is to split `intrusive_ptr::reset_()` into two functions and mark the bigger one as `C10_NOINLINE`. Differential Revision: D87308588 Pull Request resolved: pytorch#168080 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/malfet, https://github.com/ezyang

@KarhouTam

# Motivation Thanks to @KarhouTam for finding the issue mentioned in pytorch#167172 This PR aims to improve the build logic in activities for kineto. # Additional Context Fix pytorch#167172 Pull Request resolved: pytorch#167204 Approved by: https://github.com/EikanWang, https://github.com/ezyang

@kwen2501

The test is skipped on a condition which needs to be used here or it will fail because the exit code is -6 not zero if the condition is not met and the test executed Fixes pytorch#154441 @kwen2501 Pull Request resolved: pytorch#167971 Approved by: https://github.com/kwen2501

This is a ~20x speedup for benchmarks/dynamo/microbenchmarks/optree_tree_map.py Pull Request resolved: pytorch#168342 Approved by: https://github.com/anijain2305

tuple[int] -> tuple[int,...] 1 -> more Like shape, shape: tuple[int, ...] # [B, Hq, M, Hkv, N, D] Inspired by pytorch#168320 Pull Request resolved: pytorch#168892 Approved by: https://github.com/cyyever, https://github.com/Skylion007

…ch/_inductor/codegen/cuda (pytorch#160685)" This reverts commit 7556637. Reverted pytorch#160685 on behalf of https://github.com/yangw-dev due to failed internal tests test_cpu_/test_cpu#link-tree/torch/utils/_config_module.py line 371, in _config = self._config[name] KeyError: 'cuda.cutlass_dir' Diff: D87660662 ([comment](pytorch#160174 (comment)))

…pytorch#160174)" This reverts commit 008ac43. Reverted pytorch#160174 on behalf of https://github.com/yangw-dev due to failed internal tests test_cpu_/test_cpu#link-tree/torch/utils/_config_module.py line 371, in _config = self._config[name] KeyError: 'cuda.cutlass_dir' Diff: D87660662 ([comment](pytorch#160174 (comment)))

…ytorch#167456)" This reverts commit 4ee6b3d. Reverted pytorch#167456 on behalf of https://github.com/yangw-dev due to failed internal test Diff D87660150 , errorl ModuleNotFoundError: No module named 'extension_backends' ([comment](pytorch#167456 (comment)))

…13 RecursionError problems (pytorch#167888)" This reverts commit 24e1958. Reverted pytorch#167888 on behalf of https://github.com/yangw-dev due to failed interal test Tracing payload for Mock should not be called: pt2_compile_chromium_events Fatal Python error: Segmentation fault, please remerge after fixing it ([comment](pytorch#167888 (comment)))

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#168315 Approved by: https://github.com/pytorchbot

Pull Request resolved: pytorch#168365 Approved by: https://github.com/anijain2305

…68893) Pull Request resolved: pytorch#168893 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168365

…ytorch#168221) To avoid circular import issues: - utils.py used to include registration functions which import/depend on DTensor.sharding_propagator - I plan to use other utils from utils.py inside sharding_propagator.py Pull Request resolved: pytorch#168221 Approved by: https://github.com/albanD

This was a workaround for gcc-8 on ARM, introduced by pytorch#44199, which is no longer relevant as CentOS-8 is past its EOL Was reminded about it while looking at pytorch#168907 Pull Request resolved: pytorch#168909 Approved by: https://github.com/Skylion007, https://github.com/nimeduhansaka

This PR adds comprehensive benchmarks for PyTorch optimizers to measure optimizer.step() performance across different parameter configurations. ### Optimizers benchmarked: - AdamW - Adam - SGD (with momentum=0.9) - RMSprop - Adagrad ### Test configurations: - num_params: [1, 10, 100] - param_size: [100K, 1M, 10M] Pull Request resolved: pytorch#168101 Approved by: https://github.com/slayton58

…orch#168369) Previously, these error messages would get truncated when they were hit on device 0 because device is a "char" (actually, an int8_t) and therefore '0' is interpreted as the null byte to terminate a string. Essentially, it is the same issue as pytorch#123984. There's something strange in the TORCH_CHECK_WITH macro that is causing this. I don't feel like figuring out those obscure macro details right now, though. Pull Request resolved: pytorch#168369 Approved by: https://github.com/eqy

…h#167713) Fix a API typo in the cuda graph tutorial. The API given in cuda graph tutorial is wrong. Pull Request resolved: pytorch#167713 Approved by: https://github.com/jerryzh168

## Remove deprecated `split_cat_fx_passes` First of a couple of small PRs that remove deprecated and unused code. Remove the deprecated `split_cat_fx_passes` configuration variable from inductor config and clean up associated test patches. ### Changes - Remove `split_cat_fx_passes` from `torch/_inductor/config.py` - Remove `@patch.object(config, "split_cat_fx_passes", False)` decorators from tests in `test/inductor/test_perf.py` Pull Request resolved: pytorch#167738 Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/cyyever

This PR removes unnecessary thrust::tie. Pull Request resolved: pytorch#168943 Approved by: https://github.com/ngimel

This PR removes unnecessary uses of thrust::tuple before moving to CCCL. Pull Request resolved: pytorch#168936 Approved by: https://github.com/ngimel

… func (pytorch#167723) Fixes pytorch#167197 The inductor backend is trying to convert the float infinity value to an integer in pow lowering (possibly for indexing, iteration counts, or type conversions). Python/C++ cannot convert float('inf') to an integer, causing the overflow error Pull Request resolved: pytorch#167723 Approved by: https://github.com/shunting314

Pull Request resolved: pytorch#168394 Approved by: https://github.com/jansel

Pull Request resolved: pytorch#166436 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel

This reverts commit 1328a02. Reverted pytorch#168122 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#168122 (comment)))

rocm-repo-management-api · 2025-11-24T16:24:13Z

Jenkins build for ecdea869f81e8ecd9051f75810c1eafd6ec5523b commit finished as NOT_BUILT
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-11-24T19:26:04Z

Jenkins build for ecdea869f81e8ecd9051f75810c1eafd6ec5523b commit finished as SUCCESS
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-11-24T19:37:43Z

Jenkins build for ecdea869f81e8ecd9051f75810c1eafd6ec5523b commit finished as ABORTED
Links: Blue Ocean view / Build artifacts

pytorchmergebot and others added 30 commits November 18, 2025 17:20

[CUDA][Thor] Enable CUTLASS matmuls on Thor (pytorch#164836)

5333e51

This PR enables special matmuls on Thor devices. This includes row-wise scaled matmul on `fp8` and group gemm on `bfloat16`. Pull Request resolved: pytorch#164836 Approved by: https://github.com/ngimel

small changes (pytorch#167852)

e3c5b78

Pull Request resolved: pytorch#167852 Approved by: https://github.com/fmassa

[MPS] Move elu impl to Metal (pytorch#166903)

dc4f3c7

Pull Request resolved: pytorch#166903 Approved by: https://github.com/malfet

[CI][CUDA] Unskip nvshmem triton tests (pytorch#167760)

878757c

Fixes false negative (illusion): "all B200 periodic nvshmem-triton tests passed" Pull Request resolved: pytorch#167760 Approved by: https://github.com/ngimel

[CD] Add cuda-bindings dependency to CUDA wheels (pytorch#167769)

cea8678

Pull Request resolved: pytorch#167769 Approved by: https://github.com/ngimel, https://github.com/leofang

Update AGENTS.md (pytorch#168111)

13ec55d

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#168111 Approved by: https://github.com/ezyang

Flamefire and others added 23 commits November 22, 2025 14:57

[dynamo] Special case handling for tree_map (pytorch#168342)

a3cc252

This is a ~20x speedup for benchmarks/dynamo/microbenchmarks/optree_tree_map.py Pull Request resolved: pytorch#168342 Approved by: https://github.com/anijain2305

[BugFix] Fix incorrect type hint. (pytorch#168892)

9fa3e6e

tuple[int] -> tuple[int,...] 1 -> more Like shape, shape: tuple[int, ...] # [B, Hq, M, Hkv, N, D] Inspired by pytorch#168320 Pull Request resolved: pytorch#168892 Approved by: https://github.com/cyyever, https://github.com/Skylion007

[dynamo] Special case handling for tree_map_only (pytorch#168365)

19c34dd

Pull Request resolved: pytorch#168365 Approved by: https://github.com/anijain2305

[dynamo] Fix local test failures for dynamo/test_repros.py (pytorch#1…

d3f61c1

…68893) Pull Request resolved: pytorch#168893 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168365

[tutorial] typo fix, update torch.compiler_cudagraph_trees.md (pytorc…

dbe6124

…h#167713) Fix a API typo in the cuda graph tutorial. The API given in cuda graph tutorial is wrong. Pull Request resolved: pytorch#167713 Approved by: https://github.com/jerryzh168

Replace thrust::tie with structure binding (pytorch#168943)

c91c92f

This PR removes unnecessary thrust::tie. Pull Request resolved: pytorch#168943 Approved by: https://github.com/ngimel

Remove unnecessary uses of thrust::tuple (pytorch#168936)

265397e

This PR removes unnecessary uses of thrust::tuple before moving to CCCL. Pull Request resolved: pytorch#168936 Approved by: https://github.com/ngimel

[dynamo][hops] Add xfail tests for side effects (pytorch#168394)

1aaedbc

Pull Request resolved: pytorch#168394 Approved by: https://github.com/jansel

[Intel GPU] Update Intel Triton commit pin (pytorch#166436)

5ff187d

Pull Request resolved: pytorch#166436 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel

Revert "bucketing compile time improve (pytorch#168122)"

654c5fb

This reverts commit 1328a02. Reverted pytorch#168122 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#168122 (comment)))

Merge remote-tracking branch 'upstream/main' into develop_IFU_20251124

ecdea86

pragupta requested review from jeffdaily and jithunnair-amd as code owners November 24, 2025 16:09

pragupta merged commit f742da3 into develop Nov 25, 2025
43 of 44 checks passed

pragupta deleted the develop_IFU_20251124 branch November 25, 2025 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AUTOGENERATED] develop_IFU_20251124 #2827

[AUTOGENERATED] develop_IFU_20251124 #2827

Uh oh!

pragupta commented Nov 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

rocm-repo-management-api bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

119 participants

[AUTOGENERATED] develop_IFU_20251124 #2827

[AUTOGENERATED] develop_IFU_20251124 #2827

Uh oh!

Conversation

pragupta commented Nov 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

119 participants

pragupta commented Nov 24, 2025 •

edited by github-actions bot

Loading

rocm-repo-management-api bot commented Nov 24, 2025 •

edited

Loading

rocm-repo-management-api bot commented Nov 24, 2025 •

edited

Loading