Skip to content

Conversation

@pragupta
Copy link
Collaborator

@pragupta pragupta commented Nov 24, 2025

rocm_base: 5ca076d
rocm_base: 5ca076d
upstream_main: 654c5fb

pytorchmergebot and others added 30 commits November 18, 2025 17:20
…7661)"

This reverts commit 1b43d6c.

Reverted pytorch#167661 on behalf of https://github.com/yangw-dev due to break internal tests and build, please reach out meta fellas to have fix it and reland again, error examplke: hip/KernelUtils.cuh:74:5: error: no matching function for call to 'unsafeAtomicAdd' ([comment](pytorch#167661 (comment)))
Summary: The export_memory_timeline method in torch.profiler is being deprecated in favor of the newer memory snapshot API (torch.cuda.memory._record_memory_history and torch.cuda.memory._export_memory_snapshot). This change adds the deprecated decorator from typing_extensions and updates the docstring to guide users to the recommended alternative. The decorator will emit a FutureWarning at runtime, and the docstring now includes a .. deprecated:: directive for documentation visibility.

Test Plan: Manual verification that the decorator is properly applied and the deprecation message is informative.

Differential Revision: D87272399

Pull Request resolved: pytorch#168036
Approved by: https://github.com/valentinandrei
This PR introduces a `Tensor` subclass which represents a complex tensor in terms of two real ones. Ops are decomposed as individual ops  on the real and imaginary parts.

It is compatible with `torch.compile`, so long as the real ops used are also compatible. Autograd "works", but is WIP due to different edge-case behaviour.
Pull Request resolved: pytorch#167621
Approved by: https://github.com/ezyang
Repeatition of pytorch#155708
Has been broken for a while, and ET pin in Pytorch are so old that `torch==2.10.0.dev20250915` could no longer be found in nightly indices
Pull Request resolved: pytorch#168090
Approved by: https://github.com/atalman, https://github.com/yangw-dev
This PR enables special matmuls on Thor devices. This includes row-wise scaled matmul on `fp8` and group gemm on `bfloat16`.
Pull Request resolved: pytorch#164836
Approved by: https://github.com/ngimel
…orch#167395)

This adds a debug HTTP server for debugging stuck or slow jobs. It runs the WorkerServer on every worker and then launches a separate flask process on rank 0 to have users connect to for debugging.

This can easily be improved to trigger profilers as well as visualize the data much better.

Initial handlers:
* pytorch profiler
* FlightRecorder data
* Python stacks

```
os.environ["TORCH_NCCL_TRACE_BUFFER_SIZE"] = "2000"

from torch.distributed.debug import enable_debug_server

enable_debug_server()
```

Test plan:

```
torchrun --nnodes 1 --nproc_per_node=gpu ~/scripts/debug_test.py
```

<img width="2000" height="1045" alt="20251117_16h58m18s_grim" src="https://github.com/user-attachments/assets/82305b75-227c-4412-a481-00b622db6bd1" />
<img width="2172" height="1624" alt="20251117_16h58m11s_grim" src="https://github.com/user-attachments/assets/def9841c-c7e6-483a-81c3-cf0c56f6bad8" />
<img width="1985" height="1635" alt="20251117_16h58m03s_grim" src="https://github.com/user-attachments/assets/04fcf148-df58-41b4-8754-8706ee0d1de6" />

Pull Request resolved: pytorch#167395
Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/atalman
…ytorch#167079)

Summary:
As title.

Knowing the size of the leaked tensor is useful, it allows us to focus on the largest leaks.

Differential Revision: D86218574

Pull Request resolved: pytorch#167079
Approved by: https://github.com/kausv
…torch#161703)

it's another pr to port distributed tensor test for Intel GPU, while the other pr is pytorch#161604
We could enable Intel GPU with following methods and try the best to keep the original code styles:

Use torch.accelerator for general gpu
Skip the case if running on xpu which has known issues

Pull Request resolved: pytorch#161703
Approved by: https://github.com/guangyey, https://github.com/d4l3k, https://github.com/albanD
The all gather bucketing was part of the way to fusing in dtype casts into the bucket. We do this by allocating the group bucket buffer, then viewing each slice of it as the destination dtype. We then foreach_copy_ into the allocated buffer, with each collective copying in to its destination dtype.

This logic was causing an issue in a later part of the stack, but not fully firing, so might as well fix it.

Note: custom ops dont yet support list[dtype], so i worked around by list[int], but will fix in a follow up.

Pull Request resolved: pytorch#167853
Approved by: https://github.com/ruisizhang123
ghstack dependencies: pytorch#167852
The bucketing dtype fusing was causing nodes which had dependencies to be erased. Transfer those deps over to the new nodes, and also add an assertion that none of our deps are erased to catch this type of error in the future.

Pull Request resolved: pytorch#167863
Approved by: https://github.com/fmassa
ghstack dependencies: pytorch#167852, pytorch#167853
Since the currently intended workflow on the new MI3xx CI capacity is [trunk-rocm-mi300.yml](https://github.com/pytorch/pytorch/blob/d91269e8ce309437c1f849b5ab3362d69b178ef4/.github/workflows/trunk-rocm-mi300.yml#L54), which only needs the jammy images, limiting those to optimize docker caching times.

Pull Request resolved: pytorch#168088
Approved by: https://github.com/jeffdaily
For GPU: Previously reported that only a single sample could be tested with huber_loss functional. Current snapshot of the code does not appear to suffer from numerical issues as reported before.

For CPU: While testing GPU, it was discovered that with Half appears to be numerically unstable. This commit resolves issue with CPU by upcasting Half to float for the computation.

Pull Request resolved: pytorch#166952
Approved by: https://github.com/benjaminglass1, https://github.com/isuruf
…h/csrc/Exceptions.h (pytorch#168056)

Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: dtolnay

Differential Revision: D87273132

Pull Request resolved: pytorch#168056
Approved by: https://github.com/malfet, https://github.com/Skylion007
Summary: If the Tensor has a PyObject, it's use count will now be two instead of one.

Test Plan: `buck test -j 18 fbcode//mode/dev-nosan fbcode//caffe2/test:torch`

Differential Revision: D87297965

Pull Request resolved: pytorch#168060
Approved by: https://github.com/albanD, https://github.com/Skylion007
As compiler has not been supported for last 3 years and all manylinux2_28 builds should have at least gcc-11

Prep change for C++20 standard migration
Pull Request resolved: pytorch#167933
Approved by: https://github.com/yangw-dev, https://github.com/atalman
ghstack dependencies: pytorch#168090
…rch#168104)

We only want to cache the latest CI docker image for `main` and `release` branches in cases where multiple `docker-builds` workflow runs get triggered in quick succession. This is because the latest run will anyway overwrite the cached images, since we do not maintain a cached image per-SHA, instead it's only one-per-branch (to minimize cache size and docker load times at runner bringup).

Also removing `workflow_dispatch` as a trigger since it won't work (needs artifacts from `docker-builds` run)

Pull Request resolved: pytorch#168104
Approved by: https://github.com/jeffdaily
Fixes pytorch#167905

Below typo correction has been done.

Existing comment:
// List of Any can contains heterogenous types

Suggested comment:
// List of Any can contains heterogeneous types
Pull Request resolved: pytorch#167907
Approved by: https://github.com/albanD
Unclear which PR in the ghstack caused the ROCm failure. Stack was (oldest at bottom):
 - pytorch#167962
 - pytorch#167804
 - pytorch#167803
 - pytorch#167802
 - pytorch#168025

Fixes the following test:

PYTORCH_TEST_WITH_ROCM=1 python test/cpp_extensions/libtorch_agnostic_2_10_extension/test_version_compatibility.py FunctionVersionCompatibilityTest.test_mv_tensor_accessor_cuda_works_with_2_9

Pull Request resolved: pytorch#168087
Approved by: https://github.com/jeffdaily, https://github.com/janeyx99

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Fixes false negative (illusion):  "all B200 periodic nvshmem-triton tests passed"

Pull Request resolved: pytorch#167760
Approved by: https://github.com/ngimel
# Motivation
This is definitely a bug: we were attempting to release cached memory back to the system without proper **synchronization**. Callers must ensure that all accesses to memory blocks allocated by SYCL APIs have completed before invoking `sycl::free`.

For a simple example, in the following code:
```python
pool = torch.xpu.MemPool()
with torch.xpu.use_mem_pool(pool):
    input = torch.randn(100, device='xpu')
sum = input.sum()
del pool
print(sum)
```
`sum` may exhibit undefined behavior because `input.sum()` might not have finished executing before `del pool` triggers `input`'s memory release.

With this fix, we ensure that all kernels on the associated streams complete before the memory pool is destroyed, guaranteeing that `sum` holds the correct value.

# Solution
Because `c10::xpu::syncStreamsOnDevice` has host overhead, we use a boolean flag `streams_synced` to ensure it is called only once.
Pull Request resolved: pytorch#168074
Approved by: https://github.com/EikanWang
…6833)

The implementation plan of MemPool for XPU, which is the dependance of [XPUGraph](pytorch#166285), following the [RFC](pytorch#162143).

- [ ] pytorch#166831
- [ ] ->pytorch#166833
- [ ] pytorch#166843
Pull Request resolved: pytorch#166833
Approved by: https://github.com/EikanWang, https://github.com/gujinghui
Summary: Fix pytorch#167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key.

Pull Request resolved: pytorch#168063
Approved by: https://github.com/yushangdi
Fixes #ISSUE_NUMBER

Pull Request resolved: pytorch#168111
Approved by: https://github.com/ezyang
…torch#166273)

Partially vibe-coded with ClaudeCode, and changes following ops (summary also created by Claude):
- **Activation operations**: Added checks rejecting Long, Complex, and Bool types for operations like log_softmax, log_sigmoid, mish, softplus, and silu, as MPS doesn't support exponent operations on these types

- **Linear algebra operations**: Restricted linalg_lu_factor, linalg_solve, and linalg_solve_triangular to Float type only (previously only checked for complex types)

- **Pooling operations**: Added checks to reject Complex types for avg_pool2d and max_pool2d operations

- **Loss functions**: Added type checks for nll_loss (Complex), huber_loss (Long, Complex), and grid_sampler_2d (Complex)

- **Reduction operations**:
  - Fixed NANSUM to handle integral types correctly (can't contain NaN, so just performs regular sum)
  - Added Long type check for std/var operations

- **Other operations**:
  - softmax: Now explicitly requires floating point types
  - bincount: Rejects Bool type to prevent crashes

All checks use `TORCH_CHECK_NOT_IMPLEMENTED`
Pull Request resolved: pytorch#166273
Approved by: https://github.com/manuelcandales
Summary: Shrink binary size to reduce relocation overflows. The most important change is to split `intrusive_ptr::reset_()` into two functions and mark the bigger one as `C10_NOINLINE`.

Differential Revision: D87308588

Pull Request resolved: pytorch#168080
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/malfet, https://github.com/ezyang
# Motivation
Thanks to @KarhouTam for finding the issue mentioned in pytorch#167172
This PR aims to improve the build logic in activities for kineto.

# Additional Context
Fix pytorch#167172

Pull Request resolved: pytorch#167204
Approved by: https://github.com/EikanWang, https://github.com/ezyang
Flamefire and others added 23 commits November 22, 2025 14:57
The test is skipped on a condition which needs to be used here or it will fail because the exit code is -6 not zero if the condition is not met and the test executed

Fixes pytorch#154441

@kwen2501

Pull Request resolved: pytorch#167971
Approved by: https://github.com/kwen2501
This is a ~20x speedup for benchmarks/dynamo/microbenchmarks/optree_tree_map.py

Pull Request resolved: pytorch#168342
Approved by: https://github.com/anijain2305
tuple[int] -> tuple[int,...]

1 -> more

Like shape, shape: tuple[int, ...]  # [B, Hq, M, Hkv, N, D]

Inspired by pytorch#168320

Pull Request resolved: pytorch#168892
Approved by: https://github.com/cyyever, https://github.com/Skylion007
…ch/_inductor/codegen/cuda (pytorch#160685)"

This reverts commit 7556637.

Reverted pytorch#160685 on behalf of https://github.com/yangw-dev due to failed internal tests test_cpu_/test_cpu#link-tree/torch/utils/_config_module.py line 371, in _config = self._config[name] KeyError: 'cuda.cutlass_dir' Diff: D87660662 ([comment](pytorch#160174 (comment)))
…pytorch#160174)"

This reverts commit 008ac43.

Reverted pytorch#160174 on behalf of https://github.com/yangw-dev due to failed internal tests test_cpu_/test_cpu#link-tree/torch/utils/_config_module.py line 371, in _config = self._config[name] KeyError: 'cuda.cutlass_dir' Diff: D87660662 ([comment](pytorch#160174 (comment)))
…ytorch#167456)"

This reverts commit 4ee6b3d.

Reverted pytorch#167456 on behalf of https://github.com/yangw-dev due to failed internal test Diff D87660150 , errorl ModuleNotFoundError: No module named 'extension_backends' ([comment](pytorch#167456 (comment)))
…13 RecursionError problems (pytorch#167888)"

This reverts commit 24e1958.

Reverted pytorch#167888 on behalf of https://github.com/yangw-dev due to failed interal test Tracing payload for Mock should not be called: pt2_compile_chromium_events Fatal Python error: Segmentation fault, please remerge after fixing it ([comment](pytorch#167888 (comment)))
…ytorch#168221)

To avoid circular import issues:
- utils.py used to include registration functions which import/depend on
  DTensor.sharding_propagator
- I plan to use other utils from utils.py inside sharding_propagator.py
Pull Request resolved: pytorch#168221
Approved by: https://github.com/albanD
This was a workaround for gcc-8 on ARM, introduced by pytorch#44199,  which is no longer relevant as  CentOS-8 is past its EOL

Was reminded about it while looking at pytorch#168907

Pull Request resolved: pytorch#168909
Approved by: https://github.com/Skylion007, https://github.com/nimeduhansaka
This PR adds comprehensive benchmarks for PyTorch optimizers to measure optimizer.step() performance across different parameter configurations.

### Optimizers benchmarked:
  - AdamW
  - Adam
  - SGD (with momentum=0.9)
  - RMSprop
  - Adagrad

### Test configurations:
- num_params: [1, 10, 100]
- param_size: [100K, 1M, 10M]
Pull Request resolved: pytorch#168101
Approved by: https://github.com/slayton58
…orch#168369)

Previously, these error messages would get truncated when they were hit on device 0 because device is a "char" (actually, an int8_t) and therefore '0' is interpreted as the null byte to terminate a string. Essentially, it is the same issue as pytorch#123984.

There's something strange in the TORCH_CHECK_WITH macro that is causing this. I don't feel like figuring out those obscure macro details right now, though.

Pull Request resolved: pytorch#168369
Approved by: https://github.com/eqy
…h#167713)

Fix a API typo in the cuda graph tutorial.

The API given in cuda graph tutorial is wrong.

Pull Request resolved: pytorch#167713
Approved by: https://github.com/jerryzh168
  ## Remove deprecated `split_cat_fx_passes`

First of a couple of small PRs that remove deprecated and unused code.

Remove the deprecated `split_cat_fx_passes` configuration variable from inductor config and clean up associated test patches.

  ### Changes
  - Remove `split_cat_fx_passes` from `torch/_inductor/config.py`
  - Remove `@patch.object(config, "split_cat_fx_passes", False)` decorators from tests in `test/inductor/test_perf.py`

Pull Request resolved: pytorch#167738
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/cyyever
This PR removes unnecessary thrust::tie.

Pull Request resolved: pytorch#168943
Approved by: https://github.com/ngimel
This PR removes unnecessary uses of thrust::tuple before moving to CCCL.
Pull Request resolved: pytorch#168936
Approved by: https://github.com/ngimel
… func (pytorch#167723)

Fixes pytorch#167197

The inductor backend is trying to convert the float infinity value to an integer in pow lowering (possibly for indexing, iteration counts, or type conversions). Python/C++ cannot convert float('inf') to an integer, causing the overflow error

Pull Request resolved: pytorch#167723
Approved by: https://github.com/shunting314
This reverts commit 1328a02.

Reverted pytorch#168122 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#168122 (comment)))
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 24, 2025

Jenkins build for ecdea869f81e8ecd9051f75810c1eafd6ec5523b commit finished as NOT_BUILT
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 24, 2025

Jenkins build for ecdea869f81e8ecd9051f75810c1eafd6ec5523b commit finished as SUCCESS
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

Jenkins build for ecdea869f81e8ecd9051f75810c1eafd6ec5523b commit finished as ABORTED
Links: Blue Ocean view / Build artifacts

@pragupta pragupta merged commit f742da3 into develop Nov 25, 2025
43 of 44 checks passed
@pragupta pragupta deleted the develop_IFU_20251124 branch November 25, 2025 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.