Skip to content

Conversation

@pragupta
Copy link
Collaborator

@pragupta pragupta commented Nov 4, 2025

rocm_base: 56002f4
rocm_base: 56002f4
upstream_main: 24db5c4

eellison and others added 30 commits October 30, 2025 16:54
Make the bucketer use multi-dtype bucketing for all gathers.

Pull Request resolved: pytorch#166527
Approved by: https://github.com/IvanKobzarev, https://github.com/ezyang
Bucket all reduce in bucketer, thanks to @IvanKobzarev's earlier pr.

Pull Request resolved: pytorch#166528
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: pytorch#166527
…orch#164518)"

This reverts commit 311ea0d.

Reverted pytorch#164518 on behalf of https://github.com/atalman due to breaks internal builds Error: from logging_utils import ( ModuleNotFoundError: No module named 'logging_utils' ([comment](pytorch#164518 (comment)))
… func `_init_one_process_group` (pytorch#166614)

To makes pg cache change easier and code modularization, we isolate the logic of process group creation into a separate function named `_init_one_process_group`.

Pull Request resolved: pytorch#166614
Approved by: https://github.com/lw
Fixes pytorch#165177

When converting guards to sources if we were unable to get the expected symbol from symbol_to_source then try to get it from var_to_sources.

I was unable to make a simpler repro than what was described in the issue (which relies on llama3 - so inappropriate for a unit test).

Pull Request resolved: pytorch#165723
Approved by: https://github.com/bobrenjc93
Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We can keep erf inline this way.

Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:

Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85730452

Pull Request resolved: pytorch#166594
Approved by: https://github.com/mcfi, https://github.com/fadara01
This reverts commit 9051940.

Reverted pytorch#166541 on behalf of https://github.com/atalman due to breaks internal test ([comment](pytorch#166541 (comment)))
…rd error (pytorch#166367)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#166367
Approved by: https://github.com/bdhirsh
…#165139)

As the title stated.

- AOTI_TORCH_CHECK depend on TORCH_CHECK_MSG which located in c10/util/Exception.h, which maybe break BC
- AOTI_TORCH_CHECK is not used everywhere
- STD_TORCH_CHECK have ABI check tests.
Pull Request resolved: pytorch#165139
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
As the title stated.

**Changes**:
- torch/csrc/inductor(Part 2)

Pull Request resolved: pytorch#165119
Approved by: https://github.com/janeyx99
ghstack dependencies: pytorch#165139
It appears to be unused based on `cd torch; rg has_symints`.

Pull Request resolved: pytorch#163667
Approved by: https://github.com/xmfan, https://github.com/azahed98, https://github.com/albanD
ghstack dependencies: pytorch#162990
Just like we use `is/is not` for class comparisons, it is generally advised to use `is/is not` for comparisons against torch functions.

Pull Request resolved: pytorch#166624
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
In some cases the warning from pytorch#147744 still gets emitted because [atexit hooks aren't called](python/cpython#114279).

Even in those cases, if the atexit hooks _were_ called you could end up with issues due to the directory being deleted in one process, but still being used elsewhere.

It's better all round to load these modules entirely in-memory.

Pull Request resolved: pytorch#157713
Approved by: https://github.com/xush6528
After pytorch#163049, this PR fixes the type annotations to match the actual implementation for ProcessGroupXCCL::Options.
Pull Request resolved: pytorch#166418
Approved by: https://github.com/guangyey, https://github.com/ezyang
…h#165511)

# Motivation
Aligned with other backends, this PR introduces a new API torch.xpu.get_per_process_memory_fraction to allow user to retrieve the allowed memory fraction per a single process.

Pull Request resolved: pytorch#165511
Approved by: https://github.com/EikanWang, https://github.com/ezyang
ghstack dependencies: pytorch#165508, pytorch#165509, pytorch#165510
Fixes pytorch#166630

Pull Request resolved: pytorch#166631
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Fixes diskspace cutoff to say that the machine does not have difference=100 - diskspace_cutoff_int space available.

Pull Request resolved: pytorch#166645
Approved by: https://github.com/jeffdaily
…serts (pytorch#166171)

Summary:
Using `CUDA_KERNEL_ASSERT_PRINTF` inside kernels allows us to log invalid values to the console (that can be in turn used to surface _hopefully_ more clearer error messages).

This does have an impact in the number of registers needed for the values being logged (I confirmed via diffing PTX that there is no other impact relative to using `__assert_fail`)

To avoid causing perf bottlenecks, this change adds a compile-time switch to enable more verbose errors in some of the common kernels that cause DSAs. There is also a Buck config that can be used to configure this switch more conveniently.

## Alternatives considered
I considered making the behavior of `CUDA_KERNEL_ASSERT_PRINTF` controllable via a compile-time macro instead of writing another wrapper for it but there are kernels where the extra register pressure is not as severe and in those cases, having more useful error messages by default is pretty useful.

Test Plan:
## Simple Python Driver:
```
# scatter_errors.py
import torch
def main() -> None:
    a = torch.rand(128, device="cuda:0")
    idx = torch.randint(0, 128, (100,), device="cuda:0")
    idx[0] = 9999
    b = torch.scatter(a, 0, idx, 555.0)
    print(b)
```

When running normally via:
```
$ buck2 run @//mode/opt  :scatter_errors
```
we see the followng DSA message:
```
fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
```

Running via:
```
$  buck2 run @//mode/opt -c fbcode.c10_enable_verbose_assert=1 :scatter_errors
```
however produces:
```
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0]: Assertion failed: `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"`: Expected 0 <= idx_dim < index_size (128), but got idx_dim = 9999
```

Differential Revision: D85185987

Pull Request resolved: pytorch#166171
Approved by: https://github.com/ngimel
Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected.

Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC.

Pull Request resolved: pytorch#166456
Approved by: https://github.com/Skylion007
Summary:

Blockwise 1x128 and 128x128 scaling is only available on CUDA >= 12.9
and only on Hopper GPUs. Attempting to run on B200 would give a
hard-to-debug `CUBLAS_STATUS_NOT_SUPPORTED`.

Add a more helpful `NotImplementedError` to catch this case.

Also more explicitly disable ROCm builds for relevant methods, based on
lack of support per [hipBLASlt
docs](https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/reference/datatypes.html#_CPPv4N28hipblasLtMatmulMatrixScale_t40HIPBLASLT_MATMUL_MATRIX_SCALE_VEC128_32FE).

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: pytorch#166639
Approved by: https://github.com/drisspg
This PR removes unused loop variables in tests.

Pull Request resolved: pytorch#166509
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
…ch#166603)

Looking for feedback on this approach.
Received user reports of spurious pyrefly errors for users using hg instead of git. I think this was due to the fact that when using a venv and git, `make setup-env` installs requirements and pulls from a nightly torch wheel, which is needed for pyrefly to type check properly.

Initial documentation for `make setup-env` I found here: https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#developing-pytorch

Testing:
```
hg clone --git ssh://git@github.com/pytorch/pytorch.git
conda create -n pytorch_env python=3.10 # (or manually create venv instead of using script)
cd pytorch
pip install -r requirements.txt
pip install -r requirements-build.txt
lintrunner init
# check how many pyrefly errors - 15,709 errors (11,693 ignored)
lintrunner # confirm error message / warning appears
>>> General linter failure:
  Warning (PYREFLY) nightly-wheel-not-run
    pytorch-nightly.pth not found. You may need to run make setup-env or make
    setup-env-conda to install nightly binaries and type stubs.
```
Pull Request resolved: pytorch#166603
Approved by: https://github.com/aorenste
)

Replace assert statements with explicit if/raise patterns in:

- torch/ao/quantization/fx/* (177 errors)

fix partialy pytorch#164878

Pull Request resolved: pytorch#165420
Approved by: https://github.com/RohitRathore1, https://github.com/fffrog, https://github.com/albanD
This config improves performance by 250% on some kernels that contain `t1.atomic_add(...)`. Again, we conditionalize for ROCm/HIP, so there is no impact to NV.

Pull Request resolved: pytorch#166470
Approved by: https://github.com/PaulZhang12, https://github.com/mlazos, https://github.com/eellison, https://github.com/jansel
…h#166397)

According to next commit to AOTriton:
ROCm/aotriton@8625c4f

These changes missed in 0.11b release:
pytorch#161754

Pull Request resolved: pytorch#166397
Approved by: https://github.com/jeffdaily
Our CP codebase now contains several files and we are adding more. This
PR refactors the code to consolidate the files into a context_parallel
folder but keep the import so that the existing users of CP won't be
affected.

Unfortunately, we have to split this PR into two PRs as the PyTorch
infra cannot accept a PR with 3000+ LoC change and git cannot recognize
that _context_parallel/_attention.py is moved from _attention.py because
we want to keep BC.

This is the second PR.

Pull Request resolved: pytorch#166501
Approved by: https://github.com/Skylion007
ghstack dependencies: pytorch#166456
Stonepia and others added 25 commits November 4, 2025 02:14
This PR enables `test_fxir_backend.py`'s tests formerly skipped xpu tests. No additional changes needed for the features.

Pull Request resolved: pytorch#166493
Approved by: https://github.com/angelayi, https://github.com/EikanWang
This ensures that we graph break at the right time, leading to the right
stack trace.

Pull Request resolved: pytorch#166909
Approved by: https://github.com/tugsbayasgalan
…#165686)

Wraps the header in a C file and compile it using a C compiler, which should support designated initializers

Fix issue pytorch#160647

Pull Request resolved: pytorch#165686
Approved by: https://github.com/williamwen42
In pytorch#166440, didn't realize you could turn on anomaly mode while disabling NaN checks for these stacks. Adding them to `debug_mode.operators[*].fwd_stack_trace`.

Pull Request resolved: pytorch#166842
Approved by: https://github.com/yushangdi, https://github.com/mikaylagawarecki
# Motivation
This PR intends to add `ExpandableSegment` struct, which is used to help support the expandable segment feature. I split it to a single PR to facilitate the code review.

Pull Request resolved: pytorch#166299
Approved by: https://github.com/EikanWang, https://github.com/albanD, https://github.com/gujinghui
…6292)

# Motivation
This PR intends to add expandable segment feature support on XPU. This will help
- Reduce memory fragmentation;
- Gradually map physical pages into virtual address space as needed.

# Additional Context
The traditional caching allocator frequently allocates and frees device memory blocks. However, over time, with varying tensor size, the device address space becomes fragmented. Even when there's enough total free memory, a lack of contiguous space can cause large allocations to fail.
The **expandable segment** feature addresses this by dynamically extending physical memory within a reserved virtual address range, reducing fragmentation and minimizing reallocation overhead.
The potential drawbacks are
- Virtual memory overhead;
- Potential page mapping overhead;
- Increased complexity.

Pull Request resolved: pytorch#166292
Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
ghstack dependencies: pytorch#166299
# Motivation
This PR introduces support for peer-to-peer (P2P) access between devices, including querying and enabling P2P connections between two devices.
It supports two categories of allocations:
- Regular allocations;
- Expandable segment allocations.

# Additional Context
The follow-up is that we should use this feature to optimize our copy kernel when P2P is supported.

Pull Request resolved: pytorch#166424
Approved by: https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: pytorch#166299, pytorch#166292
… to trace through typing.get_type_hints (pytorch#166757)

This is covered by `test_get_type_hints` in test/dynamo/test_repros.py

Pull Request resolved: pytorch#166757
Approved by: https://github.com/Lucaskabela
This PR continues to apply ruff UP035 rule to inductor code. ruff UP035 rule aims to use Python 3.10 syntax and libraries.

Pull Request resolved: pytorch#166225
Approved by: https://github.com/aorenste
…#166839)

Fixes pytorch#166838
1. Register `convolution_overrideable` key for flop_counter. CUDA relies on keys with `cudnn_convolution`. For devices like `XPU`, it falls to `convolution_overrideable`. Without the correct registration, the flop_couter will silently return 0 for XPU in line:
https://github.com/pytorch/pytorch/blob/e1d011d6eb571cd98ec7c7ed8e8b518a5463ec97/torch/_inductor/analysis/profile_analysis.py#L178-L179

2. Enable the tests when enabling the XPU on `test_analysis.py`.

Pull Request resolved: pytorch#166839
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jansel
Make sure you're on cutlass 4.2.0+

Test Plan:
Tritonbench(oss):
`clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16  --num-inputs 1 --metrics tflops,accuracy`

Unit Tests(oss):
`clear; python test/inductor/test_cutedsl_grouped_mm.py`

Differential Revision: D82010227

Pull Request resolved: pytorch#165036
Approved by: https://github.com/alexsamardzic, https://github.com/drisspg, https://github.com/mlazos
This reverts commit c21868b.

Reverted pytorch#162275 on behalf of https://github.com/izaitsevfb due to breaking test_rms_norm_bwd_float32_split_reductions_True_shape2 ([comment](pytorch#162275 (comment)))
…#166830)

# Motivation

I believe this is a bug - here's why:
In [dnnl_common_types.h](https://github.com/uxlfoundation/oneDNN/blob/98132c49080c19caffee1f7ba65be83339cba985/include/oneapi/dnnl/dnnl_common_types.h#L116-L125) is defined as a pointer to an `int64_t[12]` array;
We can confirm this from the implementation in [memory_desc.cpp](https://github.com/uxlfoundation/oneDNN/blob/98132c49080c19caffee1f7ba65be83339cba985/src/common/memory_desc.cpp#L746-L748) where the member indeed points to an internal array.

# Solution

Therefore, when accessing `md_padded_dims`, we should first dereference the pointer and then use it with an index - directly using it without dereferencing would corrupt memory.

Pull Request resolved: pytorch#166830
Approved by: https://github.com/EikanWang
…peError (pytorch#165264)

Fixes pytorch#164684

### Description

Symbolic tracing fails during multiplication between a `SymBool` and a `Tensor`. This scenario is triggered when `.item()` is called on a 0-dim boolean tensor within a `torch.compile` region. In compile mode, this yields a `SymBool`, and the subsequent `SymBool * FakeTensor` operation is unsupported, leading to a `TypeError` or a data-dependent `UserError`.

### Solution

This PR addresses the issue at the type-conversion level, as suggested by reviewers.

The root cause of the TypeError is that torch.sym_float() (which is called by _maybe_convert_to_dtype during type promotion for aten.mul) lacks a conversion path for SymBool and incorrectly falls back to builtins.float(SymBool).

This fix addresses this by implementing the __sym_float__(self) method within the SymBool class (defined in torch/__init__.py).

The torch.sym_float(a) utility function is already designed to check for hasattr(a, "__sym_float__") before falling back to builtins.float(). By adding this method, SymBool instances now correctly advertise their ability to be cast to SymFloat. The new method implementation leverages self.node.sym_float() to correctly convert the symbolic boolean value to its symbolic float representation (0.0 or 1.0), resolving the TypeError at its source.

This approach is more fundamental than modifying a specific operation in builtin.py and ensures SymBool can be correctly promoted to SymFloat in any operation, while still preserving its boolean nature for control flow operations like guard_or_false (which is verified by a new test case).

### Verification

1.  **Bug Reproduced**: The initial `UserError: Could not guard on data-dependent expression` was successfully reproduced with the script from the issue. As shown below
<img width="1369" height="945" alt="Screenshot 2025-10-13 at 10 29 05" src="https://github.com/user-attachments/assets/8daa4555-3347-4af5-906a-02150b8df9d1" />

2.  **Fix Validated**: After applying the code changes, the same script now runs to completion, printing `✅ eager success` and `✅ compile success`. As shown below
<img width="1228" height="82" alt="Screenshot 2025-10-13 at 10 29 21" src="https://github.com/user-attachments/assets/94c4f143-b898-4dda-9bff-0ad5450a30fa" />

3. Added a new test class DynamoOpPromotionTests to test/dynamo/test_misc.py with three new test cases:
1. test_symbool_tensor_mul_does_not_fail: Verifies that the original bug report code (with .item() + *) no longer raises an error when compiled.
2. test_symbool_guard_or_false: Verifies that this fix does not cause a regression for guard_or_false(SymBool) (the concern raised by reviewers).
3. test_symbool_tensor_mul: Verifies the behavior of Tensor(bool) * Tensor(float) (without .item()) for completeness.
All new tests were added and pass locally.

Pull Request resolved: pytorch#165264
Approved by: https://github.com/laithsakka, https://github.com/Lucaskabela
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#166867
Approved by: https://github.com/albanD
ghstack dependencies: pytorch#166868
# Motivation
This PR aims to reuse some UT to validate the expandable segment feature.

# Additional Context
Currently, the failure is related to the internal track `GSD-11403`, we could get the fix when upgrading the driver to `ci-neo-master-034630` or greater
TODO: add test conv and gemm into this test case when upgrading the driver.

Pull Request resolved: pytorch#166495
Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
ghstack dependencies: pytorch#166299, pytorch#166292, pytorch#166424
Summary:

cuBlasLt enforces size/stride requirements for 1x128 and 128x128 blockwise scaling
kernels, some of which weren't being handled, causing silent incorrect
answers especially for 128x128 scaling cases.

cuBlasLt enforces ([docs](https://docs.nvidia.com/cuda/cublas/#scaling-factors-layouts)) for deepseek-style
scaling, for `A: MxN`, `B: KxN` you have the following:

```Py
L = K // 128
L4 = round_up(L, 4)

1x128 x 128x128:
* A_scale: [M, K // 128], stride: [1, M]
* B_scale: [L4, N // 128], stride: [1, L4]

128x128 x 1x128:
* A_scale: [L4, M // 128], stride: [1, L4]
* B_scale: [N, K // 128], stride: [1, N]

1x128 x 1x128:
* A_scale: [M, K // 128], stride: [1, M]
* B_scale: [N, K // 128], stride: [1, N]
```

Notable here is the `L4` term, which means that we must round up to the nearest multiple of 4 blocks
in the `K` dimension. This wasn't enforced previously, and caused silent wrong answers
where `(K // 128) % 4 != 0`.

Test Plan:

Reviewers:

Subscribers:

@vkuzo

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: pytorch#166752
Approved by: https://github.com/drisspg, https://github.com/vkuzo
Forward fix for failing Docker release builds
Related to: pytorch#166897

Nightly Docker build failure https://github.com/pytorch/pytorch/actions/runs/18900508440/job/53946606434
Due to missing base image:
```
ERROR: failed to build: failed to solve: docker.io/nvidia/cuda:13.0.2-devel-ubuntu22.04: not found
```

Pull Request resolved: pytorch#166904
Approved by: https://github.com/tinglvv, https://github.com/malfet
# Conflicts:
#	.ci/docker/requirements-ci.txt
@pragupta pragupta requested a review from jeffdaily as a code owner November 4, 2025 15:37
@pragupta pragupta merged commit 3d74218 into develop Nov 4, 2025
11 checks passed
@pragupta pragupta deleted the develop_IFU_20251104 branch November 4, 2025 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.