Skip to content

[AUTOGENERATED] develop_IFU_20260211#2969

Merged
pragupta merged 1017 commits into
developfrom
develop_IFU_20260211
Feb 12, 2026
Merged

[AUTOGENERATED] develop_IFU_20260211#2969
pragupta merged 1017 commits into
developfrom
develop_IFU_20260211

Conversation

@pragupta
Copy link
Copy Markdown
Collaborator

rocm_base: fe101ec

justinchuby and others added 30 commits February 4, 2026 22:23
Implements ONNX export for `torch.ops.higher_order.invoke_subgraph`, which is created by `torch.compiler.nested_compile_region`.

Actual function preservation needs update in onnxscript optimizer and version converter to prevent inlining.

## Example

```python
class Model(torch.nn.Module):
    def forward(self, x, y):
        def inner_fn(a, b):
            return torch.mul(a, b) + a

        # Function preserved as separate entity in ONNX graph, not inlined (when onnxscript is updated)
        return torch.compiler.nested_compile_region(inner_fn)(x, y)

onnx_program = torch.onnx.export(Model(), (x, y), dynamo=True)
```

Replaces pytorch#172715
Fixes pytorch#172459

Pull Request resolved: pytorch#174283
Approved by: https://github.com/titaiwangms

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
On riscv64, installing lintrunner 0.12.7 from sdist fails because its build dependency maturin<0.13 cannot be installed:

pip install "maturin>=0.12,<0.13" fails with:
BackendUnavailable: Cannot import 'setuptools.build_meta'

This can be reproduced on x86 also.

Upgrading to maturin >= 1.0 (as done in lintrunner 0.12.11) resolves the issue.

Pull Request resolved: pytorch#173658
Approved by: https://github.com/malfet
…173558)

**Context**
Previously, list / dict comprehensions were treated as a function call and would add a new frame to the Python stack. As a result, if there was a graph break in the comprehension, Dynamo would only skip tracing the comprehension code. In Python 3.12, comprehensions are inlined into their surrounding function, so when we graph break, the entire function is skipped.

This PR handles list / dict comprehensions in Dynamo by only skipping tracing for the bytecode related to the comprehension.

References
- PEP709: https://peps.python.org/pep-0709/

**Solution**
1. Ops BUILD_LIST and BUILD_MAP are always at the beginning of list or dict comprehensions, respectively. When processing these ops in Dynamo, we check if the preceding instructions indicate a comprehension. This is done by `_is_comprehension_start`. `_is_comprehension_start `dynamically retrieves the bytecode prefix for a comprehension using `get_comprehension_bytecode_prefix`. `get_comprehension_bytecode_prefix` builds a dummy list comprehension and gets the associated instruction opnames.
2. If we identify that we are in a comprehension, we check if we can speculate and if we are not in a nested comprehension. If these checks pass and speculation is not failed, we set a checkpoint via speculation.
3. If a graph break is triggered, we handle it in the normal way by restarting tracing. Once we reach the checkpoint set in BUILD_LIST / BUILD_MAP, we handle the graph break in `_handle_comprehension_graph_break`.
4. At a high level, this function compiles the graph up to the comprehension, adds the comprehension bytecode to be run eagerly, generates code to load any locals created in the comprehension, and creates a resume function for code after the comprehension.
5. Handling the comprehesion graph break involves analysis of the bytecode to determine the instruction that ends the comprehension bytecode, the result variable (if there is one), whether the result should stay on the stack, what happens to the result, iterator variables that need to be restored, other locals produced / modified in the comprehension, and vars read from the outer scope. To help with this, we create dataclass `ComprehensionAnalysis` that is returned by `_analyze_comprehension`. This function also dynamically retrieves example bytecode sequences during analysis, ensuring that it is resilient to bytecode changes across Python versions.
6. Finally, we resume tracing as usual.

**Edge Cases Handled**
1. Multiple comprehensions with a graph break in only one
7. Multiple comprehensions with graphs breaks in all
8. Comprehension that calls a function that produces a graph break
9. Nested comprehensions with graph breaks
10. Comprehensions with multiple iterators
11. Comprehensions discarded without usage (as opposed to being assigned to a variable)
12. Comprehensions that are used in an expression before being stored to a variable
13. Comprehensions that are directly returned
14. 1 or more Walrus operators (creating side effects) in comprehension
15. Side effects nested in comprehensions.
16. Comprehensions that mutate or read outer variables
17. Comprehensions that mutate or read global variables
18. Comprehensions that modify closure variables
19. List and dict comprehensions together

**Edge Cases Unimplemented**
1. Comprehension graph break in resume function with captured variables (e.g. test_torch.py::TestTorchDeviceTypeCPU::test_cauchy_kstest_cpu_bfloat16)
2. Comprehension with captured tensor not in local slot (e.g. test_autograd.py::TestAutograd::test_pickle)

**Test Cases**
New test cases are added in test_comprehensions.py. These cases test for the production of the correct number of graphs and the correct number of specific operators in each graph.

**Misc Notes**
1. One extension of this system is to skip tracing for arbitrary sequences of bytecode such as in loops, try blocks, generic context managers, etc. This code is currently highly specific to comprehensions and would need significant refactoring for this purpose.

**Next Steps**
1. Add support for torch._dynamo.config.nested_graph_breaks=True. In the currently implementation, we fall back to skipping the entire frame when nested_graph_breaks=True. As a follow up, we would like to have this functionality supported.
5. Add support for set comprehensions. We currently only support list and dict comprehensions.

Fixes pytorch#171822

Pull Request resolved: pytorch#173558
Approved by: https://github.com/williamwen42
Also updated test logic, as OpSchema was replaced by OpSignature for onnx functions.

Required for pytorch#165083

Pull Request resolved: pytorch#173828
Approved by: https://github.com/titaiwangms, https://github.com/malfet
…172160)

The `same_meta` function was missing checks for `is_conj()` and
`is_neg()` tensor flags. This caused `remove_noop_ops` to incorrectly
remove `clone` operations that were resolving conjugation (from
`resolve_conj()`).

When complex convolution is compiled, the C++ implementation calls
`resolve_conj()` before `view_as_real()`. The `resolve_conj()` traces
to a `clone` operation. Without the conjugate bit check, this clone
was being removed as a "no-op", causing `view_as_real` to be called
on a still-conjugated tensor, which fails with:
"view_as_real doesn't work on unresolved conjugated tensors"

Added regression tests:
- test_complex_real_imag_conj: tests real/imag extraction from conj tensors
- test_complex_conv2d_conj: tests complex convolution with conj inputs

Fixes pytorch#171665

Pull Request resolved: pytorch#172160
Approved by: https://github.com/eellison
Fixes pytorch#134173

NOTE: Uncommenting the following
https://github.com/pytorch/pytorch/blob/d8039170f00cf084e4af91f1db84497bfccdf149/test/inductor/test_compiled_autograd.py#L5215
https://github.com/pytorch/pytorch/blob/d8039170f00cf084e4af91f1db84497bfccdf149/test/inductor/test_compiled_autograd.py#L5313

and running `python test/inductor/test_compiled_autograd.py TestAutogradWithCompiledAutograd.test_graph_save_on_cpu` fails for a different reason
```
torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
  Explanation: Dynamo developers have intentionally marked that the function `save_on_cpu.__init__.<locals>.unpack_from_cpu` in file `/opt/pytorch/pytorch/torch/autograd/graph.py` should not be traced.
  Hint: Avoid calling the function `save_on_cpu.__init__.<locals>.unpack_from_cpu`.
  Hint: Apply `@torch._dynamo.dont_skip_tracing` to the function `save_on_cpu.__init__.<locals>.unpack_from_cpu` to force tracing into the function. More graph breaks may occur as a result of attempting to trace into the function.
  Hint: Please file an issue to PyTorch.
```

Pull Request resolved: pytorch#172578
Approved by: https://github.com/ezyang
Changes:

Add launch_pdl: True to combo kernel triton_meta when PDL is
enabled
Fix missing shape=() parameter in _handle_pdl_after_load()
Add tests for PDL + combo kernel integration

See, example kernel: https://gist.github.com/eellison/50fea54d1096b0ece3c97f6e8ee02d5b
written with claude

Pull Request resolved: pytorch#174232
Approved by: https://github.com/karthickai, https://github.com/v0i0
# Motivation
Move EmptyTensor to PyTorch for better maintenance.

# Additional Context
The pin commit intel/torch-xpu-ops@83c9813 is from a viable strict [branch](https://github.com/intel/torch-xpu-ops/commits/viable/strict/).
The flow is to first land this PR, then land intel/torch-xpu-ops#2836, and finally update the pin commit from the main branch.

Pull Request resolved: pytorch#174194
Approved by: https://github.com/EikanWang
lintrunner now provides official riscv64 wheels from 0.13.0, so it can be safely enabled on riscv64
Pull Request resolved: pytorch#173993
Approved by: https://github.com/Skylion007, https://github.com/cyyever
… for external template buffers (pytorch#174148)

Design doc: pytorch/helion#1346.

Add two extension points for external template buffers (e.g. Helion kernel):
- `codegen_template_override()` in SIMDKernel - allows custom template code generation
- `emit_kernel_override()` in Kernel - allows custom kernel emission to wrapper

These hooks enable external template buffers to integrate with Inductor's template fusion without modifying core Inductor code.

After this PR, we will add Helion dynamo variable and HOP handling in pytorch/helion#1351.

Pull Request resolved: pytorch#174148
Approved by: https://github.com/jansel
…#174077)

Addresses the TODO in `test_local_tensor.py` by adding view ops testing for LocalTensor
Pull Request resolved: pytorch#174077
Approved by: https://github.com/dzmitry-huba
fixes pytorch#166387

As pytorch moved to [new API](https://docs.pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices) for tf32, many like transformer started using them.

It seems pytorch inductor is still using old allow_tf32, so when new API is invoked and read happens for old API we see error like
ERROR: PyTorch is checking whether allow_tf32_new is enabled for cuBlas matmul,Current status indicate that you have used mix of the legacy and new APIs to set the TF32 status for cublas matmul. We suggest only using the new API to set the TF32 flag. See also: https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices

My PR is addressing this by using new API in inductor.

currently I have only made changes pytorch issue lined above and transofrmer, you can see traceback in [comment](huggingface/transformers#42371 (comment)) section here, huggingface/transformers#42371

Can I can start changing more allow allow_tf32 in inductor

Pull Request resolved: pytorch#173731
Approved by: https://github.com/jansel, https://github.com/isuruf

Co-authored-by: Isuru Fernando <isuruf@gmail.com>
This fixes pytorch#173879 by using the proposed formula and indicating the shape of the result.

Examples showing that the shape indication is correct:
```python
>>> import torch
>>> a = torch.randn(2, 3, 4, 5, 6)
>>> b = torch.randn(5, 6, 7)
>>> torch.tensordot(a, b, dims=2).shape
torch.Size([2, 3, 4, 7])
>>> a.shape[:-2]
torch.Size([2, 3, 4])
>>> b.shape[2:]
torch.Size([7])

>>> a = torch.randn(2, 3, 4)
>>> b = torch.randn(2, 3, 4, 5, 6)
>>> torch.tensordot(a, b, dims=3).shape
torch.Size([5, 6])
>>> a.shape[:-3]
torch.Size([])
>>> b.shape[3:]
torch.Size([5, 6])
```
Pull Request resolved: pytorch#173893
Approved by: https://github.com/mikaylagawarecki
Use torch._check instead of direct comparison in squareCheckInputs to
defer validation to runtime for unbacked symbolic dimensions. Also use
sym_min/sym_max in linalg_lu_factor_ex_meta and make_contiguous_strides_for
to handle symbolic dimensions properly.

This enables the following 18 ops to work with unbacked symbolic dimensions:
- cholesky_inverse
- linalg.cholesky, linalg.cholesky_ex
- linalg.det, linalg.slogdet
- linalg.eig, linalg.eigh, linalg.eigvals, linalg.eigvalsh
- linalg.inv, linalg.inv_ex
- linalg.ldl_factor, linalg.ldl_factor_ex
- linalg.lu_factor, linalg.lu_factor_ex
- lu, triangular_solve
- matrix_exp

Pull Request resolved: pytorch#173399
Approved by: https://github.com/aorenste
…ests (pytorch#171625)

Otherwise some paddings seem to fail pattern-match

Pull Request resolved: pytorch#171625
Approved by: https://github.com/ngimel, https://github.com/eellison
… inputs (pytorch#174334)

When caching an AOTAutograd entry for a model where an output is a view
of an input with dynamic shapes, pickle fails because view_meta_sequence
contains SymInt references that create a chain to unpicklable objects
(WeakValueDictionary).

The fix clears view_meta_sequence in make_runtime_safe() when it has
symbolic inputs. This is safe because gen_alias_from_base() already
skips view replay for symbolic inputs and falls back to as_strided().

This PR was authored with Claude.

Fixes: pytorch#174299

Pull Request resolved: pytorch#174334
Approved by: https://github.com/aorenste
Reduced Dynamo compile time from 14.71 seconds to 13.896 seconds.

1) Cache only on Source object - makes lookup faster
2) Extend the variable tracker cache to lazy variable trackers. Earlier, we
were creating duplicate copies of VT, and unnecessary calling the
__call__ method of LazyVT to construct the variable tracker many times.
Now, the cache just returns the cached lazy VT, and if its realized,
we just use the realized VT.

Pull Request resolved: pytorch#174242
Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42
Differential Revision: D92291036

Pull Request resolved: pytorch#174302
Approved by: https://github.com/zhxchen17
malfet and others added 23 commits February 11, 2026 01:08
Purely claude-coded using metal-kernel writing skill.

Performance comparison collected using `python test/bench_mps_ops.py grid_sampler_2d`
| Benchmark | MPSGraph (us) | Metal Shader (us) |
  |---|---|---|
  | grid_sample-bilinear-64x64 (torch.float16) | 114.7 | 120.0 |
  | grid_sample-bilinear-128x128 (torch.float16) | 180.4 | 151.1 |
  | grid_sample-bilinear-256x256 (torch.float16) | 423.5 | 364.9 |
  | grid_sample-bilinear-512x512 (torch.float16) | 2393.1 | 1145.3 |
  | grid_sample-nearest-64x64 (torch.float16) | 107.7 | 112.3 |
  | grid_sample-nearest-128x128 (torch.float16) | 131.6 | 124.3 |
  | grid_sample-nearest-256x256 (torch.float16) | 215.1 | 204.2 |
  | grid_sample-nearest-512x512 (torch.float16) | 1089.2 | 565.0 |
  | grid_sample-bilinear-64x64 (torch.float32) | 117.4 | 139.5 |
  | grid_sample-bilinear-128x128 (torch.float32) | 165.4 | 188.9 |
  | grid_sample-bilinear-256x256 (torch.float32) | 462.0 | 398.8 |
  | grid_sample-bilinear-512x512 (torch.float32) | 4311.3 | 1483.5 |
  | grid_sample-nearest-64x64 (torch.float32) | 113.6 | 100.3 |
  | grid_sample-nearest-128x128 (torch.float32) | 134.6 | 122.1 |
  | grid_sample-nearest-256x256 (torch.float32) | 263.4 | 208.6 |
  | grid_sample-nearest-512x512 (torch.float32) | 2289.0 | 896.6 |
  | grid_sample-bilinear-64x64 (torch.bfloat16) | 114.3 | 132.9 |
  | grid_sample-bilinear-128x128 (torch.bfloat16) | 152.4 | 182.5 |
  | grid_sample-bilinear-256x256 (torch.bfloat16) | 343.4 | 369.3 |
  | grid_sample-bilinear-512x512 (torch.bfloat16) | 2333.9 | 1155.2 |
  | grid_sample-nearest-64x64 (torch.bfloat16) | 107.5 | 106.1 |
  | grid_sample-nearest-128x128 (torch.bfloat16) | 130.4 | 114.0 |
  | grid_sample-nearest-256x256 (torch.bfloat16) | 211.9 | 190.3 |
  | grid_sample-nearest-512x512 (torch.bfloat16) | 795.9 | 540.7 |

TODOs:
 - Code sharing for interpolation mode between upsample and grid-sampler

Fixes pytorch#174339 and pytorch#125098

Pull Request resolved: pytorch#174343
Approved by: https://github.com/manuelcandales
ghstack dependencies: pytorch#174676, pytorch#174677, pytorch#174678
Add XPU_DRIVER activity to the profiler so it reports XPU L0 driver activities.
It is counterpart to CUDA_DRIVER activity.

Updates the third_party/kineto submodule.
Add test.

Pull Request resolved: pytorch#172940
Approved by: https://github.com/guangyey, https://github.com/sraikund16
As the title suggests, for better documentation.
Pull Request resolved: pytorch#174453
Approved by: https://github.com/EikanWang
More optimizations will follow !

This one is simple:
if we  are evaluating a+b+c+... >0 and all terms are symbols/constants with var
range >0 then return true before calling into expensive static evaluator.

***results***
export time
 5m4.868s -> 3m4.165s
 (two minutes saved)

Pull Request resolved: pytorch#174615
Approved by: https://github.com/Lucaskabela
…4610)

There is an interesting use case I need to call out here:

FlexAttention BlockMask's pytree registration contains arbitrary user defined mask_mod function. This gets problematic when we are exporting via dynamo_graph_capture_for_export because we re-run the model code multiple times where the output bytecode contains a logic to reconstruct user defined mask_mod. This doesn't work with aot_export's pytree thunkify logic as it would receive an spec that has different id for the mask_mod (because we reconstructed multiple times). This was not a problem for torch.compile because we always just re-run the inner graph module without inp/out processing. I think this is a result of our independent API's working correctly but the integration point between them is little awkward. (torch IR API + aot_autograd)

The way we fix it is we wrap the user defined function with _MaskMod wrapper that does value based checking instead of identity so that two different reconstructions of mask_mod still returns True. I had to special case _MaskMod for the old export path since torch.export.export is still on the _dynamo_graph_capture_for_export.

Pull Request resolved: pytorch#174610
Approved by: https://github.com/zhxchen17, https://github.com/drisspg
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@077a6c](intel/torch-xpu-ops@077a6ce), includes:

- Adjust layer_norm_backward_kernel interface to match that of PyTorch
- Fix incorrect Tensor Size for NestedTensor QKV Transform
- Support calling oneCCL AllToAll API directly
- Add NaN input checks to prevent false singular matrix errors in oneMKL linear algebra operations
Pull Request resolved: pytorch#174591
Approved by: https://github.com/EikanWang
The recursion limit has to be unset before exitting `subTest` or it may fail inside of pytest due to the low limit set in the test.

Pull Request resolved: pytorch#174693
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
This PR fixes `RuntimeError: CUDA driver error: invalid argument` when combo kernels have large ynumels that exceed grid.y limit. Added y/z grid overflow handling similar to `Grid2DWithYZOverflow`

This issue happens when `combo_kernel_per_subkernel_blocks = False` (which is False by default). After the flatten dispatch PR pytorch#172527 is added, `combo_kernel_per_subkernel_blocks = True` will make this issue obsolete.

Pull Request resolved: pytorch#174354
Approved by: https://github.com/mlazos
…#174533)

Differential Revision: D92629416

Support tlparse's fx_graph_runnable with nested user defined triton kernels and constexprs. Also fixes some edge cases with user defined triton kernels.

Pull Request resolved: pytorch#174533
Approved by: https://github.com/eellison
This converts NanCheck into an op so it can be used from outside of ProcessGroupNCCL. This can be used from torchcomms.

Misc changes:
* add CPU implementation
* use CUDA_KERNEL_ASSERT macro so it logs a more helpful message when nancheck fires

Test plan:

CI

```
$ python -c "import torch; torch.ops.c10d.check_for_nan(torch.tensor(float('nan'), device='cuda')); torch.cuda.synchronize()"                 (pytorch-3.12)
/home/tristanr/pytorch/torch/csrc/distributed/c10d/NanCheck.cu:217: checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(tailPtr[threadIdx.x])` failed.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/tristanr/pytorch/torch/cuda/__init__.py", line 1165, in synchronize
    return torch._C._cuda_synchronize()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
Pull Request resolved: pytorch#174736
Approved by: https://github.com/kwen2501, https://github.com/allenwang28
Changes:

Add launch_pdl: True to combo kernel triton_meta when PDL is
enabled
Fix missing shape=() parameter in _handle_pdl_after_load()
Add tests for PDL + combo kernel integration

See, example kernel: https://gist.github.com/eellison/50fea54d1096b0ece3c97f6e8ee02d5b
written with claude

Pull Request resolved: pytorch#174232
Approved by: https://github.com/karthickai, https://github.com/v0i0
# Conflicts:
#	.ci/docker/requirements-ci.txt
#	requirements-build.txt
#	torch/utils/hipify/cuda_to_hip_mappings.py
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Feb 11, 2026

Jenkins build for 241aa87f0fde758bc85bd988fb3812d02a1f43a2 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Feb 12, 2026

Jenkins build for 3ee04a9830bea722779f6591ffb9a2386afcfc14 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@pragupta pragupta merged commit cc3acaf into develop Feb 12, 2026
79 of 83 checks passed
@pragupta pragupta deleted the develop_IFU_20260211 branch February 12, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.