forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 75
[AUTOGENERATED] develop_IFU_20251104 #2784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Make the bucketer use multi-dtype bucketing for all gathers. Pull Request resolved: pytorch#166527 Approved by: https://github.com/IvanKobzarev, https://github.com/ezyang
Bucket all reduce in bucketer, thanks to @IvanKobzarev's earlier pr. Pull Request resolved: pytorch#166528 Approved by: https://github.com/IvanKobzarev ghstack dependencies: pytorch#166527
…orch#164518)" This reverts commit 311ea0d. Reverted pytorch#164518 on behalf of https://github.com/atalman due to breaks internal builds Error: from logging_utils import ( ModuleNotFoundError: No module named 'logging_utils' ([comment](pytorch#164518 (comment)))
… func `_init_one_process_group` (pytorch#166614) To makes pg cache change easier and code modularization, we isolate the logic of process group creation into a separate function named `_init_one_process_group`. Pull Request resolved: pytorch#166614 Approved by: https://github.com/lw
Fixes pytorch#165177 When converting guards to sources if we were unable to get the expected symbol from symbol_to_source then try to get it from var_to_sources. I was unable to make a simpler repro than what was described in the issue (which relies on llama3 - so inappropriate for a unit test). Pull Request resolved: pytorch#165723 Approved by: https://github.com/bobrenjc93
Summary: After a precision study, we concluded it is ok to use ACL's exp function on f32's erf() We can keep erf inline this way. Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19: Before: f32 erf: 2539.179us After: f32 erf: 1329.063us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85730452 Pull Request resolved: pytorch#166594 Approved by: https://github.com/mcfi, https://github.com/fadara01
This reverts commit 9051940. Reverted pytorch#166541 on behalf of https://github.com/atalman due to breaks internal test ([comment](pytorch#166541 (comment)))
…rd error (pytorch#166367) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#166367 Approved by: https://github.com/bdhirsh
…#165139) As the title stated. - AOTI_TORCH_CHECK depend on TORCH_CHECK_MSG which located in c10/util/Exception.h, which maybe break BC - AOTI_TORCH_CHECK is not used everywhere - STD_TORCH_CHECK have ABI check tests. Pull Request resolved: pytorch#165139 Approved by: https://github.com/Skylion007, https://github.com/janeyx99
As the title stated. **Changes**: - torch/csrc/inductor(Part 2) Pull Request resolved: pytorch#165119 Approved by: https://github.com/janeyx99 ghstack dependencies: pytorch#165139
It appears to be unused based on `cd torch; rg has_symints`. Pull Request resolved: pytorch#163667 Approved by: https://github.com/xmfan, https://github.com/azahed98, https://github.com/albanD ghstack dependencies: pytorch#162990
Just like we use `is/is not` for class comparisons, it is generally advised to use `is/is not` for comparisons against torch functions. Pull Request resolved: pytorch#166624 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
In some cases the warning from pytorch#147744 still gets emitted because [atexit hooks aren't called](python/cpython#114279). Even in those cases, if the atexit hooks _were_ called you could end up with issues due to the directory being deleted in one process, but still being used elsewhere. It's better all round to load these modules entirely in-memory. Pull Request resolved: pytorch#157713 Approved by: https://github.com/xush6528
After pytorch#163049, this PR fixes the type annotations to match the actual implementation for ProcessGroupXCCL::Options. Pull Request resolved: pytorch#166418 Approved by: https://github.com/guangyey, https://github.com/ezyang
…h#165511) # Motivation Aligned with other backends, this PR introduces a new API torch.xpu.get_per_process_memory_fraction to allow user to retrieve the allowed memory fraction per a single process. Pull Request resolved: pytorch#165511 Approved by: https://github.com/EikanWang, https://github.com/ezyang ghstack dependencies: pytorch#165508, pytorch#165509, pytorch#165510
Fixes pytorch#166630 Pull Request resolved: pytorch#166631 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Fixes diskspace cutoff to say that the machine does not have difference=100 - diskspace_cutoff_int space available. Pull Request resolved: pytorch#166645 Approved by: https://github.com/jeffdaily
…serts (pytorch#166171) Summary: Using `CUDA_KERNEL_ASSERT_PRINTF` inside kernels allows us to log invalid values to the console (that can be in turn used to surface _hopefully_ more clearer error messages). This does have an impact in the number of registers needed for the values being logged (I confirmed via diffing PTX that there is no other impact relative to using `__assert_fail`) To avoid causing perf bottlenecks, this change adds a compile-time switch to enable more verbose errors in some of the common kernels that cause DSAs. There is also a Buck config that can be used to configure this switch more conveniently. ## Alternatives considered I considered making the behavior of `CUDA_KERNEL_ASSERT_PRINTF` controllable via a compile-time macro instead of writing another wrapper for it but there are kernels where the extra register pressure is not as severe and in those cases, having more useful error messages by default is pretty useful. Test Plan: ## Simple Python Driver: ``` # scatter_errors.py import torch def main() -> None: a = torch.rand(128, device="cuda:0") idx = torch.randint(0, 128, (100,), device="cuda:0") idx[0] = 9999 b = torch.scatter(a, 0, idx, 555.0) print(b) ``` When running normally via: ``` $ buck2 run @//mode/opt :scatter_errors ``` we see the followng DSA message: ``` fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed. ``` Running via: ``` $ buck2 run @//mode/opt -c fbcode.c10_enable_verbose_assert=1 :scatter_errors ``` however produces: ``` [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0]: Assertion failed: `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"`: Expected 0 <= idx_dim < index_size (128), but got idx_dim = 9999 ``` Differential Revision: D85185987 Pull Request resolved: pytorch#166171 Approved by: https://github.com/ngimel
Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected. Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC. Pull Request resolved: pytorch#166456 Approved by: https://github.com/Skylion007
Dynamo now supports Python 3.13. Pull Request resolved: pytorch#166497 Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42
…icient SM count (pytorch#165921) Pull Request resolved: pytorch#165921 Approved by: https://github.com/ngimel
Summary: Blockwise 1x128 and 128x128 scaling is only available on CUDA >= 12.9 and only on Hopper GPUs. Attempting to run on B200 would give a hard-to-debug `CUBLAS_STATUS_NOT_SUPPORTED`. Add a more helpful `NotImplementedError` to catch this case. Also more explicitly disable ROCm builds for relevant methods, based on lack of support per [hipBLASlt docs](https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/reference/datatypes.html#_CPPv4N28hipblasLtMatmulMatrixScale_t40HIPBLASLT_MATMUL_MATRIX_SCALE_VEC128_32FE). Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: pytorch#166639 Approved by: https://github.com/drisspg
This PR removes unused loop variables in tests. Pull Request resolved: pytorch#166509 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
…ch#166603) Looking for feedback on this approach. Received user reports of spurious pyrefly errors for users using hg instead of git. I think this was due to the fact that when using a venv and git, `make setup-env` installs requirements and pulls from a nightly torch wheel, which is needed for pyrefly to type check properly. Initial documentation for `make setup-env` I found here: https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#developing-pytorch Testing: ``` hg clone --git ssh://git@github.com/pytorch/pytorch.git conda create -n pytorch_env python=3.10 # (or manually create venv instead of using script) cd pytorch pip install -r requirements.txt pip install -r requirements-build.txt lintrunner init # check how many pyrefly errors - 15,709 errors (11,693 ignored) lintrunner # confirm error message / warning appears >>> General linter failure: Warning (PYREFLY) nightly-wheel-not-run pytorch-nightly.pth not found. You may need to run make setup-env or make setup-env-conda to install nightly binaries and type stubs. ``` Pull Request resolved: pytorch#166603 Approved by: https://github.com/aorenste
…ytorch#166480) Pull Request resolved: pytorch#166480 Approved by: https://github.com/Skylion007, https://github.com/malfet
) Replace assert statements with explicit if/raise patterns in: - torch/ao/quantization/fx/* (177 errors) fix partialy pytorch#164878 Pull Request resolved: pytorch#165420 Approved by: https://github.com/RohitRathore1, https://github.com/fffrog, https://github.com/albanD
This config improves performance by 250% on some kernels that contain `t1.atomic_add(...)`. Again, we conditionalize for ROCm/HIP, so there is no impact to NV. Pull Request resolved: pytorch#166470 Approved by: https://github.com/PaulZhang12, https://github.com/mlazos, https://github.com/eellison, https://github.com/jansel
Fix for pytorch#163528 Pull Request resolved: pytorch#165064 Approved by: https://github.com/benjaminglass1, https://github.com/mlazos
…h#166397) According to next commit to AOTriton: ROCm/aotriton@8625c4f These changes missed in 0.11b release: pytorch#161754 Pull Request resolved: pytorch#166397 Approved by: https://github.com/jeffdaily
Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected. Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC. This is the second PR. Pull Request resolved: pytorch#166501 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#166456
This PR enables `test_fxir_backend.py`'s tests formerly skipped xpu tests. No additional changes needed for the features. Pull Request resolved: pytorch#166493 Approved by: https://github.com/angelayi, https://github.com/EikanWang
This ensures that we graph break at the right time, leading to the right stack trace. Pull Request resolved: pytorch#166909 Approved by: https://github.com/tugsbayasgalan
…#165686) Wraps the header in a C file and compile it using a C compiler, which should support designated initializers Fix issue pytorch#160647 Pull Request resolved: pytorch#165686 Approved by: https://github.com/williamwen42
In pytorch#166440, didn't realize you could turn on anomaly mode while disabling NaN checks for these stacks. Adding them to `debug_mode.operators[*].fwd_stack_trace`. Pull Request resolved: pytorch#166842 Approved by: https://github.com/yushangdi, https://github.com/mikaylagawarecki
# Motivation This PR intends to add `ExpandableSegment` struct, which is used to help support the expandable segment feature. I split it to a single PR to facilitate the code review. Pull Request resolved: pytorch#166299 Approved by: https://github.com/EikanWang, https://github.com/albanD, https://github.com/gujinghui
…6292) # Motivation This PR intends to add expandable segment feature support on XPU. This will help - Reduce memory fragmentation; - Gradually map physical pages into virtual address space as needed. # Additional Context The traditional caching allocator frequently allocates and frees device memory blocks. However, over time, with varying tensor size, the device address space becomes fragmented. Even when there's enough total free memory, a lack of contiguous space can cause large allocations to fail. The **expandable segment** feature addresses this by dynamically extending physical memory within a reserved virtual address range, reducing fragmentation and minimizing reallocation overhead. The potential drawbacks are - Virtual memory overhead; - Potential page mapping overhead; - Increased complexity. Pull Request resolved: pytorch#166292 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui ghstack dependencies: pytorch#166299
# Motivation This PR introduces support for peer-to-peer (P2P) access between devices, including querying and enabling P2P connections between two devices. It supports two categories of allocations: - Regular allocations; - Expandable segment allocations. # Additional Context The follow-up is that we should use this feature to optimize our copy kernel when P2P is supported. Pull Request resolved: pytorch#166424 Approved by: https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: pytorch#166299, pytorch#166292
Adding distributed testing back to trunk since we have been observing [reasonable queueing](https://hud.pytorch.org/queue_time_analysis?dateRange=30&startDate=2025-10-05T01%3A44%3A55.924Z&endDate=2025-11-04T01%3A44%3A55.925Z&granularity=week&chartType=bar&repos=pytorch%2Fpytorch&category=machine_type&machineTypes=linux.rocm.gpu.gfx942.1&items=linux.rocm.gpu.gfx942.1) based on current MI3xx capacity. Partially addresses pytorch#166108. Pull Request resolved: pytorch#166915 Approved by: https://github.com/jeffdaily
… to trace through typing.get_type_hints (pytorch#166757) This is covered by `test_get_type_hints` in test/dynamo/test_repros.py Pull Request resolved: pytorch#166757 Approved by: https://github.com/Lucaskabela
Pull Request resolved: pytorch#166894 Approved by: https://github.com/malfet ghstack dependencies: pytorch#166757
Pull Request resolved: pytorch#166895 Approved by: https://github.com/guilhermeleobas ghstack dependencies: pytorch#166757, pytorch#166894
Pull Request resolved: pytorch#166902 Approved by: https://github.com/malfet, https://github.com/pianpwk ghstack dependencies: pytorch#166757, pytorch#166894, pytorch#166895
This PR continues to apply ruff UP035 rule to inductor code. ruff UP035 rule aims to use Python 3.10 syntax and libraries. Pull Request resolved: pytorch#166225 Approved by: https://github.com/aorenste
…#166839) Fixes pytorch#166838 1. Register `convolution_overrideable` key for flop_counter. CUDA relies on keys with `cudnn_convolution`. For devices like `XPU`, it falls to `convolution_overrideable`. Without the correct registration, the flop_couter will silently return 0 for XPU in line: https://github.com/pytorch/pytorch/blob/e1d011d6eb571cd98ec7c7ed8e8b518a5463ec97/torch/_inductor/analysis/profile_analysis.py#L178-L179 2. Enable the tests when enabling the XPU on `test_analysis.py`. Pull Request resolved: pytorch#166839 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jansel
Make sure you're on cutlass 4.2.0+ Test Plan: Tritonbench(oss): `clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy` Unit Tests(oss): `clear; python test/inductor/test_cutedsl_grouped_mm.py` Differential Revision: D82010227 Pull Request resolved: pytorch#165036 Approved by: https://github.com/alexsamardzic, https://github.com/drisspg, https://github.com/mlazos
This reverts commit c21868b. Reverted pytorch#162275 on behalf of https://github.com/izaitsevfb due to breaking test_rms_norm_bwd_float32_split_reductions_True_shape2 ([comment](pytorch#162275 (comment)))
…#166830) # Motivation I believe this is a bug - here's why: In [dnnl_common_types.h](https://github.com/uxlfoundation/oneDNN/blob/98132c49080c19caffee1f7ba65be83339cba985/include/oneapi/dnnl/dnnl_common_types.h#L116-L125) is defined as a pointer to an `int64_t[12]` array; We can confirm this from the implementation in [memory_desc.cpp](https://github.com/uxlfoundation/oneDNN/blob/98132c49080c19caffee1f7ba65be83339cba985/src/common/memory_desc.cpp#L746-L748) where the member indeed points to an internal array. # Solution Therefore, when accessing `md_padded_dims`, we should first dereference the pointer and then use it with an index - directly using it without dereferencing would corrupt memory. Pull Request resolved: pytorch#166830 Approved by: https://github.com/EikanWang
…peError (pytorch#165264) Fixes pytorch#164684 ### Description Symbolic tracing fails during multiplication between a `SymBool` and a `Tensor`. This scenario is triggered when `.item()` is called on a 0-dim boolean tensor within a `torch.compile` region. In compile mode, this yields a `SymBool`, and the subsequent `SymBool * FakeTensor` operation is unsupported, leading to a `TypeError` or a data-dependent `UserError`. ### Solution This PR addresses the issue at the type-conversion level, as suggested by reviewers. The root cause of the TypeError is that torch.sym_float() (which is called by _maybe_convert_to_dtype during type promotion for aten.mul) lacks a conversion path for SymBool and incorrectly falls back to builtins.float(SymBool). This fix addresses this by implementing the __sym_float__(self) method within the SymBool class (defined in torch/__init__.py). The torch.sym_float(a) utility function is already designed to check for hasattr(a, "__sym_float__") before falling back to builtins.float(). By adding this method, SymBool instances now correctly advertise their ability to be cast to SymFloat. The new method implementation leverages self.node.sym_float() to correctly convert the symbolic boolean value to its symbolic float representation (0.0 or 1.0), resolving the TypeError at its source. This approach is more fundamental than modifying a specific operation in builtin.py and ensures SymBool can be correctly promoted to SymFloat in any operation, while still preserving its boolean nature for control flow operations like guard_or_false (which is verified by a new test case). ### Verification 1. **Bug Reproduced**: The initial `UserError: Could not guard on data-dependent expression` was successfully reproduced with the script from the issue. As shown below <img width="1369" height="945" alt="Screenshot 2025-10-13 at 10 29 05" src="https://github.com/user-attachments/assets/8daa4555-3347-4af5-906a-02150b8df9d1" /> 2. **Fix Validated**: After applying the code changes, the same script now runs to completion, printing `✅ eager success` and `✅ compile success`. As shown below <img width="1228" height="82" alt="Screenshot 2025-10-13 at 10 29 21" src="https://github.com/user-attachments/assets/94c4f143-b898-4dda-9bff-0ad5450a30fa" /> 3. Added a new test class DynamoOpPromotionTests to test/dynamo/test_misc.py with three new test cases: 1. test_symbool_tensor_mul_does_not_fail: Verifies that the original bug report code (with .item() + *) no longer raises an error when compiled. 2. test_symbool_guard_or_false: Verifies that this fix does not cause a regression for guard_or_false(SymBool) (the concern raised by reviewers). 3. test_symbool_tensor_mul: Verifies the behavior of Tensor(bool) * Tensor(float) (without .item()) for completeness. All new tests were added and pass locally. Pull Request resolved: pytorch#165264 Approved by: https://github.com/laithsakka, https://github.com/Lucaskabela
…pytorch#166613) Pull Request resolved: pytorch#166613 Approved by: https://github.com/hl475
Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#166867 Approved by: https://github.com/albanD ghstack dependencies: pytorch#166868
# Motivation This PR aims to reuse some UT to validate the expandable segment feature. # Additional Context Currently, the failure is related to the internal track `GSD-11403`, we could get the fix when upgrading the driver to `ci-neo-master-034630` or greater TODO: add test conv and gemm into this test case when upgrading the driver. Pull Request resolved: pytorch#166495 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui ghstack dependencies: pytorch#166299, pytorch#166292, pytorch#166424
Summary: cuBlasLt enforces size/stride requirements for 1x128 and 128x128 blockwise scaling kernels, some of which weren't being handled, causing silent incorrect answers especially for 128x128 scaling cases. cuBlasLt enforces ([docs](https://docs.nvidia.com/cuda/cublas/#scaling-factors-layouts)) for deepseek-style scaling, for `A: MxN`, `B: KxN` you have the following: ```Py L = K // 128 L4 = round_up(L, 4) 1x128 x 128x128: * A_scale: [M, K // 128], stride: [1, M] * B_scale: [L4, N // 128], stride: [1, L4] 128x128 x 1x128: * A_scale: [L4, M // 128], stride: [1, L4] * B_scale: [N, K // 128], stride: [1, N] 1x128 x 1x128: * A_scale: [M, K // 128], stride: [1, M] * B_scale: [N, K // 128], stride: [1, N] ``` Notable here is the `L4` term, which means that we must round up to the nearest multiple of 4 blocks in the `K` dimension. This wasn't enforced previously, and caused silent wrong answers where `(K // 128) % 4 != 0`. Test Plan: Reviewers: Subscribers: @vkuzo Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: pytorch#166752 Approved by: https://github.com/drisspg, https://github.com/vkuzo
Forward fix for failing Docker release builds Related to: pytorch#166897 Nightly Docker build failure https://github.com/pytorch/pytorch/actions/runs/18900508440/job/53946606434 Due to missing base image: ``` ERROR: failed to build: failed to solve: docker.io/nvidia/cuda:13.0.2-devel-ubuntu22.04: not found ``` Pull Request resolved: pytorch#166904 Approved by: https://github.com/tinglvv, https://github.com/malfet
…6869) Pull Request resolved: pytorch#166869 Approved by: https://github.com/eellison ghstack dependencies: pytorch#166521
# Conflicts: # .ci/docker/requirements-ci.txt
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
rocm_base: 56002f4rocm_base: 56002f4
upstream_main: 24db5c4