forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 75
[AUTOGENERATED] develop_IFU_20251124 #2827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…7661)" This reverts commit 1b43d6c. Reverted pytorch#167661 on behalf of https://github.com/yangw-dev due to break internal tests and build, please reach out meta fellas to have fix it and reland again, error examplke: hip/KernelUtils.cuh:74:5: error: no matching function for call to 'unsafeAtomicAdd' ([comment](pytorch#167661 (comment)))
Summary: The export_memory_timeline method in torch.profiler is being deprecated in favor of the newer memory snapshot API (torch.cuda.memory._record_memory_history and torch.cuda.memory._export_memory_snapshot). This change adds the deprecated decorator from typing_extensions and updates the docstring to guide users to the recommended alternative. The decorator will emit a FutureWarning at runtime, and the docstring now includes a .. deprecated:: directive for documentation visibility. Test Plan: Manual verification that the decorator is properly applied and the deprecation message is informative. Differential Revision: D87272399 Pull Request resolved: pytorch#168036 Approved by: https://github.com/valentinandrei
This PR introduces a `Tensor` subclass which represents a complex tensor in terms of two real ones. Ops are decomposed as individual ops on the real and imaginary parts. It is compatible with `torch.compile`, so long as the real ops used are also compatible. Autograd "works", but is WIP due to different edge-case behaviour. Pull Request resolved: pytorch#167621 Approved by: https://github.com/ezyang
Repeatition of pytorch#155708 Has been broken for a while, and ET pin in Pytorch are so old that `torch==2.10.0.dev20250915` could no longer be found in nightly indices Pull Request resolved: pytorch#168090 Approved by: https://github.com/atalman, https://github.com/yangw-dev
This PR enables special matmuls on Thor devices. This includes row-wise scaled matmul on `fp8` and group gemm on `bfloat16`. Pull Request resolved: pytorch#164836 Approved by: https://github.com/ngimel
…orch#167395) This adds a debug HTTP server for debugging stuck or slow jobs. It runs the WorkerServer on every worker and then launches a separate flask process on rank 0 to have users connect to for debugging. This can easily be improved to trigger profilers as well as visualize the data much better. Initial handlers: * pytorch profiler * FlightRecorder data * Python stacks ``` os.environ["TORCH_NCCL_TRACE_BUFFER_SIZE"] = "2000" from torch.distributed.debug import enable_debug_server enable_debug_server() ``` Test plan: ``` torchrun --nnodes 1 --nproc_per_node=gpu ~/scripts/debug_test.py ``` <img width="2000" height="1045" alt="20251117_16h58m18s_grim" src="https://github.com/user-attachments/assets/82305b75-227c-4412-a481-00b622db6bd1" /> <img width="2172" height="1624" alt="20251117_16h58m11s_grim" src="https://github.com/user-attachments/assets/def9841c-c7e6-483a-81c3-cf0c56f6bad8" /> <img width="1985" height="1635" alt="20251117_16h58m03s_grim" src="https://github.com/user-attachments/assets/04fcf148-df58-41b4-8754-8706ee0d1de6" /> Pull Request resolved: pytorch#167395 Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/atalman
…ytorch#167079) Summary: As title. Knowing the size of the leaked tensor is useful, it allows us to focus on the largest leaks. Differential Revision: D86218574 Pull Request resolved: pytorch#167079 Approved by: https://github.com/kausv
…torch#161703) it's another pr to port distributed tensor test for Intel GPU, while the other pr is pytorch#161604 We could enable Intel GPU with following methods and try the best to keep the original code styles: Use torch.accelerator for general gpu Skip the case if running on xpu which has known issues Pull Request resolved: pytorch#161703 Approved by: https://github.com/guangyey, https://github.com/d4l3k, https://github.com/albanD
Pull Request resolved: pytorch#167852 Approved by: https://github.com/fmassa
The all gather bucketing was part of the way to fusing in dtype casts into the bucket. We do this by allocating the group bucket buffer, then viewing each slice of it as the destination dtype. We then foreach_copy_ into the allocated buffer, with each collective copying in to its destination dtype. This logic was causing an issue in a later part of the stack, but not fully firing, so might as well fix it. Note: custom ops dont yet support list[dtype], so i worked around by list[int], but will fix in a follow up. Pull Request resolved: pytorch#167853 Approved by: https://github.com/ruisizhang123 ghstack dependencies: pytorch#167852
The bucketing dtype fusing was causing nodes which had dependencies to be erased. Transfer those deps over to the new nodes, and also add an assertion that none of our deps are erased to catch this type of error in the future. Pull Request resolved: pytorch#167863 Approved by: https://github.com/fmassa ghstack dependencies: pytorch#167852, pytorch#167853
Since the currently intended workflow on the new MI3xx CI capacity is [trunk-rocm-mi300.yml](https://github.com/pytorch/pytorch/blob/d91269e8ce309437c1f849b5ab3362d69b178ef4/.github/workflows/trunk-rocm-mi300.yml#L54), which only needs the jammy images, limiting those to optimize docker caching times. Pull Request resolved: pytorch#168088 Approved by: https://github.com/jeffdaily
For GPU: Previously reported that only a single sample could be tested with huber_loss functional. Current snapshot of the code does not appear to suffer from numerical issues as reported before. For CPU: While testing GPU, it was discovered that with Half appears to be numerically unstable. This commit resolves issue with CPU by upcasting Half to float for the computation. Pull Request resolved: pytorch#166952 Approved by: https://github.com/benjaminglass1, https://github.com/isuruf
…h/csrc/Exceptions.h (pytorch#168056) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: dtolnay Differential Revision: D87273132 Pull Request resolved: pytorch#168056 Approved by: https://github.com/malfet, https://github.com/Skylion007
Summary: If the Tensor has a PyObject, it's use count will now be two instead of one. Test Plan: `buck test -j 18 fbcode//mode/dev-nosan fbcode//caffe2/test:torch` Differential Revision: D87297965 Pull Request resolved: pytorch#168060 Approved by: https://github.com/albanD, https://github.com/Skylion007
As compiler has not been supported for last 3 years and all manylinux2_28 builds should have at least gcc-11 Prep change for C++20 standard migration Pull Request resolved: pytorch#167933 Approved by: https://github.com/yangw-dev, https://github.com/atalman ghstack dependencies: pytorch#168090
Pull Request resolved: pytorch#166903 Approved by: https://github.com/malfet
…rch#168104) We only want to cache the latest CI docker image for `main` and `release` branches in cases where multiple `docker-builds` workflow runs get triggered in quick succession. This is because the latest run will anyway overwrite the cached images, since we do not maintain a cached image per-SHA, instead it's only one-per-branch (to minimize cache size and docker load times at runner bringup). Also removing `workflow_dispatch` as a trigger since it won't work (needs artifacts from `docker-builds` run) Pull Request resolved: pytorch#168104 Approved by: https://github.com/jeffdaily
Fixes pytorch#167905 Below typo correction has been done. Existing comment: // List of Any can contains heterogenous types Suggested comment: // List of Any can contains heterogeneous types Pull Request resolved: pytorch#167907 Approved by: https://github.com/albanD
Unclear which PR in the ghstack caused the ROCm failure. Stack was (oldest at bottom): - pytorch#167962 - pytorch#167804 - pytorch#167803 - pytorch#167802 - pytorch#168025 Fixes the following test: PYTORCH_TEST_WITH_ROCM=1 python test/cpp_extensions/libtorch_agnostic_2_10_extension/test_version_compatibility.py FunctionVersionCompatibilityTest.test_mv_tensor_accessor_cuda_works_with_2_9 Pull Request resolved: pytorch#168087 Approved by: https://github.com/jeffdaily, https://github.com/janeyx99 Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Fixes false negative (illusion): "all B200 periodic nvshmem-triton tests passed" Pull Request resolved: pytorch#167760 Approved by: https://github.com/ngimel
# Motivation
This is definitely a bug: we were attempting to release cached memory back to the system without proper **synchronization**. Callers must ensure that all accesses to memory blocks allocated by SYCL APIs have completed before invoking `sycl::free`.
For a simple example, in the following code:
```python
pool = torch.xpu.MemPool()
with torch.xpu.use_mem_pool(pool):
input = torch.randn(100, device='xpu')
sum = input.sum()
del pool
print(sum)
```
`sum` may exhibit undefined behavior because `input.sum()` might not have finished executing before `del pool` triggers `input`'s memory release.
With this fix, we ensure that all kernels on the associated streams complete before the memory pool is destroyed, guaranteeing that `sum` holds the correct value.
# Solution
Because `c10::xpu::syncStreamsOnDevice` has host overhead, we use a boolean flag `streams_synced` to ensure it is called only once.
Pull Request resolved: pytorch#168074
Approved by: https://github.com/EikanWang
…rch#167923) As in the title. The my_shape test is added to reproduce https://github.com/pytorch/audio/actions/runs/19395471276/job/55494871226: Pull Request resolved: pytorch#167923 Approved by: https://github.com/janeyx99, https://github.com/mikaylagawarecki
…6833) The implementation plan of MemPool for XPU, which is the dependance of [XPUGraph](pytorch#166285), following the [RFC](pytorch#162143). - [ ] pytorch#166831 - [ ] ->pytorch#166833 - [ ] pytorch#166843 Pull Request resolved: pytorch#166833 Approved by: https://github.com/EikanWang, https://github.com/gujinghui
Summary: Fix pytorch#167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key. Pull Request resolved: pytorch#168063 Approved by: https://github.com/yushangdi
Pull Request resolved: pytorch#167769 Approved by: https://github.com/ngimel, https://github.com/leofang
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#168111 Approved by: https://github.com/ezyang
…torch#166273) Partially vibe-coded with ClaudeCode, and changes following ops (summary also created by Claude): - **Activation operations**: Added checks rejecting Long, Complex, and Bool types for operations like log_softmax, log_sigmoid, mish, softplus, and silu, as MPS doesn't support exponent operations on these types - **Linear algebra operations**: Restricted linalg_lu_factor, linalg_solve, and linalg_solve_triangular to Float type only (previously only checked for complex types) - **Pooling operations**: Added checks to reject Complex types for avg_pool2d and max_pool2d operations - **Loss functions**: Added type checks for nll_loss (Complex), huber_loss (Long, Complex), and grid_sampler_2d (Complex) - **Reduction operations**: - Fixed NANSUM to handle integral types correctly (can't contain NaN, so just performs regular sum) - Added Long type check for std/var operations - **Other operations**: - softmax: Now explicitly requires floating point types - bincount: Rejects Bool type to prevent crashes All checks use `TORCH_CHECK_NOT_IMPLEMENTED` Pull Request resolved: pytorch#166273 Approved by: https://github.com/manuelcandales
Summary: Shrink binary size to reduce relocation overflows. The most important change is to split `intrusive_ptr::reset_()` into two functions and mark the bigger one as `C10_NOINLINE`. Differential Revision: D87308588 Pull Request resolved: pytorch#168080 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/malfet, https://github.com/ezyang
# Motivation Thanks to @KarhouTam for finding the issue mentioned in pytorch#167172 This PR aims to improve the build logic in activities for kineto. # Additional Context Fix pytorch#167172 Pull Request resolved: pytorch#167204 Approved by: https://github.com/EikanWang, https://github.com/ezyang
The test is skipped on a condition which needs to be used here or it will fail because the exit code is -6 not zero if the condition is not met and the test executed Fixes pytorch#154441 @kwen2501 Pull Request resolved: pytorch#167971 Approved by: https://github.com/kwen2501
This is a ~20x speedup for benchmarks/dynamo/microbenchmarks/optree_tree_map.py Pull Request resolved: pytorch#168342 Approved by: https://github.com/anijain2305
tuple[int] -> tuple[int,...] 1 -> more Like shape, shape: tuple[int, ...] # [B, Hq, M, Hkv, N, D] Inspired by pytorch#168320 Pull Request resolved: pytorch#168892 Approved by: https://github.com/cyyever, https://github.com/Skylion007
…ch/_inductor/codegen/cuda (pytorch#160685)" This reverts commit 7556637. Reverted pytorch#160685 on behalf of https://github.com/yangw-dev due to failed internal tests test_cpu_/test_cpu#link-tree/torch/utils/_config_module.py line 371, in _config = self._config[name] KeyError: 'cuda.cutlass_dir' Diff: D87660662 ([comment](pytorch#160174 (comment)))
…pytorch#160174)" This reverts commit 008ac43. Reverted pytorch#160174 on behalf of https://github.com/yangw-dev due to failed internal tests test_cpu_/test_cpu#link-tree/torch/utils/_config_module.py line 371, in _config = self._config[name] KeyError: 'cuda.cutlass_dir' Diff: D87660662 ([comment](pytorch#160174 (comment)))
…ytorch#167456)" This reverts commit 4ee6b3d. Reverted pytorch#167456 on behalf of https://github.com/yangw-dev due to failed internal test Diff D87660150 , errorl ModuleNotFoundError: No module named 'extension_backends' ([comment](pytorch#167456 (comment)))
…13 RecursionError problems (pytorch#167888)" This reverts commit 24e1958. Reverted pytorch#167888 on behalf of https://github.com/yangw-dev due to failed interal test Tracing payload for Mock should not be called: pt2_compile_chromium_events Fatal Python error: Segmentation fault, please remerge after fixing it ([comment](pytorch#167888 (comment)))
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#168315 Approved by: https://github.com/pytorchbot
Pull Request resolved: pytorch#168365 Approved by: https://github.com/anijain2305
…68893) Pull Request resolved: pytorch#168893 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168365
…ytorch#168221) To avoid circular import issues: - utils.py used to include registration functions which import/depend on DTensor.sharding_propagator - I plan to use other utils from utils.py inside sharding_propagator.py Pull Request resolved: pytorch#168221 Approved by: https://github.com/albanD
This was a workaround for gcc-8 on ARM, introduced by pytorch#44199, which is no longer relevant as CentOS-8 is past its EOL Was reminded about it while looking at pytorch#168907 Pull Request resolved: pytorch#168909 Approved by: https://github.com/Skylion007, https://github.com/nimeduhansaka
This PR adds comprehensive benchmarks for PyTorch optimizers to measure optimizer.step() performance across different parameter configurations. ### Optimizers benchmarked: - AdamW - Adam - SGD (with momentum=0.9) - RMSprop - Adagrad ### Test configurations: - num_params: [1, 10, 100] - param_size: [100K, 1M, 10M] Pull Request resolved: pytorch#168101 Approved by: https://github.com/slayton58
…orch#168369) Previously, these error messages would get truncated when they were hit on device 0 because device is a "char" (actually, an int8_t) and therefore '0' is interpreted as the null byte to terminate a string. Essentially, it is the same issue as pytorch#123984. There's something strange in the TORCH_CHECK_WITH macro that is causing this. I don't feel like figuring out those obscure macro details right now, though. Pull Request resolved: pytorch#168369 Approved by: https://github.com/eqy
…h#167713) Fix a API typo in the cuda graph tutorial. The API given in cuda graph tutorial is wrong. Pull Request resolved: pytorch#167713 Approved by: https://github.com/jerryzh168
## Remove deprecated `split_cat_fx_passes` First of a couple of small PRs that remove deprecated and unused code. Remove the deprecated `split_cat_fx_passes` configuration variable from inductor config and clean up associated test patches. ### Changes - Remove `split_cat_fx_passes` from `torch/_inductor/config.py` - Remove `@patch.object(config, "split_cat_fx_passes", False)` decorators from tests in `test/inductor/test_perf.py` Pull Request resolved: pytorch#167738 Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/cyyever
This PR removes unnecessary thrust::tie. Pull Request resolved: pytorch#168943 Approved by: https://github.com/ngimel
This PR removes unnecessary uses of thrust::tuple before moving to CCCL. Pull Request resolved: pytorch#168936 Approved by: https://github.com/ngimel
… func (pytorch#167723) Fixes pytorch#167197 The inductor backend is trying to convert the float infinity value to an integer in pow lowering (possibly for indexing, iteration counts, or type conversions). Python/C++ cannot convert float('inf') to an integer, causing the overflow error Pull Request resolved: pytorch#167723 Approved by: https://github.com/shunting314
Pull Request resolved: pytorch#168394 Approved by: https://github.com/jansel
Pull Request resolved: pytorch#166436 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel
This reverts commit 1328a02. Reverted pytorch#168122 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#168122 (comment)))
|
Jenkins build for ecdea869f81e8ecd9051f75810c1eafd6ec5523b commit finished as NOT_BUILT |
|
Jenkins build for ecdea869f81e8ecd9051f75810c1eafd6ec5523b commit finished as SUCCESS |
|
Jenkins build for ecdea869f81e8ecd9051f75810c1eafd6ec5523b commit finished as ABORTED |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
rocm_base: 5ca076d
rocm_base: 5ca076d
upstream_main: 654c5fb