forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 75
[AUTOGENERATED] develop_IFU_20251103 #2777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Pull Request resolved: pytorch#166575 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Summary: This ads waitcounter for whether or not the pool is running, as well as if we are running jobs. This also ads waitcounters for the first job within a pool. First job and running are working correctly. The job waitcounter seems to either be detecting a leak of a job, or is broken subtly. Test Plan: We've tested this internally and see valid ods metrics. Note that we may be leaking jobs, or the job logic may not be handling an exception correctly. Differential Revision: D83705931 Pull Request resolved: pytorch#164527 Approved by: https://github.com/masnesral
\# why - enable users to control which choices get used on which inputs - reduce lowering time, and pin kernel selection, by selecting them for the inputs \# what - a new InductorChoices subclass that implements a lookup table - a README explaining the usage - corresponding testing - currently only supports templates that go through `V.choices.get_template_configs` \# testing ``` python3 -bb -m pytest test/inductor/test_lookup_table.py -v ``` Differential Revision: [D85685743](https://our.internmc.facebook.com/intern/diff/D85685743) Pull Request resolved: pytorch#164978 Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos
Bumps [uv](https://github.com/astral-sh/uv) from 0.9.5 to 0.9.6. - [Release notes](https://github.com/astral-sh/uv/releases) - [Changelog](https://github.com/astral-sh/uv/blob/main/CHANGELOG.md) - [Commits](astral-sh/uv@0.9.5...0.9.6) --- updated-dependencies: - dependency-name: uv dependency-version: 0.9.6 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
) Closes pytorch#164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: pytorch#164518 Approved by: https://github.com/kwen2501
* We are separating out the rocm jobs of the periodic workflow * We are introducing a new label `ciflow/periodic-rocm-mi200` to allow us to run distributed tests only on ROCm runners, without triggering many other jobs on the `periodic.yml` workflow (via `ciflow/periodic`) * This new workflow will also be triggered via the `ciflow/periodic`, thus maintaining the old status quo. * We are reverting to the `linux.rocm.gpu.4` label since it targets a lot more CI nodes at this point than the K8s/ARC-based `linux.rocm.gpu.mi250.4` label, as that is still having some network/scaling issues. Pull Request resolved: pytorch#166544 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
This PR fixes a syntactic error in test_indexing.py by a misplaced `if else` expression. Pull Request resolved: pytorch#166390 Approved by: https://github.com/jerryzh168
…ytorch#166406) Pull Request resolved: pytorch#166406 Approved by: https://github.com/ezyang
Address pytorch#165081 Pull Request resolved: pytorch#166541 Approved by: https://github.com/mlazos
…ytorch#166384) This PR reused native_mm and mix_order_reduction for Intel GPU and enabled the corresonding test. Fixes pytorch#165370 Pull Request resolved: pytorch#166384 Approved by: https://github.com/jansel
Fixes pytorch#166475 Pull Request resolved: pytorch#166588 Approved by: https://github.com/titaiwangms
**Summary** This implements the backward pass for the Varlen API and registers `_varlen_attn()` as a custom op. **Benchmarking** To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs | | Variable Length API | SDPA | |--------|--------------------|----------| | Runtime | 0.8189142608642578 ms | 3.263883056640625 ms | | TFLOPs | 268.652 | 158.731 | We can see that runtime for Varlen is >3x faster **Testing** Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen gradients vs SDPA. For custom op testing, `test_custom_op_registration` uses logging mode to verify that `_varlen_attn()` was called and tests with `torch.compile`. `test_custom_op_compliances` uses `torch.library.opcheck()` to verify. Pull Request resolved: pytorch#164504 Approved by: https://github.com/drisspg
Summary: was digging through matmul padding for other work, and I noticed that the compute bound checking won't work on MI350 since we haven't supplied the tech specs yet. I added MI350 specs following the predefined format Test Plan: CI Differential Revision: D85804980 Pull Request resolved: pytorch#166576 Approved by: https://github.com/leitian
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#166597 Approved by: https://github.com/pytorchbot
Preivously, we would stash a single stream value we constructed at trace time in a global and return the same value from repeated calls to the graph. With this PR, we construct the stream value in advance, reference the constructed value in the graph via the lookup table, and if that value is returned as an output, read the value from the lookup table and return it (in bytecode, not as a graph output, since we don't support arbitrary stream outputs). Pull Request resolved: pytorch#164819 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#164304, pytorch#164522
Pull Request resolved: pytorch#165211 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#164304, pytorch#164522, pytorch#164819
merge into stream tests Pull Request resolved: pytorch#165212 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#164304, pytorch#164522, pytorch#164819, pytorch#165211
… sigmoid + CUDA kernel bug (pytorch#166568) Differential Revision: D85792537 Pull Request resolved: pytorch#166568 Approved by: https://github.com/minjang
…e index (pytorch#165356) Pull Request resolved: pytorch#165356 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#164304, pytorch#164522, pytorch#164819, pytorch#165211, pytorch#165212
…orch#161476) For pytorch#114850, we will port 3 distributed tests to Intel GPU. We could enable Intel GPU with the following methods and try the best to keep the original code styles: - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - use "requires_accelerator_dist_backend" to enable "xccl" - enabled XPU for some test path - skip some test cases that Intel GPU does not support Pull Request resolved: pytorch#161476 Approved by: https://github.com/weifengpy, https://github.com/guangyey
This PR adds `strict=True/False` to zip calls in test utils. strict=True is passed when possible. Pull Request resolved: pytorch#166257 Approved by: https://github.com/janeyx99
…ready (pytorch#165740) Fixes pytorch#165738 Pull Request resolved: pytorch#165740 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/desertfire
fix typo in other folders pytorch#166374 pytorch#166126 _typos.toml ```bash [files] extend-exclude = ["tools/linter/dictionary.txt"] [default.extend-words] nd = "nd" arange = "arange" Nd = "Nd" GLOBALs = "GLOBALs" hte = "hte" iy = "iy" PN = "PN" Dout = "Dout" optin = "optin" gam = "gam" PTD = "PTD" Sur = "Sur" nin = "nin" tme = "tme" inpt = "inpt" mis = "mis" Raison = "Raison" ouput = "ouput" nto = "nto" Onwer = "Onwer" callibrate = "callibrate" ser = "ser" Metdata = "Metdata" ``` Pull Request resolved: pytorch#166606 Approved by: https://github.com/ezyang
This PR removes unused loop variables. Pull Request resolved: pytorch#166258 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
This reverts commit 39e5cdd. Reverted pytorch#166257 on behalf of https://github.com/atalman due to Failing: test/distributed/fsdp/test_fsdp_mixed_precision.py::TestFSDPTrainEval::test_train_ema_eval_flow [GH job link](https://github.com/pytorch/pytorch/actions/runs/18934047991/job/54057218160) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/39e5cdddf7e57881c52473d1288a66f0222527e1) ([comment](pytorch#166257 (comment)))
…ut device index (pytorch#165356)" This reverts commit f1af679. Reverted pytorch#165356 on behalf of https://github.com/atalman due to test/test_rename_privateuse1_to_existing_device.py::TestRenamePrivateuseoneToExistingBackend::test_external_module_register_with_existing_backend [GH job link](https://github.com/pytorch/pytorch/actions/runs/18930365446/job/54046768884) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/a5335263d32b5be2b2647661334d81225c3cc3fc) ([comment](pytorch#165356 (comment)))
This reverts commit a533526. Reverted pytorch#165212 on behalf of https://github.com/atalman due to test/test_rename_privateuse1_to_existing_device.py::TestRenamePrivateuseoneToExistingBackend::test_external_module_register_with_existing_backend [GH job link](https://github.com/pytorch/pytorch/actions/runs/18930365446/job/54046768884) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/a5335263d32b5be2b2647661334d81225c3cc3fc) ([comment](pytorch#165212 (comment)))
…r triton sigmoid + CUDA kernel bug (pytorch#166568)" This reverts commit d46d8d6. Reverted pytorch#166568 on behalf of https://github.com/atalman due to Failed test/test_extension_utils.py::TestExtensionUtils::test_external_module_register_with_renamed_backend [GH job link](https://github.com/pytorch/pytorch/actions/runs/18931754443/job/54050880312) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/d46d8d6f54b15ded4f2483c7bde31be124281ab8) ([comment](pytorch#166568 (comment)))
In the initial pr for overlapping preserving bucketing, for a graph like:
```
def foo(...):
ag = all_gather(...)
hiding_compute = mm(...)
wait(ag)
```
We would add dependencies from mm -> ag, and wait from wait -> hiding_compute, to prevent bucketing reordering these collectives so that overlap no long occurred. however, there is an additional way for bucketing to prevent overlap.
If we were to reorder another collective so the graph looked like:
```
def foo(...):
ag = all_gather(...)
ar = all_reduce(...)
wait(ar)
hiding_compute = mm(...)
wait(ag)
```
Overlap would not occur, because the wait for the all reduce would also force realization of every collective enqueued on the same stream prior to the all reduce. NCCL uses a single stream per process group.
To model, we set a set a strict ordering of all collective starts, waits, and hiding compute initially when bucketing. Then, when trying to add a collective to a bucket, we will see if we interfere with overlap for all of the following possible bucketings:
[move collective start to bucket start, move bucket start to collective start] x [move collective wait to bucket wait x move bucket wait to collective wait].
For any of these positions, we check if overlap would have been interfered with because of stream queue semantics. Then, if not, we remove the moving start and wait from the constrained ordering of collectives, and see if it's topologically valid to merge the nodes.
Pull Request resolved: pytorch#166324
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: pytorch#166309
This reverts commit 79aee77. Reverted pytorch#165211 on behalf of https://github.com/atalman due to failure: test/test_python_dispatch.py::TestPythonDispatch::test_return_stream [GH job link](https://github.com/pytorch/pytorch/actions/runs/18942517662/job/54086481693) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/7563f61cc8a40a5ba21a498a2d98895b4eec3f39) ([comment](pytorch#165211 (comment)))
pytorch#165216) Replaces 71 assert statements across 11 files in `torch.distributed` with explicit if-checks raising AssertionError to prevent assertions from being disabled with Python -O flag. Fixes pytorch#164878 Pull Request resolved: pytorch#165216 Approved by: https://github.com/albanD
Update devtoolset in Manylinux 2.28 rocm builds. 11 is too old does not support compiling with C++20 properly Pull Request resolved: pytorch#166764 Approved by: https://github.com/sudharssun, https://github.com/jeffdaily
…pytorch#159689) Pull Request resolved: pytorch#159689 Approved by: https://github.com/msaroufim, https://github.com/Skylion007
This reverts commit c761999. Reverted pytorch#166361 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#166361 (comment)))
This PR cleans up unused assignments. Pull Request resolved: pytorch#166791 Approved by: https://github.com/xmfan
By wrapping the python objects with FakeScriptObject(FakeOpaqueQueue) we restrict users to do anything to this object. torch.compile support can be easily enabled by the rest of [this stack](pytorch#163936) and existing support for ScriptObjects. One thing to note is that by default in functionalization we mark all ops that take in FakeScriptObjects as being effectful. Should this be the case for these custom ops that take in python objs? Pull Request resolved: pytorch#165005 Approved by: https://github.com/zou3519
Fixes pytorch#165865 ## What this PR does? - [x] Add `generator` arg to `rand*_like` APIs (`rand_like()`, `randn_like()`, `randint_like()`). - [x] Add unit tests for `rand*_like` APIs - [x] Add corresponding arg docs - [x] Refactor `rand*_like()` codes in `TensorFactories.cpp` - [x] Add corresponding and former missed items in `VmapModeRegistrations.cpp` ## Example (using `rand_like()`) ```python gen0 = torch.Generator() gen1 = torch.Generator() gen2 = torch.Generator() gen0.manual_seed(42) gen1.manual_seed(42) gen2.manual_seed(2025) tensor = torch.empty(10) t0 = torch.rand_like(tensor, generator=gen0) t1 = torch.rand_like(tensor, generator=gen1) t2 = torch.rand_like(tensor, generator=gen2) assert t0 == t1 assert t2 != t0 assert t2 != t1 ``` Pull Request resolved: pytorch#166160 Approved by: https://github.com/cyyever, https://github.com/albanD
Use the more generic `Benchmarker.benchmark` function to allow benchmarking other devices that support the required functionality, for example prologue and epilogue fusion can be benchmarked for triton CPU. Pull Request resolved: pytorch#164938 Approved by: https://github.com/nmacchioni, https://github.com/eellison
Fixes pytorch#166810 Pull Request resolved: pytorch#166811 Approved by: https://github.com/slayton58, https://github.com/Skylion007, https://github.com/malfet
Adding `conv` (conv1d, conv2d, conv3d) to the list of operator microbenchmarks run in the CI script (`.ci/pytorch/test.sh`), ensuring convolution operators are now benchmarked alongside existing ones. Pull Request resolved: pytorch#166331 Approved by: https://github.com/huydhn, https://github.com/jbschlosser
…torch#166806) I missed this API for MTIAGraph in D84457757(pytorch#165963) Differential Revision: [D86026706](https://our.internmc.facebook.com/intern/diff/D86026706/) Pull Request resolved: pytorch#166806 Approved by: https://github.com/albanD ghstack dependencies: pytorch#166805
…e. (pytorch#166775) Summary: as title, we should return an entire tracing_context object instead of fake_mode only, since tracing context should contain full set of information. Test Plan: pytest test/export/test_experimental.py Pull Request resolved: pytorch#166775 Approved by: https://github.com/tugsbayasgalan
Summary: dict_keys_getitem can show up in the bytecode but it's using dict.keys() which is not fx tracable. fx.wrap should make it as a standalone function in the graph to be invoked later with real inputs. Test Plan: pytest test/export/test_experimental.py Pull Request resolved: pytorch#166776 Approved by: https://github.com/jamesjwu ghstack dependencies: pytorch#166775
) Summary: make_fx() will register tensor constants as new buffers while tracing a shuffle graph for dynamo graph capture. This breaks the invariance that the resulting graph looks identical to the original eager model in terms of state dict. So we need to de-register the buffers and set them as plain tensor constants. Test Plan: pytest test/export/test_experimental.py Pull Request resolved: pytorch#166777 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: pytorch#166775, pytorch#166776
…ch#166554) Fixes pytorch#166253 ## Summary When `torch.full` is called with a 0-D tensor as `fill_value` inside a `torch.compile`'d function, the value was being incorrectly cached, causing subsequent calls with different values to return the first value. ## Root Cause The Dynamo handler for `torch.full` was calling `aten._local_scalar_dense` to convert tensor fill_values to Python scalars at compile time, which baked the value into the compiled graph as a constant. ## Solution Modified the Dynamo handler to decompose `torch.full(size, tensor_fill_value)` into `empty(size).fill_(tensor_fill_value)` when `fill_value` is a `TensorVariable`, keeping the fill value dynamic in the compiled graph. ## Testing Added test case that verifies torch.full works correctly with dynamic tensor fill_values across multiple calls and dtypes. Pull Request resolved: pytorch#166554 Approved by: https://github.com/Lucaskabela
…wise/reduction consumer (pytorch#166165)" This reverts commit 94f2657. Reverted pytorch#166165 on behalf of https://github.com/izaitsevfb due to breaks test_LinearAndSoftmax_codegen test ([comment](pytorch#166165 (comment)))
…66868) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#166868 Approved by: https://github.com/maggiemoss, https://github.com/zou3519
…rocess (pytorch#166560) Summary: DCP checkpoint background process currently determines the port used for pg via get_free_port(). During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call. We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port. This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT. The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1". context: https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/ Differential Revision: D84928180 Pull Request resolved: pytorch#166560 Approved by: https://github.com/meetv18
…into fsdp (pytorch#166433) **Summary:** I have created a new composable replicate api that's integrated into FSDP's codebase with minimal changes. The key changes I made are when we use DDPMeshInfo, we use Replicate placements, prevent initial sharding of parameters, set worldsize to 1 to skip allgathers and reducescatter. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py 2. pytest test_pp_composability.py 3. pytest test_replicate_with_fsdp.py Pull Request resolved: pytorch#166433 Approved by: https://github.com/weifengpy
# Conflicts: # .ci/docker/requirements-ci.txt
|
Jenkins build for 2eea9c424bfdb5249d3e1cf516617dd616dc7f01 commit finished as FAILURE |
|
Jenkins build for 9396162549a868bc4dc5af38a1c39f462f0b1aa2 commit finished as FAILURE |
9396162 to
86a7a33
Compare
Collaborator
Author
|
Due to issues with create_ifu_tag.yml, changes from this PR were reverted. All changes from here are now reflected in #2784 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
rocm_base: 777e73c