Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases #126598

yifuwang · 2024-05-18T00:41:24Z

Stack from ghstack (oldest at bottom):

wip #127682
Add profiler annotation for fused_all_gather_matmul and fused_matmul_reduce_scatter #127556
Improve the scheduling for fused_matmul_reduce_scatter #127455
force_stride_order on fused_all_gather_matmul/fused_matmul_reduce_scatter's operands to avoid a copy due to layout transformation #127454
Allow overriding per-dim group options via _MeshEnv.set_dim_group_options #126599
-> Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases #126598

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

…mul-reduce-scatter in certain cases [ghstack-poisoned]

pytorch-bot · 2024-05-18T00:41:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126598

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 2 Unrelated Failures

As of commit ccc11e5 with merge base 4afc5c7 ():

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#127438)
sebotnet33ts_256
pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) ()
inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…mul and matmul-reduce-scatter in certain cases" [ghstack-poisoned]

…mul and matmul-reduce-scatter in certain cases" ## Context See context [here](#122163). [ghstack-poisoned]

wanchaol

This pass looks very clean and easy to understand! Mainly have some questions

wanchaol · 2024-05-21T05:52:49Z

test/distributed/tensor/parallel/test_micro_pipeline_tp.py

+    @parametrize("A_dims", [2, 3])
+    @parametrize("gather_dim", [0, 1, 2])
+    @fresh_inductor_cache()
+    def test_fuse_all_gather_matmul(self, A_dims, gather_dim):


no need to add in this PR, let's try to add the e2e integration test with ColwiseParallel for allgather matmuls in follow up PRs.

Why not test this in this PR 🤔

Added a test for dtensor-based seq-par.

test/distributed/tensor/parallel/test_micro_pipeline_tp.py

wanchaol · 2024-05-21T05:57:44Z

torch/_inductor/fx_passes/micro_pipeline_tp.py

+        match.nodes,
+        aten.cat.default,
+    )[0]
+    shard_node = ag_node.args[0]


shard_node is not getting used?

wanchaol · 2024-05-21T06:02:26Z

torch/_inductor/fx_passes/micro_pipeline_tp.py

+        second reshape node is replaced with `new_node`.
+
+        In addition, we ensure that the original mm node ends up with zero
+        users by replacing it with a reverse reshape of `new_node`.


could you elaborate this part more? how could replacing it with a reverse shape of new_node results in original mm node with zero users?

An ND-matmul shows up in fx graphs as reshape -> mm -> reshape sequences. The first reshape flattens the leading dims while the second one unflattens them. Consider the following fake fx graph:

buf_0 = allgather(...) buf_1 = aten.reshape(buf_0, ...) buf_2 = aten.mm(buf_1, ...) buf_3 = aten.reshape(buf_2, ...)

Since fused_all_gather_matmul semantically performs matmuls (as opposed to mms), its results will replace buf_0 and buf_3. It's okay if buf_1 ends up with non-zero users, since it's just a view on buf_0. However, if for some reason buf_2 ends up with non-zero users and can't be removed after the fusion (e.g. buf_2 being returned for some reason), we'd be performing an extra mm.

To ensure buf_2 has zero users after the fusion, since buf_3 is always available and it's a reshape from buf_2, we replace buf_2 with the reverse reshape from buf_3.

torch/_inductor/fx_passes/micro_pipeline_tp.py

wanchaol · 2024-05-21T06:05:56Z

torch/_inductor/fx_passes/micro_pipeline_tp.py

+patterns = PatternMatcherPass()
+
+
+def _is_backward(graph: torch.fx.Graph) -> bool:


ditto: this function seems not getting used, is it for debugging?

…mul and matmul-reduce-scatter in certain cases" ## Context See context [here](#122163). [ghstack-poisoned]

…mul and matmul-reduce-scatter in certain cases" [ghstack-poisoned]

…mul and matmul-reduce-scatter in certain cases" cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire [ghstack-poisoned]

yifuwang · 2024-05-31T02:56:11Z

@pytorchbot merge

pytorchmergebot · 2024-05-31T02:59:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-31T03:10:07Z

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / macos-13-py3-arm64 / test (default, 1, 3, macos-m1-stable), trunk / macos-13-py3-arm64 / test (default, 3, 3, macos-m1-stable)

Details for Dev Infra team

Raised by workflow job

…mul and matmul-reduce-scatter in certain cases" [ghstack-poisoned]

yifuwang · 2024-06-01T16:13:18Z

@pytorchbot merge

pytorchmergebot · 2024-06-01T16:15:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-01T22:13:46Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

yifuwang · 2024-06-02T00:29:31Z

@pytorchbot merge

pytorchmergebot · 2024-06-02T00:31:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-02T06:30:23Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

Introduce Inductor passes to micro-pipeline all-gather-matmul and mat…

23f3b05

…mul-reduce-scatter in certain cases [ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category labels May 18, 2024

This was referenced May 18, 2024

Introduce ProcessGroupCudaP2P #122163

Closed

Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter #126597

Draft

Allow overriding per-dim group options via _MeshEnv.set_dim_group_options #126599

Open

yifuwang added 4 commits May 18, 2024 01:02

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

831c156

…mul and matmul-reduce-scatter in certain cases" [ghstack-poisoned]

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

494a7d6

…mul and matmul-reduce-scatter in certain cases" [ghstack-poisoned]

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

82e4eb5

…mul and matmul-reduce-scatter in certain cases" [ghstack-poisoned]

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

ba832b7

…mul and matmul-reduce-scatter in certain cases" [ghstack-poisoned]

yifuwang mentioned this pull request May 19, 2024

Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter #126634

Closed

yifuwang requested review from wanchaol and Chillee May 20, 2024 21:19

yifuwang marked this pull request as ready for review May 20, 2024 21:19

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

d1cedf1

…mul and matmul-reduce-scatter in certain cases" ## Context See context [here](#122163). [ghstack-poisoned]

wanchaol approved these changes May 21, 2024

View reviewed changes

yifuwang added 6 commits May 21, 2024 13:26

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

d376bb4

…mul and matmul-reduce-scatter in certain cases" ## Context See context [here](#122163). [ghstack-poisoned]

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

57c1d54

…mul and matmul-reduce-scatter in certain cases" ## Context See context [here](#122163). [ghstack-poisoned]

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

1243155

…mul and matmul-reduce-scatter in certain cases" ## Context See context [here](#122163). [ghstack-poisoned]

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

bd59c4a

…mul and matmul-reduce-scatter in certain cases" ## Context See context [here](#122163). [ghstack-poisoned]

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

17752e9

…mul and matmul-reduce-scatter in certain cases" ## Context See context [here](#122163). [ghstack-poisoned]

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

a0dd632

…mul and matmul-reduce-scatter in certain cases" ## Context See context [here](#122163). [ghstack-poisoned]

This was referenced May 29, 2024

force_stride_order on fused_all_gather_matmul/fused_matmul_reduce_scatter's operands to avoid a copy due to layout transformation #127454

Open

Improve the scheduling for fused_matmul_reduce_scatter #127455

Open

yifuwang added 2 commits May 29, 2024 14:43

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

45ff3af

…mul and matmul-reduce-scatter in certain cases" [ghstack-poisoned]

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

9649e9f

…mul and matmul-reduce-scatter in certain cases" [ghstack-poisoned]

yifuwang added the topic: not user facing topic category label May 29, 2024

yifuwang removed the release notes: distributed (pipeline) release notes category label May 29, 2024

yifuwang mentioned this pull request May 30, 2024

Add profiler annotation for fused_all_gather_matmul and fused_matmul_reduce_scatter #127556

Open

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 31, 2024

pytorchmergebot added the merging label May 31, 2024

pytorchmergebot removed the merging label May 31, 2024

Update on "Introduce Inductor passes to micro-pipeline all-gather-mat…

ccc11e5

…mul and matmul-reduce-scatter in certain cases" [ghstack-poisoned]

yifuwang mentioned this pull request Jun 1, 2024

wip #127682

Draft

pytorchmergebot added the merging label Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases #126598

Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases #126598

yifuwang commented May 18, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 18, 2024 •

edited

wanchaol left a comment

wanchaol May 21, 2024

Chillee May 21, 2024

yifuwang May 22, 2024

wanchaol May 21, 2024

wanchaol May 21, 2024

yifuwang May 21, 2024

wanchaol May 21, 2024

yifuwang commented May 31, 2024

pytorchmergebot commented May 31, 2024

pytorchmergebot commented May 31, 2024

yifuwang commented Jun 1, 2024

pytorchmergebot commented Jun 1, 2024

pytorchmergebot commented Jun 1, 2024

yifuwang commented Jun 2, 2024

pytorchmergebot commented Jun 2, 2024

pytorchmergebot commented Jun 2, 2024

		patterns = PatternMatcherPass()


		def _is_backward(graph: torch.fx.Graph) -> bool:

Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases #126598

Are you sure you want to change the base?

Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases #126598

Conversation

yifuwang commented May 18, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented May 18, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126598

⏳ 1 Pending, 2 Unrelated Failures

wanchaol left a comment

Choose a reason for hiding this comment

wanchaol May 21, 2024

Choose a reason for hiding this comment

Chillee May 21, 2024

Choose a reason for hiding this comment

yifuwang May 22, 2024

Choose a reason for hiding this comment

wanchaol May 21, 2024

Choose a reason for hiding this comment

wanchaol May 21, 2024

Choose a reason for hiding this comment

yifuwang May 21, 2024

Choose a reason for hiding this comment

wanchaol May 21, 2024

Choose a reason for hiding this comment

yifuwang commented May 31, 2024

pytorchmergebot commented May 31, 2024

Merge started

pytorchmergebot commented May 31, 2024

Merge failed

yifuwang commented Jun 1, 2024

pytorchmergebot commented Jun 1, 2024

Merge started

pytorchmergebot commented Jun 1, 2024

yifuwang commented Jun 2, 2024

pytorchmergebot commented Jun 2, 2024

Merge started

pytorchmergebot commented Jun 2, 2024

yifuwang commented May 18, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 18, 2024 •

edited