Set allocation domain of sharded tensor #2271

cowanmeg · 2024-05-20T14:52:13Z

Sets allocation domain of sharded tensors during the pass propagateShardingsAndSetAllocationDomain.
The two passes are merged in attempt to reduce the number of passes over all expressions in the fusion.

Allocation domain is set to the tv's leaf domain. Since presegmentation passes and scheduling occur after the sharding passes, the leaf domain is identical to the rfact domain. After DID parallelization of the leaf domain is allowed the leaf and rfactor domain will not be the same.

This will avoid issues such as #2245 (comment) and allow the AllocationDomainPass presegmentation pass on for distributed matmul tests

wujingyue

Is it safe now to replace https://github.com/NVIDIA/Fuser/pull/2245/files#diff-db5ba7cef14ad9a3c1eaab113a6f0a6f875e92890b34078f8185e0970022ce45R298 with a check for contiguity? Although there's little performance penalty to call .contiguous on an already contiguous tensor, leaving it there would give readers the wrong impression that the input slices can be non-contiguous, which IIRC shouldn't happen with this PR.

cowanmeg · 2024-05-20T17:06:53Z

Is it safe now to replace https://github.com/NVIDIA/Fuser/pull/2245/files#diff-db5ba7cef14ad9a3c1eaab113a6f0a6f875e92890b34078f8185e0970022ce45R298 with a check for contiguity? Although there's little performance penalty to call .contiguous on an already contiguous tensor, leaving it there would give readers the wrong impression that the input slices can be non-contiguous, which IIRC shouldn't happen with this PR.

Yup the added PipelineTwoStage tests pass when removing the contiguous call, but 'PipelineTestStagedReduction.StagedReduction/ReductionOnly' is failing. I would hold off on reviewing until I fix!

csrc/multidevice/utils.cpp

tests/cpp/test_sharding.cpp

csrc/multidevice/utils.cpp

csrc/multidevice/executor.cpp

jjsjann123

some nitpick, but overall looks straightforward to me.

I'll let @wujingyue stamp since he seems to have more questions. (feel free to delegate it to me for a more thorough review if you don't have time to wrap it up before your vacation 🍹 )

csrc/multidevice/utils.cpp

tests/cpp/test_multidevice_matmul.cpp

csrc/multidevice/utils.cpp

jjsjann123 · 2024-05-20T20:45:04Z

!build

jjsjann123 · 2024-05-20T20:46:38Z

FYI, be extra careful with CI for PRs touching allocation domain. Some of our schedulers have some sharp corners on this and could give you some unexpected failures.

cowanmeg · 2024-05-20T21:18:52Z

Sorry everyone who viewed before the big changes 😓
The new update only sets the allocation domain of TVs that are in a resharding expression. This fixed most of the tests, but one DistributedMatmulTest.
There are some big changes to contiguity that this doesn't address, namely that DID axes continue to have true/false settings while they shouldn't have a value since they aren't allocated. I'll fix that in a later PR!
Also, this isn't blocking anything so feel free to review after you get back @wujingyue

tests/cpp/test_multidevice_matmul.cpp

tests/cpp/test_multidevice_sharding.cpp

csrc/multidevice/utils.cpp

cowanmeg · 2024-05-20T21:24:36Z

!build --dist

cowanmeg · 2024-05-20T22:55:04Z

!build

jjsjann123

Looks pretty clean to me now. stamping.

csrc/multidevice/communication.cpp

csrc/multidevice/utils.cpp

wujingyue · 2024-05-21T00:18:00Z

csrc/multidevice/utils.cpp

+      for (auto tv : ir_utils::filterByType<TensorView>(expr->inputs())) {
+        for (auto c : tv->getContiguity()) {
+          if (c.has_value()) {
+            NVF_CHECK(


I'm confused. I thought setShardedAllocationDomain ought to make inputs of resharding exprs contiguous rather than expect them to be contiguous. Am I missing something?

^^^ that doesn't sound right. You cannot change contiguity / stride order on inputs.

IIUC, the code here validates that the input entry is contiguous and then later explicitly sets allocation domain on each if they are implicit.

One part I'm not totally sure is, Resharding expression input must be contiguous . Should this check also apply to TensorViews which we are not specifying allocation domain?

You cannot change contiguity / stride order on inputs.

You are right for fusion inputs. However, resharding Expr's inputs usually have more flexibility.

Thinking more about this, maybe the logic should belong to somewhere near

Fuser/csrc/multidevice/utils.cpp

Lines 393 to 395 in 690134d

TensorView* input_permute = permute(input, {{sharding_axis, 0}});

TensorView* output_permute = set(input_permute);

TensorView* new_output = permute(output_permute, {{0, sharding_axis}});

? That's where we actively reorder the resharded dimension to be outermost in rfactor however not enforcing it to be outermost in allocation?

ProcessGroup only accept contiguous tensors (because nccl and ucc only deal with contiguous buffers, passed as void pointers). So for now it is reasonable to only support contiguous tensors. Later, we could add support for non-contiguous tensors, but to do that we'll have no choice but to make as many process group call as there a contiguous components.

So imo this "assert" makes sense for this pr and we could remove it later by implementing what I described above

@wujingyue this line

Fuser/csrc/multidevice/utils.cpp

Lines 393 to 395 in 690134d

TensorView* input_permute = permute(input, {{sharding_axis, 0}});

TensorView* output_permute = set(input_permute);

TensorView* new_output = permute(output_permute, {{0, sharding_axis}});

is a "trick" to allow non-oputermost resharding while avoiding non-contiguous buffers, by reordering the axis to place the sharded axis at outermost position (therefore the actual buffer is contiguous in memory)

tests/cpp/test_sharding.cpp

wujingyue · 2024-05-21T00:31:48Z

tests/cpp/test_sharding.cpp

+  insertReshardings(&fusion);
+  insertShardedAxisReordering(&fusion);
+  setShardedAllocationDomain(&fusion);
+  for (auto expr : fusion.exprs()) {


How do you like making the check more specific? For example, is it possible to check the following:

I'd expect there's only one resharding Expr, which is a sum

I'd also expect the input of that Expr has DID as the first IterDomain in the containing allocation domain.

wujingyue · 2024-05-21T07:04:49Z

I'll let @wujingyue stamp since he seems to have more questions. (feel free to delegate it to me for a more thorough review if you don't have time to wrap it up before your vacation 🍹 )

I won't have time to take another look before my vacation. As long as you and @jjsjann123 are confident in the change, please merge it without me. Anyhow, this is a strict improvement and my remaining comments are about potentially making allocation domains around resharding more correct.

samnordmann

Looks good to me! Thanks!

csrc/multidevice/communication.cpp

csrc/multidevice/utils.cpp

samnordmann · 2024-05-21T15:07:17Z

csrc/multidevice/utils.cpp

+      for (auto tv : ir_utils::filterByType<TensorView>(expr->inputs())) {
+        for (auto c : tv->getContiguity()) {
+          if (c.has_value()) {
+            NVF_CHECK(


ProcessGroup only accept contiguous tensors (because nccl and ucc only deal with contiguous buffers, passed as void pointers). So for now it is reasonable to only support contiguous tensors. Later, we could add support for non-contiguous tensors, but to do that we'll have no choice but to make as many process group call as there a contiguous components.

So imo this "assert" makes sense for this pr and we could remove it later by implementing what I described above

@wujingyue this line

Fuser/csrc/multidevice/utils.cpp

Lines 393 to 395 in 690134d

TensorView* input_permute = permute(input, {{sharding_axis, 0}});

TensorView* output_permute = set(input_permute);

TensorView* new_output = permute(output_permute, {{0, sharding_axis}});

is a "trick" to allow non-oputermost resharding while avoiding non-contiguous buffers, by reordering the axis to place the sharded axis at outermost position (therefore the actual buffer is contiguous in memory)

tests/cpp/test_multidevice_sharding.cpp

naoyam · 2024-05-21T22:08:01Z

csrc/multidevice/utils.cpp

+                expr);
+          }
+        }
+        setShardedAllocationDomain(tv);


Can somebody please explain why this is necessary? IIUC, we confirm this tensor is contiguous. Isn't that sufficient?

We need to explicitly set the allocation domain to avoid optimization passes mutating it. (i.e. empty allocation domain means a fair game for optimization passes).

i.e. allocation order inference might came in and change the stride order, if it's left empty. Which would trigger scheduling error because we cannot yet support stride order for resharding operations

So, this is not necessary if the allocation order inference is not done?

I think so.

For the record, it's not just allocation order inference, alias passes could also update allocation domain introducing similar issue. See Jingyue's comment in #2245 (comment)

cc'ing @cowanmeg for sanity check

…llocation_domain

cowanmeg · 2024-05-24T18:00:34Z

!build

cowanmeg · 2024-05-29T18:05:34Z

!build

cowanmeg · 2024-05-30T05:56:36Z

I see an unrelated tolerance error in IndexingOpTest.TorchGatherSumAdd_CUDA, so will merge this.

Sets allocation domain of sharded tensors during the pass `propagateShardingsAndSetAllocationDomain`. The two passes are merged in attempt to reduce the number of passes over all expressions in the fusion. Allocation domain is set to the tv's leaf domain. Since presegmentation passes and scheduling occur after the sharding passes, the leaf domain is identical to the rfact domain. After DID parallelization of the leaf domain is allowed the leaf and rfactor domain will not be the same. This will avoid issues such as #2245 (comment) and allow the `AllocationDomainPass` presegmentation pass on for distributed matmul tests

cowanmeg added 4 commits May 18, 2024 05:46

set allocation domain of sharded tensors

ab4489f

name

cb34b54

remove unused import

256d345

fix error messages

5fc3683

cowanmeg requested review from wujingyue and jjsjann123 May 20, 2024 14:52

cowanmeg marked this pull request as draft May 20, 2024 15:03

wujingyue marked this pull request as ready for review May 20, 2024 16:46

wujingyue reviewed May 20, 2024

View reviewed changes

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

tests/cpp/test_sharding.cpp Outdated Show resolved Hide resolved

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

jjsjann123 reviewed May 20, 2024

View reviewed changes

csrc/multidevice/executor.cpp Outdated Show resolved Hide resolved

jjsjann123 reviewed May 20, 2024

View reviewed changes

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

tests/cpp/test_multidevice_matmul.cpp Outdated Show resolved Hide resolved

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

cowanmeg added 2 commits May 20, 2024 20:57

only set allocation domain of resharding expr tvs and fix broken tests

4231198

contig fix

bdebf65

Undo test name change

4e34f1b

cowanmeg commented May 20, 2024

View reviewed changes

tests/cpp/test_multidevice_matmul.cpp Show resolved Hide resolved

tests/cpp/test_multidevice_sharding.cpp Show resolved Hide resolved

csrc/multidevice/utils.cpp Show resolved Hide resolved

wujingyue self-requested a review May 20, 2024 21:52

jjsjann123 approved these changes May 21, 2024

View reviewed changes

csrc/multidevice/communication.cpp Show resolved Hide resolved

wujingyue reviewed May 21, 2024

View reviewed changes

samnordmann approved these changes May 21, 2024

View reviewed changes

naoyam reviewed May 21, 2024

View reviewed changes

cowanmeg added 2 commits May 24, 2024 17:37

feedback

373a344

Merge branch 'main' of https://github.com/nvidia/Fuser into sharded_a…

2fb1e6f

…llocation_domain

cowanmeg merged commit b60ea8a into NVIDIA:main May 30, 2024
36 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set allocation domain of sharded tensor #2271

Set allocation domain of sharded tensor #2271

cowanmeg commented May 20, 2024

wujingyue left a comment

cowanmeg commented May 20, 2024

jjsjann123 left a comment

jjsjann123 commented May 20, 2024

jjsjann123 commented May 20, 2024

cowanmeg commented May 20, 2024

cowanmeg commented May 20, 2024

cowanmeg commented May 20, 2024

jjsjann123 left a comment

wujingyue May 21, 2024

jjsjann123 May 21, 2024

wujingyue May 21, 2024

samnordmann May 21, 2024

wujingyue May 21, 2024

wujingyue commented May 21, 2024

samnordmann left a comment

samnordmann May 21, 2024

naoyam May 21, 2024

jjsjann123 May 22, 2024

naoyam May 22, 2024

jjsjann123 May 22, 2024

cowanmeg commented May 24, 2024

cowanmeg commented May 29, 2024

cowanmeg commented May 30, 2024

	TensorView* input_permute = permute(input, {{sharding_axis, 0}});
	TensorView* output_permute = set(input_permute);
	TensorView* new_output = permute(output_permute, {{0, sharding_axis}});

Set allocation domain of sharded tensor #2271

Set allocation domain of sharded tensor #2271

Conversation

cowanmeg commented May 20, 2024

wujingyue left a comment

Choose a reason for hiding this comment

cowanmeg commented May 20, 2024

jjsjann123 left a comment

Choose a reason for hiding this comment

jjsjann123 commented May 20, 2024

jjsjann123 commented May 20, 2024

cowanmeg commented May 20, 2024

cowanmeg commented May 20, 2024

cowanmeg commented May 20, 2024

jjsjann123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wujingyue commented May 21, 2024

samnordmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cowanmeg commented May 24, 2024

cowanmeg commented May 29, 2024

cowanmeg commented May 30, 2024