ExpressionEvaluator validates allocation domain for Set.Permute. by wujingyue · Pull Request #1167 · NVIDIA/Fuser

wujingyue · 2023-10-27T02:20:48Z

This was a fun exercise. It's half done -- see my TODO. But I'm curious to hear your thoughts before I continue.

Should ExpressionEvaluator guarantee to produce tensors that always respect allocation domain and contiguity?
If so, should it guarantee that for just fusion outputs or every intermediate tensor as well?
Is an empty allocation domain compatible with any stride order or with only row-major?

wujingyue · 2023-10-27T05:19:33Z

!build

jjsjann123

I think we should respect explicit alloc domain and contiguity, be it an intermediate or output. If fusion API has it and it's going to be used by scheduler as well as fusion definition, stride in those sense becomes semantic and is no longer an implementation detail.

Is an empty allocation domain compatible with any stride order or with only row-major?

For empty allocation domain, I tend to think this is left to interpretation. i.e. if there's no constraint, we don't necessarily enforce anything there.
i.e. feel free to create aliases if that means we are returning tensor in arbitrary stride order.

jjsjann123 · 2023-10-27T08:27:20Z

csrc/ir/nodes.cpp

      at::Tensor out_tensor = in_tensor.permute(computePermutation(
          out_tv->getRootDomain(), out_tv->getRFactorDomain()));
+
+      inferAndValidateAllocationSizesAndStrides(out_tensor, out_tv, ee);


oh, so we are just totally not using allocDomain at all here. 😿

So the validation here would just always fail when we have permuted/transformed alloc domain?

Generally, I think it does make sense to validate with allocation domain, but my concern are:

inferAndValidateAllocationSizesAndStrides is slow, and IIUC LoadStoreOp::evaluate will become a hot path because it will be executing at runtime. At some point, we might need to worry about CPU overhead here.

I don't know how the alias analysis works, but, for example, if I have rFactor [I1, I2] and allocation [I2, I1], then view(-1) should be invalid, and we should do reshape(-1), which is not an alias. I think the alias analysis should take this into consideration. I think we can not really have an accurate check unless that is built out. But depending on how you would like to incrementally approach the final destination, I have no preference on whether we should add this check first or add complete handling allocation first.

inferAndValidateAllocationSizesAndStrides is slow, and IIUC LoadStoreOp::evaluate will become a hot path because it will be executing at runtime. At some point, we might need to worry about CPU overhead here.

Good point -- I'm quite concerned about that too. @jjsjann123 and I happened to talk about this the other day. An alternative is to do more symbolic execution at compile time. For example, if fusion IR could represent strides (as Val*s) and propagate these symbolic strides through meta ops like reshapes and transposes, we could spend less run time. This would involve a much larger change (possibly redefine allocation domain) so I didn't take that path for now.

cc @tfogal

I think the alias analysis should take this into consideration.

It does, e.g.,

Fuser/csrc/optimization/alias_analysis.cpp

Line 103 in 6751f27

// This is a sufficient but not necessary condition for `out` to alias

. The invariant I've been enforcing is: if alias analysis reports out_tv is an alias of in_tv, then ExpressionEvaluator is able to bind in_tv to any valid in_tensor (i.e. matching allocation domain and contiguity etc) and evaluate out_tv as a out_tensor that's an alias of in_tensor.

oh, so we are just totally not using allocDomain at all here. 😿

So the validation here would just always fail when we have permuted/transformed alloc domain?

Right. In fact, I don't even know how to handle an allocation domain that's not merely a permutation of the rfactor domain (see my TODO below). For example, let's say the rfactor domain is [N,H,W,C] and the allocation domain is [N,C/32,H,W,32]. out_tensor must still have the shape of [N,H,W,C] (because the rfactor domain dictates the logical shape), but what are going to be the strides?!

PS: It's interesting that, at this point, fusion IR is more expressive than ATen ops in some cases (e.g. the allocation domain example above) and less expressive in some other cases (e.g. not being able to represent strides). I'm getting nervous about inconsistency like this will bite us at some point in the future.

less expressive in some other cases (e.g. not being able to represent strides)

This feels more like a implementation thing. Up until this point, we only deal with strides in our indexing. So the concept of stride isn't part of our fusion IR.
I'm not saying your concern isn't legit. I don't know if there's something in our indexing code that can be unified/repurposed to represent stride.

I'm getting nervous about inconsistency like this will bite us at some point in the future.

For the place where we need to interface with ATen, we should check consistency and disallow things like [N, C/32, H, W, 32] but still allow things like [N, H, W, C/32, 32]. For intermediate tensors, I don't think we should have any restriction on what allocation domain can be.

For the place where we need to interface with ATen

Due to the power of ExpressionEvaluator, it's too hard to differentiate such places. A user can ask ExpressionEvaluator to turn any TensorView (including intermediate ones) into an at::Tensor. Furthermore, even if the user asks ExpressionEvaluator for a fusion output TensorView, ExpressionEvaluator internally computes all dependent TensorViews, many of which are intermediate, as at::Tensors.

AFAICS, for this particular case, it's probably best to proceed with the assumption that allocation must be an rfactor permutation. Because I don't see another good way around. Do you all agree? We should probably have another chat about what's the right assumption between allocation and rfactor in general.

AFAICS, for this particular case, it's probably best to proceed with the assumption that allocation must be an rfactor permutation.

This is definitely better than just ignoring the allocation domain and produce silent wrong result. I think we can start from here.

zasdfgbnm · 2023-10-27T12:31:50Z

csrc/tensor_metadata.cpp

          [](auto c_flag) { return c_flag.has_value(); }),
      "The size of contiguity mismatch with the dimensionality of allocation domain");

+  // Validate that strides must be in decreasing order modulo reduction and


Why strides must be decreasing? If we have a discontiguous tensor, can't it be arbitrary?

IIUC, the input sizes and strides are rearranged according to the allocation domain and therefore in major-to-minor order. So I'm expecting strides to be in decreasing order modulo broadcast and reduction. Otherwise, it's not compatible with the allocation domain.

I think:

Fusion executor cache should reorder it as you described.

Even if fusion executor cache does not reorder it, as long as it is marked as discontiguous, the rest of our system should support it.

So I do think we need such a check in our system, but not here.

I think we might have talked past each other. The reorder happens

Fuser/csrc/tensor_metadata.cpp

Line 318 in b4ef869

// need to put them to the correct order.

in inferAndValidateAllocationSizesAndStrides. validateAllocationSizesAndStrides is an internal function whose input sizes and strides are guaranteed to be reordered (and thus my expectation). Does that make sense?

Are we talking about the same reorder? I think there are two different "reorder"s we can think about. One reorder is to reorder the computed sizes and strides to be consistent with the allocation domain. This reorder does not change allocation domain itself. For this reordering, it happens in the code you refers to. But the "reorder" I was referring to is to set the allocation domain so that the order of the allocation domain matches with the stride order of the ATen tensor. This is a process similar to how we set the contiguity of a tensor.

allocDomain is only semantically meaningful when we have the corresponding contiguity flag set as true?

Interesting. Does that mean allocation_domain[i] is completely useless when contiguity[i]==false?

What's the purpose of allocation domain if it doesn't dictate a physical memory layout (and therefore a stride order)?

It does dictate the physical memory layout, but "dictate the physical memory layout" can not be simply translated to "require the stride order to be decreasing". For example, if I have a tensor whose rFactor domain is [I1, I2, I3], shape is (4, 5, 6), and stride is (6, 2, 1), then it might makes more sense to have allocation domain [I2, I1, I3] and contiguity (false, true, true), instead of having allocation domain [I1, I2, I3] and contiguity (false, false, true).

Does that mean allocation_domain[i] is completely useless when contiguity[i]==false?

No, it is not completely useless. I think there are two separate questions:

If we see a tensor allocated by someone else, would we accept it or not?

If we should allocate a tensor, how would we allocate it?

Following this principle: https://en.wikipedia.org/wiki/Robustness_principle, which basically says "be conservative in what you do, be liberal in what you accept from others", for question 1, as long as we can correctly indexing it, we should accept it. When contiguity is false, we are indexing it in a pretty general way that does not assume stride order, so we should accept it. For question 2, when we should allocate our tensor, we should follow the order defined by allocation domain. In that sense, the allocation is still dictating the physical layout.

maybe we are not that far away from the reality that allocDomain and contiguity should come as a single package 🤷

I think they are already a single package, no?

Re @zasdfgbnm

I think they are already a single package, no?

Conceptually I think yes. But looking at the implementation we have, it's likely a no... hence the confusion that Jingyue is having regarding the topic.

Re @wujingyue

allocDomain is only semantically meaningful when we have the corresponding contiguity flag set as true?

Interesting. Does that mean allocation_domain[i] is completely useless when contiguity[i]==false?

Sorta yes. You would also need contiguity[i-1] == false.
But this isn't a definitive answer there. As @zasdfgbnm points out, we can still rely on the order for scheduler, but that's should be just a performance hint.
I think Xiang's point here is that, we shouldn't assert on stride order when corresponding contiguity flags were set to false.
I agree with this ^^^. Even though in general, I'm not a big fan of "be conservative in what you do, be liberal in what you accept from others" 😆

wujingyue · 2023-10-27T19:32:08Z

I think we should respect explicit alloc domain and contiguity, be it an intermediate or output. If fusion API has it and it's going to be used by scheduler as well as fusion definition, stride in those sense becomes semantic and is no longer an implementation detail.

Makes sense -- that's most strict and costly to implement. I think it implies most of our XyzOp::evaluates are wrong, because AFAICT none of them (except this one I'm going to submit) look into the allocation domain at all.

jjsjann123 · 2023-10-27T22:22:44Z

I'm wondering if it makes sense to limit some of our analysis on alloc_domain to be permutation of rfactor.

I know @zasdfgbnm has use case of transformed alloc_domain to represent swizzle, but that's mostly just a scheduling thing?!?! Alias analysis happening before the scheduler shouldn't need to worry about that, since we are looking at fusion before scheduling any way.

wujingyue · 2023-10-27T22:30:09Z

I'm wondering if it makes sense to limit some of our analysis on alloc_domain to be permutation of rfactor.

It's not all about alias analysis. We call ExpressionEvaluator after all compilation. We can't stop people from using ExpressionEvaluator on a ViewOp whose allocation is not a rfactor permutation. Can we?

tfogal · 2023-10-30T17:53:20Z

I have a lot to catch up on here, so apologies for not commenting technically.

At a more logistical level: it seems we're discussing semantics (and a possible change in semantics) of core components of nvFuser's IR. I suggest we don't have these discussions inside a github PR, or issue, etc., but rather as a group at, say, the Tuesday meeting series.

jjsjann123 · 2023-10-30T18:42:39Z

I have a lot to catch up on here, so apologies for not commenting technically.

At a more logistical level: it seems we're discussing semantics (and a possible change in semantics) of core components of nvFuser's IR. I suggest we don't have these discussions inside a github PR, or issue, etc., but rather as a group at, say, the Tuesday meeting series.

Good point!

We do have tomorrow's meeting scheduled for discussion on allocDomain, I'll put a slide on this topic and get some input from the group on this.

wujingyue · 2023-11-06T06:02:06Z

This PR needs more work.

#986 explains why memory swizzling requires allocation domain to be more than an rfactor permutation. This means that ExpressionEvaluator::evaluate(TensorView*) might not always be able to return a valid at::Tensor, e.g., when the input TensorView has a rfactor domain of [i0,i1] and an allocation domain of [i0/2,i1/2,4].

This leaves ExpressionEvaluator two options to treat allocation domains.

The strict approach. ExpressionEvaluator::evaluate(TensorView*) errors out when the allocation domain of the input TensorView is beyond what strides can represent. This applies to intermediate TensorViews as well because ExpressionEvaluator::evaluate is recursive.
The loose approach. ExpressionEvaluator::evaluate(TensorView*) takes the best effort to produce an at::Tensor that complies with the allocation domain. This way, the user of ExpressionEvaluator::evaluate(TensorView*) bears the responsibility to check the allocation domain required (e.g. by alias analysis or by Framework) is compatible with the output strides. This check likely only needs to happen at fusion outputs, so it tolerates incompatible allocation domains of intermediate tensors.

I haven't decided which option to take. Chances are, in the short term, they only matter in terms of implementation complexity.

wujingyue added 3 commits October 26, 2023 15:11

WIP -- could have used inferAndValidateAllocationSizesAndStrides.

5103b1f

Use inferAndValidate to validate stride order and contiguity.

0c3c142

More tests.

f8aab50

wujingyue requested review from jjsjann123 and zasdfgbnm October 27, 2023 02:20

Fix clang-tidy.

a56d9bf

wujingyue added the enhancement label Oct 27, 2023

wujingyue added the allocation domain issues related to allocation domain support label Oct 27, 2023

jjsjann123 reviewed Oct 27, 2023

View reviewed changes

zasdfgbnm reviewed Oct 27, 2023

View reviewed changes

wujingyue requested review from jjsjann123 and zasdfgbnm October 27, 2023 21:03

wujingyue marked this pull request as draft November 5, 2023 23:55

jjsjann123 mentioned this pull request Nov 7, 2023

Clear the ambiguity for allocation domain when contiguity isn't required. #1237

Closed

wujingyue closed this Apr 21, 2025

Comments

Conversation

wujingyue commented Oct 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wujingyue commented Oct 27, 2023

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Oct 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Oct 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue commented Oct 27, 2023

Uh oh!

jjsjann123 commented Oct 27, 2023

Uh oh!

wujingyue commented Oct 27, 2023

Uh oh!

tfogal commented Oct 30, 2023

Uh oh!

jjsjann123 commented Oct 30, 2023

Uh oh!

wujingyue commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wujingyue commented Oct 27, 2023 •

edited

Loading

zasdfgbnm Oct 27, 2023 •

edited

Loading

zasdfgbnm Oct 29, 2023 •

edited

Loading

wujingyue Oct 30, 2023 •

edited

Loading

wujingyue commented Nov 6, 2023 •

edited

Loading