Skip to content

Comments

ExpressionEvaluator validates allocation domain for Set.Permute. #1167

Closed
wujingyue wants to merge 4 commits intomainfrom
evaluator_allocation_domain
Closed

ExpressionEvaluator validates allocation domain for Set.Permute. #1167
wujingyue wants to merge 4 commits intomainfrom
evaluator_allocation_domain

Conversation

@wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Oct 27, 2023

This was a fun exercise. It's half done -- see my TODO. But I'm curious to hear your thoughts before I continue.

  1. Should ExpressionEvaluator guarantee to produce tensors that always respect allocation domain and contiguity?
  2. If so, should it guarantee that for just fusion outputs or every intermediate tensor as well?
  3. Is an empty allocation domain compatible with any stride order or with only row-major?

@wujingyue
Copy link
Collaborator Author

!build

@wujingyue wujingyue added the allocation domain issues related to allocation domain support label Oct 27, 2023
Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should respect explicit alloc domain and contiguity, be it an intermediate or output. If fusion API has it and it's going to be used by scheduler as well as fusion definition, stride in those sense becomes semantic and is no longer an implementation detail.

Is an empty allocation domain compatible with any stride order or with only row-major?

For empty allocation domain, I tend to think this is left to interpretation. i.e. if there's no constraint, we don't necessarily enforce anything there.
i.e. feel free to create aliases if that means we are returning tensor in arbitrary stride order.

at::Tensor out_tensor = in_tensor.permute(computePermutation(
out_tv->getRootDomain(), out_tv->getRFactorDomain()));

inferAndValidateAllocationSizesAndStrides(out_tensor, out_tv, ee);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, so we are just totally not using allocDomain at all here. 😿

So the validation here would just always fail when we have permuted/transformed alloc domain?

Copy link
Collaborator

@zasdfgbnm zasdfgbnm Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I think it does make sense to validate with allocation domain, but my concern are:

  1. inferAndValidateAllocationSizesAndStrides is slow, and IIUC LoadStoreOp::evaluate will become a hot path because it will be executing at runtime. At some point, we might need to worry about CPU overhead here.
  2. I don't know how the alias analysis works, but, for example, if I have rFactor [I1, I2] and allocation [I2, I1], then view(-1) should be invalid, and we should do reshape(-1), which is not an alias. I think the alias analysis should take this into consideration. I think we can not really have an accurate check unless that is built out. But depending on how you would like to incrementally approach the final destination, I have no preference on whether we should add this check first or add complete handling allocation first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inferAndValidateAllocationSizesAndStrides is slow, and IIUC LoadStoreOp::evaluate will become a hot path because it will be executing at runtime. At some point, we might need to worry about CPU overhead here.

Good point -- I'm quite concerned about that too. @jjsjann123 and I happened to talk about this the other day. An alternative is to do more symbolic execution at compile time. For example, if fusion IR could represent strides (as Val*s) and propagate these symbolic strides through meta ops like reshapes and transposes, we could spend less run time. This would involve a much larger change (possibly redefine allocation domain) so I didn't take that path for now.

cc @tfogal

I think the alias analysis should take this into consideration.

It does, e.g.,

// This is a sufficient but not necessary condition for `out` to alias
. The invariant I've been enforcing is: if alias analysis reports out_tv is an alias of in_tv, then ExpressionEvaluator is able to bind in_tv to any valid in_tensor (i.e. matching allocation domain and contiguity etc) and evaluate out_tv as a out_tensor that's an alias of in_tensor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, so we are just totally not using allocDomain at all here. 😿

So the validation here would just always fail when we have permuted/transformed alloc domain?

Right. In fact, I don't even know how to handle an allocation domain that's not merely a permutation of the rfactor domain (see my TODO below). For example, let's say the rfactor domain is [N,H,W,C] and the allocation domain is [N,C/32,H,W,32]. out_tensor must still have the shape of [N,H,W,C] (because the rfactor domain dictates the logical shape), but what are going to be the strides?!

PS: It's interesting that, at this point, fusion IR is more expressive than ATen ops in some cases (e.g. the allocation domain example above) and less expressive in some other cases (e.g. not being able to represent strides). I'm getting nervous about inconsistency like this will bite us at some point in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less expressive in some other cases (e.g. not being able to represent strides)

This feels more like a implementation thing. Up until this point, we only deal with strides in our indexing. So the concept of stride isn't part of our fusion IR.
I'm not saying your concern isn't legit. I don't know if there's something in our indexing code that can be unified/repurposed to represent stride.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting nervous about inconsistency like this will bite us at some point in the future.

For the place where we need to interface with ATen, we should check consistency and disallow things like [N, C/32, H, W, 32] but still allow things like [N, H, W, C/32, 32]. For intermediate tensors, I don't think we should have any restriction on what allocation domain can be.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the place where we need to interface with ATen

Due to the power of ExpressionEvaluator, it's too hard to differentiate such places. A user can ask ExpressionEvaluator to turn any TensorView (including intermediate ones) into an at::Tensor. Furthermore, even if the user asks ExpressionEvaluator for a fusion output TensorView, ExpressionEvaluator internally computes all dependent TensorViews, many of which are intermediate, as at::Tensors.

AFAICS, for this particular case, it's probably best to proceed with the assumption that allocation must be an rfactor permutation. Because I don't see another good way around. Do you all agree? We should probably have another chat about what's the right assumption between allocation and rfactor in general.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, for this particular case, it's probably best to proceed with the assumption that allocation must be an rfactor permutation.

This is definitely better than just ignoring the allocation domain and produce silent wrong result. I think we can start from here.

[](auto c_flag) { return c_flag.has_value(); }),
"The size of contiguity mismatch with the dimensionality of allocation domain");

// Validate that strides must be in decreasing order modulo reduction and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why strides must be decreasing? If we have a discontiguous tensor, can't it be arbitrary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the input sizes and strides are rearranged according to the allocation domain and therefore in major-to-minor order. So I'm expecting strides to be in decreasing order modulo broadcast and reduction. Otherwise, it's not compatible with the allocation domain.

Copy link
Collaborator

@zasdfgbnm zasdfgbnm Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think:

  1. Fusion executor cache should reorder it as you described.
  2. Even if fusion executor cache does not reorder it, as long as it is marked as discontiguous, the rest of our system should support it.

So I do think we need such a check in our system, but not here.

Copy link
Collaborator Author

@wujingyue wujingyue Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have talked past each other. The reorder happens

// need to put them to the correct order.
in inferAndValidateAllocationSizesAndStrides. validateAllocationSizesAndStrides is an internal function whose input sizes and strides are guaranteed to be reordered (and thus my expectation). Does that make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we talking about the same reorder? I think there are two different "reorder"s we can think about. One reorder is to reorder the computed sizes and strides to be consistent with the allocation domain. This reorder does not change allocation domain itself. For this reordering, it happens in the code you refers to. But the "reorder" I was referring to is to set the allocation domain so that the order of the allocation domain matches with the stride order of the ATen tensor. This is a process similar to how we set the contiguity of a tensor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocDomain is only semantically meaningful when we have the corresponding contiguity flag set as true?

Interesting. Does that mean allocation_domain[i] is completely useless when contiguity[i]==false?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of allocation domain if it doesn't dictate a physical memory layout (and therefore a stride order)?

It does dictate the physical memory layout, but "dictate the physical memory layout" can not be simply translated to "require the stride order to be decreasing". For example, if I have a tensor whose rFactor domain is [I1, I2, I3], shape is (4, 5, 6), and stride is (6, 2, 1), then it might makes more sense to have allocation domain [I2, I1, I3] and contiguity (false, true, true), instead of having allocation domain [I1, I2, I3] and contiguity (false, false, true).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean allocation_domain[i] is completely useless when contiguity[i]==false?

No, it is not completely useless. I think there are two separate questions:

  1. If we see a tensor allocated by someone else, would we accept it or not?
  2. If we should allocate a tensor, how would we allocate it?

Following this principle: https://en.wikipedia.org/wiki/Robustness_principle, which basically says "be conservative in what you do, be liberal in what you accept from others", for question 1, as long as we can correctly indexing it, we should accept it. When contiguity is false, we are indexing it in a pretty general way that does not assume stride order, so we should accept it. For question 2, when we should allocate our tensor, we should follow the order defined by allocation domain. In that sense, the allocation is still dictating the physical layout.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we are not that far away from the reality that allocDomain and contiguity should come as a single package 🤷

I think they are already a single package, no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re @zasdfgbnm

I think they are already a single package, no?

Conceptually I think yes. But looking at the implementation we have, it's likely a no... hence the confusion that Jingyue is having regarding the topic.

Re @wujingyue

allocDomain is only semantically meaningful when we have the corresponding contiguity flag set as true?

Interesting. Does that mean allocation_domain[i] is completely useless when contiguity[i]==false?

Sorta yes. You would also need contiguity[i-1] == false.
But this isn't a definitive answer there. As @zasdfgbnm points out, we can still rely on the order for scheduler, but that's should be just a performance hint.
I think Xiang's point here is that, we shouldn't assert on stride order when corresponding contiguity flags were set to false.
I agree with this ^^^. Even though in general, I'm not a big fan of "be conservative in what you do, be liberal in what you accept from others" 😆

@wujingyue
Copy link
Collaborator Author

I think we should respect explicit alloc domain and contiguity, be it an intermediate or output. If fusion API has it and it's going to be used by scheduler as well as fusion definition, stride in those sense becomes semantic and is no longer an implementation detail.

Makes sense -- that's most strict and costly to implement. I think it implies most of our XyzOp::evaluates are wrong, because AFAICT none of them (except this one I'm going to submit) look into the allocation domain at all.

@jjsjann123
Copy link
Collaborator

I'm wondering if it makes sense to limit some of our analysis on alloc_domain to be permutation of rfactor.

I know @zasdfgbnm has use case of transformed alloc_domain to represent swizzle, but that's mostly just a scheduling thing?!?! Alias analysis happening before the scheduler shouldn't need to worry about that, since we are looking at fusion before scheduling any way.

@wujingyue
Copy link
Collaborator Author

I'm wondering if it makes sense to limit some of our analysis on alloc_domain to be permutation of rfactor.

It's not all about alias analysis. We call ExpressionEvaluator after all compilation. We can't stop people from using ExpressionEvaluator on a ViewOp whose allocation is not a rfactor permutation. Can we?

@tfogal
Copy link
Contributor

tfogal commented Oct 30, 2023

I have a lot to catch up on here, so apologies for not commenting technically.

At a more logistical level: it seems we're discussing semantics (and a possible change in semantics) of core components of nvFuser's IR. I suggest we don't have these discussions inside a github PR, or issue, etc., but rather as a group at, say, the Tuesday meeting series.

@jjsjann123
Copy link
Collaborator

I have a lot to catch up on here, so apologies for not commenting technically.

At a more logistical level: it seems we're discussing semantics (and a possible change in semantics) of core components of nvFuser's IR. I suggest we don't have these discussions inside a github PR, or issue, etc., but rather as a group at, say, the Tuesday meeting series.

Good point!

We do have tomorrow's meeting scheduled for discussion on allocDomain, I'll put a slide on this topic and get some input from the group on this.

@wujingyue wujingyue marked this pull request as draft November 5, 2023 23:55
@wujingyue
Copy link
Collaborator Author

wujingyue commented Nov 6, 2023

This PR needs more work.

#986 explains why memory swizzling requires allocation domain to be more than an rfactor permutation. This means that ExpressionEvaluator::evaluate(TensorView*) might not always be able to return a valid at::Tensor, e.g., when the input TensorView has a rfactor domain of [i0,i1] and an allocation domain of [i0/2,i1/2,4].

This leaves ExpressionEvaluator two options to treat allocation domains.

  1. The strict approach. ExpressionEvaluator::evaluate(TensorView*) errors out when the allocation domain of the input TensorView is beyond what strides can represent. This applies to intermediate TensorViews as well because ExpressionEvaluator::evaluate is recursive.
  2. The loose approach. ExpressionEvaluator::evaluate(TensorView*) takes the best effort to produce an at::Tensor that complies with the allocation domain. This way, the user of ExpressionEvaluator::evaluate(TensorView*) bears the responsibility to check the allocation domain required (e.g. by alias analysis or by Framework) is compatible with the output strides. This check likely only needs to happen at fusion outputs, so it tolerates incompatible allocation domains of intermediate tensors.

I haven't decided which option to take. Chances are, in the short term, they only matter in terms of implementation complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

allocation domain issues related to allocation domain support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants