ExpressionEvaluator validates allocation domain for Set.Permute. #1167
ExpressionEvaluator validates allocation domain for Set.Permute. #1167
Conversation
|
!build |
jjsjann123
left a comment
There was a problem hiding this comment.
I think we should respect explicit alloc domain and contiguity, be it an intermediate or output. If fusion API has it and it's going to be used by scheduler as well as fusion definition, stride in those sense becomes semantic and is no longer an implementation detail.
Is an empty allocation domain compatible with any stride order or with only row-major?
For empty allocation domain, I tend to think this is left to interpretation. i.e. if there's no constraint, we don't necessarily enforce anything there.
i.e. feel free to create aliases if that means we are returning tensor in arbitrary stride order.
| at::Tensor out_tensor = in_tensor.permute(computePermutation( | ||
| out_tv->getRootDomain(), out_tv->getRFactorDomain())); | ||
|
|
||
| inferAndValidateAllocationSizesAndStrides(out_tensor, out_tv, ee); |
There was a problem hiding this comment.
oh, so we are just totally not using allocDomain at all here. 😿
So the validation here would just always fail when we have permuted/transformed alloc domain?
There was a problem hiding this comment.
Generally, I think it does make sense to validate with allocation domain, but my concern are:
inferAndValidateAllocationSizesAndStridesis slow, and IIUCLoadStoreOp::evaluatewill become a hot path because it will be executing at runtime. At some point, we might need to worry about CPU overhead here.- I don't know how the alias analysis works, but, for example, if I have rFactor
[I1, I2]and allocation[I2, I1], thenview(-1)should be invalid, and we should doreshape(-1), which is not an alias. I think the alias analysis should take this into consideration. I think we can not really have an accurate check unless that is built out. But depending on how you would like to incrementally approach the final destination, I have no preference on whether we should add this check first or add complete handling allocation first.
There was a problem hiding this comment.
inferAndValidateAllocationSizesAndStrides is slow, and IIUC LoadStoreOp::evaluate will become a hot path because it will be executing at runtime. At some point, we might need to worry about CPU overhead here.
Good point -- I'm quite concerned about that too. @jjsjann123 and I happened to talk about this the other day. An alternative is to do more symbolic execution at compile time. For example, if fusion IR could represent strides (as Val*s) and propagate these symbolic strides through meta ops like reshapes and transposes, we could spend less run time. This would involve a much larger change (possibly redefine allocation domain) so I didn't take that path for now.
cc @tfogal
I think the alias analysis should take this into consideration.
It does, e.g.,
Fuser/csrc/optimization/alias_analysis.cpp
Line 103 in 6751f27
out_tv is an alias of in_tv, then ExpressionEvaluator is able to bind in_tv to any valid in_tensor (i.e. matching allocation domain and contiguity etc) and evaluate out_tv as a out_tensor that's an alias of in_tensor.
There was a problem hiding this comment.
oh, so we are just totally not using allocDomain at all here. 😿
So the validation here would just always fail when we have permuted/transformed alloc domain?
Right. In fact, I don't even know how to handle an allocation domain that's not merely a permutation of the rfactor domain (see my TODO below). For example, let's say the rfactor domain is [N,H,W,C] and the allocation domain is [N,C/32,H,W,32]. out_tensor must still have the shape of [N,H,W,C] (because the rfactor domain dictates the logical shape), but what are going to be the strides?!
PS: It's interesting that, at this point, fusion IR is more expressive than ATen ops in some cases (e.g. the allocation domain example above) and less expressive in some other cases (e.g. not being able to represent strides). I'm getting nervous about inconsistency like this will bite us at some point in the future.
There was a problem hiding this comment.
less expressive in some other cases (e.g. not being able to represent strides)
This feels more like a implementation thing. Up until this point, we only deal with strides in our indexing. So the concept of stride isn't part of our fusion IR.
I'm not saying your concern isn't legit. I don't know if there's something in our indexing code that can be unified/repurposed to represent stride.
There was a problem hiding this comment.
I'm getting nervous about inconsistency like this will bite us at some point in the future.
For the place where we need to interface with ATen, we should check consistency and disallow things like [N, C/32, H, W, 32] but still allow things like [N, H, W, C/32, 32]. For intermediate tensors, I don't think we should have any restriction on what allocation domain can be.
There was a problem hiding this comment.
For the place where we need to interface with ATen
Due to the power of ExpressionEvaluator, it's too hard to differentiate such places. A user can ask ExpressionEvaluator to turn any TensorView (including intermediate ones) into an at::Tensor. Furthermore, even if the user asks ExpressionEvaluator for a fusion output TensorView, ExpressionEvaluator internally computes all dependent TensorViews, many of which are intermediate, as at::Tensors.
AFAICS, for this particular case, it's probably best to proceed with the assumption that allocation must be an rfactor permutation. Because I don't see another good way around. Do you all agree? We should probably have another chat about what's the right assumption between allocation and rfactor in general.
There was a problem hiding this comment.
AFAICS, for this particular case, it's probably best to proceed with the assumption that allocation must be an rfactor permutation.
This is definitely better than just ignoring the allocation domain and produce silent wrong result. I think we can start from here.
| [](auto c_flag) { return c_flag.has_value(); }), | ||
| "The size of contiguity mismatch with the dimensionality of allocation domain"); | ||
|
|
||
| // Validate that strides must be in decreasing order modulo reduction and |
There was a problem hiding this comment.
Why strides must be decreasing? If we have a discontiguous tensor, can't it be arbitrary?
There was a problem hiding this comment.
IIUC, the input sizes and strides are rearranged according to the allocation domain and therefore in major-to-minor order. So I'm expecting strides to be in decreasing order modulo broadcast and reduction. Otherwise, it's not compatible with the allocation domain.
There was a problem hiding this comment.
I think:
- Fusion executor cache should reorder it as you described.
- Even if fusion executor cache does not reorder it, as long as it is marked as discontiguous, the rest of our system should support it.
So I do think we need such a check in our system, but not here.
There was a problem hiding this comment.
I think we might have talked past each other. The reorder happens
Fuser/csrc/tensor_metadata.cpp
Line 318 in b4ef869
inferAndValidateAllocationSizesAndStrides. validateAllocationSizesAndStrides is an internal function whose input sizes and strides are guaranteed to be reordered (and thus my expectation). Does that make sense?
There was a problem hiding this comment.
Are we talking about the same reorder? I think there are two different "reorder"s we can think about. One reorder is to reorder the computed sizes and strides to be consistent with the allocation domain. This reorder does not change allocation domain itself. For this reordering, it happens in the code you refers to. But the "reorder" I was referring to is to set the allocation domain so that the order of the allocation domain matches with the stride order of the ATen tensor. This is a process similar to how we set the contiguity of a tensor.
There was a problem hiding this comment.
allocDomain is only semantically meaningful when we have the corresponding contiguity flag set as true?
Interesting. Does that mean allocation_domain[i] is completely useless when contiguity[i]==false?
There was a problem hiding this comment.
What's the purpose of allocation domain if it doesn't dictate a physical memory layout (and therefore a stride order)?
It does dictate the physical memory layout, but "dictate the physical memory layout" can not be simply translated to "require the stride order to be decreasing". For example, if I have a tensor whose rFactor domain is [I1, I2, I3], shape is (4, 5, 6), and stride is (6, 2, 1), then it might makes more sense to have allocation domain [I2, I1, I3] and contiguity (false, true, true), instead of having allocation domain [I1, I2, I3] and contiguity (false, false, true).
There was a problem hiding this comment.
Does that mean
allocation_domain[i]is completely useless whencontiguity[i]==false?
No, it is not completely useless. I think there are two separate questions:
- If we see a tensor allocated by someone else, would we accept it or not?
- If we should allocate a tensor, how would we allocate it?
Following this principle: https://en.wikipedia.org/wiki/Robustness_principle, which basically says "be conservative in what you do, be liberal in what you accept from others", for question 1, as long as we can correctly indexing it, we should accept it. When contiguity is false, we are indexing it in a pretty general way that does not assume stride order, so we should accept it. For question 2, when we should allocate our tensor, we should follow the order defined by allocation domain. In that sense, the allocation is still dictating the physical layout.
There was a problem hiding this comment.
maybe we are not that far away from the reality that
allocDomainandcontiguityshould come as a single package 🤷
I think they are already a single package, no?
There was a problem hiding this comment.
Re @zasdfgbnm
I think they are already a single package, no?
Conceptually I think yes. But looking at the implementation we have, it's likely a no... hence the confusion that Jingyue is having regarding the topic.
Re @wujingyue
allocDomain is only semantically meaningful when we have the corresponding contiguity flag set as true?
Interesting. Does that mean allocation_domain[i] is completely useless when contiguity[i]==false?
Sorta yes. You would also need contiguity[i-1] == false.
But this isn't a definitive answer there. As @zasdfgbnm points out, we can still rely on the order for scheduler, but that's should be just a performance hint.
I think Xiang's point here is that, we shouldn't assert on stride order when corresponding contiguity flags were set to false.
I agree with this ^^^. Even though in general, I'm not a big fan of "be conservative in what you do, be liberal in what you accept from others" 😆
Makes sense -- that's most strict and costly to implement. I think it implies most of our |
|
I'm wondering if it makes sense to limit some of our analysis on alloc_domain to be I know @zasdfgbnm has use case of transformed alloc_domain to represent swizzle, but that's mostly just a scheduling thing?!?! Alias analysis happening before the scheduler shouldn't need to worry about that, since we are looking at fusion before scheduling any way. |
It's not all about alias analysis. We call |
|
I have a lot to catch up on here, so apologies for not commenting technically. At a more logistical level: it seems we're discussing semantics (and a possible change in semantics) of core components of nvFuser's IR. I suggest we don't have these discussions inside a github PR, or issue, etc., but rather as a group at, say, the Tuesday meeting series. |
Good point! We do have tomorrow's meeting scheduled for discussion on |
|
This PR needs more work. #986 explains why memory swizzling requires allocation domain to be more than an rfactor permutation. This means that This leaves
I haven't decided which option to take. Chances are, in the short term, they only matter in terms of implementation complexity. |
This was a fun exercise. It's half done -- see my TODO. But I'm curious to hear your thoughts before I continue.