-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TensorView::clearReductionIterDomains specifies wrong contiguity flag #2200
Conversation
!build |
!build |
!build |
!build |
patching issue exposed in #2168 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I want to make sure I understand. Allocation domain can contain any type of IterDomain
including IterType::Reduction
and IterType::Broadcast
. Contiguity refers specifically to TensorDomain::noReductions(tv->getAllocationDomain())
. Is there a reason we use std::nullopt
for broadcast contiguity whereas we don't represent reduction axes at all?
tv1->setAllocationDomain( | ||
{tv1->axis(0), tv1->axis(2), tv1->axis(1)}, | ||
{std::nullopt, true, std::nullopt}); | ||
tv1->clearReductionIterDomains(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to come up with a better test? I assume the user of nvFuser API (Python or csrc/ops) never calls clearReductionIterDomains. So the test as is seems to be testing an internal implementation that's subject to change, not an API behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to come up with a better test?
I like having a tight-fitting test to go with a patch PR.
TensorView::clearReductionIterDomain
is a public API. So I think it's fair to have this as a test that there's no assert thing. But I agree that it's not that interesting a thing to put here. But as Naoya suggested below, if I put validation here, then it's a bad test on asserting internal implementation.
So I agree with @wujingyue here, I'll dig up the other PR for an end-2-end test that goes through FusionExecutorCache
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand here correctly, but why not just validating the root and allocation domains are set as intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand here correctly, but why not just validating the root and allocation domains are set as intended?
I think for this specific API, it's a fairly reasonable validation. i.e. Reduction should be dropped and contiguity flags updated as well. @wujingyue was arguing that clearReductionIterDomains
is not a publicly exposed API. So the test isn't meaningful. Does adding the validation suggested by @naoyam make more sense for us to keep the simple test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the test as is. My only complaint was/is that this test can be fragile if we later want to use a different way to process the output IterDomains for Welford. We might not even have a counterpart API to test against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only complaint was/is that this test can be fragile if we later want to use a different way to process the output IterDomains for Welford. We might not even have a counterpart API to test against.
Yeah that part is understood and I agree that we should have better test coverage for reduction/normalization scheduler in general.
The issue patched in this PR is just one of the issues discovered in the refactor work #2168.
I have #2202 to track patching allocation domain support in reduction scheduler, which I think would be a better place to put functional end-2-end tests with the scheduler.
I don't think our protocol trates broadcast and reduction differently regarding contiguity. Lines 3201 to 3219 in 2dbf1d0
|
You're right. I was confused but I understand now. They're both treated the same as you say. This code is skipping reduction axes in the new contiguity because this method is meant to strip out all the reduction domains. |
!build |
merging with green CI. Reviewer's concern on test case was tracked in issue #2202 |
TensorView::clearReductionIterDomains
constructs new contiguity vector on root domain, causing validation assert when output's allocation domain is not identical to its root domain.This PR patches the logic and adds a test.