Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TensorView::clearReductionIterDomains specifies wrong contiguity flag #2200

Merged
merged 13 commits into from
May 7, 2024

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented May 4, 2024

TensorView::clearReductionIterDomains constructs new contiguity vector on root domain, causing validation assert when output's allocation domain is not identical to its root domain.

This PR patches the logic and adds a test.

@jjsjann123
Copy link
Collaborator Author

!build

@jjsjann123
Copy link
Collaborator Author

!build

@jjsjann123
Copy link
Collaborator Author

!build

@jjsjann123 jjsjann123 added the allocation domain issues related to allocation domain support label May 4, 2024
@jjsjann123 jjsjann123 changed the title Clear reduction iter domains patch Fix TensorView::clearReductionIterDomains specifies wrong contiguity flag May 6, 2024
@jjsjann123
Copy link
Collaborator Author

!build

@jjsjann123 jjsjann123 marked this pull request as ready for review May 6, 2024 16:29
@jjsjann123
Copy link
Collaborator Author

patching issue exposed in #2168

Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I want to make sure I understand. Allocation domain can contain any type of IterDomain including IterType::Reduction and IterType::Broadcast. Contiguity refers specifically to TensorDomain::noReductions(tv->getAllocationDomain()). Is there a reason we use std::nullopt for broadcast contiguity whereas we don't represent reduction axes at all?

tv1->setAllocationDomain(
{tv1->axis(0), tv1->axis(2), tv1->axis(1)},
{std::nullopt, true, std::nullopt});
tv1->clearReductionIterDomains();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to come up with a better test? I assume the user of nvFuser API (Python or csrc/ops) never calls clearReductionIterDomains. So the test as is seems to be testing an internal implementation that's subject to change, not an API behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to come up with a better test?

I like having a tight-fitting test to go with a patch PR.

TensorView::clearReductionIterDomain is a public API. So I think it's fair to have this as a test that there's no assert thing. But I agree that it's not that interesting a thing to put here. But as Naoya suggested below, if I put validation here, then it's a bad test on asserting internal implementation.

So I agree with @wujingyue here, I'll dig up the other PR for an end-2-end test that goes through FusionExecutorCache.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand here correctly, but why not just validating the root and allocation domains are set as intended?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand here correctly, but why not just validating the root and allocation domains are set as intended?

I think for this specific API, it's a fairly reasonable validation. i.e. Reduction should be dropped and contiguity flags updated as well. @wujingyue was arguing that clearReductionIterDomains is not a publicly exposed API. So the test isn't meaningful. Does adding the validation suggested by @naoyam make more sense for us to keep the simple test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the test as is. My only complaint was/is that this test can be fragile if we later want to use a different way to process the output IterDomains for Welford. We might not even have a counterpart API to test against.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only complaint was/is that this test can be fragile if we later want to use a different way to process the output IterDomains for Welford. We might not even have a counterpart API to test against.

Yeah that part is understood and I agree that we should have better test coverage for reduction/normalization scheduler in general.

The issue patched in this PR is just one of the issues discovered in the refactor work #2168.
I have #2202 to track patching allocation domain support in reduction scheduler, which I think would be a better place to put functional end-2-end tests with the scheduler.

csrc/tensor_view.cpp Outdated Show resolved Hide resolved
csrc/tensor_view.cpp Outdated Show resolved Hide resolved
csrc/tensor_view.cpp Outdated Show resolved Hide resolved
csrc/tensor_view.cpp Outdated Show resolved Hide resolved
@jjsjann123
Copy link
Collaborator Author

Contiguity refers specifically to TensorDomain::noReductions(tv->getAllocationDomain()). Is there a reason we use std::nullopt for broadcast contiguity whereas we don't represent reduction axes at all?

I don't think our protocol trates broadcast and reduction differently regarding contiguity.

Fuser/csrc/ir/nodes.cpp

Lines 3201 to 3219 in 2dbf1d0

void validateContiguity(
const std::vector<IterDomain*>& allocation_domain,
const std::vector<std::optional<bool>>& contiguity) {
NVF_CHECK(
contiguity.size() == allocation_domain.size(),
"Invalid contiguity information provided, incorrect size. Received vector of size ",
contiguity.size(),
" but needed one of size ",
allocation_domain.size());
for (auto i : c10::irange(contiguity.size())) {
bool expect_null =
(allocation_domain.at(i)->isBroadcast() ||
allocation_domain.at(i)->isReduction());
NVF_CHECK(
expect_null != contiguity.at(i).has_value(),
"The contiguity of a broadcast/reduction dimension must be None. "
"The contiguity of a non-broadcast/reduction dimension must be true/false");
}
}

@jacobhinkle
Copy link
Collaborator

I don't think our protocol trates broadcast and reduction differently regarding contiguity.

You're right. I was confused but I understand now. They're both treated the same as you say. This code is skipping reduction axes in the new contiguity because this method is meant to strip out all the reduction domains.

@jjsjann123
Copy link
Collaborator Author

!build

@jjsjann123
Copy link
Collaborator Author

merging with green CI. Reviewer's concern on test case was tracked in issue #2202

@jjsjann123 jjsjann123 merged commit 729f36c into main May 7, 2024
35 checks passed
@jjsjann123 jjsjann123 deleted the clearReductionIterDomains_patch branch May 7, 2024 04:56
@jjsjann123 jjsjann123 restored the clearReductionIterDomains_patch branch May 7, 2024 05:02
@jjsjann123 jjsjann123 deleted the clearReductionIterDomains_patch branch May 7, 2024 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
allocation domain issues related to allocation domain support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants