Fix TensorView::clearReductionIterDomains specifies wrong contiguity flag #2200

jjsjann123 · 2024-05-04T07:00:08Z

TensorView::clearReductionIterDomains constructs new contiguity vector on root domain, causing validation assert when output's allocation domain is not identical to its root domain.

This PR patches the logic and adds a test.

jjsjann123 · 2024-05-04T07:00:18Z

!build

jjsjann123 · 2024-05-04T07:32:24Z

!build

jjsjann123 · 2024-05-04T08:03:22Z

!build

jjsjann123 · 2024-05-06T16:28:46Z

!build

jjsjann123 · 2024-05-06T16:29:35Z

patching issue exposed in #2168

jacobhinkle

Looks good to me, but I want to make sure I understand. Allocation domain can contain any type of IterDomain including IterType::Reduction and IterType::Broadcast. Contiguity refers specifically to TensorDomain::noReductions(tv->getAllocationDomain()). Is there a reason we use std::nullopt for broadcast contiguity whereas we don't represent reduction axes at all?

wujingyue · 2024-05-06T17:34:30Z

tests/cpp/test_allocation_domain.cpp

+  tv1->setAllocationDomain(
+      {tv1->axis(0), tv1->axis(2), tv1->axis(1)},
+      {std::nullopt, true, std::nullopt});
+  tv1->clearReductionIterDomains();


Is it possible to come up with a better test? I assume the user of nvFuser API (Python or csrc/ops) never calls clearReductionIterDomains. So the test as is seems to be testing an internal implementation that's subject to change, not an API behavior.

Is it possible to come up with a better test?

I like having a tight-fitting test to go with a patch PR.

TensorView::clearReductionIterDomain is a public API. So I think it's fair to have this as a test that there's no assert thing. But I agree that it's not that interesting a thing to put here. But as Naoya suggested below, if I put validation here, then it's a bad test on asserting internal implementation.

So I agree with @wujingyue here, I'll dig up the other PR for an end-2-end test that goes through FusionExecutorCache.

I'm not sure if I understand here correctly, but why not just validating the root and allocation domains are set as intended?

I'm not sure if I understand here correctly, but why not just validating the root and allocation domains are set as intended?

I think for this specific API, it's a fairly reasonable validation. i.e. Reduction should be dropped and contiguity flags updated as well. @wujingyue was arguing that clearReductionIterDomains is not a publicly exposed API. So the test isn't meaningful. Does adding the validation suggested by @naoyam make more sense for us to keep the simple test?

I'm OK with the test as is. My only complaint was/is that this test can be fragile if we later want to use a different way to process the output IterDomains for Welford. We might not even have a counterpart API to test against.

My only complaint was/is that this test can be fragile if we later want to use a different way to process the output IterDomains for Welford. We might not even have a counterpart API to test against.

Yeah that part is understood and I agree that we should have better test coverage for reduction/normalization scheduler in general.

The issue patched in this PR is just one of the issues discovered in the refactor work #2168.
I have #2202 to track patching allocation domain support in reduction scheduler, which I think would be a better place to put functional end-2-end tests with the scheduler.

csrc/tensor_view.cpp

tests/cpp/test_allocation_domain.cpp

jjsjann123 · 2024-05-06T18:32:44Z

Contiguity refers specifically to TensorDomain::noReductions(tv->getAllocationDomain()). Is there a reason we use std::nullopt for broadcast contiguity whereas we don't represent reduction axes at all?

I don't think our protocol trates broadcast and reduction differently regarding contiguity.

Fuser/csrc/ir/nodes.cpp

Lines 3201 to 3219 in 2dbf1d0

    
           void validateContiguity( 
        
               const std::vector<IterDomain*>& allocation_domain, 
        
               const std::vector<std::optional<bool>>& contiguity) { 
        
             NVF_CHECK( 
        
                 contiguity.size() == allocation_domain.size(), 
        
                 "Invalid contiguity information provided, incorrect size. Received vector of size ", 
        
                 contiguity.size(), 
        
                 " but needed one of size ", 
        
                 allocation_domain.size()); 
        
             for (auto i : c10::irange(contiguity.size())) { 
        
               bool expect_null = 
        
                   (allocation_domain.at(i)->isBroadcast() || 
        
                    allocation_domain.at(i)->isReduction()); 
        
               NVF_CHECK( 
        
                   expect_null != contiguity.at(i).has_value(), 
        
                   "The contiguity of a broadcast/reduction dimension must be None. " 
        
                   "The contiguity of a non-broadcast/reduction dimension must be true/false"); 
        
             } 
        
           }

jacobhinkle · 2024-05-06T19:03:25Z

I don't think our protocol trates broadcast and reduction differently regarding contiguity.

You're right. I was confused but I understand now. They're both treated the same as you say. This code is skipping reduction axes in the new contiguity because this method is meant to strip out all the reduction domains.

csrc/tensor_view.cpp

jjsjann123 · 2024-05-06T22:26:19Z

!build

jjsjann123 · 2024-05-07T04:56:40Z

merging with green CI. Reviewer's concern on test case was tracked in issue #2202

jjsjann123 added 3 commits May 3, 2024 23:50

fixing nvfuser::TensorView::clearReductionIterDomains

646c2b8

fixing part 2

69c0c67

clangformat and tests

d7c8a5e

typo

010aac0

jjsjann123 added 4 commits May 4, 2024 00:48

fixing test

0b94a2e

trying to fix test again

e25b459

fixing test for real this time

0ba4d16

clangformat

3f0c191

jjsjann123 added the allocation domain issues related to allocation domain support label May 4, 2024

jjsjann123 mentioned this pull request May 5, 2024

Allocation order refactor #2168

Merged

jjsjann123 added 2 commits May 6, 2024 09:19

quick refactor / clean up

239bf10

Merge remote-tracking branch 'origin/main' into HEAD

28f3959

jjsjann123 changed the title ~~Clear reduction iter domains patch~~ Fix TensorView::clearReductionIterDomains specifies wrong contiguity flag May 6, 2024

quick_fix

fbc1823

jjsjann123 requested review from naoyam, zasdfgbnm, jacobhinkle and wujingyue May 6, 2024 16:29

jjsjann123 marked this pull request as ready for review May 6, 2024 16:29

jacobhinkle reviewed May 6, 2024

View reviewed changes

wujingyue reviewed May 6, 2024

View reviewed changes

csrc/tensor_view.cpp Outdated Show resolved Hide resolved

csrc/tensor_view.cpp Outdated Show resolved Hide resolved

csrc/tensor_view.cpp Outdated Show resolved Hide resolved

naoyam reviewed May 6, 2024

View reviewed changes

csrc/tensor_view.cpp Outdated Show resolved Hide resolved

naoyam reviewed May 6, 2024

View reviewed changes

tests/cpp/test_allocation_domain.cpp Show resolved Hide resolved

review comments

488223f

wujingyue reviewed May 6, 2024

View reviewed changes

csrc/tensor_view.cpp Show resolved Hide resolved

updating minimal repro test

12ac2a9

jjsjann123 requested review from wujingyue and naoyam May 6, 2024 22:02

wujingyue approved these changes May 6, 2024

View reviewed changes

jjsjann123 mentioned this pull request May 7, 2024

Reduction scheduler does not handle allocation domain properly and trigger assert when reduction output has specified allocation domain #2202

Open

jjsjann123 merged commit 729f36c into main May 7, 2024
35 checks passed

jjsjann123 deleted the clearReductionIterDomains_patch branch May 7, 2024 04:56

jjsjann123 restored the clearReductionIterDomains_patch branch May 7, 2024 05:02

jjsjann123 deleted the clearReductionIterDomains_patch branch May 7, 2024 05:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TensorView::clearReductionIterDomains specifies wrong contiguity flag #2200

Fix TensorView::clearReductionIterDomains specifies wrong contiguity flag #2200

jjsjann123 commented May 4, 2024 •

edited

Loading

jjsjann123 commented May 4, 2024

jjsjann123 commented May 4, 2024

jjsjann123 commented May 4, 2024

jjsjann123 commented May 6, 2024

jjsjann123 commented May 6, 2024

jacobhinkle left a comment

wujingyue May 6, 2024

jjsjann123 May 6, 2024

naoyam May 6, 2024

jjsjann123 May 6, 2024

wujingyue May 6, 2024

jjsjann123 May 7, 2024

jjsjann123 commented May 6, 2024

jacobhinkle commented May 6, 2024

jjsjann123 commented May 6, 2024

jjsjann123 commented May 7, 2024

Fix TensorView::clearReductionIterDomains specifies wrong contiguity flag #2200

Fix TensorView::clearReductionIterDomains specifies wrong contiguity flag #2200

Conversation

jjsjann123 commented May 4, 2024 • edited Loading

jjsjann123 commented May 4, 2024

jjsjann123 commented May 4, 2024

jjsjann123 commented May 4, 2024

jjsjann123 commented May 6, 2024

jjsjann123 commented May 6, 2024

jacobhinkle left a comment

Choose a reason for hiding this comment

wujingyue May 6, 2024

Choose a reason for hiding this comment

jjsjann123 May 6, 2024

Choose a reason for hiding this comment

naoyam May 6, 2024

Choose a reason for hiding this comment

jjsjann123 May 6, 2024

Choose a reason for hiding this comment

wujingyue May 6, 2024

Choose a reason for hiding this comment

jjsjann123 May 7, 2024

Choose a reason for hiding this comment

jjsjann123 commented May 6, 2024

jacobhinkle commented May 6, 2024

jjsjann123 commented May 6, 2024

jjsjann123 commented May 7, 2024

jjsjann123 commented May 4, 2024 •

edited

Loading