-
Notifications
You must be signed in to change notification settings - Fork 75
Translate MmaOp patterns properly on Hopper #4072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TODO: Squeeze, permute, chain tests
Failing at computing TMA descriptor now
Build errors on clang
| } | ||
|
|
||
| void AmpereMultipleMatmulScheduler::run() { | ||
| // Clears memory spaces on intermediate tensors, calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes to Ampere scheduler should not change generated code, but do let us use a common cacheInputsAndOutputs method.
| for (Val* dv : fusion_->outputs()) { | ||
| auto* d = dv->as<TensorView>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dc is ignored anyway. We long ago stopped using cached_outputs_ in the Hopper scheduler, so I removed it for Ampere too as it was causing a problem due to the refactor not filling that vector.
|
|
||
| namespace { | ||
|
|
||
| // Check whether tv has all the output_groups in its logical domain, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These utilities are no longer needed since we can now safely translate all matmul patterns the same way on both Hopper and Ampere. The differences are purely in downstream scheduling.
tests/cpp/test_matmul_scheduler.cpp
Outdated
| // Test that our fake plugin works to override the default heuristic | ||
| TEST_F(MatmulSchedulerPluginTest, BasicMatmul) { | ||
| NVFUSER_TEST_CUDA_ARCH_RANGE_GUARD(8, 0, 9, 0); | ||
| NVFUSER_TEST_CUDA_ARCH_RANGE_GUARD(8, 0, 10, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More tests can likely be unguarded once we update their params or set a fixture to create default sets of parameters.
Fixes the horizontal fusion tests
This can be made to work but currently the config factory generates an invalid config.
|
!test --diff |
|
!test --diff |
| void HopperMultipleMatmulScheduler::run() { | ||
| // Clears memory spaces on intermediate tensors, calls | ||
| // cache{After,Before,Fork} on inputs and outputs | ||
| cacheInputsAndOutputs(); | ||
|
|
||
| // Finds matmul patterns and translates them to MmaOps, then finds tensor | ||
| // and dimension roles for all tensors in the fusion | ||
| findPatterns(); | ||
| translatePatterns(); | ||
| findRoles(); | ||
|
|
||
| // Clears memory spaces on intermediate tensors, calls | ||
| // cache{After,Before,Fork} on inputs and outputs. | ||
| // Defines acw_smem/bcw_smem and acr/bcr by possibly calling cacheAfter. | ||
| // This also collects mma_results_ | ||
| defineOperandCaches(); | ||
| cacheInputsAndOutputs(/*skip_intermediates=*/true); | ||
|
|
||
| // We wait until we are done caching tensors to find roles, since this | ||
| // requires building an IdModel, which would not be updated during the cache | ||
| // calls. | ||
| findRoles(); | ||
|
|
||
| inspectPrologues(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rearranged to not cache until after translation of patterns. This is helpful because it lets us cache the global tensors that have been "skipped" with producer tensor aliases, instead of the original fusion inputs.
rdspring1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| std::vector<TensorView*> acw_smems_, bcw_smems_, acrs_, bcrs_, abs_, bbs_, | ||
| splitk_sums_, smem_epilogues_; | ||
| private: | ||
| // Tensors used for loading operands from smem to registers, and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
|
|
||
| // This is like the above method, but tv should not have any K dimension | ||
| void transformLikeMmaOutputWithoutK(TensorView* tv); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
| int64_t num_device_and_batch_dims_ = 0; | ||
|
|
||
| std::vector<TensorView*> as_, bs_, mma_results_; | ||
| std::vector<TensorView*> as_, bs_, acw_smems_, bcw_smems_, mma_results_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as_, bs_, acw_smems_, bcw_smems_ variable names are not very intuitive to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, they are a holdover from the original matmul scheduler code, which is likely influenced by terminology in other projects like cutlass. It might be better to name as_ as a_gmem_ and acw_smems_ as a_smem_. We can do that in another PR.
csrc/scheduler/multi_matmul.cpp
Outdated
| if (auto it = tensor_roles_.find(MatmulTensorRole::EPILOGUE_INPUT); | ||
| it != tensor_roles_.end()) { | ||
| for (TensorView* tv : it->second) { | ||
| tv->cacheAfter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we shouldn't track the epilogue input caches in a data member. It would be used in scheduleEpilogue.
|
!test |
|
!test |
#3986 fixes our most common use cases of MatmulOp and LinearOp translation on Hopper. It does so by scheduling global intermediates' mtypes and allocation domains during translation. However, in case there is no translation and we are already given an MmaOp this fails. The current PR instead does mtype and allocation domain propagation while caching operands, so that we can properly set
as_andbs_and so forth. This means that the input fusions don't need to differ between hopper and ampere anymore, so we can translate both cases in the same way and the only difference will be during scheduling.Note that this will also make it easier to maintain internal tooling which uses things like
canonicalizeInputToBMNK.Changes in this PR:
avoid_intermediatesargument toMatmulPattern::translateToMmaOpand update all call sites.mma_utils.cppscheduler_utils::scheduleInputToSkipIntermediateswhich will schedule the allocation domains and memory types of consumers of inputs recursively to avoid "metadata ops" at the beginning of a fusion.HopperMultipleMatmulSchedulerto removedefineOperandCachesand movecacheInputsAndOutputsafter pattern translation but beforefindRoles. AlsocacheInputsAndOutputsnow usesscheduler_utils::scheduleInputToSkipIntermediatesand defines the operand roles as the last gmem tensor returned by that utility.AllocationDomainTest.BasicMatmul/*to allow it to run on Hopper