Translate MmaOp patterns properly on Hopper #4072

jacobhinkle · 2025-03-13T17:27:38Z

#3986 fixes our most common use cases of MatmulOp and LinearOp translation on Hopper. It does so by scheduling global intermediates' mtypes and allocation domains during translation. However, in case there is no translation and we are already given an MmaOp this fails. The current PR instead does mtype and allocation domain propagation while caching operands, so that we can properly set as_ and bs_ and so forth. This means that the input fusions don't need to differ between hopper and ampere anymore, so we can translate both cases in the same way and the only difference will be during scheduling.

Note that this will also make it easier to maintain internal tooling which uses things like canonicalizeInputToBMNK.

Changes in this PR:

Remove avoid_intermediates argument to MatmulPattern::translateToMmaOp and update all call sites.
Remove some helper utilities in mma_utils.cpp
Introduce scheduler_utils::scheduleInputToSkipIntermediates which will schedule the allocation domains and memory types of consumers of inputs recursively to avoid "metadata ops" at the beginning of a fusion.
Rearrange HopperMultipleMatmulScheduler to remove defineOperandCaches and move cacheInputsAndOutputs after pattern translation but before findRoles. Also cacheInputsAndOutputs now uses scheduler_utils::scheduleInputToSkipIntermediates and defines the operand roles as the last gmem tensor returned by that utility.
Unguards AllocationDomainTest.BasicMatmul/* to allow it to run on Hopper

…owering

TODO: Squeeze, permute, chain tests

Failing at computing TMA descriptor now

Build errors on clang

jacobhinkle · 2025-03-19T16:30:24Z

csrc/scheduler/ampere_multi_matmul.cpp

 }

 void AmpereMultipleMatmulScheduler::run() {
-  // Clears memory spaces on intermediate tensors, calls


Changes to Ampere scheduler should not change generated code, but do let us use a common cacheInputsAndOutputs method.

jacobhinkle · 2025-03-19T16:31:16Z

csrc/scheduler/ampere_multi_matmul.cpp

+    for (Val* dv : fusion_->outputs()) {
+      auto* d = dv->as<TensorView>();


dc is ignored anyway. We long ago stopped using cached_outputs_ in the Hopper scheduler, so I removed it for Ampere too as it was causing a problem due to the refactor not filling that vector.

jacobhinkle · 2025-03-19T16:32:22Z

csrc/scheduler/mma_utils.cpp


 namespace {

-// Check whether tv has all the output_groups in its logical domain, and


These utilities are no longer needed since we can now safely translate all matmul patterns the same way on both Hopper and Ampere. The differences are purely in downstream scheduling.

jacobhinkle · 2025-03-19T16:33:20Z

tests/cpp/test_matmul_scheduler.cpp

 // Test that our fake plugin works to override the default heuristic
 TEST_F(MatmulSchedulerPluginTest, BasicMatmul) {
-  NVFUSER_TEST_CUDA_ARCH_RANGE_GUARD(8, 0, 9, 0);
+  NVFUSER_TEST_CUDA_ARCH_RANGE_GUARD(8, 0, 10, 0);


More tests can likely be unguarded once we update their params or set a fixture to create default sets of parameters.

Fixes the horizontal fusion tests

This can be made to work but currently the config factory generates an invalid config.

jacobhinkle · 2025-03-19T20:00:37Z

!test --diff

…tterns

jacobhinkle · 2025-03-20T12:54:16Z

!test --diff

…tterns

jacobhinkle · 2025-03-25T15:45:42Z

csrc/scheduler/hopper_multi_matmul.cpp

 void HopperMultipleMatmulScheduler::run() {
-  // Clears memory spaces on intermediate tensors, calls
-  // cache{After,Before,Fork} on inputs and outputs
-  cacheInputsAndOutputs();
-
  // Finds matmul patterns and translates them to MmaOps, then finds tensor
  // and dimension roles for all tensors in the fusion
  findPatterns();
  translatePatterns();
-  findRoles();

+  // Clears memory spaces on intermediate tensors, calls
+  // cache{After,Before,Fork} on inputs and outputs.
  // Defines acw_smem/bcw_smem and acr/bcr by possibly calling cacheAfter.
-  // This also collects mma_results_
-  defineOperandCaches();
+  cacheInputsAndOutputs(/*skip_intermediates=*/true);
+
+  // We wait until we are done caching tensors to find roles, since this
+  // requires building an IdModel, which would not be updated during the cache
+  // calls.
+  findRoles();

  inspectPrologues();


Rearranged to not cache until after translation of patterns. This is helpful because it lets us cache the global tensors that have been "skipped" with producer tensor aliases, instead of the original fusion inputs.

…tterns

rdspring1

LGTM

rdspring1 · 2025-03-31T18:38:29Z

csrc/scheduler/ampere_multi_matmul.h

-  std::vector<TensorView*> acw_smems_, bcw_smems_, acrs_, bcrs_, abs_, bbs_,
-      splitk_sums_, smem_epilogues_;
+ private:
+  // Tensors used for loading operands from smem to registers, and the


rdspring1 · 2025-03-31T18:39:21Z

csrc/scheduler/hopper_multi_matmul.h


  // This is like the above method, but tv should not have any K dimension
  void transformLikeMmaOutputWithoutK(TensorView* tv);
-


rdspring1 · 2025-03-31T18:46:49Z

csrc/scheduler/multi_matmul.h

  int64_t num_device_and_batch_dims_ = 0;

-  std::vector<TensorView*> as_, bs_, mma_results_;
+  std::vector<TensorView*> as_, bs_, acw_smems_, bcw_smems_, mma_results_,


as_, bs_, acw_smems_, bcw_smems_ variable names are not very intuitive to me.

Yeah, they are a holdover from the original matmul scheduler code, which is likely influenced by terminology in other projects like cutlass. It might be better to name as_ as a_gmem_ and acw_smems_ as a_smem_. We can do that in another PR.

rdspring1 · 2025-03-31T18:49:12Z

csrc/scheduler/multi_matmul.cpp

+  if (auto it = tensor_roles_.find(MatmulTensorRole::EPILOGUE_INPUT);
+      it != tensor_roles_.end()) {
+    for (TensorView* tv : it->second) {
+      tv->cacheAfter();


I'm not sure why we shouldn't track the epilogue input caches in a data member. It would be used in scheduleEpilogue.

jacobhinkle · 2025-04-02T15:27:00Z

!test

…tterns

jacobhinkle · 2025-04-03T00:12:35Z

!test

jacobhinkle added 30 commits February 5, 2025 16:32

Add test

cd189a8

Merge remote-tracking branch 'origin/main' into jh/remove_bcasts_at_l…

5a65fa4

…owering

Merge remote-tracking branch 'origin/main' into jh/remove_bcasts_at_l…

1c938dc

…owering

Switch to G->G

be2d161

Fix bug in aliasTensorProducer. Test passes!

ccbee27

Remove isTrivialExpr from fusion_simplifier.cpp

d80e349

Update comment

a60527b

Merge remote-tracking branch 'origin/main' into jh/remove_bcasts_at_l…

b9c9091

…owering

Don't alias consumers if they are fusion outputs

36a9377

Do actual replacement in lowerSrcIndex

bf477e8

Check that allocation domains are exact mapped

0467377

Merge remote-tracking branch 'origin/main' into jh/remove_bcasts_at_l…

793d09f

…owering

Add test that we have no global intermediates

2920c52

Add test file with Broadcast test

926d869

TODO: Squeeze, permute, chain tests

Add Squeeze test

5f19420

Fix reference chasing, add BroadcastSqueeze test

2396310

Add Permute test

865b6a6

Make findTensorProducerAliases an analysis pass

c1bfa46

Respect aliases in sync insertion pass

ed2e415

Start converting MatmulOp case

4488d8c

Use ValGroup instead of trying to be fancy

18fd310

Fix up translation of MatmulOp

7e2e7b5

Failing at computing TMA descriptor now

Use alias when computing TMA descriptor

f22abbd

Fix by using cacheAfter instead of set

9649fe2

Use old2new. Use helper for LinearOp

e06ff43

Unguard GEMM tests. Investigating others...

f58fe1b

Unguard epilogue fusion test

019e1e5

Replace pattern.A/B after translation to fix horizontal fusion

9859459

Hold off on using std::view::ranges::iota for now

0cb07b3

Build errors on clang

Fix comment on translateToMmaOp

7d0db8a

jacobhinkle commented Mar 19, 2025

View reviewed changes

jacobhinkle changed the title ~~[WIP] Translate MmaOp patterns properly on Hopper~~ Translate MmaOp patterns properly on Hopper Mar 19, 2025

jacobhinkle added 3 commits March 19, 2025 15:38

Reuse transformed operands in translateToMmaOp

8a23a0d

Fixes the horizontal fusion tests

Re-guard MatmulSchedulerPluginTest.BasicMatmul

566585a

This can be made to work but currently the config factory generates an invalid config.

Properly check prologues in getMatmulCompileTimeRejectReason

da9480a

jacobhinkle and others added 4 commits March 19, 2025 20:01

Merge remote-tracking branch 'origin/main' into jh/translate_mmaop_pa…

8f0f4be

…tterns

Fix bug in getUnsqueeze. Use output rank instead of input rank

ed73182

Fix llama ffn horizontal fusion on Ampere

8aa9c57

Merge branch 'main' into jh/translate_mmaop_patterns

2b49b4c

Merge remote-tracking branch 'origin/main' into jh/translate_mmaop_pa…

73dc294

…tterns

jacobhinkle marked this pull request as ready for review March 25, 2025 15:42

jacobhinkle commented Mar 25, 2025

View reviewed changes

jacobhinkle requested a review from rdspring1 March 25, 2025 15:55

jacobhinkle added 2 commits March 25, 2025 12:52

Merge remote-tracking branch 'origin/main' into jh/translate_mmaop_pa…

71afba8

…tterns

Merge remote-tracking branch 'origin/main' into jh/translate_mmaop_pa…

9ff2119

…tterns

rdspring1 approved these changes Mar 31, 2025

View reviewed changes

jacobhinkle and others added 2 commits April 2, 2025 11:14

Record cached epilogue inputs and use in scheduleEpilogue

f814858

Merge branch 'main' into jh/translate_mmaop_patterns

97a09a8

jacobhinkle added 2 commits April 2, 2025 20:10

Call findRoles befor caching to fix errors

0e3348b

Merge remote-tracking branch 'origin/main' into jh/translate_mmaop_pa…

7138a2f

…tterns

jacobhinkle merged commit 40d3cd1 into main Apr 3, 2025
53 checks passed

jacobhinkle deleted the jh/translate_mmaop_patterns branch April 3, 2025 11:21

jacobhinkle mentioned this pull request Apr 15, 2025

Inlining error in Hopper matmul with AxisMapping and grid swizzling #3671

Closed

		for (Val* dv : fusion_->outputs()) {
		auto* d = dv->as<TensorView>();


		namespace {

		// Check whether tv has all the output_groups in its logical domain, and


		// This is like the above method, but tv should not have any K dimension
		void transformLikeMmaOutputWithoutK(TensorView* tv);

Translate MmaOp patterns properly on Hopper #4072

Translate MmaOp patterns properly on Hopper #4072

Uh oh!

Conversation

jacobhinkle commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Mar 19, 2025

Uh oh!

jacobhinkle commented Mar 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdspring1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Apr 2, 2025

Uh oh!

jacobhinkle commented Apr 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jacobhinkle commented Mar 13, 2025 •

edited

Loading