CuTe DSL: work around SM120 rank-2 TMA cute.copy hang#3189
Closed
alecco wants to merge 3 commits into
Closed
Conversation
The SM120 direct TMA load helper forwards composed SMEM layouts to CuTe's TMA atom builder. Document the required contract for that path: when the atom is built with a swizzled SMEM layout, the destination SMEM pointer passed to sm120_tma_load_2d must carry the same swizzle. Extend the SM120 direct TMA smoke to exercise this contract with SW128 swizzled shared memory. The smoke constructs the descriptor with a composed SW128 layout, allocates the destination tensor with the matching swizzled pointer, issues the direct rank-2 TMA load, and reads back through the same swizzled CuTe tensor view. The default smoke keeps swizzled coverage to 64x64 FP16/BF16 tiles so the diagnostic remains well below RTX 50 / SM120's 99 KiB shared-memory limit and avoids the larger swizzled shapes that still need separate backend investigation.
alecco
pushed a commit
to alecco/quack
that referenced
this pull request
Apr 28, 2026
Add narrow QuACK wrappers around the SM120 rank-2 direct TMA workaround introduced in NVIDIA/cutlass#3189 ("CuTe DSL: work around SM120 rank-2 TMA cute.copy hang"). The local CuTe DSL workaround keeps CuTe descriptor construction, but bypasses the currently problematic SM120 rank-2 cute.copy issue path by exposing a direct CTA-local TMA load helper. This commit adds QuACK-side helpers for that path: - feature checks for the required CuTe DSL cpasync helpers - explicit row-major [seq, d] -> TMA-basis (d, seq) tensor construction - direct rank-2 TMA atom construction - descriptor address access - CTA-local direct TMA load issue This is intentionally not wired into GemmSm120 yet. GemmSm120 is under active development, and this keeps the new functionality opt-in and low-conflict. The helper is meant as reusable scaffolding for future SM120 FlashAttention-style K/V load experiments where direct TMA can stage large dense tiles while other warps do useful work. Add copy-focused validation plus a benchmark for comparing direct TMA against two non-TMA baselines: a simple producer-warp cp.async path and a cooperative blocking copy path. The benchmark has an FA-like overlap scenario with two consumer models: - mma: default synthetic BF16/FP16 Tensor Core work using SM120 warp-level MmaF16BF16Op, so staged K/V-like tiles feed ldmatrix and cute.gemm work - scalar: diagnostic shared-memory read and FP32 accumulation work for isolating staging overhead The benchmark is not intended to represent full GEMM or full FlashAttention performance. It is a focused tool for checking whether the direct TMA workaround is usable and for studying where TMA becomes worthwhile: generally larger tile transfers, enough independent consumer work to hide pipeline wait, and enough care around shared-memory footprint. The docs include Nsight Compute commands and note that workstation timing noise should be interpreted with sweeps and counters rather than single rows. Current GEMM behavior is unchanged.
alecco
pushed a commit
to alecco/quack
that referenced
this pull request
Apr 29, 2026
Add narrow QuACK wrappers around the SM120 rank-2 direct TMA workaround introduced in NVIDIA/cutlass#3189 ("CuTe DSL: work around SM120 rank-2 TMA cute.copy hang"). The local CuTe DSL workaround keeps CuTe descriptor construction, but bypasses the currently problematic SM120 rank-2 cute.copy issue path by exposing a direct CTA-local TMA load helper. This commit adds QuACK-side helpers for that path: - feature checks for the required CuTe DSL cpasync helpers - explicit row-major [seq, d] -> TMA-basis (d, seq) tensor construction - direct rank-2 TMA atom construction - descriptor address access - CTA-local direct TMA load issue This is intentionally not wired into GemmSm120 yet. GemmSm120 is under active development, and this keeps the new functionality opt-in and low-conflict. The helper is meant as reusable scaffolding for future SM120 FlashAttention-style K/V load experiments where direct TMA can stage large dense tiles while other warps do useful work. Add copy-focused validation plus a benchmark for comparing direct TMA against two non-TMA baselines: a simple producer-warp cp.async path and a cooperative blocking copy path. The benchmark has an FA-like overlap scenario with two consumer models: - mma: default synthetic BF16/FP16 Tensor Core work using SM120 warp-level MmaF16BF16Op, so staged K/V-like tiles feed ldmatrix and cute.gemm work - scalar: diagnostic shared-memory read and FP32 accumulation work for isolating staging overhead The benchmark is not intended to represent full GEMM or full FlashAttention performance. It is a focused tool for checking whether the direct TMA workaround is usable and for studying where TMA becomes worthwhile: generally larger tile transfers, enough independent consumer work to hide pipeline wait, and enough care around shared-memory footprint. The docs include Nsight Compute commands and note that workstation timing noise should be interpreted with sweeps and counters rather than single rows. Current GEMM behavior is unchanged.
Author
|
Closing PR. I was not grouping Sorry for the noise, and thanks for the patience. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Author: Alecco (& Codex) for Ologan
Summary
On SM120/SM120a, CuTe DSL rank-2 G2S descriptor TMA through the normal executable
cute.copypath can compile and launch, but then hang while waiting on the TMA pipeline mbarrier.A minimal failing shape is a CTA-local rank-2 TMA load over a direct-basis
(d, seq)FP32 tensor:The kernel reaches launch, but torch.cuda.synchronize() does not return. A repro is included at the bottom of this PR.
This PR does not fix the underlying _cute_nvgpu executable TMA lowering. Instead, it adds a narrow SM120 direct TMA issue path that uses the same CuTe-generated descriptor but bypasses the failing
cute.copylowering. It also documents and validates the contract for using that path with swizzled shared-memory destinations.Problem
The reduced diagnosis was:
Direct-basis CuTe descriptor + raw SM120 TMA issue + PipelineTmaAsync: pass
Driver API descriptor + raw SM120 TMA issue + PipelineTmaAsync: pass
Direct-basis CuTe descriptor + normal
cute.copy/PipelineTmaAsync: timeoutDriver API descriptor + normal
cute.copy/PipelineTmaAsync: timeoutFA-style canonicalize/logicalize descriptor + raw issue: coord1 ignored
This points to two separate issues:
cute.copylowering is not currently safe for this SM120 path;(seq, d)tensors can construct a descriptor path where the second coordinate is not represented correctly.The useful working shape is direct TMA basis from the start:
For swizzled shared memory, the same direct-basis rule applies. The TMA atom must be built with the intended composed SMEM layout, and the destination pointer passed to the direct issue helper must carry the matching swizzle.
Workaround Added
This PR adds an explicit SM120/SM120a rank-2 TMA issue helper under:
The intended path is:
This keeps descriptor construction in CuTe DSL, but avoids the failing executable
cute.copyTMA issue path.For swizzled SMEM, callers should use a matching layout and pointer contract. Conceptually:
The key requirement is that the descriptor’s SMEM layout and the destination SMEM pointer agree about the swizzle. The smoke test reads back through the same swizzled CuTe tensor view.
New API
cpasync.get_tma_desc_addr(tma_atom)
Returns the tiled TMA descriptor address associated with a TMA copy atom.
Issues a CTA-local SM120 rank-2 TMA load. By default it performs warp election internally, so callers do not accidentally issue multiple TMA transactions for one mbarrier phase.
Builds a narrow direct-basis rank-2 SM120 TMA load atom and returns:
This helper deliberately does not canonicalize or logicalize modes. Callers must pass a direct-basis tensor where mode 0 is physically contiguous.
Validation
Added a standalone SM120 smoke script:
It validates:
CuTe-generated direct-basis rank-2 descriptors
cpasync.get_tma_desc_addr(...)
cpasync.sm120_tma_load_2d(...)
PipelineTmaAsync
FP32 and BF16 coordinate sweeps with nonzero {d, seq} coordinates
.tile + .L2::cache_hint instruction spelling
FA-like K/V direct-basis loads for FP16/BF16 with D = 64, 96, 128 and seq_tile = 64, 128
SW128 swizzled SMEM load/readback for FP16/BF16 64x64 tiles
Local result:
SM120 direct TMA smoke passed
Also checked syntax with:
The swizzled coverage is intentionally conservative. The local SM120 setup has a 99 KiB shared-memory limit, so the committed smoke keeps individual tested tiles below a
64 KiB budget:
Larger swizzled shapes are not claimed by this PR and should be validated separately.
Scope
This is intentionally narrow:
supported: SM120/SM120a CTA-local rank-2 G2S TMA
supported: direct-basis descriptors where mode 0 is physically contiguous
supported: optional swizzled SMEM when descriptor layout and destination pointer swizzle match
not added: multicast
not added: CTA group 2
not added: arbitrary ND TMA
not added: automatic (seq, d) -> (d, seq) canonicalize/logicalize routing
not changed: generic cute.copy TMA lowering
The underlying _cute_nvgpu executable TMA lowering should still be fixed separately. This PR gives CuTe DSL users a tested SM120 rank-2 path in the meantime, including the SMEM swizzle contract needed for optimized shared-memory staging.
Agent disclosure
This work was created with OpenAI Codex CLI agent, but directed, supervised, and every line reviewed by a human.
Issue repro