Skip to content

[BUG] [CuTeDSL] SM120 FP4 TMA descriptor lowering generates unusable descriptors #3236

@alecco

Description

@alecco

Which component has the problem?

CuTe DSL

Bug Report

Describe the bug

CuTeDSL TMA descriptor lowering for SM120 FP4 generates a descriptor that does not match the descriptor produced by the CUDA Driver API. This is not only a byte-level
mismatch: code using the generated CuTeDSL descriptor fails when the descriptor is used.

For the same 128 x 128 FP4 tensor-map shape, the CuTeDSL-lowered descriptor differs from a host-side cuTensorMapEncodeTiled descriptor using
CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B.

In local checks with a 4.5.0.dev1 editable checkout and matching 4.5.0 runtime wheels, the descriptor differs at offsets including 8, 9, and 73. The descriptor
generated by the CUDA Driver works as expected for this path, while the CuTeDSL-generated descriptor is not usable.

There is also an earlier related lowering failure in 4.4.2: the same descriptor-copy path does not reach byte comparison because cute_nvgpu.copy_tma_desc fails
legalization.

Minimal standalone repro branch:

https://github.com/alecco/cutlass/tree/tma-descriptor-lowering-issue

Steps/Code to reproduce bug

Clone the repro branch:

git clone https://github.com/alecco/cutlass.git
cd cutlass
git checkout tma-descriptor-lowering-issue

Repro 1: broken generated descriptor on 4.5.0.dev1 / matching 4.5.0 runtime wheels:

python -m pip install -e python/CuTeDSL
python -m pip install -U nvidia-cutlass-dsl-libs-base==4.5.0 nvidia-cutlass-dsl-libs-cu13==4.5.0
CUTE_DSL_ARCH=sm_120a python sm120_tma_descriptor_lowering_repro.py --repro descriptor-mismatch

Observed output locally:

differing offsets: [8, 9, 17, 73]
offset  cute_dsl  cuda_driver
     8  0x92      0x20
     9  0x05      0x66
    17  0x00      0x02
    73  0x00      0x04

Repro 2: earlier 4.4.2 lowering failure:

python -m pip install -U nvidia-cutlass-dsl==4.4.2 nvidia-cutlass-dsl-libs-base==4.4.2 nvidia-cutlass-dsl-libs-cu13==4.4.2
CUTE_DSL_ARCH=sm_120a python sm120_tma_descriptor_lowering_repro.py --repro copy-tma-desc-lowering

Observed failure:

failed to legalize operation cute_nvgpu.copy_tma_desc that was explicitly marked illegal

Expected behavior

CuTeDSL should generate an SM120 FP4 TMA descriptor that is usable by generated code and matches the descriptor produced by the CUDA Driver API for the same tensor-map
parameters.

The descriptor generated through CuTeDSL lowering should not cause code using it to fail.

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • GPU architecture: SM120
  • CUTE_DSL_ARCH: sm_120a
  • CUDA Toolkit / Driver: CUDA environment with cuTensorMapEncodeTiled
  • Python dependency: torch with torch.float4_e2m1fn_x2
  • CuTeDSL versions checked:
    • 4.5.0.dev1 editable checkout with nvidia-cutlass-dsl-libs-base==4.5.0 and nvidia-cutlass-dsl-libs-cu13==4.5.0
    • nvidia-cutlass-dsl==4.4.2 with matching 4.4.2 runtime wheels

Additional context

The repro compares only the 128-byte opaque tensor-map descriptor payload. It avoids a full GEMM kernel so the issue is isolated to descriptor generation/lowering.

The CUDA Driver descriptor is created host-side with:

cuTensorMapEncodeTiled
CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B
rank = 3
globalDim = (128, 128, 1)
globalStrides = (64, 8192)
boxDim = (128, 128, 1)
elementStrides = (1, 1, 1)
interleave = CU_TENSOR_MAP_INTERLEAVE_NONE
swizzle = CU_TENSOR_MAP_SWIZZLE_128B
l2Promotion = CU_TENSOR_MAP_L2_PROMOTION_L2_128B
oobFill = CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE

The CuTeDSL descriptor is generated through the corresponding cpasync.make_tiled_tma_atom / TensorMapManager / cpasync.copy_tensormap path.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions