Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
CuTeDSL TMA descriptor lowering for SM120 FP4 generates a descriptor that does not match the descriptor produced by the CUDA Driver API. This is not only a byte-level
mismatch: code using the generated CuTeDSL descriptor fails when the descriptor is used.
For the same 128 x 128 FP4 tensor-map shape, the CuTeDSL-lowered descriptor differs from a host-side cuTensorMapEncodeTiled descriptor using
CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B.
In local checks with a 4.5.0.dev1 editable checkout and matching 4.5.0 runtime wheels, the descriptor differs at offsets including 8, 9, and 73. The descriptor
generated by the CUDA Driver works as expected for this path, while the CuTeDSL-generated descriptor is not usable.
There is also an earlier related lowering failure in 4.4.2: the same descriptor-copy path does not reach byte comparison because cute_nvgpu.copy_tma_desc fails
legalization.
Minimal standalone repro branch:
https://github.com/alecco/cutlass/tree/tma-descriptor-lowering-issue
Steps/Code to reproduce bug
Clone the repro branch:
git clone https://github.com/alecco/cutlass.git
cd cutlass
git checkout tma-descriptor-lowering-issue
Repro 1: broken generated descriptor on 4.5.0.dev1 / matching 4.5.0 runtime wheels:
python -m pip install -e python/CuTeDSL
python -m pip install -U nvidia-cutlass-dsl-libs-base==4.5.0 nvidia-cutlass-dsl-libs-cu13==4.5.0
CUTE_DSL_ARCH=sm_120a python sm120_tma_descriptor_lowering_repro.py --repro descriptor-mismatch
Observed output locally:
differing offsets: [8, 9, 17, 73]
offset cute_dsl cuda_driver
8 0x92 0x20
9 0x05 0x66
17 0x00 0x02
73 0x00 0x04
Repro 2: earlier 4.4.2 lowering failure:
python -m pip install -U nvidia-cutlass-dsl==4.4.2 nvidia-cutlass-dsl-libs-base==4.4.2 nvidia-cutlass-dsl-libs-cu13==4.4.2
CUTE_DSL_ARCH=sm_120a python sm120_tma_descriptor_lowering_repro.py --repro copy-tma-desc-lowering
Observed failure:
failed to legalize operation cute_nvgpu.copy_tma_desc that was explicitly marked illegal
Expected behavior
CuTeDSL should generate an SM120 FP4 TMA descriptor that is usable by generated code and matches the descriptor produced by the CUDA Driver API for the same tensor-map
parameters.
The descriptor generated through CuTeDSL lowering should not cause code using it to fail.
Environment details (please complete the following information):
- Environment location: Bare-metal
- GPU architecture: SM120
CUTE_DSL_ARCH: sm_120a
- CUDA Toolkit / Driver: CUDA environment with cuTensorMapEncodeTiled
- Python dependency:
torch with torch.float4_e2m1fn_x2
- CuTeDSL versions checked:
4.5.0.dev1 editable checkout with nvidia-cutlass-dsl-libs-base==4.5.0 and nvidia-cutlass-dsl-libs-cu13==4.5.0
nvidia-cutlass-dsl==4.4.2 with matching 4.4.2 runtime wheels
Additional context
The repro compares only the 128-byte opaque tensor-map descriptor payload. It avoids a full GEMM kernel so the issue is isolated to descriptor generation/lowering.
The CUDA Driver descriptor is created host-side with:
cuTensorMapEncodeTiled
CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B
rank = 3
globalDim = (128, 128, 1)
globalStrides = (64, 8192)
boxDim = (128, 128, 1)
elementStrides = (1, 1, 1)
interleave = CU_TENSOR_MAP_INTERLEAVE_NONE
swizzle = CU_TENSOR_MAP_SWIZZLE_128B
l2Promotion = CU_TENSOR_MAP_L2_PROMOTION_L2_128B
oobFill = CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE
The CuTeDSL descriptor is generated through the corresponding cpasync.make_tiled_tma_atom / TensorMapManager / cpasync.copy_tensormap path.
Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
CuTeDSL TMA descriptor lowering for SM120 FP4 generates a descriptor that does not match the descriptor produced by the CUDA Driver API. This is not only a byte-level
mismatch: code using the generated CuTeDSL descriptor fails when the descriptor is used.
For the same
128 x 128FP4 tensor-map shape, the CuTeDSL-lowered descriptor differs from a host-sidecuTensorMapEncodeTileddescriptor usingCU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B.In local checks with a
4.5.0.dev1editable checkout and matching4.5.0runtime wheels, the descriptor differs at offsets including8,9, and73. The descriptorgenerated by the CUDA Driver works as expected for this path, while the CuTeDSL-generated descriptor is not usable.
There is also an earlier related lowering failure in
4.4.2: the same descriptor-copy path does not reach byte comparison becausecute_nvgpu.copy_tma_descfailslegalization.
Minimal standalone repro branch:
https://github.com/alecco/cutlass/tree/tma-descriptor-lowering-issue
Steps/Code to reproduce bug
Clone the repro branch:
git clone https://github.com/alecco/cutlass.git cd cutlass git checkout tma-descriptor-lowering-issueRepro 1: broken generated descriptor on
4.5.0.dev1/ matching4.5.0runtime wheels:Observed output locally:
Repro 2: earlier
4.4.2lowering failure:Observed failure:
failed to legalize operation
cute_nvgpu.copy_tma_descthat was explicitly marked illegalExpected behavior
CuTeDSL should generate an SM120 FP4 TMA descriptor that is usable by generated code and matches the descriptor produced by the CUDA Driver API for the same tensor-map
parameters.
The descriptor generated through CuTeDSL lowering should not cause code using it to fail.
Environment details (please complete the following information):
CUTE_DSL_ARCH:sm_120atorchwithtorch.float4_e2m1fn_x24.5.0.dev1editable checkout withnvidia-cutlass-dsl-libs-base==4.5.0andnvidia-cutlass-dsl-libs-cu13==4.5.0nvidia-cutlass-dsl==4.4.2with matching4.4.2runtime wheelsAdditional context
The repro compares only the 128-byte opaque tensor-map descriptor payload. It avoids a full GEMM kernel so the issue is isolated to descriptor generation/lowering.
The CUDA Driver descriptor is created host-side with:
The CuTeDSL descriptor is generated through the corresponding
cpasync.make_tiled_tma_atom/TensorMapManager/cpasync.copy_tensormappath.