[BUG] [CuTeDSL] SM120 FP4 TMA descriptor lowering generates unusable descriptors

### Which component has the problem?

CuTe DSL

### Bug Report

**Describe the bug**

CuTeDSL TMA descriptor lowering for SM120 FP4 generates a descriptor that does not match the descriptor produced by the CUDA Driver API. This is not only a byte-level
mismatch: code using the generated CuTeDSL descriptor fails when the descriptor is used.

For the same `128 x 128` FP4 tensor-map shape, the CuTeDSL-lowered descriptor differs from a host-side `cuTensorMapEncodeTiled` descriptor using
`CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B`.

In local checks with a `4.5.0.dev1` editable checkout and matching `4.5.0` runtime wheels, the descriptor differs at offsets including `8`, `9`, and `73`. The descriptor
generated by the CUDA Driver works as expected for this path, while the CuTeDSL-generated descriptor is not usable.

There is also an earlier related lowering failure in `4.4.2`: the same descriptor-copy path does not reach byte comparison because `cute_nvgpu.copy_tma_desc` fails
legalization.

Minimal standalone repro branch:

https://github.com/alecco/cutlass/tree/tma-descriptor-lowering-issue


Steps/Code to reproduce bug

Clone the repro branch:

```bash
git clone https://github.com/alecco/cutlass.git
cd cutlass
git checkout tma-descriptor-lowering-issue
```

Repro 1: broken generated descriptor on `4.5.0.dev1` / matching `4.5.0` runtime wheels:

```bash
python -m pip install -e python/CuTeDSL
python -m pip install -U nvidia-cutlass-dsl-libs-base==4.5.0 nvidia-cutlass-dsl-libs-cu13==4.5.0
```

```bash
CUTE_DSL_ARCH=sm_120a python sm120_tma_descriptor_lowering_repro.py --repro descriptor-mismatch
```

Observed output locally:

```text
differing offsets: [8, 9, 17, 73]
offset  cute_dsl  cuda_driver
     8  0x92      0x20
     9  0x05      0x66
    17  0x00      0x02
    73  0x00      0x04
```

Repro 2: earlier `4.4.2` lowering failure:

```bash
python -m pip install -U nvidia-cutlass-dsl==4.4.2 nvidia-cutlass-dsl-libs-base==4.4.2 nvidia-cutlass-dsl-libs-cu13==4.4.2
```

```bash
CUTE_DSL_ARCH=sm_120a python sm120_tma_descriptor_lowering_repro.py --repro copy-tma-desc-lowering
```

Observed failure:

failed to legalize operation `cute_nvgpu.copy_tma_desc` that was explicitly marked illegal

## Expected behavior

CuTeDSL should generate an SM120 FP4 TMA descriptor that is usable by generated code and matches the descriptor produced by the CUDA Driver API for the same tensor-map
parameters.

The descriptor generated through CuTeDSL lowering should not cause code using it to fail.

Environment details (please complete the following information):

- Environment location: Bare-metal
- GPU architecture: SM120
- `CUTE_DSL_ARCH`: `sm_120a`
- CUDA Toolkit / Driver: CUDA environment with cuTensorMapEncodeTiled
- Python dependency: `torch` with `torch.float4_e2m1fn_x2`
- CuTeDSL versions checked:
    - `4.5.0.dev1` editable checkout with `nvidia-cutlass-dsl-libs-base==4.5.0` and `nvidia-cutlass-dsl-libs-cu13==4.5.0`
    - `nvidia-cutlass-dsl==4.4.2` with matching `4.4.2` runtime wheels

Additional context

The repro compares only the 128-byte opaque tensor-map descriptor payload. It avoids a full GEMM kernel so the issue is isolated to descriptor generation/lowering.

The CUDA Driver descriptor is created host-side with:

```text
cuTensorMapEncodeTiled
CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B
rank = 3
globalDim = (128, 128, 1)
globalStrides = (64, 8192)
boxDim = (128, 128, 1)
elementStrides = (1, 1, 1)
interleave = CU_TENSOR_MAP_INTERLEAVE_NONE
swizzle = CU_TENSOR_MAP_SWIZZLE_128B
l2Promotion = CU_TENSOR_MAP_L2_PROMOTION_L2_128B
oobFill = CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE
```

The CuTeDSL descriptor is generated through the corresponding `cpasync.make_tiled_tma_atom` / `TensorMapManager` / `cpasync.copy_tensormap` path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [CuTeDSL] SM120 FP4 TMA descriptor lowering generates unusable descriptors #3236

Which component has the problem?

Bug Report

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] [CuTeDSL] SM120 FP4 TMA descriptor lowering generates unusable descriptors #3236

Description

Which component has the problem?

Bug Report

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions