`nvidia-cutlass-dsl` 4.5.0: `nvvm.mma.block_scale` lowering produces PTX rejected by ptxas (sm_120/120f/121a)

## Summary

In `nvidia-cutlass-dsl==4.5.0` (the stable release that fixed the `MmaSM120BlockScaledOp` admissibility check for sm_121a), the MLIR-to-PTX lowering of `nvvm.mma.block_scale` for FP4 (E2M1) inputs with UE4M3 scales produces PTX that CUDA 13.0 `ptxas` rejects with:

```
ptxas application ptx input, line 969; error : Unexpected instruction types specified for '_mma'
...50+ instructions, same error...
```

The hand-written equivalent `mma.sync.aligned.m16n8k64.row.col.kind::mxf4nvf4.block_scale.scale_vec::4X.f32.e2m1.e2m1.f32.ue4m3` compiles cleanly on `.target sm_120a`, `.target sm_120f`, and `.target sm_121a`. So the lowering pipeline is emitting *something different* — likely a slightly-different LLVM intrinsic encoding — that the user-facing ptxas can't decode.

## Environment

- `nvidia-cutlass-dsl==4.5.0` (also `nvidia-cutlass-dsl-libs-base==4.5.0`, `nvidia-cutlass-dsl-libs-cu13==4.5.0`)
- CUDA 13.0 toolkit (ptxas built `Wed_Aug_20_01:53:56_PM_PDT_2025`)
- Hardware: NVIDIA GB10 (sm_121a) — DGX Spark
- Triggered via `flashinfer-python==0.6.11` (the `b12x_fused_moe` backend's NVFP4 path)
- Also reproduced via vLLM PR vllm-project/vllm#40082, `--moe-backend flashinfer_b12x`

## Reproduction

The full IR dump is large (~2000 lines) because ptxas flags ~50 mma instructions per kernel — happy to attach if helpful. The offending MLIR op is:

```mlir
%X = "nvvm.mma.block_scale"(%a, %sa_data, %sa_byteid_lo, %sa_byteid_hi,
                             %b, %sb_data, %sb_byteid_lo, %sb_byteid_hi, %c) <{
    aType = #nvvm.mma_type<e2m1>,
    bType = #nvvm.mma_type<e2m1>,
    blockScaleFormat = #nvvm.block_scale_format<ue4m3>,
    cType = #nvvm.mma_type<f32>,
    layoutA = #nvvm.mma_layout<row>,
    layoutB = #nvvm.mma_layout<col>,
    scaleVecSize = #nvvm.scale_vec_size<x4>,
    shape = #nvvm.shape<m = 16, n = 8, k = 64>
}> : (vector<4xi32>, i32, i16, i16, vector<2xi32>, i32, i16, i16, vector<4xf32>) -> vector<4xf32>
```

With target attribute:

```mlir
"gpu.module"() <{sym_name = "kernels", targets = [#nvvm.target<chip = "sm_121a", flags = {"ptx-cmd-options" = []}>]}> ({...})
```

Switching the target to `sm_120a`, `sm_120f`, or `sm_121a` produces the same ptxas error.

## Hand-written PTX (compiles fine on all three targets)

```ptx
.version 8.8
.target sm_121a       // or sm_120a or sm_120f — all work
.address_size 64
.visible .entry test() {
  .reg .b32 a<4>, b<2>, c<4>, d<4>;
  .reg .b32 sa, sb;
  mov.b32 a0, 0; mov.b32 a1, 0; mov.b32 a2, 0; mov.b32 a3, 0;
  mov.b32 b0, 0; mov.b32 b1, 0;
  mov.b32 c0, 0; mov.b32 c1, 0; mov.b32 c2, 0; mov.b32 c3, 0;
  mov.b32 sa, 0; mov.b32 sb, 0;
  mma.sync.aligned.m16n8k64.row.col.kind::mxf4nvf4.block_scale.scale_vec::4X.f32.e2m1.e2m1.f32.ue4m3
    {d0,d1,d2,d3}, {a0,a1,a2,a3}, {b0,b1}, {c0,c1,c2,c3}, sa, {0,0}, sb, {0,0};
  ret;
}
```

```bash
$ ptxas test.ptx -o test.cubin --gpu-name=sm_121a; echo exit=$?
exit=0
```

So ptxas accepts this exact instruction at every Blackwell consumer-card target. The cute-dsl emission must be producing slightly different operand-encoding, register-encoding, or intrinsic name that ptxas can't decode (note ptxas calls it `_mma`, with underscore — possibly the LLVM-intrinsic name leaking through rather than the user-facing `mma.sync` mnemonic).

## Smaller related bug

`base_dsl/runtime/cuda.py:_get_gpu_arch_info` has no `(12, 1)` entry in `gpu_arch_map`:

```python
gpu_arch_map = {
    ...
    (12, 0): ("Blackwell", "sm_120a", ["sm_120a"]),  # RTX PRO 6000 / RTX 50 Series
    # missing: (12, 1) for GB10 / DGX Spark
}
return gpu_arch_map.get((major, minor),
                        ("Unknown", f"sm_{major}{minor}", [f"sm_{major}{minor}"]))
```

GB10 falls through to `("Unknown", "sm_121", ["sm_121"])` — note `sm_121` *without* the `a` suffix. Users have to manually set `CUTE_DSL_ARCH=sm_121a` to override.

Suggested one-line fix:

```python
(12, 1): ("Blackwell", "sm_121a", ["sm_121a", "sm_120f"]),  # GB10 / DGX Spark
```

## Why this matters

Multiple downstream issues are blocked on this:
- vllm-project/vllm#40082 (b12x MoE / FP4 GEMM kernels for sm_120/121) — cannot land usefully until cute-dsl emits valid PTX for this MMA kind
- vllm-project/vllm#41519 (shadowlilac-oss's `mimo_v2` repro on RTX PRO 6000 sm_120) — same root cause downstream
- flashinfer-ai/flashinfer#2776 (NVFP4 MoE crash on sm_120/121)
- NVIDIA forum thread "MiMo-V2.5 (New model)" (multiple consumer-Blackwell users)

Note that the precompiled-cubin alternative path (`--moe-backend flashinfer_cutlass` in vLLM) is ALSO broken on consumer Blackwell, but for a different reason: `flashinfer-cubin==0.6.11` ships 12,681 FP4 cubins targeting `Sm100a`/`Sm100f`/`Sm103a` only — **zero sm_120 or sm_121 cubins**. So neither the JIT path (this bug) nor the AOT path (missing wheel coverage) currently works. Closing this one would unblock the JIT path, at minimum.

Happy to provide full reproducer (Dockerfile + recipe + IR dump) if useful for triage.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`nvidia-cutlass-dsl` 4.5.0: `nvvm.mma.block_scale` lowering produces PTX rejected by ptxas (sm_120/120f/121a) #3227

Summary

Environment

Reproduction

Hand-written PTX (compiles fine on all three targets)

Smaller related bug

Why this matters

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nvidia-cutlass-dsl 4.5.0: nvvm.mma.block_scale lowering produces PTX rejected by ptxas (sm_120/120f/121a) #3227

Description

Summary

Environment

Reproduction

Hand-written PTX (compiles fine on all three targets)

Smaller related bug

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`nvidia-cutlass-dsl` 4.5.0: `nvvm.mma.block_scale` lowering produces PTX rejected by ptxas (sm_120/120f/121a) #3227