Skip to content

nvidia-cutlass-dsl 4.5.0: nvvm.mma.block_scale lowering produces PTX rejected by ptxas (sm_120/120f/121a) #3227

@idonati

Description

@idonati

Summary

In nvidia-cutlass-dsl==4.5.0 (the stable release that fixed the MmaSM120BlockScaledOp admissibility check for sm_121a), the MLIR-to-PTX lowering of nvvm.mma.block_scale for FP4 (E2M1) inputs with UE4M3 scales produces PTX that CUDA 13.0 ptxas rejects with:

ptxas application ptx input, line 969; error : Unexpected instruction types specified for '_mma'
...50+ instructions, same error...

The hand-written equivalent mma.sync.aligned.m16n8k64.row.col.kind::mxf4nvf4.block_scale.scale_vec::4X.f32.e2m1.e2m1.f32.ue4m3 compiles cleanly on .target sm_120a, .target sm_120f, and .target sm_121a. So the lowering pipeline is emitting something different — likely a slightly-different LLVM intrinsic encoding — that the user-facing ptxas can't decode.

Environment

  • nvidia-cutlass-dsl==4.5.0 (also nvidia-cutlass-dsl-libs-base==4.5.0, nvidia-cutlass-dsl-libs-cu13==4.5.0)
  • CUDA 13.0 toolkit (ptxas built Wed_Aug_20_01:53:56_PM_PDT_2025)
  • Hardware: NVIDIA GB10 (sm_121a) — DGX Spark
  • Triggered via flashinfer-python==0.6.11 (the b12x_fused_moe backend's NVFP4 path)
  • Also reproduced via vLLM PR Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 vllm-project/vllm#40082, --moe-backend flashinfer_b12x

Reproduction

The full IR dump is large (~2000 lines) because ptxas flags ~50 mma instructions per kernel — happy to attach if helpful. The offending MLIR op is:

%X = "nvvm.mma.block_scale"(%a, %sa_data, %sa_byteid_lo, %sa_byteid_hi,
                             %b, %sb_data, %sb_byteid_lo, %sb_byteid_hi, %c) <{
    aType = #nvvm.mma_type<e2m1>,
    bType = #nvvm.mma_type<e2m1>,
    blockScaleFormat = #nvvm.block_scale_format<ue4m3>,
    cType = #nvvm.mma_type<f32>,
    layoutA = #nvvm.mma_layout<row>,
    layoutB = #nvvm.mma_layout<col>,
    scaleVecSize = #nvvm.scale_vec_size<x4>,
    shape = #nvvm.shape<m = 16, n = 8, k = 64>
}> : (vector<4xi32>, i32, i16, i16, vector<2xi32>, i32, i16, i16, vector<4xf32>) -> vector<4xf32>

With target attribute:

"gpu.module"() <{sym_name = "kernels", targets = [#nvvm.target<chip = "sm_121a", flags = {"ptx-cmd-options" = []}>]}> ({...})

Switching the target to sm_120a, sm_120f, or sm_121a produces the same ptxas error.

Hand-written PTX (compiles fine on all three targets)

.version 8.8
.target sm_121a       // or sm_120a or sm_120f — all work
.address_size 64
.visible .entry test() {
  .reg .b32 a<4>, b<2>, c<4>, d<4>;
  .reg .b32 sa, sb;
  mov.b32 a0, 0; mov.b32 a1, 0; mov.b32 a2, 0; mov.b32 a3, 0;
  mov.b32 b0, 0; mov.b32 b1, 0;
  mov.b32 c0, 0; mov.b32 c1, 0; mov.b32 c2, 0; mov.b32 c3, 0;
  mov.b32 sa, 0; mov.b32 sb, 0;
  mma.sync.aligned.m16n8k64.row.col.kind::mxf4nvf4.block_scale.scale_vec::4X.f32.e2m1.e2m1.f32.ue4m3
    {d0,d1,d2,d3}, {a0,a1,a2,a3}, {b0,b1}, {c0,c1,c2,c3}, sa, {0,0}, sb, {0,0};
  ret;
}
$ ptxas test.ptx -o test.cubin --gpu-name=sm_121a; echo exit=$?
exit=0

So ptxas accepts this exact instruction at every Blackwell consumer-card target. The cute-dsl emission must be producing slightly different operand-encoding, register-encoding, or intrinsic name that ptxas can't decode (note ptxas calls it _mma, with underscore — possibly the LLVM-intrinsic name leaking through rather than the user-facing mma.sync mnemonic).

Smaller related bug

base_dsl/runtime/cuda.py:_get_gpu_arch_info has no (12, 1) entry in gpu_arch_map:

gpu_arch_map = {
    ...
    (12, 0): ("Blackwell", "sm_120a", ["sm_120a"]),  # RTX PRO 6000 / RTX 50 Series
    # missing: (12, 1) for GB10 / DGX Spark
}
return gpu_arch_map.get((major, minor),
                        ("Unknown", f"sm_{major}{minor}", [f"sm_{major}{minor}"]))

GB10 falls through to ("Unknown", "sm_121", ["sm_121"]) — note sm_121 without the a suffix. Users have to manually set CUTE_DSL_ARCH=sm_121a to override.

Suggested one-line fix:

(12, 1): ("Blackwell", "sm_121a", ["sm_121a", "sm_120f"]),  # GB10 / DGX Spark

Why this matters

Multiple downstream issues are blocked on this:

Note that the precompiled-cubin alternative path (--moe-backend flashinfer_cutlass in vLLM) is ALSO broken on consumer Blackwell, but for a different reason: flashinfer-cubin==0.6.11 ships 12,681 FP4 cubins targeting Sm100a/Sm100f/Sm103a only — zero sm_120 or sm_121 cubins. So neither the JIT path (this bug) nor the AOT path (missing wheel coverage) currently works. Closing this one would unblock the JIT path, at minimum.

Happy to provide full reproducer (Dockerfile + recipe + IR dump) if useful for triage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions