You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In nvidia-cutlass-dsl==4.5.0 (the stable release that fixed the MmaSM120BlockScaledOp admissibility check for sm_121a), the MLIR-to-PTX lowering of nvvm.mma.block_scale for FP4 (E2M1) inputs with UE4M3 scales produces PTX that CUDA 13.0 ptxas rejects with:
ptxas application ptx input, line 969; error : Unexpected instruction types specified for '_mma'
...50+ instructions, same error...
The hand-written equivalent mma.sync.aligned.m16n8k64.row.col.kind::mxf4nvf4.block_scale.scale_vec::4X.f32.e2m1.e2m1.f32.ue4m3 compiles cleanly on .target sm_120a, .target sm_120f, and .target sm_121a. So the lowering pipeline is emitting something different — likely a slightly-different LLVM intrinsic encoding — that the user-facing ptxas can't decode.
So ptxas accepts this exact instruction at every Blackwell consumer-card target. The cute-dsl emission must be producing slightly different operand-encoding, register-encoding, or intrinsic name that ptxas can't decode (note ptxas calls it _mma, with underscore — possibly the LLVM-intrinsic name leaking through rather than the user-facing mma.sync mnemonic).
Smaller related bug
base_dsl/runtime/cuda.py:_get_gpu_arch_info has no (12, 1) entry in gpu_arch_map:
GB10 falls through to ("Unknown", "sm_121", ["sm_121"]) — note sm_121without the a suffix. Users have to manually set CUTE_DSL_ARCH=sm_121a to override.
NVIDIA forum thread "MiMo-V2.5 (New model)" (multiple consumer-Blackwell users)
Note that the precompiled-cubin alternative path (--moe-backend flashinfer_cutlass in vLLM) is ALSO broken on consumer Blackwell, but for a different reason: flashinfer-cubin==0.6.11 ships 12,681 FP4 cubins targeting Sm100a/Sm100f/Sm103a only — zero sm_120 or sm_121 cubins. So neither the JIT path (this bug) nor the AOT path (missing wheel coverage) currently works. Closing this one would unblock the JIT path, at minimum.
Happy to provide full reproducer (Dockerfile + recipe + IR dump) if useful for triage.
Summary
In
nvidia-cutlass-dsl==4.5.0(the stable release that fixed theMmaSM120BlockScaledOpadmissibility check for sm_121a), the MLIR-to-PTX lowering ofnvvm.mma.block_scalefor FP4 (E2M1) inputs with UE4M3 scales produces PTX that CUDA 13.0ptxasrejects with:The hand-written equivalent
mma.sync.aligned.m16n8k64.row.col.kind::mxf4nvf4.block_scale.scale_vec::4X.f32.e2m1.e2m1.f32.ue4m3compiles cleanly on.target sm_120a,.target sm_120f, and.target sm_121a. So the lowering pipeline is emitting something different — likely a slightly-different LLVM intrinsic encoding — that the user-facing ptxas can't decode.Environment
nvidia-cutlass-dsl==4.5.0(alsonvidia-cutlass-dsl-libs-base==4.5.0,nvidia-cutlass-dsl-libs-cu13==4.5.0)Wed_Aug_20_01:53:56_PM_PDT_2025)flashinfer-python==0.6.11(theb12x_fused_moebackend's NVFP4 path)--moe-backend flashinfer_b12xReproduction
The full IR dump is large (~2000 lines) because ptxas flags ~50 mma instructions per kernel — happy to attach if helpful. The offending MLIR op is:
With target attribute:
Switching the target to
sm_120a,sm_120f, orsm_121aproduces the same ptxas error.Hand-written PTX (compiles fine on all three targets)
So ptxas accepts this exact instruction at every Blackwell consumer-card target. The cute-dsl emission must be producing slightly different operand-encoding, register-encoding, or intrinsic name that ptxas can't decode (note ptxas calls it
_mma, with underscore — possibly the LLVM-intrinsic name leaking through rather than the user-facingmma.syncmnemonic).Smaller related bug
base_dsl/runtime/cuda.py:_get_gpu_arch_infohas no(12, 1)entry ingpu_arch_map:GB10 falls through to
("Unknown", "sm_121", ["sm_121"])— notesm_121without theasuffix. Users have to manually setCUTE_DSL_ARCH=sm_121ato override.Suggested one-line fix:
Why this matters
Multiple downstream issues are blocked on this:
mimo_v2repro on RTX PRO 6000 sm_120) — same root cause downstreamNote that the precompiled-cubin alternative path (
--moe-backend flashinfer_cutlassin vLLM) is ALSO broken on consumer Blackwell, but for a different reason:flashinfer-cubin==0.6.11ships 12,681 FP4 cubins targetingSm100a/Sm100f/Sm103aonly — zero sm_120 or sm_121 cubins. So neither the JIT path (this bug) nor the AOT path (missing wheel coverage) currently works. Closing this one would unblock the JIT path, at minimum.Happy to provide full reproducer (Dockerfile + recipe + IR dump) if useful for triage.