fix(CuTeDSL): correct FP4 tensor K dimension in grouped blockscaled GEMM by Hale423 · Pull Request #3102 · NVIDIA/cutlass

Hale423 · 2026-03-13T04:49:33Z

Float4E2M1FN packs 2 elements per byte, so the K storage dimension
must be halved (k // 2) when creating int8 device tensors for A and B.

This matches the existing correct handling in
dense_blockscaled_gemm_persistent.py (line 2493-2498).

fix(CuTeDSL): correct FP4 tensor K dimension in grouped blockscaled GEMM

8567629

Provide feedback