cuTile Rust 0.2.0 adds low-precision inference support and accompanies our paper, Fearless Concurrency on the GPU.
Paper: https://arxiv.org/abs/2606.15991
Artifacts: cutile-benchmarks/paper/
Announcement: #164
Highlights
- CUDA 13.3-oriented low-precision support, including NVFP4 packing and unpacking and block-scaled matrix multiply.
- Runnable NVFP4 and MXFP8 examples.
- New
cutile-kernelscrate with reusable inference-oriented kernels written in cuTile Rust. - Paper reproducibility artifacts under
cutile-benchmarks/paper/, including benchmark harnesses, CSV results, hardware details, and clock settings. - Expanded compile-only regression coverage and example/test cleanup for release validation.
Performance context
On NVIDIA B200, cuTile Rust reaches 7 TB/s for element-wise operations and 2 PFlop/s for GEMM, about 91% of peak memory bandwidth and 92% of dense f16 peak, respectively. Safe Rust persistent GEMM reaches 2.07 PFlop/s at M=N=K=8192, within 0.3% of the corresponding low-level Tile IR variant.
We also evaluated Grout, a Qwen3 inference engine built with cuTile Rust in collaboration with Hugging Face. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on B200, showing strong performance on memory-bound inference tasks, consistent with our HBM roofline analysis.
Notes
cuTile Rust remains early-stage research software. CUDA 13.3 is required for the new low-precision features, and hardware-specific features such as native FP4 require architectures that support them.