Skip to content

v0.2.0

Latest

Choose a tag to compare

@elibol elibol released this 16 Jun 19:41
· 5 commits to main since this release
v0.2.0
de6c5cd

cuTile Rust 0.2.0 adds low-precision inference support and accompanies our paper, Fearless Concurrency on the GPU.

Paper: https://arxiv.org/abs/2606.15991
Artifacts: cutile-benchmarks/paper/
Announcement: #164

Highlights

  • CUDA 13.3-oriented low-precision support, including NVFP4 packing and unpacking and block-scaled matrix multiply.
  • Runnable NVFP4 and MXFP8 examples.
  • New cutile-kernels crate with reusable inference-oriented kernels written in cuTile Rust.
  • Paper reproducibility artifacts under cutile-benchmarks/paper/, including benchmark harnesses, CSV results, hardware details, and clock settings.
  • Expanded compile-only regression coverage and example/test cleanup for release validation.

Performance context

On NVIDIA B200, cuTile Rust reaches 7 TB/s for element-wise operations and 2 PFlop/s for GEMM, about 91% of peak memory bandwidth and 92% of dense f16 peak, respectively. Safe Rust persistent GEMM reaches 2.07 PFlop/s at M=N=K=8192, within 0.3% of the corresponding low-level Tile IR variant.

We also evaluated Grout, a Qwen3 inference engine built with cuTile Rust in collaboration with Hugging Face. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on B200, showing strong performance on memory-bound inference tasks, consistent with our HBM roofline analysis.

Notes

cuTile Rust remains early-stage research software. CUDA 13.3 is required for the new low-precision features, and hardware-specific features such as native FP4 require architectures that support them.