cuTile Rust v0.2.0: low-precision inference support and paper artifacts #164
elibol
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
cuTile Rust 0.2.0 is now available.
This release focuses on low-precision inference support. It adds CUDA 13.3-oriented support for NVFP4 packing and unpacking, block-scaled matrix multiply, runnable NVFP4 and MXFP8 examples, and the new
cutile-kernelscrate as a source of high-performance inference kernels written in cuTile Rust.This release also accompanies our paper, Fearless Concurrency on the GPU: https://arxiv.org/abs/2606.15991. The paper artifacts are included in the repository under
cutile-benchmarks/paper/.Our results show that cuTile Rust adds safety without measurable runtime overhead. On NVIDIA B200, cuTile Rust reaches 7 TB/s for element-wise operations and 2 PFlop/s for GEMM, about 91% of peak memory bandwidth and 92% of dense
f16peak, respectively. Safe Rust persistent GEMM reaches 2.07 PFlop/s atM=N=K=8192, within 0.3% of the corresponding low-level Tile IR variant.We also evaluated Grout, a Qwen3 inference engine built with cuTile Rust in collaboration with Hugging Face. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on B200, showing strong performance on memory-bound inference tasks, consistent with our HBM roofline analysis.
cuTile Rust remains early-stage research software, but 0.2.0 is a meaningful step toward writing practical inference kernels in idiomatic Rust while preserving Rust's ownership discipline across the GPU launch boundary.
Release notes: https://github.com/NVlabs/cutile-rs/releases/tag/v0.2.0
Crates.io: https://crates.io/crates/cutile
Beta Was this translation helpful? Give feedback.
All reactions