Release v0.2.0 · NVlabs/cutile-rs

cuTile Rust 0.2.0 adds low-precision inference support and accompanies our paper, Fearless Concurrency on the GPU.

Paper: https://arxiv.org/abs/2606.15991
Artifacts: cutile-benchmarks/paper/
Announcement: #164

Highlights

CUDA 13.3-oriented low-precision support, including NVFP4 packing and unpacking and block-scaled matrix multiply.
Runnable NVFP4 and MXFP8 examples.
New cutile-kernels crate with reusable inference-oriented kernels written in cuTile Rust.
Paper reproducibility artifacts under cutile-benchmarks/paper/, including benchmark harnesses, CSV results, hardware details, and clock settings.
Expanded compile-only regression coverage and example/test cleanup for release validation.

Performance context

On NVIDIA B200, cuTile Rust reaches 7 TB/s for element-wise operations and 2 PFlop/s for GEMM, about 91% of peak memory bandwidth and 92% of dense f16 peak, respectively. Safe Rust persistent GEMM reaches 2.07 PFlop/s at M=N=K=8192, within 0.3% of the corresponding low-level Tile IR variant.

We also evaluated Grout, a Qwen3 inference engine built with cuTile Rust in collaboration with Hugging Face. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on B200, showing strong performance on memory-bound inference tasks, consistent with our HBM roofline analysis.

Notes

cuTile Rust remains early-stage research software. CUDA 13.3 is required for the new low-precision features, and hardware-specific features such as native FP4 require architectures that support them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Performance context

Notes

Uh oh!