Patch release: correctness fixes in the device pipeline, plus a licensing cleanup in cuda-bindings that supersedes v0.2.0 for anyone consuming that crate.
Correctness
- Device functions and call sites now carry LLVM's
convergentattribute, preventing illegal transforms around warp-synchronous operations (#145). - The detected GPU architecture is treated as a hint, not a hard target, so explicitly requested
--archbuilds behave as asked (#145). - NaN float literals are emitted as hex bit patterns instead of bare
nan, whichllcrejects (#116, ported from #63). - Loads, stores, and allocas now carry Rust-side ABI alignment instead of LLVM defaults (#122, ported from #113).
cuda-bindings licensing cleanup
crates/cuda-bindingsships under the NVIDIA Software License and accepts no external contributions. Two externally-authored patches that briefly landed there have been removed and their functionality re-implemented under NVIDIA authorship (#152): theCUDA_HOMEtoolkit fallback and the CUDA 12.8cuEventElapsedTime_v2compatibility shim.- The FFI bindings themselves were never affected: they are generated at build time by bindgen from your local CUDA headers and are not checked in.
- The crate's SPDX headers now match its declared license, and CI now rejects non-NVIDIA-authored changes to the crate (#152).
Testing, CI, and docs
- 14 new mir-lower control-flow unit tests (#111, thanks @goog00).
- Error-demo examples are now classified in STATUS.md and enforced in CI (#86, thanks @ronakv).
- gemm_sol gained a live cublasLt speed-of-light baseline for perf work.
thread::index_1d/index_2duniqueness is documented as launch-conditional (#127).