Release v0.4.0 · NVIDIA/numba-cuda-mlir

This first update focuses on platform support, debugging, ecosystem enablement, performance, and broader CUDA Python compatibility.

Added Support for Windows, and integrated Windows tests into CI.
Added CUDA-gdb CI workflow and debugging support validation.
Added experimental third-party ecosystem coverage for nvmath-python, RAPIDS/cuDF, and numbast extension backends.
Improved warm compile-time performance by redesigning extension registry refresh behavior, delivering an additional ~40% speedup over the previous implementation on our benchmark suite and reaching ~1.8x geomean speedup on warm compile-time vs. numba-cuda.

Introduced Windows CI coverage and related build fixes, including static CRT usage.
Added CUDA-gdb workflow coverage to validate debugging behavior.
Improved compatibility with newer libc++ versions.
Removed the implicit nvjitlink dependency derived from cudatoolkit.

Replaced implicit context refresh with explicit initialization and version-tracked registries, reducing warm compile overhead.
Optimized CUDA Array Interface launch caching.
Avoided finalizing internal device callees during compilation.
Added user-controlled handling for LTOIR linker optimization disabling instead of unconditionally disabling it.

Unified vector type handling by replacing VectorTypeStub with VectorType / VectorTypeClass.
Introduced a value/storage data model to fix float16 and bool memory representation issues.
Fixed lowering for defaults, tuples, dtype tokens, heterogeneous tuple assignment, optional values, and string constant folding.
Fixed array.real / array.imag on shared-memory arrays preserving address space.
Fixed VectorType to complex setitem behavior.
Fixed to_numba_type handling for NumPy dtypes.

Enabled extension linkage in MLIR lowerings.
Added Extension API documentation.
Added Numbast MLIR source CI tests.
Added experimental cuDF / RAPIDS third-party test coverage, including use of pylibcudf from the active conda environment.
Prevented unintended invocation of the Numba-CUDA JIT and addressed resulting issues.

Fixed ICE for raise-only kernels.
Fixed shared-memory view behavior with None starts.
Fixed array slicing issues.
Fixed multiple lowering edge cases involving tuples, optionals, constants, and complex/vector interactions.
Fixed cuDF CI and Numbast CI issues.

Provide feedback