This first update focuses on platform support, debugging, ecosystem enablement, performance, and broader CUDA Python compatibility.
Highlights
- Added Support for Windows, and integrated Windows tests into CI.
- Added CUDA-gdb CI workflow and debugging support validation.
- Added experimental third-party ecosystem coverage for nvmath-python, RAPIDS/cuDF, and numbast extension backends.
- Improved warm compile-time performance by redesigning extension registry refresh behavior, delivering an additional ~40% speedup over the previous implementation on our benchmark suite and reaching ~1.8x geomean speedup on warm compile-time vs. numba-cuda.
Platform and Tooling Support
- Introduced Windows CI coverage and related build fixes, including static CRT usage.
- Added CUDA-gdb workflow coverage to validate debugging behavior.
- Improved compatibility with newer libc++ versions.
- Removed the implicit
nvjitlinkdependency derived fromcudatoolkit.
Performance and Compilation
- Replaced implicit context refresh with explicit initialization and version-tracked registries, reducing warm compile overhead.
- Optimized CUDA Array Interface launch caching.
- Avoided finalizing internal device callees during compilation.
- Added user-controlled handling for LTOIR linker optimization disabling instead of unconditionally disabling it.
CUDA Python Compatibility Improvements
- Added full
array.view()support, including dtype bitwidth changes. - Added support for vector types in local and shared memory.
- Added CUDA vector / scalar operations and vector-to-complex conversions.
- Added support for custom dtypes.
- Added complex constructor support, including
complex32. - Added support for complex
CPointergetitem/setitem lowering. - Added support for
NamedTupleusage in kernels. - Improved support for array slicing and shared-memory views.
Lowering and Type System Fixes
- Unified vector type handling by replacing
VectorTypeStubwithVectorType/VectorTypeClass. - Introduced a value/storage data model to fix float16 and bool memory representation issues.
- Fixed lowering for defaults, tuples, dtype tokens, heterogeneous tuple assignment, optional values, and string constant folding.
- Fixed
array.real/array.imagon shared-memory arrays preserving address space. - Fixed
VectorTypeto complexsetitembehavior. - Fixed
to_numba_typehandling for NumPy dtypes.
Ecosystem and Extension Support
- Enabled extension linkage in MLIR lowerings.
- Added Extension API documentation.
- Added Numbast MLIR source CI tests.
- Added experimental cuDF / RAPIDS third-party test coverage, including use of
pylibcudffrom the active conda environment. - Prevented unintended invocation of the Numba-CUDA JIT and addressed resulting issues.
Documentation and Maintenance
- Updated reference documentation.
- Added PR documentation preview infrastructure.
- Fixed PyPI-hosted README links.
- Removed outdated conda install documentation.
- Removed legacy
@intrinsicimplementations. - Removed dead NRT C++ code.
- Removed cudasim support.
- Removed unnecessary packaging dependency from
numba_cuda._compat.
Bug Fixes
- Fixed ICE for raise-only kernels.
- Fixed shared-memory view behavior with
Nonestarts. - Fixed array slicing issues.
- Fixed multiple lowering edge cases involving tuples, optionals, constants, and complex/vector interactions.
- Fixed cuDF CI and Numbast CI issues.