This release is aligned with the CUDA Tile IR specification included in CUDA Toolkit 13.3.
*Supported Architectures*
* Added support for Hopper (`sm_90`) architecture.
*New Operations*
* Added op `alloca` for automatic memory allocation
* Added op `mmaf_scaled` for floating-point matrix-multiply-accumulate with scaled inputs on `sm_100` and above
* Added op `pack` to pack a tile into a byte array
* Added op `unpack` to unpack a byte array into a tile
* Added op `make_strided_view` to create a strided view from a tensor view
* Added op `make_gather_scatter_view` to create a gather/scatter view from a tensor view
* Added op `atomic_red_view_tko` for view-based atomic reduction on global memory
*New Types*
* Added type `strided_view` for strided tile views with configurable traversal strides
* Added type `gather_scatter_view` for gather/scatter access patterns over tensor views
* Added type `i4` (4-bit integer) for quantization support. `i4` tiles must be converted to a supported integer type for use in operations
* Added type `f4E2M1FN` (4-bit float)
*Modified Operations*
* Modified op `entry` to add `num_worker_warps_per_cta` optimization hint
* Modified op `entry` to add `default` key support for optimization hints (fallback when no target-specific hint is given)
* Modified op `mmaf` to add `fast_acc` attribute for faster but less precise FP8 MMA accumulation on Hopper GPUs
* Modified op `global` to add `constant` attribute to mark globals as immutable/read-only
* Modified op `global` to add `symbol_visibility` attribute (public/private)
* Modified op `module` to add `producer` attribute for identifying the generating tool
* Modified op `exp` to add `rounding_mode` attribute (approx and full modes)
* Modified op `atomic_rmw_tko` to add `bf16` support to `ADDF` mode
* Modified ops `load_view_tko` and `store_view_tko` to change index types from scalar-only to support 1D tensor indices (for gather/scatter)
* Modified ops `exti`, `trunci`, `pack`, and `unpack` to add `i4` type support
*Documentation Improvements*
* Added 4-bit memory layout documentation to tensor view type
* Added overflow/undefined behavior documentation improvements to `ftoi` and `itof`
*Fixed Issues*
* Fixed bug where `atomic_rmw_tko` `FADD` with `f16` could produce incorrect behavior.
*Known Issues*
* Declaring an `f16` constant, converting to an FP8 type, and then printing it on `sm_120` can cause a compiler crash.