Minor fixes given the Tile IR spec by maleadt · Pull Request #203 · JuliaGPU/cuTile.jl

maleadt · 2026-04-27T15:54:14Z

I had Claude check our implementation against the spec, and it flagged a couple of correctness issues and some minor missing features.

cuTile had no real TFloat32 support in bytecode emission: `float_to_bits` returned raw Float32 bits, scalar TFloat32 didn't resolve to a tile type, and `constant_to_bytes` had no path for it. TF32 reduce/scan identities silently emitted 4 bytes where the spec wants 3, and any bare scalar TFloat32 reaching codegen errored. Pack TFloat32 into 19 bits per spec §5.1.2 (sign | 8-bit exp | 10-bit mantissa, RNE on the dropped bits), add the missing branches in `tile_type_for_julia!` and `constant_to_bytes`, and drop the underscore on `_tile_type_for_julia!` while we're touching it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec §8.3.14 / §8.3.16 deliver block args as (element, accumulator) pairs, but Julia's `Base.reduce` / `Base.accumulate` convention is `op(acc, elem)`. The previous binding silently swapped operands for non-commutative combiners; existing tests masked it because every combiner exercised was commutative. Swap each pair of body block args at the body-region boundary before mapping them onto the user combiner. Add non-commutative reduce and scan tests that pin the convention via `subf` operand order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec §8.10 forbids `weak` ordering on both `atomic_cas_tko` and `atomic_rmw_tko`. cuTile's `MemoryOrder` enum exposes `Weak` for non-atomic loads/stores; routing it to an atomic intrinsic emitted spec-invalid bytecode with no Julia-level diagnostic. Guard at the intrinsic-emit boundary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec §8.8.10 defines `cuda_tile.mulhii` for unsigned integers only. The intrinsic previously took an unused `Signedness` argument that was silently dropped at codegen, so `mul_hi(::Int32, ::Int32)` returned the unsigned high half. Drop the unused parameter (breaking change to `Intrinsics.mulhii`) and validate the element type at the intrinsic-emit boundary, pointing signed callers to `reinterpret`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…3.1. Tile IR < 13.2 had no token result on `cuda_tile.print_tko`. The previous fallback created a fresh root `MakeTokenOp` as the SSA print's result token — but a fresh root has no happens-after edge to prior ops, so subsequent stores/atomics chained through it lost the ordering established before the print. Forward the input token instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec §8.3.10 restricts `cuda_tile.iota` to a 1-d tile of integer element type. cuTile previously accepted multi-dimensional shapes and float types, surfacing as opaque downstream errors from cuda_tile_translate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec §8.3.2 requires lhs/rhs/result to share rank, element type, and shape in every non-cat dimension. cuTile previously trusted the first operand for all three, silently emitting malformed bytecode (or aborting tfunc with a BoundsError) on mismatched calls. Validate at the intrinsic-emit boundary; widen tfunc to `Tile{T}` on rank mismatch so the diagnostic surfaces from `emit_intrinsic!`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec §8.3.12 restricts the offsets operand of `cuda_tile.offset` to integer types. The high-level gather/scatter helpers always pass integers, but direct `Intrinsics.offset` callers could silently emit invalid bytecode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

atan2 was added in spec §8.7.5 / 13.2 release notes but was missing from cuTile entirely. `Base.atan(y, x)` over float tiles failed with "no Tile IR equivalent". Add the opcode, encoder (gated on bytecode version >= v13.2), intrinsic, and `Base.atan(y, x)` overlay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec §8.8.8 defines i8 × i8 → i32 MMA with per-input signedness. The opcode was reserved (`MmaIOp = 74`) but no encoder, intrinsic dispatch, or language-level routing existed. Add `encode_MmaIOp!` and extend `Intrinsics.mma` to dispatch on element types: matching float types route to `MmaFOp`; i8 × i8 → i32 routes to `MmaIOp` with signedness derived from the Julia type (`Int8` → signed, `UInt8` → unsigned). The existing `Base.muladd` wrapper handles the column-major swap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maleadt and others added 10 commits April 27, 2026 18:51

maleadt force-pushed the tb/spec branch from de40206 to 5695844 Compare April 27, 2026 16:59

maleadt merged commit 74a4df8 into main Apr 27, 2026
13 checks passed

maleadt deleted the tb/spec branch April 27, 2026 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor fixes given the Tile IR spec#203

Minor fixes given the Tile IR spec#203
maleadt merged 10 commits into
mainfrom
tb/spec

maleadt commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maleadt commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant