Skip to content

Minor fixes given the Tile IR spec#203

Merged
maleadt merged 10 commits into
mainfrom
tb/spec
Apr 27, 2026
Merged

Minor fixes given the Tile IR spec#203
maleadt merged 10 commits into
mainfrom
tb/spec

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Apr 27, 2026

I had Claude check our implementation against the spec, and it flagged a couple of correctness issues and some minor missing features.

maleadt and others added 10 commits April 27, 2026 18:51
cuTile had no real TFloat32 support in bytecode emission:
`float_to_bits` returned raw Float32 bits, scalar TFloat32 didn't
resolve to a tile type, and `constant_to_bytes` had no path for it.
TF32 reduce/scan identities silently emitted 4 bytes where the spec
wants 3, and any bare scalar TFloat32 reaching codegen errored.

Pack TFloat32 into 19 bits per spec §5.1.2 (sign | 8-bit exp |
10-bit mantissa, RNE on the dropped bits), add the missing branches
in `tile_type_for_julia!` and `constant_to_bytes`, and drop the
underscore on `_tile_type_for_julia!` while we're touching it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.3.14 / §8.3.16 deliver block args as (element, accumulator)
pairs, but Julia's `Base.reduce` / `Base.accumulate` convention is
`op(acc, elem)`. The previous binding silently swapped operands for
non-commutative combiners; existing tests masked it because every
combiner exercised was commutative.

Swap each pair of body block args at the body-region boundary
before mapping them onto the user combiner. Add non-commutative
reduce and scan tests that pin the convention via `subf` operand
order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.10 forbids `weak` ordering on both `atomic_cas_tko` and
`atomic_rmw_tko`. cuTile's `MemoryOrder` enum exposes `Weak` for
non-atomic loads/stores; routing it to an atomic intrinsic emitted
spec-invalid bytecode with no Julia-level diagnostic.

Guard at the intrinsic-emit boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.8.10 defines `cuda_tile.mulhii` for unsigned integers only.
The intrinsic previously took an unused `Signedness` argument that
was silently dropped at codegen, so `mul_hi(::Int32, ::Int32)`
returned the unsigned high half.

Drop the unused parameter (breaking change to `Intrinsics.mulhii`)
and validate the element type at the intrinsic-emit boundary,
pointing signed callers to `reinterpret`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3.1.

Tile IR < 13.2 had no token result on `cuda_tile.print_tko`. The
previous fallback created a fresh root `MakeTokenOp` as the SSA
print's result token — but a fresh root has no happens-after edge
to prior ops, so subsequent stores/atomics chained through it lost
the ordering established before the print.

Forward the input token instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.3.10 restricts `cuda_tile.iota` to a 1-d tile of integer
element type. cuTile previously accepted multi-dimensional shapes
and float types, surfacing as opaque downstream errors from
cuda_tile_translate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.3.2 requires lhs/rhs/result to share rank, element type,
and shape in every non-cat dimension. cuTile previously trusted
the first operand for all three, silently emitting malformed
bytecode (or aborting tfunc with a BoundsError) on mismatched
calls.

Validate at the intrinsic-emit boundary; widen tfunc to `Tile{T}`
on rank mismatch so the diagnostic surfaces from `emit_intrinsic!`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.3.12 restricts the offsets operand of `cuda_tile.offset`
to integer types. The high-level gather/scatter helpers always
pass integers, but direct `Intrinsics.offset` callers could
silently emit invalid bytecode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
atan2 was added in spec §8.7.5 / 13.2 release notes but was missing
from cuTile entirely. `Base.atan(y, x)` over float tiles failed
with "no Tile IR equivalent".

Add the opcode, encoder (gated on bytecode version >= v13.2),
intrinsic, and `Base.atan(y, x)` overlay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.8.8 defines i8 × i8 → i32 MMA with per-input signedness.
The opcode was reserved (`MmaIOp = 74`) but no encoder, intrinsic
dispatch, or language-level routing existed.

Add `encode_MmaIOp!` and extend `Intrinsics.mma` to dispatch on
element types: matching float types route to `MmaFOp`; i8 × i8 →
i32 routes to `MmaIOp` with signedness derived from the Julia type
(`Int8` → signed, `UInt8` → unsigned). The existing `Base.muladd`
wrapper handles the column-major swap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maleadt maleadt merged commit 74a4df8 into main Apr 27, 2026
13 checks passed
@maleadt maleadt deleted the tb/spec branch April 27, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant