Conversation
cuTile had no real TFloat32 support in bytecode emission: `float_to_bits` returned raw Float32 bits, scalar TFloat32 didn't resolve to a tile type, and `constant_to_bytes` had no path for it. TF32 reduce/scan identities silently emitted 4 bytes where the spec wants 3, and any bare scalar TFloat32 reaching codegen errored. Pack TFloat32 into 19 bits per spec §5.1.2 (sign | 8-bit exp | 10-bit mantissa, RNE on the dropped bits), add the missing branches in `tile_type_for_julia!` and `constant_to_bytes`, and drop the underscore on `_tile_type_for_julia!` while we're touching it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.3.14 / §8.3.16 deliver block args as (element, accumulator) pairs, but Julia's `Base.reduce` / `Base.accumulate` convention is `op(acc, elem)`. The previous binding silently swapped operands for non-commutative combiners; existing tests masked it because every combiner exercised was commutative. Swap each pair of body block args at the body-region boundary before mapping them onto the user combiner. Add non-commutative reduce and scan tests that pin the convention via `subf` operand order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.10 forbids `weak` ordering on both `atomic_cas_tko` and `atomic_rmw_tko`. cuTile's `MemoryOrder` enum exposes `Weak` for non-atomic loads/stores; routing it to an atomic intrinsic emitted spec-invalid bytecode with no Julia-level diagnostic. Guard at the intrinsic-emit boundary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.8.10 defines `cuda_tile.mulhii` for unsigned integers only. The intrinsic previously took an unused `Signedness` argument that was silently dropped at codegen, so `mul_hi(::Int32, ::Int32)` returned the unsigned high half. Drop the unused parameter (breaking change to `Intrinsics.mulhii`) and validate the element type at the intrinsic-emit boundary, pointing signed callers to `reinterpret`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3.1. Tile IR < 13.2 had no token result on `cuda_tile.print_tko`. The previous fallback created a fresh root `MakeTokenOp` as the SSA print's result token — but a fresh root has no happens-after edge to prior ops, so subsequent stores/atomics chained through it lost the ordering established before the print. Forward the input token instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.3.10 restricts `cuda_tile.iota` to a 1-d tile of integer element type. cuTile previously accepted multi-dimensional shapes and float types, surfacing as opaque downstream errors from cuda_tile_translate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.3.2 requires lhs/rhs/result to share rank, element type,
and shape in every non-cat dimension. cuTile previously trusted
the first operand for all three, silently emitting malformed
bytecode (or aborting tfunc with a BoundsError) on mismatched
calls.
Validate at the intrinsic-emit boundary; widen tfunc to `Tile{T}`
on rank mismatch so the diagnostic surfaces from `emit_intrinsic!`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.3.12 restricts the offsets operand of `cuda_tile.offset` to integer types. The high-level gather/scatter helpers always pass integers, but direct `Intrinsics.offset` callers could silently emit invalid bytecode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
atan2 was added in spec §8.7.5 / 13.2 release notes but was missing from cuTile entirely. `Base.atan(y, x)` over float tiles failed with "no Tile IR equivalent". Add the opcode, encoder (gated on bytecode version >= v13.2), intrinsic, and `Base.atan(y, x)` overlay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §8.8.8 defines i8 × i8 → i32 MMA with per-input signedness. The opcode was reserved (`MmaIOp = 74`) but no encoder, intrinsic dispatch, or language-level routing existed. Add `encode_MmaIOp!` and extend `Intrinsics.mma` to dispatch on element types: matching float types route to `MmaFOp`; i8 × i8 → i32 routes to `MmaIOp` with signedness derived from the Julia type (`Int8` → signed, `UInt8` → unsigned). The existing `Base.muladd` wrapper handles the column-major swap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I had Claude check our implementation against the spec, and it flagged a couple of correctness issues and some minor missing features.