Skip to content

Suppress divby assume on tile-of-pointers from offset.#219

Merged
maleadt merged 1 commit into
mainfrom
tb/moe_assumes
May 4, 2026
Merged

Suppress divby assume on tile-of-pointers from offset.#219
maleadt merged 1 commit into
mainfrom
tb/moe_assumes

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented May 4, 2026

For gather/scatter, Intrinsics.offset(base_ptr, tile_offset) produces a tile-of-pointers whose per-element divisor is gcd(ptr_align, off_div * sizeof(elem)). When that gcd collapses to the natural pointee alignment (e.g. 2 for f16), stamping it onto the IR as assume div_by<2> is vacuously true and actively harmful: tileiras's vectorizer trusts the explicit assume as the source of truth and skips its own offset-SSA walk, emitting STG.E.U16 (2-byte stores) where the structurally-equivalent Python kernel gets STG.E.128 (16-byte stores). 8x narrower stores, ~9x wall-time regression on the MoE scatter (121 GB/s vs Python's 1111 GB/s on the isolated MWE).

Fix the divby transfer for Intrinsics.offset to mirror cuTile Python's PointerOffset rule: only propagate alignment when the result is 0-D (Array.slice style); for tile-shaped offsets, return 1 and let tileiras handle vectorization. MoE bench: 1.27 ms -> 0.95 ms (now 25% faster than Python). Other benchmarks unchanged.

For gather/scatter, `Intrinsics.offset(base_ptr, tile_offset)` produces
a tile-of-pointers whose per-element divisor is `gcd(ptr_align,
off_div * sizeof(elem))`. When that gcd collapses to the natural pointee
alignment (e.g. 2 for f16), stamping it onto the IR as `assume div_by<2>`
is vacuously true and actively harmful: tileiras's vectorizer trusts the
explicit assume as the source of truth and skips its own offset-SSA walk,
emitting `STG.E.U16` (2-byte stores) where the structurally-equivalent
Python kernel gets `STG.E.128` (16-byte stores). 8x narrower stores,
~9x wall-time regression on the MoE scatter (121 GB/s vs Python's
1111 GB/s on the isolated MWE).

Fix the divby transfer for `Intrinsics.offset` to mirror cuTile Python's
`PointerOffset` rule: only propagate alignment when the result is
0-D (`Array.slice` style); for tile-shaped offsets, return 1 and let
tileiras handle vectorization. MoE bench: 1.27 ms -> 0.95 ms (now 25%
faster than Python). Other benchmarks unchanged.
@maleadt maleadt merged commit e935769 into main May 4, 2026
1 check passed
@maleadt maleadt deleted the tb/moe_assumes branch May 4, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant