Skip to content

Update benchmarks#208

Merged
maleadt merged 3 commits into
mainfrom
tb/benchmarks
Apr 29, 2026
Merged

Update benchmarks#208
maleadt merged 3 commits into
mainfrom
tb/benchmarks

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Apr 29, 2026

No description provided.

maleadt added 3 commits April 29, 2026 11:35
Matches the Python sample's fast paths: flush_to_zero on every op and
rounding<approx> on the reciprocal. SiLU is bandwidth-bound but the
approximate divide on the SFU shaves a few percent off the MoE pipeline.

Bench (RTX 5080, 4096³ MoE config): 18.8 → 19.2 TFLOPS.
Softmax's chunked kernel was looping over `Int32(0):num_chunks - Int32(1)`
where `num_chunks` was Int64 (because the kernel takes `n_elems::Int`).
That widens the inner ForOp to `tile<i64>` and forces a `trunci` per
iteration in the offset arithmetic. Casting `num_chunks` to Int32 keeps
the loop in i32 throughout, matching cuTile Python's IR shape.

Bench (RTX 5080, 4096² Float32 chunked): 1587 → 1601 GB/s.
@maleadt maleadt marked this pull request as ready for review April 29, 2026 10:41
@maleadt maleadt merged commit e478c44 into main Apr 29, 2026
13 checks passed
@maleadt maleadt deleted the tb/benchmarks branch April 29, 2026 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant