Update benchmarks by maleadt · Pull Request #208 · JuliaGPU/cuTile.jl

maleadt · 2026-04-29T10:41:10Z

No description provided.

Matches the Python sample's fast paths: flush_to_zero on every op and rounding<approx> on the reciprocal. SiLU is bandwidth-bound but the approximate divide on the SFU shaves a few percent off the MoE pipeline. Bench (RTX 5080, 4096³ MoE config): 18.8 → 19.2 TFLOPS.

Softmax's chunked kernel was looping over `Int32(0):num_chunks - Int32(1)` where `num_chunks` was Int64 (because the kernel takes `n_elems::Int`). That widens the inner ForOp to `tile<i64>` and forces a `trunci` per iteration in the offset arithmetic. Casting `num_chunks` to Int32 keeps the loop in i32 throughout, matching cuTile Python's IR shape. Bench (RTX 5080, 4096² Float32 chunked): 1587 → 1601 GB/s.

maleadt added 3 commits April 29, 2026 11:35

Update README.

866304d

maleadt marked this pull request as ready for review April 29, 2026 10:41

maleadt merged commit e478c44 into main Apr 29, 2026
13 checks passed

maleadt deleted the tb/benchmarks branch April 29, 2026 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update benchmarks#208

Update benchmarks#208
maleadt merged 3 commits into
mainfrom
tb/benchmarks

maleadt commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maleadt commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant