FFT: 100x speedup by cutting the BS by AntonOresten · Pull Request #232 · JuliaGPU/cuTile.jl

AntonOresten · 2026-05-25T21:56:26Z

The tiles currently have a batch dimension with size equivalent to the array itself. This means that only the first block is doing useful work, and the kernel slows down massively with problem size as things start to spill.

In this PR I interpret the "BS" kernel argument as a minibatch, and adjust the grid size accordingly, as well as cleaning up some integer arguments that will constant-fold anyway.

On the default problem size for benchmarks, it nows runs ~5x faster than the cuFFT (without a plan?) instead of ~25x slower. Timed on a DGX Spark. It isn't consistently faster than cuFFT though, especially for larger factors.

For the benchmarks, the problem size should definitely be updated as we're in the ~10 microsecond range with this fix, and the relative performance over cutile-python (currently reported as ~equal) may or may not hold up.

Ideally we'd drop the BS entirely, as I'm seeing BS=1 is the fastest. But it might still offer slight performance benefits at small problem sizes, and makes the kernel more expressive.

maleadt · 2026-05-26T12:22:22Z

Great improvements, thanks!

Before:

========================================================================
  fft
========================================================================
Implementation      Min (ms)    Mean (ms)   Throughput
------------------------------------------------------------------------
cuFFT               0.065       0.067       65 μs
cuTile              0.543       0.55        543 μs
------------------------------------------------------------------------

After:

========================================================================
  fft
========================================================================
Implementation      Min (ms)    Mean (ms)   Throughput
------------------------------------------------------------------------
cuTile              0.011       0.012       11 μs
cuFFT               0.062       0.066       62 μs
------------------------------------------------------------------------

With plan outside of the cuFFT benchmark:

========================================================================
  fft
========================================================================
Implementation      Min (ms)    Mean (ms)   Throughput
------------------------------------------------------------------------
cuFFT               0.004       0.005       4 μs
cuTile              0.011       0.012       11 μs
------------------------------------------------------------------------

CUDA.jl's HandleCache was supposed to chache plans already, but it's GC-driven so not entirely reliable. So I pushed a commit moving the planning out.

[ci skip]

AntonOresten mentioned this pull request May 25, 2026

FFT: fix grid and mini-batch NVIDIA/cutile-python#87

Open

3 tasks

AntonOresten and others added 3 commits May 26, 2026 14:08

FFT: 100x speedup by cutting the BS

4bf784f

Replace hardcoded 1 with BS

3b6ccd9

cuFFT: Lift plan outside of benchmark loop.

482676a

maleadt force-pushed the fix-fft-bs branch from 5e15ec6 to 482676a Compare May 26, 2026 12:22

Bump benchmark and update README.

17b5468

[ci skip]

AntonOresten commented May 26, 2026

View reviewed changes

Comment thread examples/fft.jl Outdated

Bump factors for a more realistic invocation.

2bf33f6

maleadt merged commit bf6489b into JuliaGPU:main May 26, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FFT: 100x speedup by cutting the BS#232

FFT: 100x speedup by cutting the BS#232
maleadt merged 5 commits into
JuliaGPU:mainfrom
AntonOresten:fix-fft-bs

AntonOresten commented May 25, 2026 •

edited

Loading

Uh oh!

maleadt commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AntonOresten commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AntonOresten commented May 25, 2026 •

edited

Loading