Skip to content

FFT: 100x speedup by cutting the BS#232

Merged
maleadt merged 5 commits into
JuliaGPU:mainfrom
AntonOresten:fix-fft-bs
May 26, 2026
Merged

FFT: 100x speedup by cutting the BS#232
maleadt merged 5 commits into
JuliaGPU:mainfrom
AntonOresten:fix-fft-bs

Conversation

@AntonOresten
Copy link
Copy Markdown
Contributor

@AntonOresten AntonOresten commented May 25, 2026

The tiles currently have a batch dimension with size equivalent to the array itself. This means that only the first block is doing useful work, and the kernel slows down massively with problem size as things start to spill.

In this PR I interpret the "BS" kernel argument as a minibatch, and adjust the grid size accordingly, as well as cleaning up some integer arguments that will constant-fold anyway.

On the default problem size for benchmarks, it nows runs ~5x faster than the cuFFT (without a plan?) instead of ~25x slower. Timed on a DGX Spark. It isn't consistently faster than cuFFT though, especially for larger factors.

For the benchmarks, the problem size should definitely be updated as we're in the ~10 microsecond range with this fix, and the relative performance over cutile-python (currently reported as ~equal) may or may not hold up.

Ideally we'd drop the BS entirely, as I'm seeing BS=1 is the fastest. But it might still offer slight performance benefits at small problem sizes, and makes the kernel more expressive.

@maleadt
Copy link
Copy Markdown
Member

maleadt commented May 26, 2026

Great improvements, thanks!

Before:

========================================================================
  fft
========================================================================
Implementation      Min (ms)    Mean (ms)   Throughput
------------------------------------------------------------------------
cuFFT               0.065       0.067       65 μs
cuTile              0.543       0.55        543 μs
------------------------------------------------------------------------

After:

========================================================================
  fft
========================================================================
Implementation      Min (ms)    Mean (ms)   Throughput
------------------------------------------------------------------------
cuTile              0.011       0.012       11 μs
cuFFT               0.062       0.066       62 μs
------------------------------------------------------------------------

With plan outside of the cuFFT benchmark:

========================================================================
  fft
========================================================================
Implementation      Min (ms)    Mean (ms)   Throughput
------------------------------------------------------------------------
cuFFT               0.004       0.005       4 μs
cuTile              0.011       0.012       11 μs
------------------------------------------------------------------------

CUDA.jl's HandleCache was supposed to chache plans already, but it's GC-driven so not entirely reliable. So I pushed a commit moving the planning out.

Comment thread examples/fft.jl Outdated
@maleadt maleadt merged commit bf6489b into JuliaGPU:main May 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants