Skip to content

CUDA.jl cuFFT underperforming against CuPy cuFFT #1682

@dreycenfoiles

Description

@dreycenfoiles

Describe the bug

cuFFT's with Julia are underperforming when compared with CuPy and I consistently see a ~2x performance gap. Below is an example run.

BenchmarkTools.Trial: 7144 samples with 1 evaluation.
 Range (min … max):  475.800 μs … 34.983 ms  ┊ GC (min … max): 0.00% … 8.24%
 Time  (median):     620.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   688.121 μs ±  1.282 ms  ┊ GC (mean ± σ):  0.61% ± 0.33%
fft_func            :    CPU:  166.498 us   +/-64.221 (min:  112.600 / max: 1298.200) us     GPU-0:  336.961 us   +/-62.120 (min:  238.592 / max: 1489.920) u

To reproduce

The Minimal Working Example (MWE) for this bug:

Julia code I ran.

using CUDA 
using CUDA.CUFFT 
using BenchmarkTools

A = CUDA.rand(Float32, 500,500)

function fft_func(A)
    return fft(A) 
end

@benchmark @CUDA.sync fft_func(A)

Python code I ran.

import cupyx.scipy.fft as cufft
import cupy as cp 
from cupyx.profiler import benchmark

A = cp.random.random((500,500)).astype(cp.float32)

def fft_func(A):

    return cufft.fftn(A)

print(benchmark(fft_func, (A,), n_repeat=10000))

Expected behavior

Given that both CuPy and CUDA.jl should call out to the same cuFFT routines, I would expect their runtimes to be almost identical.

Version info

Details on Julia

Julia Version 1.8.1
Commit afb6c60d69 (2022-09-06 15:09 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 6 × Intel(R) Core(TM) i5-9400 CPU @ 2.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 1 on 6 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS =

Details on CUDA:

CUDA toolkit 11.7, artifact installation
NVIDIA driver 516.94.0, for CUDA 11.7
CUDA driver 11.7

Libraries:

  • CUBLAS: 11.10.1
  • CURAND: 10.2.10
  • CUFFT: 10.7.1
  • CUSOLVER: 11.3.5
  • CUSPARSE: 11.7.3
  • CUPTI: 17.0.0
  • NVML: 11.0.0+516.94
  • CUDNN: 8.30.2 (for CUDA 11.5.0)
  • CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:

  • Julia: 1.8.1
  • LLVM: 13.0.1
  • PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
  • Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
0: NVIDIA GeForce GTX 1050 Ti (sm_61, 3.366 GiB / 4.000 GiB available)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions