-
Notifications
You must be signed in to change notification settings - Fork 258
Description
Describe the bug
cuFFT's with Julia are underperforming when compared with CuPy and I consistently see a ~2x performance gap. Below is an example run.
BenchmarkTools.Trial: 7144 samples with 1 evaluation.
Range (min … max): 475.800 μs … 34.983 ms ┊ GC (min … max): 0.00% … 8.24%
Time (median): 620.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 688.121 μs ± 1.282 ms ┊ GC (mean ± σ): 0.61% ± 0.33%
fft_func : CPU: 166.498 us +/-64.221 (min: 112.600 / max: 1298.200) us GPU-0: 336.961 us +/-62.120 (min: 238.592 / max: 1489.920) u
To reproduce
The Minimal Working Example (MWE) for this bug:
Julia code I ran.
using CUDA
using CUDA.CUFFT
using BenchmarkTools
A = CUDA.rand(Float32, 500,500)
function fft_func(A)
return fft(A)
end
@benchmark @CUDA.sync fft_func(A)
Python code I ran.
import cupyx.scipy.fft as cufft
import cupy as cp
from cupyx.profiler import benchmark
A = cp.random.random((500,500)).astype(cp.float32)
def fft_func(A):
return cufft.fftn(A)
print(benchmark(fft_func, (A,), n_repeat=10000))Expected behavior
Given that both CuPy and CUDA.jl should call out to the same cuFFT routines, I would expect their runtimes to be almost identical.
Version info
Details on Julia
Julia Version 1.8.1
Commit afb6c60d69 (2022-09-06 15:09 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 6 × Intel(R) Core(TM) i5-9400 CPU @ 2.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 1 on 6 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS =
Details on CUDA:
CUDA toolkit 11.7, artifact installation
NVIDIA driver 516.94.0, for CUDA 11.7
CUDA driver 11.7
Libraries:
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.1
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+516.94
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)
Toolchain:
- Julia: 1.8.1
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
1 device:
0: NVIDIA GeForce GTX 1050 Ti (sm_61, 3.366 GiB / 4.000 GiB available)