-
Notifications
You must be signed in to change notification settings - Fork 258
Closed
Labels
performanceHow fast can we go?How fast can we go?regressionSomething that used to work, doesn't anymore.Something that used to work, doesn't anymore.
Description
Describe the bug
When I'm running my script, I found that the first time run of the main function containing for loop is unreasonably slow.
By putting the code outside of the main function to the global makes the first step in loop much faster and the following step the same as warmed up code.
To reproduce
The Minimal Working Example (MWE) for this bug:
using CUDA
using Statistics
function some_func(x, y)
a = CUDA.rand(1000)
b = CUDA.rand(1000)
c = a .* b
return c .* x .* y
end
function main()
result = []
count = 0
for i in 1:3
t = time()
for j in 0:20
for k in 1:20
count += 1
for m in 1:20
a = CUDA.rand(1000)
b = CUDA.rand(1000)
c = some_func(a, b)
push!(result, mean(c))
end
end
end
println(time() - t)
end
end
main()The printed timing is
First time:
42.087000131607056
33.35199999809265
34.365999937057495
Second time:
1.1009998321533203
0.8480000495910645
1.4190001487731934
Without main, just global loop
10.562000036239624
0.935999870300293
1.316999912261963
Expected behavior
The top loop in main should be around 10s for the first step and around 1s for the following, but not 42s for the first step and around 33s for the following step. 42s and 33s are too much for compilation here.
Version info
Details on Julia:
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
JULIA_GPU_ALLOWSCALAR = false
Details on CUDA:
CUDA toolkit 11.3.1, artifact installation
CUDA driver 11.3.0
NVIDIA driver 466.77.0
Libraries:
- CUBLAS: 11.5.1
- CURAND: 10.2.4
- CUFFT: 10.4.2
- CUSOLVER: 11.1.2
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+466.77
- CUDNN: 8.20.0 (for CUDA 11.3.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)
Toolchain:
- Julia: 1.6.1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
1 device:
0: NVIDIA GeForce RTX 2080 Ti (sm_75, 8.582 GiB / 11.000 GiB available)
Metadata
Metadata
Assignees
Labels
performanceHow fast can we go?How fast can we go?regressionSomething that used to work, doesn't anymore.Something that used to work, doesn't anymore.