Permalink
Find file Copy path
0e3d429 Nov 2, 2017
1 contributor

Users who have contributed to this file

130 lines (99 sloc) 8.01 KB

PDE

Kuramoto-Sivashinsky algorithm benchmark (original benchmark).

This benchmark is dominated by the cost of the FFT, leading to worse results for OpenCL with CLFFT compared to the faster CUFFT. Similarly the multithreaded backend doesn't improve much over base with the same FFT implementation. Result of the benchmarked PDE:

PDE

PDE

device N = 10¹ N = 10⁷
cuarrays cuarrays 0.0013 s 0.0069x 1.7 s 16.4x
clarrays gpu clarrays gpu 0.0128 s 0.0007x 3.8 s 7.3x
clarrays cpu clarrays cpu 0.0244 s 0.0004x 16.8 s 1.6x
gpuarrays threaded gpuarrays threaded 0.0002 s 0.0531x 22.5 s 1.2x
julia base julia base 8.719e-6 s 1.0x 27.7 s 1.0x

code


Juliaset

The Julia Set benchmark. The unrolled benchmark uses generated functions to emit an unrolled version of the inner loop. This currently doesn't yield a speed up, but was quite a bit faster in the initial tests. Needs some further research of why this slowed down - Potentially an N == 16 for the inner iteration is too big. Image of the benchmarked juliaset:

Juliaset

Juliaset

device N = 2¹² N = 2²⁴
clarrays gpu clarrays gpu 0.0356 ms 2.3x 2.9 ms 96.5x
cuarrays cuarrays 0.0099 ms 8.3x 4.2 ms 66.7x
clarrays cpu clarrays cpu 0.0495 ms 1.7x 31.9 ms 8.7x
gpuarrays threaded gpuarrays threaded 0.0337 ms 2.4x 109.4 ms 2.5x
julia base julia base 0.0821 ms 1.0x 278.8 ms 1.0x

code


Juliaset Unrolled

Juliaset Unrolled

device N = 2¹² N = 2²⁴
clarrays gpu clarrays gpu 0.0355 ms 1.6x 2.9 ms 68.4x
cuarrays cuarrays 0.0095 ms 5.8x 3.8 ms 52.0x
clarrays cpu clarrays cpu 0.0485 ms 1.1x 18.3 ms 10.8x
gpuarrays threaded gpuarrays threaded 0.0285 ms 1.9x 108.4 ms 1.8x
julia base julia base 0.0552 ms 1.0x 197.2 ms 1.0x

code


Blackscholes

Blackschole is a nice benchmark for broadcasting performance. It's a medium heavy calculation per array element, where the calculation is completely independant from each other. The CuArray package is a bit slower here compared to GPUArrays, which should be straightforward to fix. I suspect that it's due to more promotions between integer types in the indexing code.

Blackscholes

Blackscholes

device N = 10¹ N = 10⁷
arrayfire cu arrayfire cu 0.1 ms 0.0075x 2.7 ms 301.2x
cuarrays cuarrays 0.0087 ms 0.1x 2.8 ms 288.5x
clarrays gpu clarrays gpu 0.0419 ms 0.0222x 2.9 ms 280.5x
arrayfire cl arrayfire cl 0.1 ms 0.0076x 3.2 ms 250.7x
clarrays cpu clarrays cpu 0.048 ms 0.0194x 14.8 ms 53.9x
gpuarrays threaded gpuarrays threaded 0.0015 ms 0.6x 173.8 ms 4.6x
julia base julia base 0.0009 ms 1.0x 800.0 ms 1.0x

code


Poincare

Poincare section of a chaotic neuronal network. The domination of OpenCL in this benchmark might be due to a better use of vector intrinsics in Transpiler.jl, but needs some more investigations. Result of calculation:

Poincare

Poincare

device N = 10³ N = 10⁹
clarrays gpu clarrays gpu 4.928e-5 s 0.0827x 0.1 s 303.1x
clarrays cpu clarrays cpu 5.2626e-5 s 0.0775x 1.3 s 34.1x
gpuarrays threaded gpuarrays threaded 0.0003 s 0.0126x 7.3 s 6.1x
julia base julia base 4.078e-6 s 1.0x 44.4 s 1.0x

code


Mapreduce

Mapreduce, e.g. sum!. Interestingly, for the sum benchmark the arrayfire opencl backend is the fastest, while GPUArrays OpenCL backend is the slowest. This means we should be able to remove the slowdown for GPUArrays + OpenCL and maybe also for all the CUDA backends.

Sum

sum

device N = 10¹ N = 10⁷
arrayfire cl arrayfire cl 0.0106 ms 0.0005x 0.4 ms 3.8x
arrayfire cu arrayfire cu 0.0096 ms 0.0006x 0.5 ms 3.4x
cuarrays cuarrays 0.1 ms 4.7355e-5x 0.5 ms 3.1x
clarrays gpu clarrays gpu 0.0088 ms 0.0006x 0.6 ms 2.6x
gpuarrays threaded gpuarrays threaded 0.0114 ms 0.0005x 1.4 ms 1.2x
julia base julia base 5.262e-6 ms 1.0x 1.7 ms 1.0x
clarrays cpu clarrays cpu 0.0073 ms 0.0007x 13.9 ms 0.1x

code