OpenBLAS 4 times slower than MKL on DDOT() #530

hiccup7 · 2015-04-03T18:07:39Z

28 seconds for OpenBLAS in Julia:

blas_set_num_threads(CPU_CORES)
const v=ones(Float64,100000)
@time for k=1:1000000;s=dot(v,v);end

7.5 seconds for MKL in Python:

import numpy as np
from scipy.linalg.blas import ddot
from timeit import default_timer as timer
v = np.ones(100000)
start = timer()
for k in range(1000000):
    s = ddot(v,v)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")

Tested environment is WinPython-64bit-3.4.3.2FlavorJulia at http://sourceforge.net/projects/winpython/files/WinPython_3.4/3.4.3.2/flavors/
The same Python time was measured in 64-bit Anaconda3 v2.1.0.

From versioninfo(true) in Julia:

Julia Version 0.3.7
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)

The text was updated successfully, but these errors were encountered:

hiccup7 · 2015-04-03T18:29:57Z

The vector size I used is representative of DSP code that I use frequently that takes hours to run. The DDOT() function takes over 99% of the CPU of my code, so a 4X slowdown for OpenBLAS makes it unusable for me. Using a CPU meter (such as Task Manager), I observe that OpenBLAS is using 1 thread, while MKL is using 4 threads in parallel, so I think this is the cause of the slowdown. Given that speedy Intel processors use hyperthreading, with two threads or logical cores per physical core, I suspect that 4 threads is optimal for my CPU because it has 4 physical cores, and the code is FPU-bound.

Some people may suggest refactoring my code to use a matrix multiply instead of DDOT(). However, my testing in Python shows that this makes execution more than twice as slow with MKL. I suspect that this is because the larger number of elements in the matrix multiply result in greater cache demands and less cache hits. Therefore, I really need a multi-threaded DDOT() in order to move from MKL to OpenBLAS. Since sum-of-products is the foundation of many DSP algorithms, I believe that a multi-threaded DDOT() will greatly benefit the whole DSP community. Right now, OpenBLAS' single-threaded DDOT() is what is keeping me from moving from Python to Julia.

hiccup7 · 2015-04-03T18:45:46Z

Another user confirmed this problem with DDOT() in OpenBLAS here:
http://stackoverflow.com/questions/29398202/how-to-get-a-multi-threaded-dot-function

xianyi · 2015-04-03T19:14:55Z

@hiccup7 , thank you for the feedback.

I think OpenBLAS doesn't parallelize ddot function. I will parallelize this function next week.

xianyi · 2015-04-05T21:43:08Z

@hiccup7 ， could you try the develop branch?

hiccup7 · 2015-04-05T22:56:36Z

@xianyi , thank you for your quick update to the code. I noticed that you improved the SDOT() function also, which is very helpful.

Do I understand correctly that I would need to replace only the libopenblas.dll file in my WinPython environment with a new one from the develop branch? I would be glad to do the testing, but I don't know how to do the build. I always use binary distributions, such as WinPython, and I don't have time to learn the build process with my full-time job. Options I see are:

Use the OpenBLAS build options from my first post and provide a downloadable dll
Ask the WinPython owner to provide a beta build

hiccup7 · 2015-04-23T16:26:13Z

The Julia team built libopenblas.dll from the develop branch on April 16th, which is long after the changes were committed to fix this issue. As I documented in JuliaLang/julia#10780, I got the same performance results as with OpenBLAS v0.2.14. Thus, this issue remains unresolved because of problems in the OpenBLAS develop branch.

brada4 · 2015-11-28T21:06:05Z

I tried this on openSUSE with default python 27 and Julia 041 on an i5-4430
In numpy case I get similar results from MKL and OpenBLAS 0.2.15 and about same fron julia too
I can induce damage in similar proportion to julia case by setting blas_set_num_threads(CPU_CORES*CPU_CORES)
I wonder if this might be a short-lived bug in windows julia threading layer configuration
I dont have windows to confirm or deny this suspicion.

tkelman · 2015-11-30T04:38:41Z

ref JuliaLang/julia#5728 for some past performance numbers (I suspect just linking to openblas from C would show the same results, other than our build-time configuration of openblas we aren't doing anything unusual at runtime in Julia here), though that current test case is segfaulting due to #697

haampie · 2018-03-15T20:36:23Z

Surprisingly, in native Julia code with threading we get up to 8x faster dot products as compared to OpenBLAS, ref https://discourse.julialang.org/t/innefficient-paralellization-need-some-help-optimizing-a-simple-dot-product/9723/20

I thought it'd be good to revive this issue then!

martin-frbg · 2018-03-15T22:02:54Z

Tests on your MacBook may be slowed down by the .align problem from daxpy 10x slower on macOS (Haswell) #1470 unless you use current develop (or snapshots less than three weeks old).
I do not think I see any evidence of the parallelized dot function supposedly committed by xianyi in early April 2015 (cf his message earlier in this thread), though optimized assembly kernels for x86_64 were added by wernsaar in that time frame. (In Parallelized ZDOT? #537, wernsaar claimed that dot was memory bound to the extent that multithreading would not bring any improvement.)

haampie · 2018-03-16T07:55:34Z

Concerning 1: The comment right after the one I linked to is performed on a Linux machine.

And 2: On Linux with Intel Xeon E3-1230 v5 I get the results listed below for a threaded dot product with doubles / Float64's. Measurements done with BenchmarkTools.jl.

Size `n = 10 000 000`

Julia w/o threading: 7.114 ms

	1 thread	2 threads	4 threads
OpenBLAS	7.169 ms	7.209 ms	7.183 ms
Julia	7.170 ms	6.109 ms	6.058 ms

Size `n = 1 000 000`

Julia w/o threading: 551.457 μs

	1 thread	2 threads	4 threads
OpenBLAS	565.068 μs	565.750 μs	574.344 μs
Julia	552.558 μs	397.631 μs	371.471 μs

Size `n = 100 000`

Julia w/o threading: 21.637 μs

	1 thread	2 threads	4 threads
OpenBLAS	22.788 μs	22.823 μs	22.793 μs
Julia	23.185 μs	12.093 μs	6.392 μs

Size `n = 10 000`

Julia w/o threading: 1.649 μs

	1 thread	2 threads	4 threads
OpenBLAS	1.679 μs	1.570 μs	1.533 μs
Julia	1.884 μs	1.250 μs	1.080 μs

So for vector sizes in the range 100 000 to 1 000 000 there's a lot to gain with threading on my architecture!

martin-frbg · 2018-03-16T13:13:14Z

was just a precaution as I saw that you ran your test on a MacBook.
It seems that at least the ARMv8 ThunderX2T99 has a multithreaded driver function for its ddot kernel thanks to the work of ashwinyes, maybe the x86_64 implementation can be stolen from there.

martin-frbg · 2018-03-16T16:00:46Z

You can try the hack from my PR if you like :-)

xianyi added the Feature request label Apr 3, 2015

This was referenced Apr 8, 2015

Beta build for FlavorJulia to help resolve OpenBLAS issues winpython/winpython#77

Closed

Use OpenBLAS "develop" branch for Julia nightly builds JuliaLang/julia#10788

Closed

hiccup7 mentioned this issue Apr 16, 2015

Build OpenBLAS with MAX_STACK_ALLOC=2048 JuliaLang/julia#10780

Closed

martin-frbg mentioned this issue Feb 9, 2017

Julia + OpenBLAS vs. MATLAB + MKL - Matrix Operations Benchmark #1090

Open

martin-frbg mentioned this issue Mar 16, 2018

Add multithreading support for Haswell DDOT #1491

Merged

martin-frbg closed this as completed in #1491 Mar 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenBLAS 4 times slower than MKL on DDOT() #530

OpenBLAS 4 times slower than MKL on DDOT() #530

hiccup7 commented Apr 3, 2015

hiccup7 commented Apr 3, 2015

hiccup7 commented Apr 3, 2015

xianyi commented Apr 3, 2015

xianyi commented Apr 5, 2015

hiccup7 commented Apr 5, 2015

hiccup7 commented Apr 23, 2015

brada4 commented Nov 28, 2015

tkelman commented Nov 30, 2015

haampie commented Mar 15, 2018

martin-frbg commented Mar 15, 2018

haampie commented Mar 16, 2018 •

edited

martin-frbg commented Mar 16, 2018 •

edited

martin-frbg commented Mar 16, 2018

OpenBLAS 4 times slower than MKL on DDOT() #530

OpenBLAS 4 times slower than MKL on DDOT() #530

Comments

hiccup7 commented Apr 3, 2015

hiccup7 commented Apr 3, 2015

hiccup7 commented Apr 3, 2015

xianyi commented Apr 3, 2015

xianyi commented Apr 5, 2015

hiccup7 commented Apr 5, 2015

hiccup7 commented Apr 23, 2015

brada4 commented Nov 28, 2015

tkelman commented Nov 30, 2015

haampie commented Mar 15, 2018

martin-frbg commented Mar 15, 2018

haampie commented Mar 16, 2018 • edited

Size n = 10 000 000

Size n = 1 000 000

Size n = 100 000

Size n = 10 000

martin-frbg commented Mar 16, 2018 • edited

martin-frbg commented Mar 16, 2018

haampie commented Mar 16, 2018 •

edited

Size `n = 10 000 000`

Size `n = 1 000 000`

Size `n = 100 000`

Size `n = 10 000`

martin-frbg commented Mar 16, 2018 •

edited