Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS 4 times slower than MKL on DDOT() #530

Closed
hiccup7 opened this issue Apr 3, 2015 · 13 comments · Fixed by #1491
Closed

OpenBLAS 4 times slower than MKL on DDOT() #530

hiccup7 opened this issue Apr 3, 2015 · 13 comments · Fixed by #1491

Comments

@hiccup7
Copy link

hiccup7 commented Apr 3, 2015

28 seconds for OpenBLAS in Julia:

blas_set_num_threads(CPU_CORES)
const v=ones(Float64,100000)
@time for k=1:1000000;s=dot(v,v);end

7.5 seconds for MKL in Python:

import numpy as np
from scipy.linalg.blas import ddot
from timeit import default_timer as timer
v = np.ones(100000)
start = timer()
for k in range(1000000):
    s = ddot(v,v)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")

Tested environment is WinPython-64bit-3.4.3.2FlavorJulia at http://sourceforge.net/projects/winpython/files/WinPython_3.4/3.4.3.2/flavors/
The same Python time was measured in 64-bit Anaconda3 v2.1.0.

From versioninfo(true) in Julia:

Julia Version 0.3.7
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)

@hiccup7
Copy link
Author

hiccup7 commented Apr 3, 2015

The vector size I used is representative of DSP code that I use frequently that takes hours to run. The DDOT() function takes over 99% of the CPU of my code, so a 4X slowdown for OpenBLAS makes it unusable for me. Using a CPU meter (such as Task Manager), I observe that OpenBLAS is using 1 thread, while MKL is using 4 threads in parallel, so I think this is the cause of the slowdown. Given that speedy Intel processors use hyperthreading, with two threads or logical cores per physical core, I suspect that 4 threads is optimal for my CPU because it has 4 physical cores, and the code is FPU-bound.

Some people may suggest refactoring my code to use a matrix multiply instead of DDOT(). However, my testing in Python shows that this makes execution more than twice as slow with MKL. I suspect that this is because the larger number of elements in the matrix multiply result in greater cache demands and less cache hits. Therefore, I really need a multi-threaded DDOT() in order to move from MKL to OpenBLAS. Since sum-of-products is the foundation of many DSP algorithms, I believe that a multi-threaded DDOT() will greatly benefit the whole DSP community. Right now, OpenBLAS' single-threaded DDOT() is what is keeping me from moving from Python to Julia.

@hiccup7
Copy link
Author

hiccup7 commented Apr 3, 2015

Another user confirmed this problem with DDOT() in OpenBLAS here:
http://stackoverflow.com/questions/29398202/how-to-get-a-multi-threaded-dot-function

@xianyi
Copy link
Collaborator

xianyi commented Apr 3, 2015

@hiccup7 , thank you for the feedback.

I think OpenBLAS doesn't parallelize ddot function. I will parallelize this function next week.

@xianyi
Copy link
Collaborator

xianyi commented Apr 5, 2015

@hiccup7 , could you try the develop branch?

@hiccup7
Copy link
Author

hiccup7 commented Apr 5, 2015

@xianyi , thank you for your quick update to the code. I noticed that you improved the SDOT() function also, which is very helpful.

Do I understand correctly that I would need to replace only the libopenblas.dll file in my WinPython environment with a new one from the develop branch? I would be glad to do the testing, but I don't know how to do the build. I always use binary distributions, such as WinPython, and I don't have time to learn the build process with my full-time job. Options I see are:

  1. Use the OpenBLAS build options from my first post and provide a downloadable dll
  2. Ask the WinPython owner to provide a beta build

@hiccup7
Copy link
Author

hiccup7 commented Apr 23, 2015

The Julia team built libopenblas.dll from the develop branch on April 16th, which is long after the changes were committed to fix this issue. As I documented in JuliaLang/julia#10780, I got the same performance results as with OpenBLAS v0.2.14. Thus, this issue remains unresolved because of problems in the OpenBLAS develop branch.

@brada4
Copy link
Contributor

brada4 commented Nov 28, 2015

I tried this on openSUSE with default python 27 and Julia 041 on an i5-4430
In numpy case I get similar results from MKL and OpenBLAS 0.2.15 and about same fron julia too
I can induce damage in similar proportion to julia case by setting blas_set_num_threads(CPU_CORES*CPU_CORES)
I wonder if this might be a short-lived bug in windows julia threading layer configuration
I dont have windows to confirm or deny this suspicion.

@tkelman
Copy link
Contributor

tkelman commented Nov 30, 2015

ref JuliaLang/julia#5728 for some past performance numbers (I suspect just linking to openblas from C would show the same results, other than our build-time configuration of openblas we aren't doing anything unusual at runtime in Julia here), though that current test case is segfaulting due to #697

@haampie
Copy link
Contributor

haampie commented Mar 15, 2018

Surprisingly, in native Julia code with threading we get up to 8x faster dot products as compared to OpenBLAS, ref https://discourse.julialang.org/t/innefficient-paralellization-need-some-help-optimizing-a-simple-dot-product/9723/20

I thought it'd be good to revive this issue then!

@martin-frbg
Copy link
Collaborator

  1. Tests on your MacBook may be slowed down by the .align problem from daxpy 10x slower on macOS (Haswell) #1470 unless you use current develop (or snapshots less than three weeks old).
  2. I do not think I see any evidence of the parallelized dot function supposedly committed by xianyi in early April 2015 (cf his message earlier in this thread), though optimized assembly kernels for x86_64 were added by wernsaar in that time frame. (In Parallelized ZDOT? #537, wernsaar claimed that dot was memory bound to the extent that multithreading would not bring any improvement.)

@haampie
Copy link
Contributor

haampie commented Mar 16, 2018

Concerning 1: The comment right after the one I linked to is performed on a Linux machine.

And 2: On Linux with Intel Xeon E3-1230 v5 I get the results listed below for a threaded dot product with doubles / Float64's. Measurements done with BenchmarkTools.jl.

Size n = 10 000 000

Julia w/o threading: 7.114 ms

1 thread 2 threads 4 threads
OpenBLAS 7.169 ms 7.209 ms 7.183 ms
Julia 7.170 ms 6.109 ms 6.058 ms

Size n = 1 000 000

Julia w/o threading: 551.457 μs

1 thread 2 threads 4 threads
OpenBLAS 565.068 μs 565.750 μs 574.344 μs
Julia 552.558 μs 397.631 μs 371.471 μs

Size n = 100 000

Julia w/o threading: 21.637 μs

1 thread 2 threads 4 threads
OpenBLAS 22.788 μs 22.823 μs 22.793 μs
Julia 23.185 μs 12.093 μs 6.392 μs

Size n = 10 000

Julia w/o threading: 1.649 μs

1 thread 2 threads 4 threads
OpenBLAS 1.679 μs 1.570 μs 1.533 μs
Julia 1.884 μs 1.250 μs 1.080 μs

So for vector sizes in the range 100 000 to 1 000 000 there's a lot to gain with threading on my architecture!

@martin-frbg
Copy link
Collaborator

martin-frbg commented Mar 16, 2018

  1. was just a precaution as I saw that you ran your test on a MacBook.
  2. It seems that at least the ARMv8 ThunderX2T99 has a multithreaded driver function for its ddot kernel thanks to the work of ashwinyes, maybe the x86_64 implementation can be stolen from there.

@martin-frbg
Copy link
Collaborator

You can try the hack from my PR if you like :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants