Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS 6 times slower than MKL on DGEMV() #532

Closed
hiccup7 opened this issue Apr 6, 2015 · 32 comments
Closed

OpenBLAS 6 times slower than MKL on DGEMV() #532

hiccup7 opened this issue Apr 6, 2015 · 32 comments

Comments

@hiccup7
Copy link

hiccup7 commented Apr 6, 2015

Small vector scenario.
26.7 seconds for OpenBLAS in Julia:

blas_set_num_threads(CPU_CORES)
const trans = 'N'
const a = ones((201, 150))
const x = ones(150)
@time for k=1:1000000; s = BLAS.gemv(trans, a, x); end

4.6 seconds for MKL in Python:

import numpy as np
from scipy.linalg.blas import dgemv
from timeit import default_timer as timer
alpha = 1.0
a = np.ones((201, 150), order='F')
x = np.ones(150)
start = timer()
for k in range(1000000):
    s = dgemv(alpha, a, x)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")

Large vector scenario.
15.7 seconds for OpenBLAS in Julia:

blas_set_num_threads(CPU_CORES)
const trans = 'N'
const a = ones((4, 100000))
const x = ones(100000)
@time for k=1:100000; s = BLAS.gemv(trans, a, x); end

7.9 seconds for MKL in Python:

import numpy as np
from scipy.linalg.blas import dgemv
from timeit import default_timer as timer
alpha = 1.0
a = np.ones((4, 100000), order='F')
x = np.ones(100000)
start = timer()
for k in range(100000):
    s = dgemv(alpha, a, x)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")

Tested environment is WinPython-64bit-3.4.3.2FlavorJulia at http://sourceforge.net/projects/winpython/files/WinPython_3.4/3.4.3.2/flavors/
The same Python time was measured in 64-bit Anaconda3 v2.1.0.

From versioninfo(true) in Julia:

Julia Version 0.3.7
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)

I observed using the CPU meter (Task Manager) that OpenBLAS is single threaded and MKL uses 4 threads. I would predict from this that OpenBLAS would be 4 times slower than MKL, but for the small vector scenario, OpenBLAS is acutally about 6 times slower than MKL. Maybe an optimization for Haswell will help OpenBLAS match MKL's speed.

I haven't tested SGEMV(), but it may need to be parallelized too. DGEMV() and SGEMV() are commonly-used functions in DSP. These are important to allow me to move from Python to Julia.

@jeromerobert
Copy link
Contributor

OpenBLAS gemv is already multi-threaded (see https://github.com/xianyi/OpenBLAS/blob/develop/interface/gemv.c and https://github.com/xianyi/OpenBLAS/blob/develop/driver/level2/gemv_thread.c). Could you try to build OpenBLAS with MAX_STACK_ALLOC=2048 and test again ? See #482 and #478 for details.

@hiccup7
Copy link
Author

hiccup7 commented Apr 8, 2015

I use Windows to develop embedded DSP code. I have never built any Windows apps before. I don't think my employer will support me spending the time to learn how to do this for Julia. I hope someone else can build with MAX_STACK_ALLOC=2048 and confirm that this fixes the issue. Otherwise, I will need to stay with Python.

If the root cause of the issue is that OpenBLAS needs to be built with MAX_STACK_ALLOC=2048 to perform properly, then perhaps:
a) OpenBLAS could default to MAX_STACK_ALLOC=2048 if MAX_STACK_ALLOC is unspecified.
b) The Julia build environment could be updated. Is this the appropriate make file?
https://github.com/JuliaLang/julia/blob/master/Make.inc

@carlkl
Copy link

carlkl commented Apr 22, 2015

unfortunately the openblas latest develop 406d9d6 together with a cherrypicked jeromerobert@ee71dd3 patch doesn't solved the gemv performance issue #532. The library was build with MAX_STACK_ALLOC=2048. See winpython/winpython#82 (comment)

@carlkl
Copy link

carlkl commented Apr 28, 2015

something weird happens with OpenBLAS dgemv. Running @hiccup7's scipy code above with ONLY ONE thread in OpenBLAS gives about the same perfomance as scipy-MKL. The performance drops as more threads are involved. @wernsaar?

  • 1 thread
    • OpenBLAS: 5.5 s
    • MKL: 5.6 s
  • 4 threads
    • OpenBLAS: 16.0 s
    • MKL: 5.0 s

@wernsaar
Copy link
Contributor

HI,

I need more informations.

  1. what platform or target
  2. is the matrix transposed or not transposed ( because different
    kernels are called )
  3. size of the matrix ( m and n )

Best regards
Werner

On 04/28/2015 02:25 PM, carlkl wrote:

something weird happens with OpenBLAS dgemv. Running @hiccup7
https://github.com/hiccup7's scipy code above with ONLY ONE thread
in OpenBLAS gives about the same perfomance as scipy-MKL. The
performance drops as more threads are involved. @wernsaar
https://github.com/wernsaar?

  • 1 thread
    o OpenBLAS: 5.5 s
    o MKL: 5.6 s
  • 4 threads
    o OpenBLAS: 16.0 s
    o MKL: 5.0 s


Reply to this email directly or view it on GitHub
#532 (comment).

@carlkl
Copy link

carlkl commented Apr 28, 2015

here it is:

@xianyi
Copy link
Collaborator

xianyi commented Apr 28, 2015

@carlkl , I already merged @jeromerobert patch on develop branch.
Could you know how many threads MKL used?

@carlkl
Copy link

carlkl commented Apr 28, 2015

about 4 according to the taskmanager. The MKL performance is not degraded if more than one thread is used.
A solution might be to increase GEMM_MULTITHREAD_THRESHOLD. Was the default 4 found during benchmarking?

@carlkl
Copy link

carlkl commented May 4, 2015

with the latest develop from wernsaar (updated dgemv_n kernel for nehalem and haswell) I still have the same behaviour with and without threads (steered with coresp. environment variables)

@hiccup7 's scipy test (see above) execution time on Windows 64bit:

  • MKL: around 5.9 sec regardless if 4 threads are used or only one thread
  • OpenBLAS: 24,7 sec with 4 threads (haswell) and 6.0 sec with only one thread

@wernsaar
Copy link
Contributor

wernsaar commented May 4, 2015

Hi,

I ran dgemv benchmark tests on our Haswell machine (Linux) in our lab.
MKL dgemv is always single-threaded on this platform, OpenBlas is
multithreaded.
For matrix sizes from 256x256 to 2048x2048, OpenBLAS is faster than MKL.
Using 2 threads with OpenBLAS, you can expect 60% better performance.
More than 2 threads are not useful.

Please give me more details:
Size of the matrix
Increment for vector x
Increment for vector y
Is the matrix transposed or not transposed

Regards

Werner

On 05/04/2015 11:11 AM, carlkl wrote:

with the latest develop from wernsaar (updated dgemv_n kernel for
nehalem and haswell) I still have the same behaviour with and without
threads (steered with coresp. environment variables)

@hiccup7 https://github.com/hiccup7 's scipy test (see above)
execution time on Windows 64bit:

  • MKL: around 5.9 sec regardless if 4 threads are used or only one
    thread
  • OpenBLAS: 24,7 sec with 4 threads (haswell) and 6.0 sec with only
    one thread


Reply to this email directly or view it on GitHub
#532 (comment).

@carlkl
Copy link

carlkl commented May 4, 2015

  • Platform:
    • windows amd64
    • gcc with win32thread model
    • openblas: latests wernsaar develop
  • Makefile.rule:
    • TARGET = HASWELL
    • DYNAMIC_ARCH = 0
    • CC = gcc
    • FC = gfortran
    • BINARY = 64
    • USE_THREAD = 1
    • USE_OPENMP = 0
    • NUM_THREADS = 32
    • NO_WARMUP = 1
    • NO_AFFINITY = 1
    • USE_SIMPLE_THREADED_LEVEL3 = 1 (also tested with 0)
    • COMMON_OPT = -O2 -march=x86-64 -mtune=generic
    • FCOMMON_OPT = -frecursive
    • MAX_STACK_ALLOC = 2048
  • Matrix:
    • fortran ordering (C ordering is much slower)
    • M x N = 201 x 150

@xianyi
Copy link
Collaborator

xianyi commented May 4, 2015

@wernsaar , it's a small matrix size of @carlkl 's test case. I think it need only use single thread instead of multithreading.

Actually, it is an old OpenBLAS issue that MKL has better adjustment of single or multithreading based on the input size.

@hiccup7
Copy link
Author

hiccup7 commented May 4, 2015

As I mentioned in the opening post, MKL is using 4 threads for both scenarios I tested.
Also note in the opening post that increments for x and y are 1, and there is no transpose.

@xianyi
Copy link
Collaborator

xianyi commented May 4, 2015

@hiccup7 , Could you test more dgemv MKL results with 1, 2, and 4 threads? Please refer to this article https://software.intel.com/en-us/node/528546 to control the number of MKL threads.

@hiccup7
Copy link
Author

hiccup7 commented May 4, 2015

Using the Python+MKL code from my opening post:
Results from Small vector scenario:
6.6 seconds, MKL_NUM_THREADS=1
5.3 seconds, MKL_NUM_THREADS=2
4.6 seconds, MKL_NUM_THREADS=4

Results from Large vector scenario:
13.5 seconds, MKL_NUM_THREADS=1
12.8 seconds, MKL_NUM_THREADS=2
7.9 seconds, MKL_NUM_THREADS=4

@hiccup7
Copy link
Author

hiccup7 commented May 5, 2015

OpenBLAS developers have access to MKL for free:
https://winpython.github.io/ Windows-only
https://store.continuum.io/cshop/anaconda/ Only Windows version contains MKL for free
https://software.intel.com/en-us/qualify-for-free-software/opensourcecontributor Linux-only

@stonebig
Copy link

stonebig commented May 5, 2015

@hiccup7 don't you mean OpenBLAS users ?

@carlkl
Copy link

carlkl commented May 5, 2015

Python users? Be aware, that MKL as included in numpy-MKL is free, but not for every usecase. I'm not a laywer, but I think you need to buy a MKL license for any commercial usage.
For comparison: OpenBLAS has a really free BSD license that fits perfectly in to the so callled scipy-stack landscape.

@stonebig
Copy link

stonebig commented May 5, 2015

Sorry, I meant the OpenBLAS developpers can't make any use MKL, except to benchmark against it.

@hiccup7
Copy link
Author

hiccup7 commented May 5, 2015

Yes, my intention for pointing out free sources for MKL was to support benchmarking, not copying source code.

@xianyi
Copy link
Collaborator

xianyi commented May 6, 2015

Hi all,

I just ran the latest develop branch on our Haswell machine(Intel Core i7-4770 CPU, Ubuntu 14.04.1 64-bit).

For 201x150,

 OPENBLAS_NUM_THREADS=1 ./test_gemv_open 201 150 1000000
201x150 1000000 loops          4.261447 s   14150.123186 MGFLOPS

OPENBLAS_NUM_THREADS=2 ./test_gemv_open 201 150 1000000
201x150 1000000 loops          3.361230 s    17939.861301 MGFLOPS

OPENBLAS_NUM_THREADS=4 ./test_gemv_open 201 150 1000000
201x150 1000000 loops   4.208811 s  14327.086676 MGFLOPS

OpenBLAS got the best performance with 2 threads.

For 4x100000,

 OPENBLAS_NUM_THREADS=1 ./test_gemv_open 4 100000 100000
4x100000    100000 loops    11.901841 s 6721.649197 MGFLOPS

OPENBLAS_NUM_THREADS=2 ./test_gemv_open 4 100000 100000
4x100000    100000 loops    12.399255 s 6452.000544 MGFLOPS

 OPENBLAS_NUM_THREADS=4 ./test_gemv_open 4 100000 100000
4x100000    100000 loops    12.463332 s 6418.829250 MGFLOPS

The performance is the same since OpenBLAS only uses one thread for 4x100000. The reason is that OpenBLAS splits the gemv_n workload on m (column) direction. In 4x100000 case, the m (4) is too small to split. Therefore, OpenBLAS only use one thread.

For small m and large n case, we need to parallel gemv by n (row) direction. Every thread computes a part of the result. Then, the main thread do the reduction.

Here is the test codes. https://gist.github.com/xianyi/65aef3c2e5bc32049806

@xianyi
Copy link
Collaborator

xianyi commented May 6, 2015

@hiccup7 , what's CPU_CORES in your test codes? Is it 4 (the number of physical cores) or 8 (the number of logical cores)?

@hiccup7
Copy link
Author

hiccup7 commented May 6, 2015

Julia sets CPU_CORES as 8 for my Intel Haswell CPU (with 4 physical cores and 8 logical cores).

Does Julia's blas_set_num_threads() function set the maximum allowed threads, and OpenBLAS can reduce the number of threads if needed to get higher speed? I hope so.

@xianyi
Copy link
Collaborator

xianyi commented May 7, 2015

@hiccup7 , OpenBLAS only can choose one thread for some small input sizes. However, OpenBLAS cannot switch 2 , 4 or 8 threads dynamically based on the input size.

xianyi added a commit that referenced this issue May 7, 2015
@xianyi
Copy link
Collaborator

xianyi commented May 7, 2015

Improve the performance for 4x100000 case.
When I uses two threads, it can achieve the best performance.

 OPENBLAS_NUM_THREADS=1 ./test_gemv 4 100000 100000
4x100000    100000 loops    12.048461 s 6639.852177 MGFLOPS

 OPENBLAS_NUM_THREADS=2 ./test_gemv 4 100000 100000
4x100000    100000 loops    6.176924 s  12951.430194 MGFLOPS

 OPENBLAS_NUM_THREADS=4 ./test_gemv 4 100000 100000
4x100000    100000 loops    12.482034 s 6409.211832 MGFLOPS

@hiccup7
Copy link
Author

hiccup7 commented May 7, 2015

@xianyi , Wonderful! Thanks for the improvement.

For my two test cases, 2 threads provides the fastest performance. Would it make sense for OpenBLAS to use 2 threads for GEMV() automatically unless the input size is small or OPENBLAS_NUM_THREADS=1 ?

@xianyi
Copy link
Collaborator

xianyi commented May 12, 2015

@hiccup7 , You can set them to 2 threads in your application.

For OpenBLAS, I think we need to test more inputs and CPUs.

@xianyi
Copy link
Collaborator

xianyi commented May 12, 2015

@hiccup7 , I applied Intel tools for open source contributor a week ago. However, I didn't get the response yet. :(

@hiccup7
Copy link
Author

hiccup7 commented May 13, 2015

The two Python distributions I mentioned for Windows are easy to install. You don't have to learn much of the the Python language to modify the code I provided for your needs to test most all the BLAS functions. The Spyder IDE included in these Python distributions makes it easy to edit, debug and run your scripts.

@jakirkham
Copy link
Contributor

Did this ever get resolved?

@fenrus75
Copy link
Contributor

fenrus75 commented Aug 4, 2018

#1712

@martin-frbg
Copy link
Collaborator

fixed by #4441 (on top of the PRs mentioned above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants