OpenBLAS 6 times slower than MKL on DGEMV() #532

hiccup7 · 2015-04-06T20:09:20Z

Small vector scenario.
26.7 seconds for OpenBLAS in Julia:

blas_set_num_threads(CPU_CORES)
const trans = 'N'
const a = ones((201, 150))
const x = ones(150)
@time for k=1:1000000; s = BLAS.gemv(trans, a, x); end

4.6 seconds for MKL in Python:

import numpy as np
from scipy.linalg.blas import dgemv
from timeit import default_timer as timer
alpha = 1.0
a = np.ones((201, 150), order='F')
x = np.ones(150)
start = timer()
for k in range(1000000):
    s = dgemv(alpha, a, x)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")

Large vector scenario.
15.7 seconds for OpenBLAS in Julia:

blas_set_num_threads(CPU_CORES)
const trans = 'N'
const a = ones((4, 100000))
const x = ones(100000)
@time for k=1:100000; s = BLAS.gemv(trans, a, x); end

7.9 seconds for MKL in Python:

import numpy as np
from scipy.linalg.blas import dgemv
from timeit import default_timer as timer
alpha = 1.0
a = np.ones((4, 100000), order='F')
x = np.ones(100000)
start = timer()
for k in range(100000):
    s = dgemv(alpha, a, x)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")

Tested environment is WinPython-64bit-3.4.3.2FlavorJulia at http://sourceforge.net/projects/winpython/files/WinPython_3.4/3.4.3.2/flavors/
The same Python time was measured in 64-bit Anaconda3 v2.1.0.

From versioninfo(true) in Julia:

Julia Version 0.3.7
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)

I observed using the CPU meter (Task Manager) that OpenBLAS is single threaded and MKL uses 4 threads. I would predict from this that OpenBLAS would be 4 times slower than MKL, but for the small vector scenario, OpenBLAS is acutally about 6 times slower than MKL. Maybe an optimization for Haswell will help OpenBLAS match MKL's speed.

I haven't tested SGEMV(), but it may need to be parallelized too. DGEMV() and SGEMV() are commonly-used functions in DSP. These are important to allow me to move from Python to Julia.

The text was updated successfully, but these errors were encountered:

jeromerobert · 2015-04-08T04:21:47Z

OpenBLAS gemv is already multi-threaded (see https://github.com/xianyi/OpenBLAS/blob/develop/interface/gemv.c and https://github.com/xianyi/OpenBLAS/blob/develop/driver/level2/gemv_thread.c). Could you try to build OpenBLAS with MAX_STACK_ALLOC=2048 and test again ? See #482 and #478 for details.

hiccup7 · 2015-04-08T17:36:38Z

I use Windows to develop embedded DSP code. I have never built any Windows apps before. I don't think my employer will support me spending the time to learn how to do this for Julia. I hope someone else can build with MAX_STACK_ALLOC=2048 and confirm that this fixes the issue. Otherwise, I will need to stay with Python.

If the root cause of the issue is that OpenBLAS needs to be built with MAX_STACK_ALLOC=2048 to perform properly, then perhaps:
a) OpenBLAS could default to MAX_STACK_ALLOC=2048 if MAX_STACK_ALLOC is unspecified.
b) The Julia build environment could be updated. Is this the appropriate make file?
https://github.com/JuliaLang/julia/blob/master/Make.inc

carlkl · 2015-04-22T21:47:14Z

unfortunately the openblas latest develop 406d9d6 together with a cherrypicked jeromerobert@ee71dd3 patch doesn't solved the gemv performance issue #532. The library was build with MAX_STACK_ALLOC=2048. See winpython/winpython#82 (comment)

carlkl · 2015-04-28T12:24:59Z

something weird happens with OpenBLAS dgemv. Running @hiccup7's scipy code above with ONLY ONE thread in OpenBLAS gives about the same perfomance as scipy-MKL. The performance drops as more threads are involved. @wernsaar?

1 thread
- OpenBLAS: 5.5 s
- MKL: 5.6 s
4 threads
- OpenBLAS: 16.0 s
- MKL: 5.0 s

wernsaar · 2015-04-28T12:37:20Z

HI,

I need more informations.

what platform or target
is the matrix transposed or not transposed ( because different
kernels are called )
size of the matrix ( m and n )

Best regards
Werner

On 04/28/2015 02:25 PM, carlkl wrote:

something weird happens with OpenBLAS dgemv. Running @hiccup7
https://github.com/hiccup7's scipy code above with ONLY ONE thread
in OpenBLAS gives about the same perfomance as scipy-MKL. The
performance drops as more threads are involved. @wernsaar
https://github.com/wernsaar?

1 thread
o OpenBLAS: 5.5 s
o MKL: 5.6 s

4 threads
o OpenBLAS: 16.0 s
o MKL: 5.0 s

—
Reply to this email directly or view it on GitHub
#532 (comment).

carlkl · 2015-04-28T12:52:10Z

here it is:

platform windows, openblas develop 406d9d6 together with a cherrypicked jeromerobert@ee71dd3 patch despite the name:
https://bitbucket.org/carlkl/mingw-w64-for-python/downloads/openblas-fb02cb0_amd64.7z
fortran ordering (C ordering is much slower)
M x N = 201 x 150

xianyi · 2015-04-28T16:29:34Z

@carlkl , I already merged @jeromerobert patch on develop branch.
Could you know how many threads MKL used?

carlkl · 2015-04-28T18:38:31Z

about 4 according to the taskmanager. The MKL performance is not degraded if more than one thread is used.
A solution might be to increase GEMM_MULTITHREAD_THRESHOLD. Was the default 4 found during benchmarking?

carlkl · 2015-05-04T09:11:45Z

with the latest develop from wernsaar (updated dgemv_n kernel for nehalem and haswell) I still have the same behaviour with and without threads (steered with coresp. environment variables)

@hiccup7 's scipy test (see above) execution time on Windows 64bit:

MKL: around 5.9 sec regardless if 4 threads are used or only one thread
OpenBLAS: 24,7 sec with 4 threads (haswell) and 6.0 sec with only one thread

wernsaar · 2015-05-04T10:23:14Z

Hi,

I ran dgemv benchmark tests on our Haswell machine (Linux) in our lab.
MKL dgemv is always single-threaded on this platform, OpenBlas is
multithreaded.
For matrix sizes from 256x256 to 2048x2048, OpenBLAS is faster than MKL.
Using 2 threads with OpenBLAS, you can expect 60% better performance.
More than 2 threads are not useful.

Please give me more details:
Size of the matrix
Increment for vector x
Increment for vector y
Is the matrix transposed or not transposed

Regards

Werner

On 05/04/2015 11:11 AM, carlkl wrote:

with the latest develop from wernsaar (updated dgemv_n kernel for
nehalem and haswell) I still have the same behaviour with and without
threads (steered with coresp. environment variables)

@hiccup7 https://github.com/hiccup7 's scipy test (see above)
execution time on Windows 64bit:

MKL: around 5.9 sec regardless if 4 threads are used or only one
thread

OpenBLAS: 24,7 sec with 4 threads (haswell) and 6.0 sec with only
one thread

—
Reply to this email directly or view it on GitHub
#532 (comment).

carlkl · 2015-05-04T13:28:43Z

Platform:
- windows amd64
- gcc with win32thread model
- openblas: latests wernsaar develop
Makefile.rule:
- TARGET = HASWELL
- DYNAMIC_ARCH = 0
- CC = gcc
- FC = gfortran
- BINARY = 64
- USE_THREAD = 1
- USE_OPENMP = 0
- NUM_THREADS = 32
- NO_WARMUP = 1
- NO_AFFINITY = 1
- USE_SIMPLE_THREADED_LEVEL3 = 1 (also tested with 0)
- COMMON_OPT = -O2 -march=x86-64 -mtune=generic
- FCOMMON_OPT = -frecursive
- MAX_STACK_ALLOC = 2048
Matrix:
- fortran ordering (C ordering is much slower)
- M x N = 201 x 150

xianyi · 2015-05-04T15:08:01Z

@wernsaar , it's a small matrix size of @carlkl 's test case. I think it need only use single thread instead of multithreading.

Actually, it is an old OpenBLAS issue that MKL has better adjustment of single or multithreading based on the input size.

hiccup7 · 2015-05-04T15:30:35Z

As I mentioned in the opening post, MKL is using 4 threads for both scenarios I tested.
Also note in the opening post that increments for x and y are 1, and there is no transpose.

xianyi · 2015-05-04T15:38:00Z

@hiccup7 , Could you test more dgemv MKL results with 1, 2, and 4 threads? Please refer to this article https://software.intel.com/en-us/node/528546 to control the number of MKL threads.

hiccup7 · 2015-05-04T22:10:38Z

Using the Python+MKL code from my opening post:
Results from Small vector scenario:
6.6 seconds, MKL_NUM_THREADS=1
5.3 seconds, MKL_NUM_THREADS=2
4.6 seconds, MKL_NUM_THREADS=4

Results from Large vector scenario:
13.5 seconds, MKL_NUM_THREADS=1
12.8 seconds, MKL_NUM_THREADS=2
7.9 seconds, MKL_NUM_THREADS=4

hiccup7 · 2015-05-05T15:13:31Z

OpenBLAS developers have access to MKL for free:
https://winpython.github.io/ Windows-only
https://store.continuum.io/cshop/anaconda/ Only Windows version contains MKL for free
https://software.intel.com/en-us/qualify-for-free-software/opensourcecontributor Linux-only

stonebig · 2015-05-05T17:38:42Z

@hiccup7 don't you mean OpenBLAS users ?

carlkl · 2015-05-05T18:36:36Z

Python users? Be aware, that MKL as included in numpy-MKL is free, but not for every usecase. I'm not a laywer, but I think you need to buy a MKL license for any commercial usage.
For comparison: OpenBLAS has a really free BSD license that fits perfectly in to the so callled scipy-stack landscape.

stonebig · 2015-05-05T18:43:42Z

Sorry, I meant the OpenBLAS developpers can't make any use MKL, except to benchmark against it.

hiccup7 · 2015-05-05T19:59:39Z

Yes, my intention for pointing out free sources for MKL was to support benchmarking, not copying source code.

xianyi · 2015-05-06T20:25:34Z

Hi all,

I just ran the latest develop branch on our Haswell machine(Intel Core i7-4770 CPU, Ubuntu 14.04.1 64-bit).

For 201x150,

 OPENBLAS_NUM_THREADS=1 ./test_gemv_open 201 150 1000000
201x150 1000000 loops          4.261447 s   14150.123186 MGFLOPS

OPENBLAS_NUM_THREADS=2 ./test_gemv_open 201 150 1000000
201x150 1000000 loops          3.361230 s    17939.861301 MGFLOPS

OPENBLAS_NUM_THREADS=4 ./test_gemv_open 201 150 1000000
201x150 1000000 loops   4.208811 s  14327.086676 MGFLOPS

OpenBLAS got the best performance with 2 threads.

For 4x100000,

 OPENBLAS_NUM_THREADS=1 ./test_gemv_open 4 100000 100000
4x100000    100000 loops    11.901841 s 6721.649197 MGFLOPS

OPENBLAS_NUM_THREADS=2 ./test_gemv_open 4 100000 100000
4x100000    100000 loops    12.399255 s 6452.000544 MGFLOPS

 OPENBLAS_NUM_THREADS=4 ./test_gemv_open 4 100000 100000
4x100000    100000 loops    12.463332 s 6418.829250 MGFLOPS

The performance is the same since OpenBLAS only uses one thread for 4x100000. The reason is that OpenBLAS splits the gemv_n workload on m (column) direction. In 4x100000 case, the m (4) is too small to split. Therefore, OpenBLAS only use one thread.

For small m and large n case, we need to parallel gemv by n (row) direction. Every thread computes a part of the result. Then, the main thread do the reduction.

Here is the test codes. https://gist.github.com/xianyi/65aef3c2e5bc32049806

xianyi · 2015-05-06T20:32:57Z

@hiccup7 , what's CPU_CORES in your test codes? Is it 4 (the number of physical cores) or 8 (the number of logical cores)?

hiccup7 · 2015-05-06T21:21:48Z

Julia sets CPU_CORES as 8 for my Intel Haswell CPU (with 4 physical cores and 8 logical cores).

Does Julia's blas_set_num_threads() function set the maximum allowed threads, and OpenBLAS can reduce the number of threads if needed to get higher speed? I hope so.

xianyi · 2015-05-07T16:18:45Z

@hiccup7 , OpenBLAS only can choose one thread for some small input sizes. However, OpenBLAS cannot switch 2 , 4 or 8 threads dynamically based on the input size.

Splite the matrix and reduction.

xianyi · 2015-05-07T21:39:39Z

Improve the performance for 4x100000 case.
When I uses two threads, it can achieve the best performance.

 OPENBLAS_NUM_THREADS=1 ./test_gemv 4 100000 100000
4x100000    100000 loops    12.048461 s 6639.852177 MGFLOPS

 OPENBLAS_NUM_THREADS=2 ./test_gemv 4 100000 100000
4x100000    100000 loops    6.176924 s  12951.430194 MGFLOPS

 OPENBLAS_NUM_THREADS=4 ./test_gemv 4 100000 100000
4x100000    100000 loops    12.482034 s 6409.211832 MGFLOPS

hiccup7 · 2015-05-07T22:15:02Z

@xianyi , Wonderful! Thanks for the improvement.

For my two test cases, 2 threads provides the fastest performance. Would it make sense for OpenBLAS to use 2 threads for GEMV() automatically unless the input size is small or OPENBLAS_NUM_THREADS=1 ?

xianyi · 2015-05-12T15:56:02Z

@hiccup7 , You can set them to 2 threads in your application.

For OpenBLAS, I think we need to test more inputs and CPUs.

xianyi · 2015-05-12T15:57:30Z

@hiccup7 , I applied Intel tools for open source contributor a week ago. However, I didn't get the response yet. :(

hiccup7 · 2015-05-13T02:35:16Z

The two Python distributions I mentioned for Windows are easy to install. You don't have to learn much of the the Python language to modify the code I provided for your needs to test most all the BLAS functions. The Spyder IDE included in these Python distributions makes it easy to edit, debug and run your scripts.

jakirkham · 2016-09-15T03:55:37Z

Did this ever get resolved?

fenrus75 · 2018-08-04T21:08:08Z

#1712

martin-frbg · 2024-01-19T14:37:00Z

fixed by #4441 (on top of the PRs mentioned above)

hiccup7 mentioned this issue Apr 8, 2015

Beta build for FlavorJulia to help resolve OpenBLAS issues winpython/winpython#77

Closed

hiccup7 mentioned this issue Apr 9, 2015

Build OpenBLAS with MAX_STACK_ALLOC=2048 JuliaLang/julia#10780

Closed

carlkl mentioned this issue Apr 22, 2015

getting theano working out of the box with mingwpy (on python3.4, then friends) winpython/winpython#82

Closed

7 tasks

xianyi added a commit that referenced this issue May 7, 2015

Refs #532. Improve gemv paralel with small m and large n case.

8e5a108

Splite the matrix and reduction.

carlkl mentioned this issue May 11, 2015

Mingw Gfortran on Windows? scipy/scipy#2829

Closed

martin-frbg mentioned this issue Feb 9, 2017

Julia + OpenBLAS vs. MATLAB + MKL - Matrix Operations Benchmark #1090

Open

martin-frbg mentioned this issue May 4, 2018

A problem about relation between "openblas_set_num_threads" and CPU waste and time cost? #1544

Closed

This was referenced Nov 2, 2018

OpenBLAS threadsafety issues for downstream libraries (NumPy) #1844

Closed

Revert change from #532 due to unsafe use of static buffer #1852

Closed

martin-frbg mentioned this issue Nov 10, 2018

Optimize gemv for small M, large N only if it can be done in a threadsafe manner #1865

Merged

martin-frbg mentioned this issue Dec 4, 2018

Deadlock in threaded dgemv #660

Closed

martin-frbg closed this as completed Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenBLAS 6 times slower than MKL on DGEMV() #532

OpenBLAS 6 times slower than MKL on DGEMV() #532

hiccup7 commented Apr 6, 2015

jeromerobert commented Apr 8, 2015

hiccup7 commented Apr 8, 2015

carlkl commented Apr 22, 2015

carlkl commented Apr 28, 2015

wernsaar commented Apr 28, 2015

carlkl commented Apr 28, 2015

xianyi commented Apr 28, 2015

carlkl commented Apr 28, 2015

carlkl commented May 4, 2015

wernsaar commented May 4, 2015

carlkl commented May 4, 2015

xianyi commented May 4, 2015

hiccup7 commented May 4, 2015

xianyi commented May 4, 2015

hiccup7 commented May 4, 2015

hiccup7 commented May 5, 2015

stonebig commented May 5, 2015

carlkl commented May 5, 2015

stonebig commented May 5, 2015

hiccup7 commented May 5, 2015

xianyi commented May 6, 2015

xianyi commented May 6, 2015

hiccup7 commented May 6, 2015

xianyi commented May 7, 2015

xianyi commented May 7, 2015

hiccup7 commented May 7, 2015

xianyi commented May 12, 2015

xianyi commented May 12, 2015

hiccup7 commented May 13, 2015

jakirkham commented Sep 15, 2016

fenrus75 commented Aug 4, 2018

martin-frbg commented Jan 19, 2024

OpenBLAS 6 times slower than MKL on DGEMV() #532

OpenBLAS 6 times slower than MKL on DGEMV() #532

Comments

hiccup7 commented Apr 6, 2015

jeromerobert commented Apr 8, 2015

hiccup7 commented Apr 8, 2015

carlkl commented Apr 22, 2015

carlkl commented Apr 28, 2015

wernsaar commented Apr 28, 2015

carlkl commented Apr 28, 2015

xianyi commented Apr 28, 2015

carlkl commented Apr 28, 2015

carlkl commented May 4, 2015

wernsaar commented May 4, 2015

carlkl commented May 4, 2015

xianyi commented May 4, 2015

hiccup7 commented May 4, 2015

xianyi commented May 4, 2015

hiccup7 commented May 4, 2015

hiccup7 commented May 5, 2015

stonebig commented May 5, 2015

carlkl commented May 5, 2015

stonebig commented May 5, 2015

hiccup7 commented May 5, 2015

xianyi commented May 6, 2015

xianyi commented May 6, 2015

hiccup7 commented May 6, 2015

xianyi commented May 7, 2015

xianyi commented May 7, 2015

hiccup7 commented May 7, 2015

xianyi commented May 12, 2015

xianyi commented May 12, 2015

hiccup7 commented May 13, 2015

jakirkham commented Sep 15, 2016

fenrus75 commented Aug 4, 2018

martin-frbg commented Jan 19, 2024