New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenBLAS 6 times slower than MKL on DGEMV() #532
Comments
OpenBLAS gemv is already multi-threaded (see https://github.com/xianyi/OpenBLAS/blob/develop/interface/gemv.c and https://github.com/xianyi/OpenBLAS/blob/develop/driver/level2/gemv_thread.c). Could you try to build OpenBLAS with MAX_STACK_ALLOC=2048 and test again ? See #482 and #478 for details. |
I use Windows to develop embedded DSP code. I have never built any Windows apps before. I don't think my employer will support me spending the time to learn how to do this for Julia. I hope someone else can build with MAX_STACK_ALLOC=2048 and confirm that this fixes the issue. Otherwise, I will need to stay with Python. If the root cause of the issue is that OpenBLAS needs to be built with MAX_STACK_ALLOC=2048 to perform properly, then perhaps: |
unfortunately the openblas latest develop 406d9d6 together with a cherrypicked jeromerobert@ee71dd3 patch doesn't solved the gemv performance issue #532. The library was build with MAX_STACK_ALLOC=2048. See winpython/winpython#82 (comment) |
HI, I need more informations.
Best regards On 04/28/2015 02:25 PM, carlkl wrote:
|
here it is:
|
@carlkl , I already merged @jeromerobert patch on develop branch. |
about 4 according to the taskmanager. The MKL performance is not degraded if more than one thread is used. |
with the latest develop from wernsaar (updated dgemv_n kernel for nehalem and haswell) I still have the same behaviour with and without threads (steered with coresp. environment variables) @hiccup7 's scipy test (see above) execution time on Windows 64bit:
|
Hi, I ran dgemv benchmark tests on our Haswell machine (Linux) in our lab. Please give me more details: Regards Werner On 05/04/2015 11:11 AM, carlkl wrote:
|
|
As I mentioned in the opening post, MKL is using 4 threads for both scenarios I tested. |
@hiccup7 , Could you test more dgemv MKL results with 1, 2, and 4 threads? Please refer to this article https://software.intel.com/en-us/node/528546 to control the number of MKL threads. |
Using the Python+MKL code from my opening post: Results from Large vector scenario: |
OpenBLAS developers have access to MKL for free: |
@hiccup7 don't you mean OpenBLAS |
Python users? Be aware, that MKL as included in numpy-MKL is free, but not for every usecase. I'm not a laywer, but I think you need to buy a MKL license for any commercial usage. |
Sorry, I meant the OpenBLAS |
Yes, my intention for pointing out free sources for MKL was to support benchmarking, not copying source code. |
Hi all, I just ran the latest develop branch on our Haswell machine(Intel Core i7-4770 CPU, Ubuntu 14.04.1 64-bit). For
OpenBLAS got the best performance with 2 threads. For
The performance is the same since OpenBLAS only uses one thread for For small Here is the test codes. https://gist.github.com/xianyi/65aef3c2e5bc32049806 |
@hiccup7 , what's |
Julia sets Does Julia's |
@hiccup7 , OpenBLAS only can choose one thread for some small input sizes. However, OpenBLAS cannot switch 2 , 4 or 8 threads dynamically based on the input size. |
Improve the performance for
|
@xianyi , Wonderful! Thanks for the improvement. For my two test cases, 2 threads provides the fastest performance. Would it make sense for OpenBLAS to use 2 threads for GEMV() automatically unless the input size is small or |
@hiccup7 , You can set them to 2 threads in your application. For OpenBLAS, I think we need to test more inputs and CPUs. |
@hiccup7 , I applied Intel tools for open source contributor a week ago. However, I didn't get the response yet. :( |
The two Python distributions I mentioned for Windows are easy to install. You don't have to learn much of the the Python language to modify the code I provided for your needs to test most all the BLAS functions. The Spyder IDE included in these Python distributions makes it easy to edit, debug and run your scripts. |
Did this ever get resolved? |
fixed by #4441 (on top of the PRs mentioned above) |
Small vector scenario.
26.7 seconds for OpenBLAS in Julia:
4.6 seconds for MKL in Python:
Large vector scenario.
15.7 seconds for OpenBLAS in Julia:
7.9 seconds for MKL in Python:
Tested environment is WinPython-64bit-3.4.3.2FlavorJulia at http://sourceforge.net/projects/winpython/files/WinPython_3.4/3.4.3.2/flavors/
The same Python time was measured in 64-bit Anaconda3 v2.1.0.
From versioninfo(true) in Julia:
I observed using the CPU meter (Task Manager) that OpenBLAS is single threaded and MKL uses 4 threads. I would predict from this that OpenBLAS would be 4 times slower than MKL, but for the small vector scenario, OpenBLAS is acutally about 6 times slower than MKL. Maybe an optimization for Haswell will help OpenBLAS match MKL's speed.
I haven't tested SGEMV(), but it may need to be parallelized too. DGEMV() and SGEMV() are commonly-used functions in DSP. These are important to allow me to move from Python to Julia.
The text was updated successfully, but these errors were encountered: