Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Julia + OpenBLAS vs. MATLAB + MKL - Matrix Operations Benchmark #1090

Open
RoyiAvital opened this issue Feb 9, 2017 · 22 comments
Open

Julia + OpenBLAS vs. MATLAB + MKL - Matrix Operations Benchmark #1090

RoyiAvital opened this issue Feb 9, 2017 · 22 comments

Comments

@RoyiAvital
Copy link

Hi,

I did some tests with MATLAB and Julia:

Matlab & Julia Matrix Operations Benchmark

I think they (At least to some part) reflect OpenBLAS vs. Intel MKL.
Hence I think they might be information worth knowing for the developers.

See also here:

Benchmark MATLAB & Julia for Matrix Operations

Thank You.

@martin-frbg
Copy link
Collaborator

Thanks for the pointer. Unfortunately it is not quite clear what you are testing here if you are pitting two "teams" against each other - how much of the difference in efficiency comes from each component ?
At first glance, the divergence of the graphs after a matrix size of ~1000 primarily suggests that MATLAB+MKL is using threading to its advantage while Julia+OpenBLAS is not. Whether that is due
to limitations in Julia or OpenBLAS (how did you build either, did you check if/how many threads they use) is at best unclear, though inferior performance compared to MKL in some functions/circumstances has been noted in the past e.g. #530,#532.
I suspect it would be more instructive to run benchmarks on isolated BLAS/LAPACK functions with both MKL and OpenBLAS first - unfortunately none of the current developers appears to have access to MKL.

@RoyiAvital
Copy link
Author

@martin-frbg , No pitting at all.

Just thought to show data if it helps the developer from the point of view that seeing the numbers might tell where to invest effort.

You raise interesting point about Multi Threading.
Hence I rechecked and it seems all my cores are being utilized (6 Cores).
So it is not that MT is disabled under Julia.

I can tell it seems the Eigen, Cholesky and SVD decomposition are a weak point of OpenBLAS compared to MKL. Are you aware of that?

Thank You.

@martin-frbg
Copy link
Collaborator

I am certainly aware of #1077 (SVD, I suspect we will need a reduced testcase for that one to investigate further) and I think we also have issues involving *syrk (used in Cholesky) on at least some platforms. There is certainly room for improvement...

@brada4
Copy link
Contributor

brada4 commented Feb 10, 2017

Can you produce graphs in julia-mkl vs julia-openblas? Thay look so proportionat that could boil down to microtiming specialties in each..

@RoyiAvital
Copy link
Author

@brada4 , I wish I could.
I don't have access to Julia + MKL (I'm on Windows, not going to hack my way for Julia + MKL).

@brada4
Copy link
Contributor

brada4 commented Feb 10, 2017

Windows octave w openblas vs matlab?

@RoyiAvital
Copy link
Author

RoyiAvital commented Feb 13, 2017

Here you have comparison of Julia + OpenBLAS vs. Julia + MKL on the same tests:

http://imgur.com/a/rBOo8

Those made by:

JuliaLang/julia#18374 (comment)

Thank You.

@brada4
Copy link
Contributor

brada4 commented Feb 13, 2017

There is #843 post-0.2.19 adding optimizing fortran flags to lapack, which should align graphs better.

@RoyiAvital
Copy link
Author

@brada4
Copy link
Contributor

brada4 commented Feb 17, 2017

'matrix generation' does not involve any BLAS, it just measures your libc rng speed and malloc at various times. Probably fastest measure ran first on same system.
'reductions' shows wrong threading threshold in OpenBLAS

@RoyiAvital
Copy link
Author

@brada4
Copy link
Contributor

brada4 commented Feb 20, 2017

It lacks anchor to OpenBLAS version. Obviously it cannot be with #843 fixed at that time.

@martin-frbg
Copy link
Collaborator

Early march 2016 would mean 0.2.15 or at best 0.2.16rc1 but I guess the point is the availability of the benchmark code and MKL result. (Not that much changed performance-wise for that one function on Haswell I think, might be interesting to see how much restoration of compiler optimization level for the lapack functions as per #843 actually buys us here but I doubt it is enough to close the gap. I do not have an ultrabook haswell as used for the test however ) WRT the "matrix generation" test mentioned above, no harm in having those numbers as well - at the very least it shows that OpenBLAS is not doing something fundamentally wrong in the way it stores and handles matrices.

@brada4
Copy link
Contributor

brada4 commented Feb 20, 2017

from numpy eigh document

The eigenvalues/eigenvectors are computed using LAPACK routines _syevd, _heevd

@martin-frbg
Copy link
Collaborator

Well that one is rather obvious. But no matter how silly that lapack handbrake thing was in retrospect, I am not so optimistic to assume that just bringing LAPACK back to its normal speed would allow "our" optimized BLAS calls to show so much gain as to actually match the MKL data.

@aminya
Copy link

aminya commented Jun 28, 2019

I updated this repository adding benchmark for Julia+OpenBlas and Julia+Intel MKL
https://github.com/aminya/MatlabJuliaMatrixOperationsBenchmark

Julia+Intel MKL is faster then openBlas64 most of the time.

@brada4
Copy link
Contributor

brada4 commented Jun 29, 2019

Not sure if 100Hz clock of matlab counts as bad performance, could you time more iterations to get past that?

@aminya
Copy link

aminya commented Jun 29, 2019

Not sure if the 100Hz clock of Matlab counts as bad performance, could you time more iterations to get past that?

Not much difference in most of the functions by running 50 times instead 4 around timeit, however, some differences in some cases.
I used a number of iterations around timeit, but the timeit itself calls the function multiple times and returns the median, then I calculate the average of different returns.

For example, inside timeit for matrix inversion, it runs 11*100 iterations and median is returned, and my 4 iterations wrap that to calculate the average of every 1100 iterations' median. If I make the number of iterations around it 50, it becomes like 55000 iterations! while I explicitly stated Julia's sample number as 700 in the code.

In a real-world situation, someone doesn't run a function like 200 times! So repeatability and stability in performance are also important.
image

@RoyiAvital
Copy link
Author

RoyiAvital commented Jun 29, 2019

I think, since MATLAB's Function Handler isn't as efficient as it should be, you shouldn't use the timeit() function.

This is the reason I didn't use it, as it adds overhead.
I think it is better to use the approach I used at the original test.

Update

Looking at the code of timeit() they seem to try calculating the overhead and remove it.

For more information:

For more accurate timing in MATLAB - High Accuracy Timer.

@aminya
Copy link

aminya commented Jun 29, 2019

Update

Looking at the code of timeit() they seem to try calculating the overhead and remove it.

Yes, timeit() is much more accurate than simple tic and toc. timeit() uses tic and toc inside, but it is smarter to get a better benchmark, that is why it is the function that Mathworks recommends for benchmarking.
The situation for both languages is the same. We pass the handle of function to a benchmarking tool, and they calculate the time spent to run that function.

@RoyiAvital
Copy link
Author

I'd still prefer direct use of tic() and toc().
The function timeit() is recommended because of the warm start and using median.

I wouldn't use timeit() on the MATLAB.
I'd just do a warm start and measure either each iteration or few iterations combined.

@aminya
Copy link

aminya commented Jun 30, 2019

I'd still prefer direct use of tic() and toc().
The function timeit() is recommended because of the warm start and using median.

I wouldn't use timeit() on the MATLAB.
I'd just do a warm start and measure either each iteration or few iterations combined.

I don't understand the reason for this. inside timeit() is happening what I would do if I wanted to measure the timing, but not even that, they have thought more than me about its accuracy and various aspects of their commercial code. https://www.mathworks.com/help/matlab/matlab_prog/measure-performance-of-your-program.html
If you feel the result is biased you can send me another matlabBench file so I can run the test with your code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants