Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Squeeze More Performance from Intel MKL #7

Closed
RoyiAvital opened this issue May 14, 2019 · 8 comments
Closed

Squeeze More Performance from Intel MKL #7

RoyiAvital opened this issue May 14, 2019 · 8 comments

Comments

@RoyiAvital
Copy link

RoyiAvital commented May 14, 2019

Hello,
I don't see the way MKL is built here (I see DLL Open is called, could it be no compilation is done only calling a DLL?).

But it would be great if it is built and compiled into Julia utilizing more features of Intel MKL to improve performance:

image

I think the Packed API and Compact API are more tricky but if they will be exposed it will be great.

MKL JIT Feature

MKL also has a new JIT feature which I think could be great addition - Intel® Math Kernel Library Improved Small Matrix Performance Using Just-in-Time (JIT) Code Generation for Matrix Multiplication (GEMM).

Remark
It seem you use Intel MKL 2019.0. while Intel MKL 2019.4 is out (OpenMP is even from 2018 release).

Update (14/10/2019)
I listed what I wanted in Benchmark MATLAB & Julia for Matrix Operations - Message 145.

@andreasnoack
Copy link
Member

This is not currently on our to do. It's a relatively low priority since for smaller problems you'd generally like to use StaticArrays which would probably be competitive to what MKL can offer for small matrices. Please let us know if you have evidence otherwise.

@RoyiAvital
Copy link
Author

@andreasnoack ,
Even if StaticArrays is competitive, why not enable MKL_DIRECT_CALL?
It is only a small compilation flag change.

The other features do something far beyond what you do with StaticArrays.
It uses the structure of a small problem (Say multiplying by the same matrix over and over) to get the gains of large size problem.

@RoyiAvital
Copy link
Author

RoyiAvital commented Sep 13, 2019

Another way to achieve (Better?) compareable performance to StaticArrays would be using MKL JIT Feature - Intel® Math Kernel Library Improved Small Matrix Performance Using Just-in-Time (JIT) Code Generation for Matrix Multiplication (GEMM).

Performance look pretty impressive:

image

Pay attention one could use it in 2 manners:

  1. Enable JIT and let the MKL Engine do decisions.
  2. Ask MKL to export a pointer to a JITted variation of a function and use it.

@RoyiAvital
Copy link
Author

By the way, there is a simple trick to squeeze better performance from MKL on AMD CPU's.
This is done by replacing the CPU Dispatch function as guided by Agner Fog.

An example for that is given in:

https://github.com/fo40225/Anaconda-Windows-AMD

@ViralBShah
Copy link
Contributor

I believe much of the discussion here is out of scope for the current MKL.jl package, and might perhaps find a bit more traction on discourse.

@RoyiAvital
Copy link
Author

@ViralBShah , What do you mean?
Those features are about integrating MKL into a project.

@ViralBShah
Copy link
Contributor

This package is mainly about providing MKL as a replacement for OpenBLAS. For further functionality, other packages can be created that leverage the presence of MKL. For now, I'm going to close this issue.

@RoyiAvital
Copy link
Author

@ViralBShah , I think you are missing the point.
All the features above must be turned on in the integration of MKL instead of OpenBLAS. They can't be done in a different place.

For instance, the flag for direct call can't be done in later place only on the integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants