Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A way to disable MKL? #73

Closed
freemin7 opened this issue Nov 14, 2021 · 10 comments
Closed

A way to disable MKL? #73

freemin7 opened this issue Nov 14, 2021 · 10 comments

Comments

@freemin7
Copy link

freemin7 commented Nov 14, 2021

Hey i am running on an ARM system where MKL doesn't exist as such. BLASBenchmarksCPU automatically (init) running mkl related code is not elegant. What would be a good way to handle this?

@idevcde
Copy link

idevcde commented Nov 14, 2021

Hey, just wondering, is it working with the way "Example 2" is suggesting or the problem occurs when the package is "starting"? https://julialinearalgebra.github.io/BLASBenchmarksCPU.jl/stable/usage/ Is it possible to use BLASBenchmarksCPU on ARM?

@freemin7
Copy link
Author

No the error happens before that. During using in the init function.

ERROR: LoadError: UndefVarError: libmkl_rt not defined
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base ./Base.jl:35
 [2] top-level scope
   @ ~/.julia/packages/BLASBenchmarksCPU/63VfB/src/BLASBenchmarksCPU.jl:48
 [3] include
   @ ./Base.jl:420 [inlined]
 [4] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
   @ Base ./loading.jl:1318
 [5] top-level scope
   @ none:1
 [6] eval
   @ ./boot.jl:373 [inlined]
 [7] eval(x::Expr)
   @ Base.MainInclude ./client.jl:453
 [8] top-level scope
   @ none:1
in expression starting at /lustre/home/guest19/.julia/packages/BLASBenchmarksCPU/63VfB/src/BLASBenchmarksCPU.jl:1
ERROR: Failed to precompile BLASBenchmarksCPU [5fdc822c-4560-4d20-af7e-e5ee461714d5] to /lustre/home/guest19/.julia/compiled/v1.7/BLASBenchmarksCPU/jl_3yOXOv.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, ignore_loaded_modules::Bool)
   @ Base ./loading.jl:1466
 [3] compilecache(pkg::Base.PkgId, path::String)
   @ Base ./loading.jl:1410
 [4] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1120
 [5] require(uuidkey::Base.PkgId)
   @ Base ./loading.jl:1013
 [6] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:997

@chriselrod
Copy link
Collaborator

LoopVectorization will still crash on the A64FX, so Octavian at least won't work.

@idevcde
Copy link

idevcde commented Nov 15, 2021

Thanks for supporting FUGAKU! :-)

@chriselrod
Copy link
Collaborator

chriselrod commented Nov 15, 2021

I hope to see FUGAKU and more vector-CPU super computers succeed in the future!

The issue that LoopVectorization/Octavian has looks like it'd need some digging to get into/isolate.
I hope to replace the current code base, hopefully within the next few months, and thus would hope that those issues can be resolved/avoided at that time, rather than spending more time with the current approach.

@idevcde
Copy link

idevcde commented Nov 16, 2021

Great to hear that! And thank you again for making the appropriate changes to BLASBenchmarksCPU.jl so it can be run on Arm! AFAIK, FUGAKU is open for trials with projects screening per submission throughout the whole year. Heaving the opportunity, can I ask some very basic questions about BLAS libraries?

So far I used mostly OpenBLAS shipped natively with Julia. Following your advice provided at Julia discourse I also used MKL.jl with Julia 1.7 (Yeah, I know how it sounds, however, it was a very good advice). When I was using MKL instead of OpenBLAS with some of Julia packages I did not have to change anything in the code of the packages. It was as easy as writing "using MKL" and I understand I was doing calculations with MKL instead of OpenBLAS.

Is it the same with Octavian.jl? Or do I have to rewrite the code of the package in order to use Octavian.jl?

And also, I am hearing that on Arm, particularly Arm Performance Version of BLAS and BLIS is heaving a good performance. If I would like to try any of them (Arm Performance Version of BLAS or BLIS), how do I make it work with Julia on Arm? Should I build _jll and pin this _jll in Julia package mode? Is there maybe any tutorial that you are aware of / maybe you could point me to any resource on this topic?

And the last question, is it correct to assume that Octavian.jl if/when working on Arm / vector-CPUs could bring significant performance increase?

@freemin7
Copy link
Author

freemin7 commented Nov 16, 2021

I address the questions i can answer.

Is it the same with Octavian.jl? Or do I have to rewrite the code of the package in order to use Octavian.jl?

Octavian exports as only public function is matmul!(C, A, B[, α, β, max_threads]) (according to docs). which means it doesn't change the definition of *(A::AbstractMatrix, B::AbstractMatrix) or similar calls. MKL.jl does that. This is a change that would need to be done by one person once. Without that an algorithmic rewrite of your code would be necessary.

How do I make Arm Performance Version of BLAS work with Julia on Arm?

If it doesn't exist already, you need to write an equivalent to MKL.jl or BLIS.jl.

Should I build _jll and pin this _jll in Julia package mode?

A _jll (which is a reproducible build instruction of the binaries for a selection of targets) is not enough alone. You will need to write a Julia wrapper which provides definitions for *(A::AbstractMatrix, B::AbstractMatrix) and many other calls. Since the API is probably quiet similar you can adapt from MKL.jl, OpenBLAS or BLIS.jl

How do I make BLIS work with Julia on Arm?

Try the package BLIS.jl see if it works and passes tests, if not make it work. (Details depend on failure mode)

Is there maybe any tutorial that you are aware of / maybe you could point me to any resource on this topic?

Look at the source code of similar packages and watch Developing Julia packages if you didn't already, although the recommendation to use TravisCI is outdated.

And the last question, is it correct to assume that Octavian.jl if/when working on Arm / vector-CPUs could bring significant performance increase?

In Julia as it's now, probably not, as Julia likes to emit NEON vector instructions which don't utilize the SVE vector registers. If Julia can be made to emit competitive SVE code then it is likely that Octavian with some tuning is competitive on A64FX.

@chriselrod
Copy link
Collaborator

On Julia 1.7+, the LinearAlgebra BLAS libraries use libblastrampoline:

julia> LinearAlgebra.BLAS.libblas
"libblastrampoline"

Which is what allows swapping BLAS implementations at runtime.

For this to work with a library like Octavian, it'd have to provide all the appropriate ccalls.

gemm is the building block of many more complicated BLAS/LAPACK algorithms, so it'd be interesting to see what the performance of LAPACK would be if you swap out OpenBLAS's gemm for Octavian (or even raw LV), which does much better at small sizes.

Longer term, these can be implemented in Julia, but I will be prioritizing rewriting LV before working on LinearAlgebra, as the rewrite will (a) help compile times and (b) make implementing many algorithms much easier.

@chriselrod
Copy link
Collaborator

If Julia can be made to emit competitive SVE code then it is likely that Octavian with some tuning is competitive on A64FX.

This requires setting the min-SVE bits arg, but that seems to be causing crashes.

@idevcde
Copy link

idevcde commented Nov 17, 2021

Thanks a lot for all the information! As for the Arm Performance Version of BLAS and writing a wrapper really I do not think its right for me to do it. As for BLIS I see that it probably performs better on Neoverse N1 than OpenBLAS that I used for testing (pls see here https://github.com/flame/blis/blob/master/docs/Performance.md). I hope to have the opportunity to carry some of the tests further with BLIS in the near future. I think I can say that Julia on Neoverse N1 was already competitive to x86 and GPU with standard OpenBLAS or at least this is my current understanding. As for potential competitive performance on A64fx I'm optimistic to do some tests in the near future.

gemm is the building block of many more complicated BLAS/LAPACK algorithms, so it'd be interesting to see what the performance of LAPACK would be if you swap out OpenBLAS's gemm for Octavian (or even raw LV), which does much better at small sizes.

I do not have precise knowledge regarding the sizes of matrices. The tests I was doing are related to AI trainings. I would be really happy to carry the tests suggested above in near (or medium / long term) future when my time permits (optimistic about it, however, sometimes there are some constraints) even if the setup would not be appropriate for the above mentioned AI problem. As for the technical requirements, I might need some advice then, as for now I have to admit that even after re-reading the posts, I would not know how to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants