Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking Scripts to tune the default algorithm choices #166

Closed
ChrisRackauckas opened this issue Jul 19, 2022 · 11 comments
Closed

Benchmarking Scripts to tune the default algorithm choices #166

ChrisRackauckas opened this issue Jul 19, 2022 · 11 comments

Comments

@ChrisRackauckas
Copy link
Member

We should put together a benchmark script and have a bunch of people run it. It should just run LUFactorization, RFLUFactorization, and FastLUFactorization (and MKLFactorization when that exists).

It would be nice for this to have an option for what kind of matrix is generated as a function of some N, so for example it can be used to generate the matrices from the Brusselator equation for testing the sparse factorizations.

@simonp0420
Copy link
Contributor

This is a continuation of the benchmarking I presented in #159. There, I presented the results of running the perf/lu.jl script of the RecursiveFactorization package for a Linux desktop machine. I repeat that exercise here, after correcting a bug in the script (see this PR). The results below are for a Windows desktop machine with the following configuration:

(RecursiveFactorization) pkg> status
     Project RecursiveFactorization v0.2.11
      Status `D:\peter\Documents\julia\dev\RecursiveFactorization\Project.toml`
  [a93c6f00] DataFrames v1.3.4
  [bdcacae8] LoopVectorization v0.12.120
  [33e6dc65] MKL v0.5.0
  [f517fe37] Polyester v0.6.13
  [7792a7ef] StrideArraysCore v0.3.15
  [d5829a12] TriangularSolve v0.1.12
  [3d5dd08c] VectorizationBase v0.21.42
  [112f6efa] VegaLite v2.6.0
  [37e2e46d] LinearAlgebra

julia> versioninfo(verbose=true)
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
      Microsoft Windows [Version 10.0.22000.795]
  CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz:
              speed         user         nice          sys         idle          irq
       #1  3000 MHz    2431453            0     19001781    516223109     13460718  ticks
       #2  3000 MHz    5596203            0      3129484    528930375       187468  ticks
       #3  3000 MHz    3831859            0      3086890    530737312        40078  ticks
       #4  3000 MHz    3733187            0      1884296    532038578        28156  ticks
       #5  3000 MHz    2444562            0      2263937    532947562        36484  ticks
       #6  3000 MHz    2220062            0      1327734    534108265        28578  ticks
       #7  3000 MHz    2176875            0      1391343    534087843        28593  ticks
       #8  3000 MHz    2917562            0      1682937    533055546        53328  ticks

  Memory: 31.85821533203125 GB (22317.58984375 MB free)
  Uptime: 537656.0 sec
  Load Avg:  0.0  0.0  0.0
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = runemacs.exe
  CHOCOLATEYLASTPATHUPDATE = 132198172845121191
  HOME = D:\peter\Documents
  HOMEDRIVE = C:
  HOMEPATH = \Users\peter
  MIC_LD_LIBRARY_PATH = C:\Program Files (x86)\Common Files\Intel\Shared Libraries\compiler\lib\intel64_win_mic
  PATH = C:\Program Files\ImageMagick-7.1.0-Q16-HDRI;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2020.1.216\windows\mpi\intel64\bin;C:\windows\system32;C:\windows;C:\windows\System32\Wbem;C:\windows\System32\WindowsPowerShell\v1.0\;C:\windows\System32\OpenSSH\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32_win\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32_win\compiler;C:\ProgramData\Oracle\Java\javapath;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\compiler;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32\compiler;C:\Program Files (x86)\Common Files\Microsoft Shared\VSA\10.0\VsaEnv;C:\Program Files\Common Files\Microsoft Shared\Windows Live;C:\Program Files (x86)\Common Files\Microsoft Shared\Windows Live;C:\Program Files\MiKTeX 2.9\miktex\bin\x64;C:\Windows\twain_32\MP830;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;c:\Program Files (x86)\ATI Technologies\ATI.ACE\Core-Static;C:\Program Files (x86)\Windows Live\Shared;C:\Program Files\gs\gs8.64\bin;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Program Files\Calibre2\;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program Files (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Git\cmd;C:\Program Files (x86)\Git\bin;C:\Program Files\TortoiseGit\bin;C:\ProgramData\chocolatey\bin;C:\Program Files\MATLAB\R2022a\bin;C:\Program Files (x86)\Calibre2\;C:\Program Files\Microsoft VS Code\bin;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files\Git\cmd;C:\Program Files\nodejs\;C:\Users\peter\AppData\Local\Microsoft\WindowsApps;c:\usr\local\bin;C:\Users\peter\AppData\Local\Programs\MiKTeX 2.9\miktex\bin\x64\;C:\Users\peter\AppData\Local\GitHubDesktop\bin;C:\Users\peter\AppData\Local\Pandoc\;C:\cygwin64\usr\i686-w64-mingw32\sys-root\mingw\lib;C:\Program Files (x86)\Aspell\bin;C:\Users\peter\AppData\Local\gitkraken\bin;C:\Users\peter\AppData\Roaming\npm;C:\Users\peter\AppData\Local\Microsoft\WindowsApps
  PATHEXT = .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC;.JL;.CPL
  PSMODULEPATH = D:\peter\Documents\WindowsPowerShell\Modules;C:\Program Files\WindowsPowerShell\Modules;C:\WINDOWS\system32\WindowsPowerShell\v1.0\Modules

First, the result of running the script after starting Julia with -t 8 and with the default OpenBLAS BLAS:
lu_float64_1 7 3_skylake_8cores_OpenBLAS
OpenBLAS performance seems incredibly bad!

Next, the 8-core result after using MKL:
lu_float64_1 7 3_skylake_8cores_MKL

Next, the OpenBLAS result after starting Julia with -t 1:
lu_float64_1 7 3_skylake_1cores_OpenBLAS

And finally, the MKL result after starting Julia with -t 1:
lu_float64_1 7 3_skylake_1cores_MKL

@simonp0420
Copy link
Contributor

Out of curiosity I commented out the line in the script

#BLAS.set_num_threads(nc)

and restarted Julia with -t 8 using OpenBLAS. The result is:
new_lu_float64_1 7 3_skylake_8cores_OpenBLAS

That didn't seem to help.

@chriselrod
Copy link
Contributor

chriselrod commented Jul 19, 2022

Ah, re what I said earlier about using RF with 1 vs multiple threads:
It doesn't benefit much from threading, but unlike OpenBLAS, Polyester won't weave a noose to hang itself with given enough threads.

OpenBLAS is dramatically faster with BLAS.set_num_threads(1) over this size range on most recent computers with many cores.

@ChrisRackauckas
Copy link
Member Author

These days, should we ever go back to OpenBLAS? There's so many cases where it's just... bad.

@chriselrod
Copy link
Contributor

chriselrod commented Jul 19, 2022

These days, should we ever go back to OpenBLAS? There's so many cases where it's just... bad.

What do you get on your AMD machine? I'm guessing MKL is still much better there for lu?

For Apple silicon, we could use Accelerate. It does quite well with a single core:
lu_float64_1 9 0-DEV 623_apple-m1_1cores_OpenBLAS

But that single core gets to use their single matrix multiplier.
With 4 cores:
lu_float64_1 9 0-DEV 623_apple-m1_4cores_OpenBLAS
OpenBLAS actually wins.

I should double check if Accelerate can benefit from multiple cores on the M1; maybe I just didn't set it. While it has only a single matrix multiplier, there's probably a lot of other things that can be done on the cores.

This was with 4 cores on a mac mini.
I'm guessing OpenBLAS would hang itself again on an M1-max/ultra, given 8 threads.

RF wins below 100x100, at least.

@YingboMa and I should look into an algorithm better suited to threading.

@ChrisRackauckas
Copy link
Member Author

How do you make use of Accelerate?

@chriselrod
Copy link
Contributor

chriselrod commented Jul 19, 2022

https://github.com/chriselrod/AppleAccelerateLinAlgWrapper.jl
I gave it a deliberately bad name to avoid stealing a good one, hoping someone would create a nicer package that can be used in some way other than ccall.
But we probably want to just ccall anyway without otherwise changing BLAS's behavior here.

I also didn't bother wrapping anything other than what I wanted to test (matmul, lu, ldiv, and rdiv).

@ChrisRackauckas
Copy link
Member Author

We might as well add it to LinearSolve.jl if we can do it in a way that doesn't hit LibBLASTrampoline.

@chriselrod
Copy link
Contributor

chriselrod commented Jul 19, 2022

We might as well add it to LinearSolve.jl if we can do it in a way that doesn't hit LibBLASTrampoline.

Currently, a problem with it is that despite hiting LibBLASTrampoline, it doesn't actually replace any existing methods, so we still need to manually use ccall to test anything.

But, yeah, we should probably just call accelerate directly to reduce the risk of something going wrong.
The trampoline may also introduce a small amount of overhead, i.e. an extra call?

@simonp0420
Copy link
Contributor

Why does OpenBLAS perform so much worse on Windows than on Linux?

@ChrisRackauckas
Copy link
Member Author

Added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants