Benchmarking Scripts to tune the default algorithm choices #166

ChrisRackauckas · 2022-07-19T00:58:32Z

We should put together a benchmark script and have a bunch of people run it. It should just run LUFactorization, RFLUFactorization, and FastLUFactorization (and MKLFactorization when that exists).

It would be nice for this to have an option for what kind of matrix is generated as a function of some N, so for example it can be used to generate the matrices from the Brusselator equation for testing the sparse factorizations.

The text was updated successfully, but these errors were encountered:

simonp0420 · 2022-07-19T03:51:47Z

This is a continuation of the benchmarking I presented in #159. There, I presented the results of running the perf/lu.jl script of the RecursiveFactorization package for a Linux desktop machine. I repeat that exercise here, after correcting a bug in the script (see this PR). The results below are for a Windows desktop machine with the following configuration:

(RecursiveFactorization) pkg> status
     Project RecursiveFactorization v0.2.11
      Status `D:\peter\Documents\julia\dev\RecursiveFactorization\Project.toml`
  [a93c6f00] DataFrames v1.3.4
  [bdcacae8] LoopVectorization v0.12.120
  [33e6dc65] MKL v0.5.0
  [f517fe37] Polyester v0.6.13
  [7792a7ef] StrideArraysCore v0.3.15
  [d5829a12] TriangularSolve v0.1.12
  [3d5dd08c] VectorizationBase v0.21.42
  [112f6efa] VegaLite v2.6.0
  [37e2e46d] LinearAlgebra

julia> versioninfo(verbose=true)
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
      Microsoft Windows [Version 10.0.22000.795]
  CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz:
              speed         user         nice          sys         idle          irq
       #1  3000 MHz    2431453            0     19001781    516223109     13460718  ticks
       #2  3000 MHz    5596203            0      3129484    528930375       187468  ticks
       #3  3000 MHz    3831859            0      3086890    530737312        40078  ticks
       #4  3000 MHz    3733187            0      1884296    532038578        28156  ticks
       #5  3000 MHz    2444562            0      2263937    532947562        36484  ticks
       #6  3000 MHz    2220062            0      1327734    534108265        28578  ticks
       #7  3000 MHz    2176875            0      1391343    534087843        28593  ticks
       #8  3000 MHz    2917562            0      1682937    533055546        53328  ticks

  Memory: 31.85821533203125 GB (22317.58984375 MB free)
  Uptime: 537656.0 sec
  Load Avg:  0.0  0.0  0.0
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = runemacs.exe
  CHOCOLATEYLASTPATHUPDATE = 132198172845121191
  HOME = D:\peter\Documents
  HOMEDRIVE = C:
  HOMEPATH = \Users\peter
  MIC_LD_LIBRARY_PATH = C:\Program Files (x86)\Common Files\Intel\Shared Libraries\compiler\lib\intel64_win_mic
  PATH = C:\Program Files\ImageMagick-7.1.0-Q16-HDRI;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2020.1.216\windows\mpi\intel64\bin;C:\windows\system32;C:\windows;C:\windows\System32\Wbem;C:\windows\System32\WindowsPowerShell\v1.0\;C:\windows\System32\OpenSSH\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32_win\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32_win\compiler;C:\ProgramData\Oracle\Java\javapath;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\compiler;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32\compiler;C:\Program Files (x86)\Common Files\Microsoft Shared\VSA\10.0\VsaEnv;C:\Program Files\Common Files\Microsoft Shared\Windows Live;C:\Program Files (x86)\Common Files\Microsoft Shared\Windows Live;C:\Program Files\MiKTeX 2.9\miktex\bin\x64;C:\Windows\twain_32\MP830;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;c:\Program Files (x86)\ATI Technologies\ATI.ACE\Core-Static;C:\Program Files (x86)\Windows Live\Shared;C:\Program Files\gs\gs8.64\bin;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Program Files\Calibre2\;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program Files (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Git\cmd;C:\Program Files (x86)\Git\bin;C:\Program Files\TortoiseGit\bin;C:\ProgramData\chocolatey\bin;C:\Program Files\MATLAB\R2022a\bin;C:\Program Files (x86)\Calibre2\;C:\Program Files\Microsoft VS Code\bin;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files\Git\cmd;C:\Program Files\nodejs\;C:\Users\peter\AppData\Local\Microsoft\WindowsApps;c:\usr\local\bin;C:\Users\peter\AppData\Local\Programs\MiKTeX 2.9\miktex\bin\x64\;C:\Users\peter\AppData\Local\GitHubDesktop\bin;C:\Users\peter\AppData\Local\Pandoc\;C:\cygwin64\usr\i686-w64-mingw32\sys-root\mingw\lib;C:\Program Files (x86)\Aspell\bin;C:\Users\peter\AppData\Local\gitkraken\bin;C:\Users\peter\AppData\Roaming\npm;C:\Users\peter\AppData\Local\Microsoft\WindowsApps
  PATHEXT = .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC;.JL;.CPL
  PSMODULEPATH = D:\peter\Documents\WindowsPowerShell\Modules;C:\Program Files\WindowsPowerShell\Modules;C:\WINDOWS\system32\WindowsPowerShell\v1.0\Modules

First, the result of running the script after starting Julia with -t 8 and with the default OpenBLAS BLAS:

OpenBLAS performance seems incredibly bad!

Next, the 8-core result after using MKL:

Next, the OpenBLAS result after starting Julia with -t 1:

And finally, the MKL result after starting Julia with -t 1:

simonp0420 · 2022-07-19T03:55:00Z

Out of curiosity I commented out the line in the script

#BLAS.set_num_threads(nc)

and restarted Julia with -t 8 using OpenBLAS. The result is:

That didn't seem to help.

chriselrod · 2022-07-19T03:55:19Z

Ah, re what I said earlier about using RF with 1 vs multiple threads:
It doesn't benefit much from threading, but unlike OpenBLAS, Polyester won't weave a noose to hang itself with given enough threads.

OpenBLAS is dramatically faster with BLAS.set_num_threads(1) over this size range on most recent computers with many cores.

ChrisRackauckas · 2022-07-19T03:56:20Z

These days, should we ever go back to OpenBLAS? There's so many cases where it's just... bad.

chriselrod · 2022-07-19T04:08:01Z

These days, should we ever go back to OpenBLAS? There's so many cases where it's just... bad.

What do you get on your AMD machine? I'm guessing MKL is still much better there for lu?

For Apple silicon, we could use Accelerate. It does quite well with a single core:

But that single core gets to use their single matrix multiplier.
With 4 cores:

OpenBLAS actually wins.

I should double check if Accelerate can benefit from multiple cores on the M1; maybe I just didn't set it. While it has only a single matrix multiplier, there's probably a lot of other things that can be done on the cores.

This was with 4 cores on a mac mini.
I'm guessing OpenBLAS would hang itself again on an M1-max/ultra, given 8 threads.

RF wins below 100x100, at least.

@YingboMa and I should look into an algorithm better suited to threading.

ChrisRackauckas · 2022-07-19T04:09:28Z

How do you make use of Accelerate?

chriselrod · 2022-07-19T04:12:50Z

https://github.com/chriselrod/AppleAccelerateLinAlgWrapper.jl
I gave it a deliberately bad name to avoid stealing a good one, hoping someone would create a nicer package that can be used in some way other than ccall.
But we probably want to just ccall anyway without otherwise changing BLAS's behavior here.

I also didn't bother wrapping anything other than what I wanted to test (matmul, lu, ldiv, and rdiv).

ChrisRackauckas · 2022-07-19T04:14:04Z

We might as well add it to LinearSolve.jl if we can do it in a way that doesn't hit LibBLASTrampoline.

chriselrod · 2022-07-19T04:15:58Z

We might as well add it to LinearSolve.jl if we can do it in a way that doesn't hit LibBLASTrampoline.

Currently, a problem with it is that despite hiting LibBLASTrampoline, it doesn't actually replace any existing methods, so we still need to manually use ccall to test anything.

But, yeah, we should probably just call accelerate directly to reduce the risk of something going wrong.
The trampoline may also introduce a small amount of overhead, i.e. an extra call?

simonp0420 · 2022-07-19T13:54:59Z

Why does OpenBLAS perform so much worse on Windows than on Linux?

ChrisRackauckas · 2023-10-17T11:18:06Z

Added.

ChrisRackauckas mentioned this issue Jul 19, 2022

"Fast" solvers are slow for dense, complex matrix #159

Closed

ChrisRackauckas closed this as completed Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking Scripts to tune the default algorithm choices #166

Benchmarking Scripts to tune the default algorithm choices #166

ChrisRackauckas commented Jul 19, 2022

simonp0420 commented Jul 19, 2022

simonp0420 commented Jul 19, 2022

chriselrod commented Jul 19, 2022 •

edited

Loading

ChrisRackauckas commented Jul 19, 2022

chriselrod commented Jul 19, 2022 •

edited

Loading

ChrisRackauckas commented Jul 19, 2022

chriselrod commented Jul 19, 2022 •

edited

Loading

ChrisRackauckas commented Jul 19, 2022

chriselrod commented Jul 19, 2022 •

edited

Loading

simonp0420 commented Jul 19, 2022

ChrisRackauckas commented Oct 17, 2023

Benchmarking Scripts to tune the default algorithm choices #166

Benchmarking Scripts to tune the default algorithm choices #166

Comments

ChrisRackauckas commented Jul 19, 2022

simonp0420 commented Jul 19, 2022

simonp0420 commented Jul 19, 2022

chriselrod commented Jul 19, 2022 • edited Loading

ChrisRackauckas commented Jul 19, 2022

chriselrod commented Jul 19, 2022 • edited Loading

ChrisRackauckas commented Jul 19, 2022

chriselrod commented Jul 19, 2022 • edited Loading

ChrisRackauckas commented Jul 19, 2022

chriselrod commented Jul 19, 2022 • edited Loading

simonp0420 commented Jul 19, 2022

ChrisRackauckas commented Oct 17, 2023

chriselrod commented Jul 19, 2022 •

edited

Loading

chriselrod commented Jul 19, 2022 •

edited

Loading

chriselrod commented Jul 19, 2022 •

edited

Loading

chriselrod commented Jul 19, 2022 •

edited

Loading