Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal dgemm OpenBLAS performance and dual-socket scaling #46123

Open
carstenbauer opened this issue Jul 21, 2022 · 5 comments
Open

Suboptimal dgemm OpenBLAS performance and dual-socket scaling #46123

carstenbauer opened this issue Jul 21, 2022 · 5 comments
Labels
domain:linear algebra Linear algebra performance Must go faster

Comments

@carstenbauer
Copy link
Member

carstenbauer commented Jul 21, 2022

From carstenbauer/julia-dgemm-noctua#2. (cc @ViralBShah)

I benchmarked a simple dgemm call (i.e. mul!(C,A,B)) on Noctua 1 (single and dual-socket Intel Xeon Gold "Skylake" 6148 20-Core CPUs) for multiple BLAS libraries (called from Julia using LBT)

  • 20 cores -> single-socket
  • 40 cores -> dual-socket (full node)

Here are the benchmark results:

BLAS # cores size GFLOPS
Intel MKL v2022.0.0 (JLL) 40 cores 10240 2081
Intel MKL v2022.0.0 (JLL) 20 cores 10240 1054
BLIS 0.9.0 (JLL) 40 cores 10240 1890
BLIS 0.9.0 (JLL) 20 cores 10240 990
Octavian 0.3.15 40 cores 10240 1053
Octavian 0.3.15 20 cores 10240 1016
OpenBLAS (shipped with Julia 1.8) 40 cores 10240 1092
OpenBLAS (shipped with Julia 1.8) 20 cores 10240 1063
--------------------- ----------- ------- ------
OpenBLAS 0.3.17 (custom) 40 cores 10240 1908
OpenBLAS 0.3.17 (custom) 20 cores 10240 1439
OpenBLAS 0.3.20 (custom) 40 cores 10240 1897
OpenBLAS 0.3.20 (custom) 20 cores 10240 1444
--------------------- ----------- ------- ------
OpenBLAS 0.3.17 (JLL) 40 cores 10240 1437
OpenBLAS 0.3.17 (JLL) 20 cores 10240 1124
OpenBLAS 0.3.20 (JLL) 40 cores 10240 1535
OpenBLAS 0.3.20 (JLL) 20 cores 10240 1185

The custom OpenBLAS has been compiled with

make INTERFACE64=1 USE_THREAD=1 NO_AFFINITY=1 GEMM_MULTITHREADING_THRESHOLD=50 NO_STATIC=1 BINARY=64

Primary observations/conclusions:

  • MKL and BLIS (through MKL.jl and BLISBLAS.jl) scale reasonably well from single to dual-socket but OpenBLAS shipped with Julia 1.8 doesn't scale at all. (Octavian also doesn't scale, see Dual-socket support JuliaLinearAlgebra/Octavian.jl#151)
  • A custom build of OpenBLAS shows overall best single-socket performance and scales reasonably well. Therefore it is not just that OpenBLAS is inferior to MKL/BLIS. Perhaps we use suboptimal build options?
  • What is particularly curious is the using OpenBLAS_jll (0.3.17 and 0.3.20) manually leads to strictly better performance (both in terms of numbers and scaling) than the default/shipped OpenBLAS. How is the default integration of the OpenBLAS_jll different from just manually doing using OpenBLAS_jll and BLAS.lbt_forward(...; clear=true)? (It's still worse than a custom build of OpenBLAS though.)

I hope we can improve the default OpenBLAS performance and scaling.

@ViralBShah
Copy link
Member

@chriselrod Are you able to reproduce this on some of our machines?

@chriselrod
Copy link
Contributor

chriselrod commented Jul 22, 2022

@chriselrod Are you able to reproduce this on some of our machines?

Not really on deapsea2, which I believe is 32 cores x 2 sockets:

julia> using LinearAlgebra

julia> M = K = N = 16_000;

julia> A = Matrix{Float64}(undef, M, K); B = Matrix{Float64}(undef, K, N); C = Matrix{Float64}(undef, M, N);

julia> let A=A; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> let A=B; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
996.8121355318401

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1052.3059335030646

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1158.2279430700944

julia> BLAS.set_num_threads(64);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1424.0752368319172

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1597.7102493574532

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1599.5668391744443

julia> using MKL

julia> BLAS.get_num_threads()
64

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1566.9680236220754

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1688.7183359779663

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1694.5279138377305

julia> BLAS.set_num_threads(32);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1168.313201204457

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1151.4661029505437

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1112.8236207523348

MKL and OpenBLAS both show about the same speedup.
I used Threads.@threads to initialize the arrays because of first touch.

On deapsea2, Octavian thinks we only have 32 cores, and thus refuses to use more than 32 threads.
Switching back to hwloc instead of CpuId would fix this:

julia> using Hwloc

julia> Hwloc.num_numa_nodes()
8

julia> Hwloc.num_physical_cores()
64

julia> Hwloc.num_packages()
2

if I recall correctly, we no longer need to support WINE, so I think switching back would be okay.

CpuId seems to figure out we have 32 cores per socket

julia> using CpuId, Octavian

julia> CpuId.cpucores()
32

julia> Octavian.num_cores()
static(32)

but doesn't realize we have two of them.

julia> CpuId.cpunodes()
1

julia> CpuId.cpucores_total()
32

BTW, @carstenbauer also gave me access to the cluster he's running on, but I've been procrastinating on actually signing on and poking around.

@chriselrod
Copy link
Contributor

julia> using LinearAlgebra

julia> BLAS.get_num_threads()
64

julia> M = K = N = 16_000;

julia> A = Matrix{Float64}(undef, M, K); B = Matrix{Float64}(undef, K, N); C = Matrix{Float64}(undef, M, N);

julia> let A=A; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> let A=B; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
982.1094539637073

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1016.5287429664286

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
993.1180919251972

julia> BLAS.set_num_threads(128);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1253.0140262394689

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1224.9438668463697

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1216.3002770861356

julia> using MKL

julia> BLAS.get_num_threads()
64

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1299.188240131717

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1383.8168443837112

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1309.8513547840373

julia> BLAS.set_num_threads(128);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1340.979151645549

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1308.8692178797796

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1325.0394458234089

julia> BLAS.get_num_threads()
64

This time, CpuId gets it right, and Hwloc gets it wrong???

julia> using Hwloc

julia> Hwloc.num_physical_cores()
64

julia> Hwloc.num_packages()
1

julia> using CpuId

julia> CpuId.cpucores()
64

julia> CpuId.cpucores_total()
128

@carstenbauer
Copy link
Member Author

carstenbauer commented Jul 22, 2022

FWIW, I ran (some of) the same benchmarks on a qualitatively different system (Noctua 2, AMD Milan Zen 3, 2x 64 cores) and also found that the default OpenBLAS doesn't scale. See https://github.com/carstenbauer/julia-dgemm-noctua#noctua-2. (Chris has access to both Noctua 1 and Nocuta 2)

(Might be able to add results from clusters at other HPC centers soon)

@carstenbauer
Copy link
Member Author

carstenbauer commented Jul 22, 2022

@chriselrod Which system is this run on? If it's a Noctua 2 login node, be aware that the login nodes in Noctua 2 are special in the sense that they only have 1x64 physical cores but HT enabled (i.e. 128 virtual cores). OTOH, compute nodes have 2x64 physical cores and HT disabled (i.e. 128 physical == virtual cores).

Also, for completeness, which Julia version and how many Julia threads are you using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants