-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suboptimal dgemm
OpenBLAS performance and dual-socket scaling
#46123
Comments
@chriselrod Are you able to reproduce this on some of our machines? |
Not really on deapsea2, which I believe is 32 cores x 2 sockets: julia> using LinearAlgebra
julia> M = K = N = 16_000;
julia> A = Matrix{Float64}(undef, M, K); B = Matrix{Float64}(undef, K, N); C = Matrix{Float64}(undef, M, N);
julia> let A=A; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;
julia> let A=B; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
996.8121355318401
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1052.3059335030646
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1158.2279430700944
julia> BLAS.set_num_threads(64);
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1424.0752368319172
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1597.7102493574532
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1599.5668391744443
julia> using MKL
julia> BLAS.get_num_threads()
64
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1566.9680236220754
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1688.7183359779663
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1694.5279138377305
julia> BLAS.set_num_threads(32);
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1168.313201204457
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1151.4661029505437
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1112.8236207523348 MKL and OpenBLAS both show about the same speedup. On deapsea2, Octavian thinks we only have 32 cores, and thus refuses to use more than 32 threads. julia> using Hwloc
julia> Hwloc.num_numa_nodes()
8
julia> Hwloc.num_physical_cores()
64
julia> Hwloc.num_packages()
2 if I recall correctly, we no longer need to support WINE, so I think switching back would be okay. CpuId seems to figure out we have 32 cores per socket julia> using CpuId, Octavian
julia> CpuId.cpucores()
32
julia> Octavian.num_cores()
static(32) but doesn't realize we have two of them. julia> CpuId.cpunodes()
1
julia> CpuId.cpucores_total()
32 BTW, @carstenbauer also gave me access to the cluster he's running on, but I've been procrastinating on actually signing on and poking around. |
julia> using LinearAlgebra
julia> BLAS.get_num_threads()
64
julia> M = K = N = 16_000;
julia> A = Matrix{Float64}(undef, M, K); B = Matrix{Float64}(undef, K, N); C = Matrix{Float64}(undef, M, N);
julia> let A=A; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;
julia> let A=B; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
982.1094539637073
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1016.5287429664286
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
993.1180919251972
julia> BLAS.set_num_threads(128);
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1253.0140262394689
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1224.9438668463697
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1216.3002770861356
julia> using MKL
julia> BLAS.get_num_threads()
64
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1299.188240131717
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1383.8168443837112
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1309.8513547840373
julia> BLAS.set_num_threads(128);
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1340.979151645549
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1308.8692178797796
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1325.0394458234089
julia> BLAS.get_num_threads()
64 This time, CpuId gets it right, and Hwloc gets it wrong??? julia> using Hwloc
julia> Hwloc.num_physical_cores()
64
julia> Hwloc.num_packages()
1
julia> using CpuId
julia> CpuId.cpucores()
64
julia> CpuId.cpucores_total()
128 |
FWIW, I ran (some of) the same benchmarks on a qualitatively different system (Noctua 2, AMD Milan Zen 3, 2x 64 cores) and also found that the default OpenBLAS doesn't scale. See https://github.com/carstenbauer/julia-dgemm-noctua#noctua-2. (Chris has access to both Noctua 1 and Nocuta 2) (Might be able to add results from clusters at other HPC centers soon) |
@chriselrod Which system is this run on? If it's a Noctua 2 login node, be aware that the login nodes in Noctua 2 are special in the sense that they only have 1x64 physical cores but HT enabled (i.e. 128 virtual cores). OTOH, compute nodes have 2x64 physical cores and HT disabled (i.e. 128 physical == virtual cores). Also, for completeness, which Julia version and how many Julia threads are you using? |
From carstenbauer/julia-dgemm-noctua#2. (cc @ViralBShah)
I benchmarked a simple
dgemm
call (i.e.mul!(C,A,B)
) on Noctua 1 (single and dual-socket Intel Xeon Gold "Skylake" 6148 20-Core CPUs) for multiple BLAS libraries (called from Julia using LBT)Here are the benchmark results:
The custom OpenBLAS has been compiled with
Primary observations/conclusions:
using OpenBLAS_jll
andBLAS.lbt_forward(...; clear=true)
? (It's still worse than a custom build of OpenBLAS though.)I hope we can improve the default OpenBLAS performance and scaling.
The text was updated successfully, but these errors were encountered: