Skip to content

Thread overhead involving FFT/BLAS on dual sockets #29

@shipengcheng1230

Description

@shipengcheng1230

Hi, sorry if the title is not crystal-clear. I found a problem when doing computation on a cluster:

using FFTW, BenchmarkTools, LinearAlgebra, Printf, Polyester, Random

println("Julia num threads: $(Threads.nthreads()), Total Sys CPUs: $(Sys.CPU_THREADS)")
println("FFT provider: $(FFTW.get_provider()), BLAS: $(BLAS.vendor())")

function ode_1(du, u, p, t)
    @batch for i  eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
end

function ode_2(du, u, p, t)
    v1, v2, plan, _, _ = p
    mul!(v1, plan, u)
    @batch for i  eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
    ldiv!(v2, plan, v1)
end

function ode_3(du, u, p, t)
    _, _, _, K, w = p
    @batch for i  eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
    mul!(w, K, vec(u))
end

begin
    N = 64
    n = (2N-1) * N
    Random.seed!(42)
    u = rand(2N-1, N)
    du = similar(u)
    v₁ = rand(ComplexF64, N, N)
    v₂ = rand(2N-1, N)
    K = rand(n, n)
    w = zeros(n)
    FFTW.set_num_threads(Threads.nthreads()) # use all CPUs
    plan = plan_rfft(du, 1; flags=FFTW.PATIENT)
    p = (v₁, v₂, plan, K, w)
    BLAS.set_num_threads(Threads.nthreads()) # use all CPUs
end

println("ODE-1: only element-wise (EW) ops")
@btime ode_1($du, $u, $p, 1.0)

println("ODE-2: FFT + EW + FFT")
@btime ode_2($du, $u, $p, 1.0)

println("ODE-3: EW + BLAS")
@btime ode_2($du, $u, $p, 1.0)

Running on my local computer (2 * Intel(R) Xeon(R) Gold 6136, 2 * 12 CPUs, hyperthreading enabled), performance scales well even with oversubscribing (both Polyester and MKL handle that, I do observe CPU usage is always large than specified during ODE-2)

Good Results
~/codes » julia16 --check-bounds=no -O3 -t 6 ex2.jl                                   pshi@discover
Julia num threads: 6, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  114.905 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  172.101 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  173.103 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 12 ex2.jl                                  pshi@discover
Julia num threads: 12, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  56.885 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  106.648 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  106.777 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 24 ex2.jl                                  pshi@discover
Julia num threads: 24, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  29.294 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  77.235 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  77.275 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 48 ex2.jl                                  pshi@discover
Julia num threads: 48, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  28.303 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  76.601 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  77.470 μs (2 allocations: 160 bytes)

However, running on a remote computer cluster (2 * Intel(R) Xeon(R) Gold 5220, 2 * 18 CPUs, hyperthreading disabled), performance deteriorates when starting Julia with more than one socket threads:

Bad Results
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  42.415 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  96.472 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  96.324 μs (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  40.662 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  92.047 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  92.357 μs (2 allocations: 160 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  39.156 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  143.665 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  148.203 μs (2 allocations: 160 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  32.706 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  10.992 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
  10.987 ms (2 allocations: 160 bytes) # oops!

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  25.268 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  12.047 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
  13.059 ms (2 allocations: 160 bytes) # oops!

Replace all @batch with @threads, which shows no overhead but not as fast (that is why to switch to this package):

Results with `@threads`
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  97.260 μs (91 allocations: 8.09 KiB)
ODE-2: FFT + EW + FFT
  182.220 μs (93 allocations: 8.25 KiB)
ODE-3: EW + BLAS
  181.990 μs (93 allocations: 8.25 KiB)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  88.000 μs (96 allocations: 8.55 KiB)
ODE-2: FFT + EW + FFT
  185.106 μs (98 allocations: 8.70 KiB)
ODE-3: EW + BLAS
  190.014 μs (99 allocations: 8.73 KiB)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  94.474 μs (101 allocations: 8.98 KiB)
ODE-2: FFT + EW + FFT
  184.741 μs (103 allocations: 9.14 KiB)
ODE-3: EW + BLAS
  190.534 μs (104 allocations: 9.17 KiB)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  107.443 μs (136 allocations: 12.11 KiB)
ODE-2: FFT + EW + FFT
  173.397 μs (138 allocations: 12.27 KiB)
ODE-3: EW + BLAS
  169.724 μs (138 allocations: 12.27 KiB)

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  124.303 μs (181 allocations: 16.11 KiB)
ODE-2: FFT + EW + FFT
  203.427 μs (183 allocations: 16.27 KiB)
ODE-3: EW + BLAS
  209.808 μs (183 allocations: 16.27 KiB)

When I benchmark FFT or BLAS or those dummy element-wise operations separately on the remote cluster, there is no problem at all. But when things get mixed up, it doesn't work as expected. When I use 18 CPUs, the CPU usage is actually close to 3500%. Using more than 18 CPUs, the number is capped at 3600% but those overheads show up as above. Could it be related to the hyperthreading or something else (like MKL)? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions