Threaded benchmarks #1861

hennyg888 · 2021-07-15T14:20:54Z

I recently ran some benchmarks on threading for Oceananigans based on scripts added by @francispoulin in an older branch.
https://github.com/CliMA/Oceananigans.jl/blob/fjp/multithreaded-benchmarks/benchmark/weak_scaling_shallow_water_model_threaded.jl
https://github.com/CliMA/Oceananigans.jl/blob/fjp/multithreaded-benchmarks/benchmark/weak_scaling_shallow_water_model_serial.jl
Besides the benchmark scripts themselves, everything else was up to date with the latest version of master.

Here are the results:

Oceananigans v0.58.8
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  EBVERSIONJULIA = 1.6.1
  JULIA_DEPOT_PATH = :
  EBROOTJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.1
  EBDEVELJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.1/easybuild/avx2-Core-julia-1.6.1-easybuild-devel
  JULIA_LOAD_PATH = :

                  Shallow water model weak scaling with multithreading benchmark
┌───────────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───────────┬─────────┬─────────┐
│          size │ threads │     min │  median │    mean │     max │    memory │  allocs │ samples │
├───────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────────┼─────────┼─────────┤
│   (8192, 512) │       1 │ 1.453 s │ 1.454 s │ 1.454 s │ 1.456 s │  1.37 MiB │    2318 │       4 │
│  (8192, 1024) │       2 │ 2.909 s │ 2.933 s │ 2.933 s │ 2.956 s │ 21.52 MiB │ 1303192 │       2 │
│  (8192, 2048) │       4 │ 2.096 s │ 2.115 s │ 2.125 s │ 2.165 s │ 16.38 MiB │  942343 │       3 │
│  (8192, 4096) │       8 │ 2.178 s │ 2.198 s │ 2.218 s │ 2.280 s │ 17.82 MiB │  987092 │       3 │
│  (8192, 8192) │      16 │ 2.201 s │ 2.218 s │ 2.216 s │ 2.230 s │ 18.33 MiB │  922426 │       3 │
│ (8192, 16384) │      32 │ 2.598 s │ 2.615 s │ 2.615 s │ 2.632 s │ 24.29 MiB │ 1116849 │       2 │
└───────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───────────┴─────────┴─────────┘

        Shallow water model weak multithreading scaling speedup
┌───────────────┬─────────┬──────────┬────────────┬─────────┬─────────┐
│          size │ threads │ slowdown │ efficiency │  memory │  allocs │
├───────────────┼─────────┼──────────┼────────────┼─────────┼─────────┤
│   (8192, 512) │       1 │      1.0 │        1.0 │     1.0 │     1.0 │
│  (8192, 1024) │       2 │  2.01669 │   0.495862 │ 15.7412 │ 562.205 │
│  (8192, 2048) │       4 │  1.45397 │   0.687771 │ 11.9861 │ 406.533 │
│  (8192, 4096) │       8 │  1.51106 │   0.661786 │ 13.0337 │ 425.838 │
│  (8192, 8192) │      16 │  1.52536 │   0.655582 │ 13.4078 │  397.94 │
│ (8192, 16384) │      32 │  1.79793 │   0.556195 │ 17.7701 │ 481.816 │
└───────────────┴─────────┴──────────┴────────────┴─────────┴─────────┘

They're not terrific, but they're decent. I am running these on 32 CPUs, so what I assume is 1 thread per CPU up to 32 threads. The slight increase in efficiency going from 2 to 4 threads is likely some flat overhead being overcome by actual efficiency increase of multithreading.
@christophernhill @glwagner is there anything we can do to improve multithreading efficiency for Oceananigans? It might not be as simple as adding @threads in front of the main for loops but with just a little bit of improvement then multithreading efficiency might just match MPI efficiency.
As it is, multithreading is already a worthwhile option to achieve speedups on systems with multiple CPUs but no MPI.

So far I've only run the scripts on one node up to 32 threads and CPUs. I'll update this issue with the result of running it on multiple nodes going up to 64 or maybe 128 CPUs just to see if efficiency is affected going from one node to more.

The text was updated successfully, but these errors were encountered:

vchuravy · 2021-07-15T14:59:13Z

Would be good to do some profiling (probably with a system profiler like perf), to understand where time is spent. The kernels using KernelAbstractions are automatically multi-threaded.

glwagner · 2021-07-15T16:09:56Z

To fill in a few more details for @hennyg888 --- almost all multithreading in Oceananigans is achieved via KernelAbstractions.jl. Improving efficiency for Oceananigans kernels likely means contributing to KernelAbstractions.jl (which @vchuravy may or may not be excited about :-D).

More specifically, all tendency evaluations, non-communicative / non-periodic halo fills (periodic halo filling uses Base broadcasting and thus is not parallelized), integrals (like the hydrostatic pressure integral, or vertical velocity computation in HydrostaticFreeSurfaceModel), evaluation of diagnostics, and broadcasting with fields all use KernelAbstractions via the Oceananigans function launch!:

Oceananigans.jl/src/Utils/kernel_launching.jl

Lines 71 to 90 in 6e39d3f

    
           function launch!(arch, grid, dims, kernel!, args...; 
        
                            dependencies = nothing, 
        
                            include_right_boundaries = false, 
        
                            reduced_dimensions = (), 
        
                            location = nothing, 
        
                            kwargs...) 
        
               workgroup, worksize = work_layout(grid, dims, 
        
                                                 include_right_boundaries = include_right_boundaries, 
        
                                                       reduced_dimensions = reduced_dimensions, 
        
                                                                 location = location) 
        
               loop! = kernel!(Architectures.device(arch), workgroup, worksize) 
        
               @debug "Launching kernel $kernel! with worksize $worksize" 
        
               event = loop!(args...; dependencies=dependencies, kwargs...) 
        
               return event 
        
           end

The line

event = loop!(args...; dependencies=dependencies, kwargs...)

launches a kernel, using KernelAbstractions syntax. event is a token that can be "waited" on if we need to.

So either we can improve multithreading by changing what happens when loop! is called --- or, possibly, by refining the dependency tree so that we can launch more kernels simultaneously. The second optimization is probably more important for small problems. You have mostly benchmarked fairly large problems so I don't we'd see much speed for them. But I'm not 100% sure.

christophernhill · 2021-07-15T16:23:03Z

@hennyg888 do you have the same problems using MPI instead of multi-threaded, and on the same CPU ( Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz ). ?

hennyg888 · 2021-07-15T16:28:52Z

For MPI I ran it on up to 128 Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz CPUs with efficiencies at around 80%. I think I have some results for MPI weak and strong scaling benchmarks posted here at the bottom #1722.

francispoulin · 2021-07-15T16:32:30Z

Thanks everyone for your feedback.

@vchuravy , great to know that multi-threading is built in!

I agree that profiling would be a good way to determine why we get not great efficiency. I have not used perf but we can look into it.

Also, do you know of benchmarking others have done using KernelAbstractions on threads that we could look at for comparison?

christophernhill · 2021-07-15T17:27:08Z

@hennyg888 and @francispoulin the results in #1722 look like they may be for Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz ? (and a different version of Julia 1.6.0 v 1.6.1 ).

Not sure how precise we want to be on what we compare with what, but it could be informative to have comparisons where only one thing is changed at a time if that is possible i.e. all run on Intel(R) Xeon(R) Platinum 8260 CPU with same problem and problem but only threading v MPI different? We could also compare across CPU and across Julia but not all at the same time?

francispoulin · 2021-07-15T17:53:39Z

Not sure why we have julia v 1.6.1 but I think that we should be able to do the results for julia 1.6.0, since that's what we have on the servers.

When we do runs over hundreds of cpus I don't know that we will be getting cpus that are all the same. Unfortunately, I don't see an easy fix for that.

christophernhill · 2021-07-15T17:59:14Z

@francispoulin (and @hennyg888 ) no worries. We can use what we have too.

I think both these tests ( #1861 and #1722 ) are on a single CPU (just lots of cores)?

francispoulin · 2021-07-15T18:02:20Z

Sorry, I was thinking of the MPI tests (since that's what I'm looking at for the slides right now).

I agree that for one CPU vs one GPU, it would be nice to use the same CPU and GPU in the different tests. I know we can specify the GPU type in the SLURM script. Maybe we can do the sme for the CPU?

christophernhill · 2021-07-15T18:10:38Z

@francispoulin and @hennyg888 do you think a metric of "number of points per second" would be useful? In general that would be Nx.Ny.Nz.Nt/tbench . That could be a way to compare 1 GPU with 128 CPU cores on the same model but with different problem sizes?

vchuravy · 2021-07-15T18:18:06Z

Also, do you know of benchmarking others have done using KernelAbstractions on threads that we could look at for comparison?

I did some benchmarks in the beginning, but mostly focused on strong scaling.

francispoulin · 2021-07-15T18:27:06Z

Interesting idea @christophernhill . For the last results that @hennyg888 posted in #1722, I did some calculations and found the following.

GPU
N=256    3.0e9
N=128    2.6e9
N=64     6.6e8

CPU
N=256      8.6e6
N=128      9.1e6
N=64       9.0e6

In an article that @ali-ramadhan referenced on the slack channel recently, a paper using a shallow water model in python, Roullet and Gaillard (2021), said they were getting 2 TFlops per second using a thousand cores. We are getting 3 GigaFlops on GPU and 9 MegaFlops.

Certainly very good speedup since we have O(400) with WENO5, but this makes me wonder whether we could do better?

But to answer your question, when @hennyg888 has the data, we can certainly produce these plots easily enough (unless there is a problem that I'm missing).

francispoulin · 2021-07-15T18:27:30Z

Also, do you know of benchmarking others have done using KernelAbstractions on threads that we could look at for comparison?

I did some benchmarks in the beginning, but mostly focused on strong scaling.

Thanks for the information. Can you point me to where some of these results might be found?

hennyg888 · 2021-07-16T14:01:57Z

@christophernhill @francispoulin I ran the threaded benchmarks up to 32 threads on 32 cores with Julia 1.6.0 and on the same CPUs as what the MPI benchmarks used. Makes sense since they're all benchmarking parallel computing efficiency.

Oceananigans v0.58.9
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, broadwell)
Environment:
  EBVERSIONJULIA = 1.6.0
  JULIA_DEPOT_PATH = :
  EBROOTJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0
  EBDEVELJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0/easybuild/avx2-Core-julia-1.6.0-easybuild-devel
  JULIA_LOAD_PATH = :

                  Shallow water model weak scaling with multithreading benchmark
┌───────────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───────────┬─────────┬─────────┐
│          size │ threads │     min │  median │    mean │     max │    memory │  allocs │ samples │
├───────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────────┼─────────┼─────────┤
│   (8192, 512) │       1 │ 1.458 s │ 1.458 s │ 1.458 s │ 1.458 s │  1.37 MiB │    2318 │       4 │
│  (8192, 1024) │       2 │ 2.925 s │ 2.989 s │ 2.989 s │ 3.052 s │ 18.06 MiB │ 1076944 │       2 │
│  (8192, 2048) │       4 │ 2.296 s │ 2.381 s │ 2.397 s │ 2.515 s │ 13.60 MiB │  760190 │       3 │
│  (8192, 4096) │       8 │ 2.347 s │ 2.369 s │ 2.377 s │ 2.415 s │ 16.36 MiB │  891860 │       3 │
│  (8192, 8192) │      16 │ 2.407 s │ 2.548 s │ 2.517 s │ 2.595 s │ 17.44 MiB │  863941 │       3 │
│ (8192, 16384) │      32 │ 3.023 s │ 3.069 s │ 3.069 s │ 3.115 s │ 23.03 MiB │ 1034063 │       2 │
└───────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───────────┴─────────┴─────────┘

        Shallow water model weak multithreading scaling speedup
┌───────────────┬─────────┬──────────┬────────────┬─────────┬─────────┐
│          size │ threads │ slowdown │ efficiency │  memory │  allocs │
├───────────────┼─────────┼──────────┼────────────┼─────────┼─────────┤
│   (8192, 512) │       1 │      1.0 │        1.0 │     1.0 │     1.0 │
│  (8192, 1024) │       2 │  2.04972 │   0.487872 │ 13.2156 │ 464.601 │
│  (8192, 2048) │       4 │  1.63302 │   0.612363 │ 9.95278 │ 327.951 │
│  (8192, 4096) │       8 │  1.62507 │   0.615359 │ 11.9706 │ 384.754 │
│  (8192, 8192) │      16 │  1.74747 │   0.572257 │  12.755 │  372.71 │
│ (8192, 16384) │      32 │  2.10486 │    0.47509 │  16.846 │ 446.101 │
└───────────────┴─────────┴──────────┴────────────┴─────────┴─────────┘

Also, after reviewing the new benchmarks and comparing them to the old benchmarks currently displayed on benchmarks.md, it seems like all the CPU vs GPU benchmarks use the same CPU and all the MPI and threaded benchmarks use another type of CPUs.
The MPI and threaded benchmarks use Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, while every other benchmark including the CPU to GPU speedup benchmarks use Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz.
This is mainly because the speedup benchmarks need a single better CPU and the MPI benchmarks need many CPUs and the exact types worked out to be like this on the cluster.

glwagner · 2021-07-18T02:09:18Z

Interesting idea @christophernhill . For the last results that @hennyg888 posted in #1722, I did some calculations and found the following.
GPU
N=256    3.0e9
N=128    2.6e9
N=64     6.6e8

CPU
N=256      8.6e6
N=128      9.1e6
N=64       9.0e6
In an article that @ali-ramadhan referenced on the slack channel recently, a paper using a shallow water model in python, Roullet and Gaillard (2021), said they were getting 2 TFlops per second using a thousand cores. We are getting 3 GigaFlops on GPU and 9 MegaFlops.

Certainly very good speedup since we have O(400) with WENO5, but this makes me wonder whether we could do better?

But to answer your question, when @hennyg888 has the data, we can certainly produce these plots easily enough (unless there is a problem that I'm missing).

We have to do more work to compare with Roullet and Gaillard (2021). First of all, there are typos in the paper: sometimes the performance is listed as 2 GFlops, other times as 2 TFlops. Second --- if I understand the situation correctly --- I don't think we've ever measured floating point operations per second. The numbers you've calculated are grid points per second; however we do many floating point operations per grid point. Roullet and Gaillard (2021) estimate their code performs something like 700-800 Flops per grid point.

christophernhill · 2021-07-18T02:23:40Z

@glwagner we could look at using - https://github.com/triscale-innov/GFlops.jl at some point.

P.S 84% of CPU seems abnormally high, dense matrix/matrix typically maxes out at about 80%.

francispoulin · 2021-07-19T12:52:54Z

Good points @glwagner . The numbers that I posted are probably best ignored for now. I imagine this should come up in another issue when we are concerned about efficiency of the calculations in general. Focusing in the threading in this issue seems best.

glwagner · 2021-07-19T13:47:37Z

@glwagner we could look at using - https://github.com/triscale-innov/GFlops.jl at some point.

P.S 84% of CPU seems abnormally high, dense matrix/matrix typically maxes out at about 80%.

Sounds like a letter to the editor. :-P

glwagner · 2021-07-24T01:17:22Z

I put together some utilities for testing multithreading versus Base.threads for a simple kernel:

https://github.com/glwagner/multithreaded-stencils

I've used a new repo because it might be worthwhile to test threaded computations in other programming languages.

christophernhill · 2021-07-24T01:41:20Z

@glwagner we could look at using - https://github.com/triscale-innov/GFlops.jl at some point.
P.S 84% of CPU seems abnormally high, dense matrix/matrix typically maxes out at about 80%.

Sounds like a letter to the editor. :-P

Could be they meant 84% of memory bandwidth limited peak? It isn't crazy to get 84% of memory bandwidth, but that then gives a very low % peak flops. I haven't read article, I guess I should!

francispoulin · 2021-07-24T16:01:57Z

Very nice work @glwagner , and thanks for making this. Lots of good stuff here.

In your calculations, you find that there is saturation at 16 threads. I might guess that you have 16 cores on one node? I would think that this should be node dependent.

Also, in the table, might it be possible to compute the efficiency as well? I think that's more standard than speed up.

glwagner · 2021-07-24T16:12:19Z

Very nice work @glwagner , and thanks for making this. Lots of good stuff here.

In your calculations, you find that there is saturation at 16 threads. I might guess that you have 16 cores on one node? I would think that this should be node dependent.

Also, in the table, might it be possible to compute the efficiency as well? I think that's more standard than speed up.

Ah, this machine has 48 cores. Since threading has an overhead cost, we expect saturation at some point. It's surprising that this happens at just 16 cores for such a large problem (512^3) though.

We can calculate more metrics for sure.

I think it would be worthwhile to investigate whether other threading paradigms scale differently for the same problem. Numba + parallel accelerator might be a good test case. @hennyg888 would you be interested in that?

Here are some docs:

https://numba.pydata.org/numba-doc/latest/user/parallel.html

francispoulin · 2021-07-24T16:16:51Z

I agree that I would expect it to saturate at higher than 16 if there were 48 cores, but clearly I'm wrong.

Getting another benchmark would be a good idea. I'm happy to consider the numba + parallel idea since that would be good to test the architecture. This mini-course did give some threaded examples to solve the diffusion equation in 3D. I wonder if we might want to ask Ludovic if they have done any scalings for multi-threading?

I'm happy to discuss this with @hennyg888 on Monday and see what we come up with. Others are happy to join the discussion if they like.

francispoulin · 2021-07-24T16:30:45Z

Below is a link to a paper that compares the scalability of multi-threading in Python, Julia and Chapel.

Brief Summary: They find that none of them do as well as OpenMP but give some reasons as to why. But they do find some improvements going up to 64 threads, but the effiicency in some cases go down to 20%. It seems that Python might do better on low numbers of threads but Julia does better on more. This was last year so I am sure this should probably redone.

Also, I should mention I don't believe their problem is like ours but it's an example and has some pictures, so that's nice to see.

https://hal.inria.fr/hal-02879767/document

christophernhill · 2021-07-24T18:08:58Z

Very nice work @glwagner , and thanks for making this. Lots of good stuff here.
In your calculations, you find that there is saturation at 16 threads. I might guess that you have 16 cores on one node? I would think that this should be node dependent.
Also, in the table, might it be possible to compute the efficiency as well? I think that's more standard than speed up.

Ah, this machine has 48 cores. Since threading has an overhead cost, we expect saturation at some point. It's surprising that this happens at just 16 cores for such a large problem (512^3) though.

We can calculate more metrics for sure.

I think it would be worthwhile to investigate whether other threading paradigms scale differently for the same problem. Numba + parallel accelerator might be a good test case. @hennyg888 would you be interested in that?

Here are some docs:

https://numba.pydata.org/numba-doc/latest/user/parallel.html

You run out of memory bandwidth at some point - usually before you get to saturate all the cores for something
like diffusion. So some of 16 thread drop off could be that.

I guess we could get even more minimalist and check a multi-threaded stream benchmark to see that?

francispoulin · 2021-07-25T15:30:16Z

I am open to trying whatever simple example you suggest @christophernhill , but I'm not sure what you mean by stream benchmark. Sorry.

glwagner · 2023-03-22T16:28:26Z

Thanks @hennyg888 for the benchmarks!

glwagner added the performance 🏍️ So we can get the wrong answer even faster label Jul 17, 2021

hennyg888 mentioned this issue Jul 21, 2021

New Threaded Benchmark Scripts #1881

Merged

hennyg888 mentioned this issue Jul 30, 2021

Use a better heuristic for CPU kernel workgroup #1902

Closed

CliMA locked and limited conversation to collaborators Mar 22, 2023

glwagner converted this issue into discussion #3007 Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Threaded benchmarks #1861

Threaded benchmarks #1861

hennyg888 commented Jul 15, 2021

vchuravy commented Jul 15, 2021

glwagner commented Jul 15, 2021 •

edited

christophernhill commented Jul 15, 2021

hennyg888 commented Jul 15, 2021

francispoulin commented Jul 15, 2021

christophernhill commented Jul 15, 2021

francispoulin commented Jul 15, 2021

christophernhill commented Jul 15, 2021

francispoulin commented Jul 15, 2021

christophernhill commented Jul 15, 2021

vchuravy commented Jul 15, 2021

francispoulin commented Jul 15, 2021

francispoulin commented Jul 15, 2021

hennyg888 commented Jul 16, 2021

glwagner commented Jul 18, 2021

christophernhill commented Jul 18, 2021 •

edited

francispoulin commented Jul 19, 2021

glwagner commented Jul 19, 2021

glwagner commented Jul 24, 2021

christophernhill commented Jul 24, 2021

francispoulin commented Jul 24, 2021

glwagner commented Jul 24, 2021 •

edited

francispoulin commented Jul 24, 2021

francispoulin commented Jul 24, 2021

christophernhill commented Jul 24, 2021

francispoulin commented Jul 25, 2021

glwagner commented Mar 22, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

Threaded benchmarks #1861

Threaded benchmarks #1861

Comments

hennyg888 commented Jul 15, 2021

vchuravy commented Jul 15, 2021

glwagner commented Jul 15, 2021 • edited

christophernhill commented Jul 15, 2021

hennyg888 commented Jul 15, 2021

francispoulin commented Jul 15, 2021

christophernhill commented Jul 15, 2021

francispoulin commented Jul 15, 2021

christophernhill commented Jul 15, 2021

francispoulin commented Jul 15, 2021

christophernhill commented Jul 15, 2021

vchuravy commented Jul 15, 2021

francispoulin commented Jul 15, 2021

francispoulin commented Jul 15, 2021

hennyg888 commented Jul 16, 2021

glwagner commented Jul 18, 2021

christophernhill commented Jul 18, 2021 • edited

francispoulin commented Jul 19, 2021

glwagner commented Jul 19, 2021

glwagner commented Jul 24, 2021

christophernhill commented Jul 24, 2021

francispoulin commented Jul 24, 2021

glwagner commented Jul 24, 2021 • edited

francispoulin commented Jul 24, 2021

francispoulin commented Jul 24, 2021

christophernhill commented Jul 24, 2021

francispoulin commented Jul 25, 2021

glwagner commented Mar 22, 2023

This issue was moved to a discussion.

glwagner commented Jul 15, 2021 •

edited

christophernhill commented Jul 18, 2021 •

edited

glwagner commented Jul 24, 2021 •

edited