Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded benchmarks #1861

Closed
hennyg888 opened this issue Jul 15, 2021 · 27 comments
Closed

Threaded benchmarks #1861

hennyg888 opened this issue Jul 15, 2021 · 27 comments
Labels
performance 🏍️ So we can get the wrong answer even faster

Comments

@hennyg888
Copy link
Collaborator

I recently ran some benchmarks on threading for Oceananigans based on scripts added by @francispoulin in an older branch.
https://github.com/CliMA/Oceananigans.jl/blob/fjp/multithreaded-benchmarks/benchmark/weak_scaling_shallow_water_model_threaded.jl
https://github.com/CliMA/Oceananigans.jl/blob/fjp/multithreaded-benchmarks/benchmark/weak_scaling_shallow_water_model_serial.jl
Besides the benchmark scripts themselves, everything else was up to date with the latest version of master.

Here are the results:

Oceananigans v0.58.8
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  EBVERSIONJULIA = 1.6.1
  JULIA_DEPOT_PATH = :
  EBROOTJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.1
  EBDEVELJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.1/easybuild/avx2-Core-julia-1.6.1-easybuild-devel
  JULIA_LOAD_PATH = :

                  Shallow water model weak scaling with multithreading benchmark
┌───────────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───────────┬─────────┬─────────┐
│          size │ threads │     min │  median │    mean │     max │    memory │  allocs │ samples │
├───────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────────┼─────────┼─────────┤
│   (8192, 512) │       1 │ 1.453 s │ 1.454 s │ 1.454 s │ 1.456 s │  1.37 MiB │    2318 │       4 │
│  (8192, 1024) │       2 │ 2.909 s │ 2.933 s │ 2.933 s │ 2.956 s │ 21.52 MiB │ 1303192 │       2 │
│  (8192, 2048) │       4 │ 2.096 s │ 2.115 s │ 2.125 s │ 2.165 s │ 16.38 MiB │  942343 │       3 │
│  (8192, 4096) │       8 │ 2.178 s │ 2.198 s │ 2.218 s │ 2.280 s │ 17.82 MiB │  987092 │       3 │
│  (8192, 8192) │      16 │ 2.201 s │ 2.218 s │ 2.216 s │ 2.230 s │ 18.33 MiB │  922426 │       3 │
│ (8192, 16384) │      32 │ 2.598 s │ 2.615 s │ 2.615 s │ 2.632 s │ 24.29 MiB │ 1116849 │       2 │
└───────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───────────┴─────────┴─────────┘

        Shallow water model weak multithreading scaling speedup
┌───────────────┬─────────┬──────────┬────────────┬─────────┬─────────┐
│          size │ threads │ slowdown │ efficiency │  memory │  allocs │
├───────────────┼─────────┼──────────┼────────────┼─────────┼─────────┤
│   (8192, 512) │       1 │      1.0 │        1.0 │     1.0 │     1.0 │
│  (8192, 1024) │       2 │  2.01669 │   0.495862 │ 15.7412 │ 562.205 │
│  (8192, 2048) │       4 │  1.45397 │   0.687771 │ 11.9861 │ 406.533 │
│  (8192, 4096) │       8 │  1.51106 │   0.661786 │ 13.0337 │ 425.838 │
│  (8192, 8192) │      16 │  1.52536 │   0.655582 │ 13.4078 │  397.94 │
│ (8192, 16384) │      32 │  1.79793 │   0.556195 │ 17.7701 │ 481.816 │
└───────────────┴─────────┴──────────┴────────────┴─────────┴─────────┘

They're not terrific, but they're decent. I am running these on 32 CPUs, so what I assume is 1 thread per CPU up to 32 threads. The slight increase in efficiency going from 2 to 4 threads is likely some flat overhead being overcome by actual efficiency increase of multithreading.
@christophernhill @glwagner is there anything we can do to improve multithreading efficiency for Oceananigans? It might not be as simple as adding @threads in front of the main for loops but with just a little bit of improvement then multithreading efficiency might just match MPI efficiency.
As it is, multithreading is already a worthwhile option to achieve speedups on systems with multiple CPUs but no MPI.

So far I've only run the scripts on one node up to 32 threads and CPUs. I'll update this issue with the result of running it on multiple nodes going up to 64 or maybe 128 CPUs just to see if efficiency is affected going from one node to more.

@vchuravy
Copy link
Collaborator

Would be good to do some profiling (probably with a system profiler like perf), to understand where time is spent. The kernels using KernelAbstractions are automatically multi-threaded.

@glwagner
Copy link
Member

glwagner commented Jul 15, 2021

To fill in a few more details for @hennyg888 --- almost all multithreading in Oceananigans is achieved via KernelAbstractions.jl. Improving efficiency for Oceananigans kernels likely means contributing to KernelAbstractions.jl (which @vchuravy may or may not be excited about :-D).

More specifically, all tendency evaluations, non-communicative / non-periodic halo fills (periodic halo filling uses Base broadcasting and thus is not parallelized), integrals (like the hydrostatic pressure integral, or vertical velocity computation in HydrostaticFreeSurfaceModel), evaluation of diagnostics, and broadcasting with fields all use KernelAbstractions via the Oceananigans function launch!:

function launch!(arch, grid, dims, kernel!, args...;
dependencies = nothing,
include_right_boundaries = false,
reduced_dimensions = (),
location = nothing,
kwargs...)
workgroup, worksize = work_layout(grid, dims,
include_right_boundaries = include_right_boundaries,
reduced_dimensions = reduced_dimensions,
location = location)
loop! = kernel!(Architectures.device(arch), workgroup, worksize)
@debug "Launching kernel $kernel! with worksize $worksize"
event = loop!(args...; dependencies=dependencies, kwargs...)
return event
end

The line

event = loop!(args...; dependencies=dependencies, kwargs...)

launches a kernel, using KernelAbstractions syntax. event is a token that can be "waited" on if we need to.

So either we can improve multithreading by changing what happens when loop! is called --- or, possibly, by refining the dependency tree so that we can launch more kernels simultaneously. The second optimization is probably more important for small problems. You have mostly benchmarked fairly large problems so I don't we'd see much speed for them. But I'm not 100% sure.

@christophernhill
Copy link
Member

@hennyg888 do you have the same problems using MPI instead of multi-threaded, and on the same CPU ( Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz ). ?

@hennyg888
Copy link
Collaborator Author

For MPI I ran it on up to 128 Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz CPUs with efficiencies at around 80%. I think I have some results for MPI weak and strong scaling benchmarks posted here at the bottom #1722.

@francispoulin
Copy link
Collaborator

Thanks everyone for your feedback.

@vchuravy , great to know that multi-threading is built in!

I agree that profiling would be a good way to determine why we get not great efficiency. I have not used perf but we can look into it.

Also, do you know of benchmarking others have done using KernelAbstractions on threads that we could look at for comparison?

@christophernhill
Copy link
Member

@hennyg888 and @francispoulin the results in #1722 look like they may be for Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz ? (and a different version of Julia 1.6.0 v 1.6.1 ).

Not sure how precise we want to be on what we compare with what, but it could be informative to have comparisons where only one thing is changed at a time if that is possible i.e. all run on Intel(R) Xeon(R) Platinum 8260 CPU with same problem and problem but only threading v MPI different? We could also compare across CPU and across Julia but not all at the same time?

@francispoulin
Copy link
Collaborator

Not sure why we have julia v 1.6.1 but I think that we should be able to do the results for julia 1.6.0, since that's what we have on the servers.

When we do runs over hundreds of cpus I don't know that we will be getting cpus that are all the same. Unfortunately, I don't see an easy fix for that.

@christophernhill
Copy link
Member

@francispoulin (and @hennyg888 ) no worries. We can use what we have too.

I think both these tests ( #1861 and #1722 ) are on a single CPU (just lots of cores)?

@francispoulin
Copy link
Collaborator

Sorry, I was thinking of the MPI tests (since that's what I'm looking at for the slides right now).

I agree that for one CPU vs one GPU, it would be nice to use the same CPU and GPU in the different tests. I know we can specify the GPU type in the SLURM script. Maybe we can do the sme for the CPU?

@christophernhill
Copy link
Member

@francispoulin and @hennyg888 do you think a metric of "number of points per second" would be useful? In general that would be Nx.Ny.Nz.Nt/tbench . That could be a way to compare 1 GPU with 128 CPU cores on the same model but with different problem sizes?

@vchuravy
Copy link
Collaborator

Also, do you know of benchmarking others have done using KernelAbstractions on threads that we could look at for comparison?

I did some benchmarks in the beginning, but mostly focused on strong scaling.

@francispoulin
Copy link
Collaborator

Interesting idea @christophernhill . For the last results that @hennyg888 posted in #1722, I did some calculations and found the following.

GPU
N=256    3.0e9
N=128    2.6e9
N=64     6.6e8

CPU
N=256      8.6e6
N=128      9.1e6
N=64       9.0e6

In an article that @ali-ramadhan referenced on the slack channel recently, a paper using a shallow water model in python, Roullet and Gaillard (2021), said they were getting 2 TFlops per second using a thousand cores. We are getting 3 GigaFlops on GPU and 9 MegaFlops.

Certainly very good speedup since we have O(400) with WENO5, but this makes me wonder whether we could do better?

But to answer your question, when @hennyg888 has the data, we can certainly produce these plots easily enough (unless there is a problem that I'm missing).

@francispoulin
Copy link
Collaborator

Also, do you know of benchmarking others have done using KernelAbstractions on threads that we could look at for comparison?

I did some benchmarks in the beginning, but mostly focused on strong scaling.

Thanks for the information. Can you point me to where some of these results might be found?

@hennyg888
Copy link
Collaborator Author

@christophernhill @francispoulin I ran the threaded benchmarks up to 32 threads on 32 cores with Julia 1.6.0 and on the same CPUs as what the MPI benchmarks used. Makes sense since they're all benchmarking parallel computing efficiency.

Oceananigans v0.58.9
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, broadwell)
Environment:
  EBVERSIONJULIA = 1.6.0
  JULIA_DEPOT_PATH = :
  EBROOTJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0
  EBDEVELJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0/easybuild/avx2-Core-julia-1.6.0-easybuild-devel
  JULIA_LOAD_PATH = :

                  Shallow water model weak scaling with multithreading benchmark
┌───────────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───────────┬─────────┬─────────┐
│          size │ threads │     min │  median │    mean │     max │    memory │  allocs │ samples │
├───────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────────┼─────────┼─────────┤
│   (8192, 512) │       1 │ 1.458 s │ 1.458 s │ 1.458 s │ 1.458 s │  1.37 MiB │    2318 │       4 │
│  (8192, 1024) │       2 │ 2.925 s │ 2.989 s │ 2.989 s │ 3.052 s │ 18.06 MiB │ 1076944 │       2 │
│  (8192, 2048) │       4 │ 2.296 s │ 2.381 s │ 2.397 s │ 2.515 s │ 13.60 MiB │  760190 │       3 │
│  (8192, 4096) │       8 │ 2.347 s │ 2.369 s │ 2.377 s │ 2.415 s │ 16.36 MiB │  891860 │       3 │
│  (8192, 8192) │      16 │ 2.407 s │ 2.548 s │ 2.517 s │ 2.595 s │ 17.44 MiB │  863941 │       3 │
│ (8192, 16384) │      32 │ 3.023 s │ 3.069 s │ 3.069 s │ 3.115 s │ 23.03 MiB │ 1034063 │       2 │
└───────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───────────┴─────────┴─────────┘

        Shallow water model weak multithreading scaling speedup
┌───────────────┬─────────┬──────────┬────────────┬─────────┬─────────┐
│          size │ threads │ slowdown │ efficiency │  memory │  allocs │
├───────────────┼─────────┼──────────┼────────────┼─────────┼─────────┤
│   (8192, 512) │       1 │      1.0 │        1.0 │     1.0 │     1.0 │
│  (8192, 1024) │       2 │  2.04972 │   0.487872 │ 13.2156 │ 464.601 │
│  (8192, 2048) │       4 │  1.63302 │   0.612363 │ 9.95278 │ 327.951 │
│  (8192, 4096) │       8 │  1.62507 │   0.615359 │ 11.9706 │ 384.754 │
│  (8192, 8192) │      16 │  1.74747 │   0.572257 │  12.755 │  372.71 │
│ (8192, 16384) │      32 │  2.10486 │    0.47509 │  16.846 │ 446.101 │
└───────────────┴─────────┴──────────┴────────────┴─────────┴─────────┘

Also, after reviewing the new benchmarks and comparing them to the old benchmarks currently displayed on benchmarks.md, it seems like all the CPU vs GPU benchmarks use the same CPU and all the MPI and threaded benchmarks use another type of CPUs.
The MPI and threaded benchmarks use Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, while every other benchmark including the CPU to GPU speedup benchmarks use Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz.
This is mainly because the speedup benchmarks need a single better CPU and the MPI benchmarks need many CPUs and the exact types worked out to be like this on the cluster.

@glwagner glwagner added the performance 🏍️ So we can get the wrong answer even faster label Jul 17, 2021
@glwagner
Copy link
Member

Interesting idea @christophernhill . For the last results that @hennyg888 posted in #1722, I did some calculations and found the following.

GPU
N=256    3.0e9
N=128    2.6e9
N=64     6.6e8

CPU
N=256      8.6e6
N=128      9.1e6
N=64       9.0e6

In an article that @ali-ramadhan referenced on the slack channel recently, a paper using a shallow water model in python, Roullet and Gaillard (2021), said they were getting 2 TFlops per second using a thousand cores. We are getting 3 GigaFlops on GPU and 9 MegaFlops.

Certainly very good speedup since we have O(400) with WENO5, but this makes me wonder whether we could do better?

But to answer your question, when @hennyg888 has the data, we can certainly produce these plots easily enough (unless there is a problem that I'm missing).

We have to do more work to compare with Roullet and Gaillard (2021). First of all, there are typos in the paper: sometimes the performance is listed as 2 GFlops, other times as 2 TFlops. Second --- if I understand the situation correctly --- I don't think we've ever measured floating point operations per second. The numbers you've calculated are grid points per second; however we do many floating point operations per grid point. Roullet and Gaillard (2021) estimate their code performs something like 700-800 Flops per grid point.

image

@christophernhill
Copy link
Member

christophernhill commented Jul 18, 2021

@glwagner we could look at using - https://github.com/triscale-innov/GFlops.jl at some point.

P.S 84% of CPU seems abnormally high, dense matrix/matrix typically maxes out at about 80%.

@francispoulin
Copy link
Collaborator

Good points @glwagner . The numbers that I posted are probably best ignored for now. I imagine this should come up in another issue when we are concerned about efficiency of the calculations in general. Focusing in the threading in this issue seems best.

@glwagner
Copy link
Member

@glwagner we could look at using - https://github.com/triscale-innov/GFlops.jl at some point.

P.S 84% of CPU seems abnormally high, dense matrix/matrix typically maxes out at about 80%.

Sounds like a letter to the editor. :-P

@glwagner
Copy link
Member

I put together some utilities for testing multithreading versus Base.threads for a simple kernel:

https://github.com/glwagner/multithreaded-stencils

I've used a new repo because it might be worthwhile to test threaded computations in other programming languages.

@christophernhill
Copy link
Member

@glwagner we could look at using - https://github.com/triscale-innov/GFlops.jl at some point.
P.S 84% of CPU seems abnormally high, dense matrix/matrix typically maxes out at about 80%.

Sounds like a letter to the editor. :-P

Could be they meant 84% of memory bandwidth limited peak? It isn't crazy to get 84% of memory bandwidth, but that then gives a very low % peak flops. I haven't read article, I guess I should!

@francispoulin
Copy link
Collaborator

Very nice work @glwagner , and thanks for making this. Lots of good stuff here.

In your calculations, you find that there is saturation at 16 threads. I might guess that you have 16 cores on one node? I would think that this should be node dependent.

Also, in the table, might it be possible to compute the efficiency as well? I think that's more standard than speed up.

@glwagner
Copy link
Member

glwagner commented Jul 24, 2021

Very nice work @glwagner , and thanks for making this. Lots of good stuff here.

In your calculations, you find that there is saturation at 16 threads. I might guess that you have 16 cores on one node? I would think that this should be node dependent.

Also, in the table, might it be possible to compute the efficiency as well? I think that's more standard than speed up.

Ah, this machine has 48 cores. Since threading has an overhead cost, we expect saturation at some point. It's surprising that this happens at just 16 cores for such a large problem (512^3) though.

We can calculate more metrics for sure.

I think it would be worthwhile to investigate whether other threading paradigms scale differently for the same problem. Numba + parallel accelerator might be a good test case. @hennyg888 would you be interested in that?

Here are some docs:

https://numba.pydata.org/numba-doc/latest/user/parallel.html

@francispoulin
Copy link
Collaborator

I agree that I would expect it to saturate at higher than 16 if there were 48 cores, but clearly I'm wrong.

Getting another benchmark would be a good idea. I'm happy to consider the numba + parallel idea since that would be good to test the architecture. This mini-course did give some threaded examples to solve the diffusion equation in 3D. I wonder if we might want to ask Ludovic if they have done any scalings for multi-threading?

I'm happy to discuss this with @hennyg888 on Monday and see what we come up with. Others are happy to join the discussion if they like.

@francispoulin
Copy link
Collaborator

Below is a link to a paper that compares the scalability of multi-threading in Python, Julia and Chapel.

Brief Summary: They find that none of them do as well as OpenMP but give some reasons as to why. But they do find some improvements going up to 64 threads, but the effiicency in some cases go down to 20%. It seems that Python might do better on low numbers of threads but Julia does better on more. This was last year so I am sure this should probably redone.

Also, I should mention I don't believe their problem is like ours but it's an example and has some pictures, so that's nice to see.

https://hal.inria.fr/hal-02879767/document

@christophernhill
Copy link
Member

Very nice work @glwagner , and thanks for making this. Lots of good stuff here.
In your calculations, you find that there is saturation at 16 threads. I might guess that you have 16 cores on one node? I would think that this should be node dependent.
Also, in the table, might it be possible to compute the efficiency as well? I think that's more standard than speed up.

Ah, this machine has 48 cores. Since threading has an overhead cost, we expect saturation at some point. It's surprising that this happens at just 16 cores for such a large problem (512^3) though.

We can calculate more metrics for sure.

I think it would be worthwhile to investigate whether other threading paradigms scale differently for the same problem. Numba + parallel accelerator might be a good test case. @hennyg888 would you be interested in that?

Here are some docs:

https://numba.pydata.org/numba-doc/latest/user/parallel.html

You run out of memory bandwidth at some point - usually before you get to saturate all the cores for something
like diffusion. So some of 16 thread drop off could be that.

I guess we could get even more minimalist and check a multi-threaded stream benchmark to see that?

@francispoulin
Copy link
Collaborator

I am open to trying whatever simple example you suggest @christophernhill , but I'm not sure what you mean by stream benchmark. Sorry.

@glwagner
Copy link
Member

Thanks @hennyg888 for the benchmarks!

@CliMA CliMA locked and limited conversation to collaborators Mar 22, 2023
@glwagner glwagner converted this issue into discussion #3007 Mar 22, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
performance 🏍️ So we can get the wrong answer even faster
Projects
None yet
Development

No branches or pull requests

5 participants