Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI distributed parallelism #590

Merged
merged 100 commits into from
Mar 10, 2021
Merged

MPI distributed parallelism #590

merged 100 commits into from
Mar 10, 2021

Conversation

ali-ramadhan
Copy link
Member

This PR takes a stab at designing a non-invasive interface to running Oceananigans on multiple CPUs and GPUs, i.e. distributed parallelism with MPI. By non-invasive I mean that no existing code will have to change. It may not be the best solution but I really like it and I'm hoping we can discuss this design. I see no reason why it won't perform well.

Vision of this PR:

  1. Oceananigans core code base wil be free of MPI and uses GPUifyLoops.jl as it already does so nothing changes there.
  2. Everything needed for distributed parallelism will live in the Oceananigans.Distributed submodule.
  3. Support for x, y, and z decompositions. In practice, choice of pressure solver may limit which decomposition we can use but they're all supported right now.
  4. Each rank will time step it's own submodel communicating with its neighbors as needed, i.e. in fill_halo_regions!. There is no "master rank".
  5. Halo communication is implemented by injecting HaloCommunication boundary conditions wherever a submodel shares a halo with another rank. This is then dispatched on so no need to modify existing code.
  6. With PR Pressure solvers for everything! #589 we will be able to easily slide in a DistributedPressureSolver struct that can be used to dispatch on solve_for_pressure!. So again, no need to modify existing code.

This way MPI does not invade the core code base making it easier to maintain, and there will be a very clear boundary between "core Oceananigans" and "distributed parallelism functionality" which I think will serve us well in the future as MPI seems to permeate deeply into other codes, making them hard to modify.

The big thing that is missing is of course the distributed pressure solver, the hard thing to implement. This is where DistributedTranspose.jl will come in handy. I also recently found PencilFFTs.jl which also looks interesting. cc @leios

For testing purposes, I'm tempted to do the pressure solve via an MPI.Gather onto rank 0 where it can be solved locally then an MPI.Scatter to pass the pressure to all ranks. Super inefficient but might be good to ensure that the DistributedModel can reproduce existing regression tests.

Performance issues:

  • Right now MPI.Isend, MPI.Recv!, and MPI.SendRecv! all expect send and receive buffers to be contiguous in memory I believe. To get around this I allocate memory for these buffers, but this is definitely not performant. @vchuravy suggested that we may be able to send and receive into strided buffers, so will look into this. cc @simonbyrne maybe you know more about this?

Quality of life features we may want in the future (which might effect design choices):

  • Distributed diagnostics: these will be pretty expensive no matter how we implement them due to the extra reduction step (MPI.Gather) required across all ranks. I wonder if it's even worth thinking about them much. If we really need things like a DistributedHorizontalAverage then we can look into that.
  • Distributed output writers: I wonder if we should add e.g. a distributed NetCDF output writer or if each rank just writes out its own output and we have a utility function that post-processes the output files and merges them at the end (this seems easier than writing another output writer).

@ali-ramadhan ali-ramadhan added feature 🌟 Something new and shiny abstractions 🎨 Whatever that means labels Jan 6, 2020
@simonbyrne
Copy link
Member

simonbyrne commented Jan 6, 2020

  • @vchuravy suggested that we may be able to send and receive into strided buffers, so will look into this. cc @simonbyrne maybe you know more about this?

Yes! I have just made this easier in JuliaParallel/MPI.jl#329: you will need the master version of MPI.jl for now, but I'd appreciate if you could try it out.

The simplest option is to pass a SubArray directly into Isend/Irecv: it should figure out the appropriate stride and construct the necessary MPI Datatype. There are also some lower-level functions if you need them.

@codecov
Copy link

codecov bot commented Jan 6, 2020

Codecov Report

Merging #590 (5f8aa56) into master (6112c6c) will increase coverage by 12.61%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #590       +/-   ##
===========================================
+ Coverage   55.78%   68.39%   +12.61%     
===========================================
  Files         171       72       -99     
  Lines        4005     2338     -1667     
===========================================
- Hits         2234     1599      -635     
+ Misses       1771      739     -1032     
Impacted Files Coverage Δ
src/Distributed/distributed_model.jl 0.00% <0.00%> (ø)
src/SurfaceWaves.jl 0.00% <0.00%> (-47.06%) ⬇️
src/Grids/regular_cartesian_grid.jl 70.96% <0.00%> (-20.70%) ⬇️
src/AbstractOperations/AbstractOperations.jl 33.33% <0.00%> (-16.67%) ⬇️
src/Diagnostics/cfl.jl 66.66% <0.00%> (-13.34%) ⬇️
src/OutputWriters/output_writer_utils.jl 48.38% <0.00%> (-3.54%) ⬇️
src/AbstractOperations/averages_of_operations.jl
...mpressibleModels/velocity_and_tracer_tendencies.jl
src/Fields/show_fields.jl
...vection/topologically_conditional_interpolation.jl
... and 184 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0fb5286...ab3e539. Read the comment docs.

@francispoulin
Copy link
Collaborator

@ali-ramadhan I am trying to follow the current state of affairs of this PR but having some difficulties. Can you maybe give me a brief summary?

@ali-ramadhan ali-ramadhan changed the title [WIP] MPI distributed parallelism MPI distributed parallelism Nov 24, 2020
@ali-ramadhan ali-ramadhan marked this pull request as draft November 24, 2020 12:00
@ali-ramadhan
Copy link
Member Author

@francispoulin Sorry for the neglected state of this PR. It's in a half-baked state and I keep meaning to revisit it.

You can decompose the domain into x, y, z cubes/slabs/pencils/etc. Then "halo communication boundary conditions" are injected on edges where the model needs to communicate with another rank then communication occurs as part of fill_halo_regions!. I was working on adding a test to verify that all halo region data was communicated properly for CPU and GPU models but encountered some deadlocking (according to latest commit message haha).

Hoping to revisit soon, don't think it should be too hard to get this PR to work for a shallow water model or compressible model (pressure solver will the hard part of an MPI incompressible model).

@francispoulin
Copy link
Collaborator

Thanks @ali-ramadhan for the summary. Very helpful and sounds like there's a lot there to play with.

I saw a link here that looks at MPI scaling in Julia for the 2D diffusion equation. I pressume it would be easy to do a similar test with Oceananigans in 2 or 3D?

What I find espeically interesting is right before the conclusiosn they compare with Fortran and C and, if I'm reading this correctly, Julia seems to be faster in serial and on two cores. Impressive!

P.S. You seem to be very busy nowadays so please don't let me add to your already full plate.

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Feb 8, 2021

Progress update

I decided to take a stab at the simplest case: triply-periodic on the CPU. Surprisingly I was able to get a distributed IncompressibleModel running just using multiple dispatch without having to modify any existing code, although a cleaner design might require a bit of refactoring.

PR is still a work-in-progress so it's a bit messy, the purpose was to demonstrate a proof of concept. MPI.jl and PencilFFTs.jl are new dependencies but I haven't updated the Project.toml yet.

So far this PR adds some new infrastructure:

  1. Distributed/multi-architectures (e.g. MultiCPU) that know about rank connectivity.
  2. Halo communication between ranks is done via multiple dispatch on a new HaloCommunicationBC type.
  3. A DistributedFFTBasedPoissonSolver for solving Poisson equations across ranks.
  4. A DistributedModel that creates local grids for each ranks, injects halo communication BCs, and passes the distributed pressure solver to a local IncompressibleModel.

I also added some simple tests for multi architecture rank connectivity, local grid construction, injection of halo communication BCs, and halo communication (testing x, y, and z slab decompositions). Also added tests for the distributed Poisson solver ensuring the solution is divergence-free.

Next step for testing would probably be to test that the code handles Bounded topologies correctly then add tests that reproduce the regression tests but on multiple ranks and ensure the output matches the regression output.

Some notes

Domain decomposition

Domain decomposition is supported and tested in x, y, and z. But for IncompressibleModel z-decomposition won't work right now because vertical integrals are done inside GPU kernels (so we probably don't want to decompose in z). And PencilFFTs.jl supports decompositions in dimensions 2 and 3 (since dimension 1 FFTs are the fastest). As a result, right now Oceananigans.jl only supports slab decompositions in y although we should figure out if pencil decompositions are possible.

Local topologies

The local grid topology may need to be different from the global/full grid topology. This isn't an issue for triply-periodic so I haven't thought about how to tackle it yet.

How would topologies work for grids on a single rank (i.e. a piece of a larger grid)? Do we need a new Communication topology?

In the case of being Periodic on the left and communicating with another rank on the right, then it makes sense to be Communication on the left and right. But you could be Bounded on the left and communicating with another rank on the right so no one topology would be correct.

I guess if we consider communication as a boundary conditions, then it makes sense to use Bounded in any case where halo communication is required. Then there is no need for extra topologies.

Distributed architecture types

What is the architecture of an IncompressibleModel that is part of a DistributedModel? I think it should be MultiCPU.

If CPU and GPU become abstract types we can continue to dispatch on them for SingleCPU and MultiCPU?

This might be needed for multiple dispatch.

DistributedModel output

Probably not high priority until the DistributedModel is fleshed out.

For now, I'm thinking of just adding a function merge_output(filepaths) that merges a list of output files filepaths after the simulation is done.

A long-term solution might involve parallel I/O, e.g. with NetCDF, but this is not supported by NCDatasets.jl yet: CliMA/ClimateMachine.jl#2007 and Alexander-Barth/NCDatasets.jl#122

I think it would be generally useful to add a DistributedField that acts like a regular field except that compute!(distributed_field) would gather the grid points from all ranks to form one field.

A DistributedField could then be passed to an output writer to provide output writing for DistributedModel, but I don't think this approach will scale well to R >> 1 ranks.

Proof of concept

Running mpi_turbulence.jl with 4 ranks should produce this expanded version of the 2D turbulence example

mpi_turbulence_reencoded.mp4

@ali-ramadhan
Copy link
Member Author

Right now this PR requires the grid points to be split evenly among the ranks, e.g. if you have 4 ranks along the x-direction then grid.Nx must be a multiple of 4.

@christophernhill suggested generalizing this. He's written some code for dividing nearly evenly into N subdomains when the global number of points is not exactly divisable by N: https://github.com/christophernhill/iap-2021-12.091/blob/e79dfe9dca5441e561cefd65b4c052b1a1dea5a3/step6.py#L46-58

@tomchor
Copy link
Collaborator

tomchor commented Feb 24, 2021

@ali-ramadhan thanks a lot for this great work. I'm super excited to have this feature!

Domain decomposition is supported and tested in x, y, and z. But for IncompressibleModel z-decomposition won't work right now because vertical integrals are done inside GPU kernels (so we probably don't want to decompose in z). And PencilFFTs.jl supports decompositions in dimensions 2 and 3 (since dimension 1 FFTs are the fastest). As a result, right now Oceananigans.jl only supports slab decompositions in y although we should figure out if pencil decompositions are possible.

Sorry if I'm misunderstand some things (I'm not very familiar with the pressure-solver implementations) but it looks to me like this isn't an issue for the pressure solver, right? (Based on the fact that you got a 2D turbulence example going.)

If that's the case, then this limitation might be okay for now, as it would only affect cases we're you'd want to parallelize in x for efficiency purposes but you could still get away with running your set-up in a less-than-optimal configuration, no? I think this would be similar to the old limitation of being able to run (Bounded, Periodic, Bounded) domains only in CPUs. This was not optimal but if you wanted to circumvent this you could easily do so by rotating your domain to make it (Periodic, Bounded, Bounded) which could be run on GPUs.

Also (again correct me if I'm missing something) from my talks with you guys about this it seems that the primary goal of MPI-distributed parallelism is to run multi-GPU simulations, since GPUs have a relatively low memory limit. In this case, I think the way to use MPI is very different from the traditional multi-CPU runs. That is, I think the important capabilities when distributing a simulation across a few GPUs are different from the important capabilities when distributing it across hundreds of CPU cores.

I guess if we consider communication as a boundary conditions, then it makes sense to use Bounded in any case where halo communication is required. Then there is no need for extra topologies.

In principle I completely agree with this. In principle communicating with another sub-domain is treated exactly the same as you treat BCs, right? Which is to fill the halo region accordingly. The only difference being that the proper way to fill it now depends on other processes, instead of depending on the user definitions of the BCs.

This is also true when a subdomain is, say, Periodic on the left and communicating on the right. Both the left and right halo regions have to filled using information outside of that particular subdomain. So I think treating that subdomain direction as Bounded is appropriate!

@ali-ramadhan
Copy link
Member Author

Sorry if I'm misunderstand some things (I'm not very familiar with the pressure-solver implementations) but it looks to me like this isn't an issue for the pressure solver, right? (Based on the fact that you got a 2D turbulence example going.)

As long as you only decompose the domain in the y direction (dimension 2) then it should work right now. That's the current limitation I guess, should be the same for 1D, 2D, and 3D.

So in the 2D turbulence example the domain is decomposed in y since that's the only decomposition currently supported.

Also (again correct me if I'm missing something) from my talks with you guys about this it seems that the primary goal of MPI-distributed parallelism is to run multi-GPU simulations, since GPUs have a relatively low memory limit. In this case, I think the way to use MPI is very different from the traditional multi-CPU runs. That is, I think the important capabilities when distributing a simulation across a few GPUs are different from the important capabilities when distributing it across hundreds of CPU cores.

Yeah for sure.

If you can only decompose in y and you want to run on hundreds of CPU ranks then each rank gets a very thin domain, almost a slice. The surface-to-volume ratio of each rank is high which means lots of communication overhead.

If you're running a huge simulation on a moderate number of GPUs then each rank gets a pretty thick slice so the surface-to-volume ratio is smaller, leading to less communication overhead.

Definitely agree that it's probably not a huge deal for now.

In principle I completely agree with this. In principle communicating with another sub-domain is treated exactly the same as you treat BCs, right? Which is to fill the halo region accordingly. The only difference being that the proper way to fill it now depends on other processes, instead of depending on the user definitions of the BCs.

This is also true when a subdomain is, say, Periodic on the left and communicating on the right. Both the left and right halo regions have to filled using information outside of that particular subdomain. So I think treating that subdomain direction as Bounded is appropriate!

Thanks for the thoughts! I think using Bounded makes a lot of sense for the reasons you mentioned.

Also would be good to avoid adding a new non-physical topology to do with halo communication haha.

@francispoulin
Copy link
Collaborator

Perhaps a silly question but with the new MPI parallelism, will be able to do a traditional multi-CPU run with Oceananigans? I presume so but the fact that people are talking about the MPI mostly supporting GPU's, I'm a bit confused.

@ali-ramadhan
Copy link
Member Author

Perhaps a silly question but with the new MPI parallelism, will be able to do a traditional multi-CPU run with Oceananigans? I presume so but the fact that people are talking about the MPI mostly supporting GPU's, I'm a bit confused.

Yes! Actually only CPU + MPI is supported right now haha.

GPU + MPI will require distributed FFTs across multiple GPUs. Unfortunately right now PencilFFTs.jl only provides distributed FFTs across multiple CPUs.

The benefit for multi-GPU parallelism was brought up since we are currently limited to decomposing the domain in the y-direction only. This is probably fine if you don't have too many ranks so the current limitation favors multi-GPU performance over multi-CPU performance.

For models that don't need distributed FFTs, e.g. ShallowWaterModel, we can decompose in x, y, and/or z, and getting multi-GPU support might not be that much more extra work.

@francispoulin
Copy link
Collaborator

That's great news @ali-ramadhan . I guess by looking at your code I can learn how to adapt it to ShallowWaterModel. Maybe I will start by doing some tests for the two-dimensional turbulence example, or has someone done that already? By that I mean checking scalabiilty.

@ali-ramadhan
Copy link
Member Author

I haven't done any scaling tests, that would be super useful!

Been meaning to clean up this PR a little bit and integrate it into the main code (right now it's completely separate) so the PR can be merged and development can continue in future PRs. Should probably also rename DistributedModel to DistributedIncompressibleModel.

@francispoulin
Copy link
Collaborator

Thanks @ali-ramadhan. Would ti make sense that I wait to do tests until after merger, just in case anything changes along the way?

I'm keen to play but can certanily wait a well.

@ali-ramadhan
Copy link
Member Author

I think the cleanup mostly involves moving stuff around in files, etc. so the functions themselves shouldn't change.

Should be able to play around right now! Might just need to change the using/import statements once this PR is merged.

Copy link
Collaborator

@francispoulin francispoulin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes you've made so far also look good to me. I see that one changed grid to my_grid. Not sure if that was a bug or not but if so glad you caught it.

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Mar 5, 2021

Alright I think this PR is ready to be merged.

Extra change since yesterday: I think we can have a cleaner design by getting rid of DistributedIncompressibleModel and just reusing IncompressibleModel but with architecture = MultiCPU.

So now instead of dispatching on ::CPU and ::GPU we would dispatch on ::AbstractCPUArchitecture and ::AbstractGPUArchitecture and things should work out whether the model is distributed or not.

I see that one changed grid to my_grid. Not sure if that was a bug or not but if so glad you caught it.

Yes I think it was picking default boundary conditions based on the full_grid but it should be using the local my_grid (not that it matters right now since the local grid inherits the topology of the full grid but we might change that soon).

EDIT: Also bumping v0.53.0.

Copy link
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things discussed on zoom with @ali-ramadhan:

  • Move the "injection" of halo communication boundary conditions from DistributedIncompressibleModel to Field constructor
  • Make sure that halo region locations are named and ordered from left to right, eg (west, east, south, north, bottom, top)
  • Use terminology "local" and "distributed" instead of "my" and "full" for operations in local memory versus distributed operations / data
  • docstring in RankConnectivity to explain what's going on

@vchuravy
Copy link
Collaborator

vchuravy commented Mar 8, 2021

Before merging it would also be good to do both a weak and strong scaling analysis to understand the basis of where we are at.

@ali-ramadhan
Copy link
Member Author

@vchuravy How big do you think we should go? I guess use one node then go up to 32 or 64 ranks or something? Or should we try doing some multi-node stuff?

Also, should we use multithreading via KernelAbstractions.jl to fill up each node and run on multiple nodes?

Right now we're limited to 1D/slab domain decompositions so I don't think IncompressibleModel will scale super well for many ranks until #1443 is resolved.

@francispoulin
Copy link
Collaborator

@vchuravy I like the idea of doing both weak and strong scaling analysis. Can you point out any examples that you think do this well that we could use as a guide?

@ali-ramadhan
Copy link
Member Author

Did a quick small strong scaling benchmark on Tartarus (256^3) up to 16 cores but results don't look super great? ~9.5x speedup on 16 cores. Better than multi-threading though.

Maybe I'm not benchmarking properly though. Could also be missing some MPI barriers. Should probably learn how to profile MPI code.

                            Incompressible model strong scaling benchmark
┌─────────────────┬───────┬────────────┬────────────┬────────────┬────────────┬────────────┬────────┐
│            size │ ranks │        min │     median │       mean │        max │     memory │ allocs │
├─────────────────┼───────┼────────────┼────────────┼────────────┼────────────┼────────────┼────────┤
│ (256, 256, 256) │     1 │    3.641 s │    3.686 s │    3.686 s │    3.730 s │ 355.28 KiB │   2336 │
│ (256, 256, 256) │     2 │    1.917 s │    1.918 s │    1.921 s │    1.928 s │ 346.00 KiB │   2782 │
│ (256, 256, 256) │     4 │    1.249 s │    1.283 s │    1.279 s │    1.300 s │ 348.47 KiB │   2822 │
│ (256, 256, 256) │     8 │ 652.029 ms │ 714.833 ms │ 704.940 ms │ 738.885 ms │ 353.84 KiB │   2902 │
│ (256, 256, 256) │    16 │ 377.153 ms │ 388.435 ms │ 394.780 ms │ 415.562 ms │ 366.16 KiB │   3062 │
└─────────────────┴───────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────┘
       Incompressible model strong scaling speedup
┌─────────────────┬───────┬─────────┬──────────┬─────────┐
│            size │ ranks │ speedup │   memory │  allocs │
├─────────────────┼───────┼─────────┼──────────┼─────────┤
│ (256, 256, 256) │     1 │     1.0 │      1.0 │     1.0 │
│ (256, 256, 256) │     2 │ 1.92195 │ 0.973876 │ 1.19092 │
│ (256, 256, 256) │     4 │ 2.87312 │ 0.980825 │ 1.20805 │
│ (256, 256, 256) │     8 │ 5.15614 │ 0.995954 │ 1.24229 │
│ (256, 256, 256) │    16 │ 9.48879 │  1.03061 │ 1.31079 │
└─────────────────┴───────┴─────────┴──────────┴─────────┘

@ali-ramadhan ali-ramadhan merged commit 0cb4885 into master Mar 10, 2021
@ali-ramadhan ali-ramadhan deleted the ar/distributed-model branch March 10, 2021 06:23
@tomchor
Copy link
Collaborator

tomchor commented Mar 10, 2021

Awesome work @ali-ramadhan! Excited to finally have this in the code.

About the scaling. I'm not sure how you're calculating the stats exactly, but one thing to remember is that, given Julia's JIT compiling + all the MPI stuff, I'm guessing the start-up for these simulations is pretty significant and might impacting your results, no?

So I guess two ways to circumvent that are to (1) benchmark with pretty long simulations or (2) compile everything ahead of time with PackageCompiler.jl.

I don't think we necessarily need to do this now though, but might be good to keep in mind for the future.

Thanks again for the great work!

@francispoulin
Copy link
Collaborator

@ali-ramadhan I agree you've done a great job with this!

One thing I might suggest we calculate is the efficiency. The way I have seen it before is the time for the serial run divided by the number of cores and total time for the MPI run. For these cases I find the efficiens of

np       efficiency
==       ==========
2         0.96
4         0.71
8         0.62
16        0.56

The numbers are decreasing but I suspect we can probably do better by looking to see where the bottle necks are. For example, I wonder if we could compute the efficiency of just computing the tendencies, which should be pretty fast. Not sure how easy that is to do though.

@ali-ramadhan
Copy link
Member Author

About the scaling. I'm not sure how you're calculating the stats exactly, but one thing to remember is that, given Julia's JIT compiling + all the MPI stuff, I'm guessing the start-up for these simulations is pretty significant and might impacting your results, no?

So I guess two ways to circumvent that are to (1) benchmark with pretty long simulations or (2) compile everything ahead of time with PackageCompiler.jl.

That's true but the benchmarking scripts call time_step!(model, some_time_step) once before benchmarking as a warmup to make sure time stepping has been compiled before @benchmark is used. Should be enough but I could be wrong.

But yeah I only @benchmark 10 time steps, so maybe doing more time steps to collect better statistics would help.

Never considered PackageCompiler.jl for benchmarking but makes sense here seeing as it's pretty easy to use!

The numbers are decreasing but I suspect we can probably do better by looking to see where the bottle necks are.

Yes for sure. I'll open an issue to figure out how we should profile and benchmark Julia + MPI better.

I wonder if we could compute the efficiency of just computing the tendencies, which should be pretty fast. Not sure how easy that is to do though.

Ah the efficiency metric is great! I'll see if we can easily add it to the benchmark results table.

It should be pretty easy to isolate a part of time_step! to benchmark. I guess MPI only enters in through the pressure solve and the halo filling. Could be useful to benchmark the distributed pressure solve separately and do the same for the halo filing? We should hopefully get similar scaling results as PencilFFTs.jl if we benchmark the pressure solve.

@vchuravy
Copy link
Collaborator

Better than multi-threading though.

Them fighting words ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
abstractions 🎨 Whatever that means feature 🌟 Something new and shiny
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants