MPI distributed parallelism #590

ali-ramadhan · 2020-01-06T20:12:58Z

This PR takes a stab at designing a non-invasive interface to running Oceananigans on multiple CPUs and GPUs, i.e. distributed parallelism with MPI. By non-invasive I mean that no existing code will have to change. It may not be the best solution but I really like it and I'm hoping we can discuss this design. I see no reason why it won't perform well.

Vision of this PR:

Oceananigans core code base wil be free of MPI and uses GPUifyLoops.jl as it already does so nothing changes there.
Everything needed for distributed parallelism will live in the Oceananigans.Distributed submodule.
Support for x, y, and z decompositions. In practice, choice of pressure solver may limit which decomposition we can use but they're all supported right now.
Each rank will time step it's own submodel communicating with its neighbors as needed, i.e. in fill_halo_regions!. There is no "master rank".
Halo communication is implemented by injecting HaloCommunication boundary conditions wherever a submodel shares a halo with another rank. This is then dispatched on so no need to modify existing code.
With PR Pressure solvers for everything! #589 we will be able to easily slide in a DistributedPressureSolver struct that can be used to dispatch on solve_for_pressure!. So again, no need to modify existing code.

This way MPI does not invade the core code base making it easier to maintain, and there will be a very clear boundary between "core Oceananigans" and "distributed parallelism functionality" which I think will serve us well in the future as MPI seems to permeate deeply into other codes, making them hard to modify.

The big thing that is missing is of course the distributed pressure solver, the hard thing to implement. This is where DistributedTranspose.jl will come in handy. I also recently found PencilFFTs.jl which also looks interesting. cc @leios

For testing purposes, I'm tempted to do the pressure solve via an MPI.Gather onto rank 0 where it can be solved locally then an MPI.Scatter to pass the pressure to all ranks. Super inefficient but might be good to ensure that the DistributedModel can reproduce existing regression tests.

Performance issues:

Right now MPI.Isend, MPI.Recv!, and MPI.SendRecv! all expect send and receive buffers to be contiguous in memory I believe. To get around this I allocate memory for these buffers, but this is definitely not performant. @vchuravy suggested that we may be able to send and receive into strided buffers, so will look into this. cc @simonbyrne maybe you know more about this?

Quality of life features we may want in the future (which might effect design choices):

Distributed diagnostics: these will be pretty expensive no matter how we implement them due to the extra reduction step (MPI.Gather) required across all ranks. I wonder if it's even worth thinking about them much. If we really need things like a DistributedHorizontalAverage then we can look into that.
Distributed output writers: I wonder if we should add e.g. a distributed NetCDF output writer or if each rank just writes out its own output and we have a utility function that post-processes the output files and merges them at the end (this seems easier than writing another output writer).

simonbyrne · 2020-01-06T21:59:39Z

@vchuravy suggested that we may be able to send and receive into strided buffers, so will look into this. cc @simonbyrne maybe you know more about this?

Yes! I have just made this easier in JuliaParallel/MPI.jl#329: you will need the master version of MPI.jl for now, but I'd appreciate if you could try it out.

The simplest option is to pass a SubArray directly into Isend/Irecv: it should figure out the appropriate stride and construct the necessary MPI Datatype. There are also some lower-level functions if you need them.

codecov · 2020-01-06T22:01:03Z

Codecov Report

Merging #590 (5f8aa56) into master (6112c6c) will increase coverage by 12.61%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           master     #590       +/-   ##
===========================================
+ Coverage   55.78%   68.39%   +12.61%     
===========================================
  Files         171       72       -99     
  Lines        4005     2338     -1667     
===========================================
- Hits         2234     1599      -635     
+ Misses       1771      739     -1032

Impacted Files	Coverage Δ
src/Distributed/distributed_model.jl	`0.00% <0.00%> (ø)`
src/SurfaceWaves.jl	`0.00% <0.00%> (-47.06%)`	⬇️
src/Grids/regular_cartesian_grid.jl	`70.96% <0.00%> (-20.70%)`	⬇️
src/AbstractOperations/AbstractOperations.jl	`33.33% <0.00%> (-16.67%)`	⬇️
src/Diagnostics/cfl.jl	`66.66% <0.00%> (-13.34%)`	⬇️
src/OutputWriters/output_writer_utils.jl	`48.38% <0.00%> (-3.54%)`	⬇️
src/AbstractOperations/averages_of_operations.jl
...mpressibleModels/velocity_and_tracer_tendencies.jl
src/Fields/show_fields.jl
...vection/topologically_conditional_interpolation.jl
... and 184 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0fb5286...ab3e539. Read the comment docs.

francispoulin · 2020-11-23T22:46:02Z

@ali-ramadhan I am trying to follow the current state of affairs of this PR but having some difficulties. Can you maybe give me a brief summary?

ali-ramadhan · 2020-11-24T12:12:41Z

@francispoulin Sorry for the neglected state of this PR. It's in a half-baked state and I keep meaning to revisit it.

You can decompose the domain into x, y, z cubes/slabs/pencils/etc. Then "halo communication boundary conditions" are injected on edges where the model needs to communicate with another rank then communication occurs as part of fill_halo_regions!. I was working on adding a test to verify that all halo region data was communicated properly for CPU and GPU models but encountered some deadlocking (according to latest commit message haha).

Hoping to revisit soon, don't think it should be too hard to get this PR to work for a shallow water model or compressible model (pressure solver will the hard part of an MPI incompressible model).

francispoulin · 2020-11-24T13:15:41Z

Thanks @ali-ramadhan for the summary. Very helpful and sounds like there's a lot there to play with.

I saw a link here that looks at MPI scaling in Julia for the 2D diffusion equation. I pressume it would be easy to do a similar test with Oceananigans in 2 or 3D?

What I find espeically interesting is right before the conclusiosn they compare with Fortran and C and, if I'm reading this correctly, Julia seems to be faster in serial and on two cores. Impressive!

P.S. You seem to be very busy nowadays so please don't let me add to your already full plate.

ali-ramadhan · 2021-02-08T17:14:33Z

Progress update

I decided to take a stab at the simplest case: triply-periodic on the CPU. Surprisingly I was able to get a distributed IncompressibleModel running just using multiple dispatch without having to modify any existing code, although a cleaner design might require a bit of refactoring.

PR is still a work-in-progress so it's a bit messy, the purpose was to demonstrate a proof of concept. MPI.jl and PencilFFTs.jl are new dependencies but I haven't updated the Project.toml yet.

So far this PR adds some new infrastructure:

Distributed/multi-architectures (e.g. MultiCPU) that know about rank connectivity.
Halo communication between ranks is done via multiple dispatch on a new HaloCommunicationBC type.
A DistributedFFTBasedPoissonSolver for solving Poisson equations across ranks.
A DistributedModel that creates local grids for each ranks, injects halo communication BCs, and passes the distributed pressure solver to a local IncompressibleModel.

I also added some simple tests for multi architecture rank connectivity, local grid construction, injection of halo communication BCs, and halo communication (testing x, y, and z slab decompositions). Also added tests for the distributed Poisson solver ensuring the solution is divergence-free.

Next step for testing would probably be to test that the code handles Bounded topologies correctly then add tests that reproduce the regression tests but on multiple ranks and ensure the output matches the regression output.

Some notes

Domain decomposition

Domain decomposition is supported and tested in x, y, and z. But for IncompressibleModel z-decomposition won't work right now because vertical integrals are done inside GPU kernels (so we probably don't want to decompose in z). And PencilFFTs.jl supports decompositions in dimensions 2 and 3 (since dimension 1 FFTs are the fastest). As a result, right now Oceananigans.jl only supports slab decompositions in y although we should figure out if pencil decompositions are possible.

Local topologies

The local grid topology may need to be different from the global/full grid topology. This isn't an issue for triply-periodic so I haven't thought about how to tackle it yet.

How would topologies work for grids on a single rank (i.e. a piece of a larger grid)? Do we need a new Communication topology?

In the case of being Periodic on the left and communicating with another rank on the right, then it makes sense to be Communication on the left and right. But you could be Bounded on the left and communicating with another rank on the right so no one topology would be correct.

I guess if we consider communication as a boundary conditions, then it makes sense to use Bounded in any case where halo communication is required. Then there is no need for extra topologies.

Distributed architecture types

What is the architecture of an IncompressibleModel that is part of a DistributedModel? I think it should be MultiCPU.

If CPU and GPU become abstract types we can continue to dispatch on them for SingleCPU and MultiCPU?

This might be needed for multiple dispatch.

`DistributedModel` output

Probably not high priority until the DistributedModel is fleshed out.

For now, I'm thinking of just adding a function merge_output(filepaths) that merges a list of output files filepaths after the simulation is done.

A long-term solution might involve parallel I/O, e.g. with NetCDF, but this is not supported by NCDatasets.jl yet: CliMA/ClimateMachine.jl#2007 and Alexander-Barth/NCDatasets.jl#122

I think it would be generally useful to add a DistributedField that acts like a regular field except that compute!(distributed_field) would gather the grid points from all ranks to form one field.

A DistributedField could then be passed to an output writer to provide output writing for DistributedModel, but I don't think this approach will scale well to R >> 1 ranks.

Proof of concept

Running mpi_turbulence.jl with 4 ranks should produce this expanded version of the 2D turbulence example

mpi_turbulence_reencoded.mp4

ali-ramadhan · 2021-02-08T18:49:35Z

Right now this PR requires the grid points to be split evenly among the ranks, e.g. if you have 4 ranks along the x-direction then grid.Nx must be a multiple of 4.

@christophernhill suggested generalizing this. He's written some code for dividing nearly evenly into N subdomains when the global number of points is not exactly divisable by N: https://github.com/christophernhill/iap-2021-12.091/blob/e79dfe9dca5441e561cefd65b4c052b1a1dea5a3/step6.py#L46-58

tomchor · 2021-02-24T16:57:16Z

@ali-ramadhan thanks a lot for this great work. I'm super excited to have this feature!

Domain decomposition is supported and tested in x, y, and z. But for IncompressibleModel z-decomposition won't work right now because vertical integrals are done inside GPU kernels (so we probably don't want to decompose in z). And PencilFFTs.jl supports decompositions in dimensions 2 and 3 (since dimension 1 FFTs are the fastest). As a result, right now Oceananigans.jl only supports slab decompositions in y although we should figure out if pencil decompositions are possible.

Sorry if I'm misunderstand some things (I'm not very familiar with the pressure-solver implementations) but it looks to me like this isn't an issue for the pressure solver, right? (Based on the fact that you got a 2D turbulence example going.)

If that's the case, then this limitation might be okay for now, as it would only affect cases we're you'd want to parallelize in x for efficiency purposes but you could still get away with running your set-up in a less-than-optimal configuration, no? I think this would be similar to the old limitation of being able to run (Bounded, Periodic, Bounded) domains only in CPUs. This was not optimal but if you wanted to circumvent this you could easily do so by rotating your domain to make it (Periodic, Bounded, Bounded) which could be run on GPUs.

Also (again correct me if I'm missing something) from my talks with you guys about this it seems that the primary goal of MPI-distributed parallelism is to run multi-GPU simulations, since GPUs have a relatively low memory limit. In this case, I think the way to use MPI is very different from the traditional multi-CPU runs. That is, I think the important capabilities when distributing a simulation across a few GPUs are different from the important capabilities when distributing it across hundreds of CPU cores.

I guess if we consider communication as a boundary conditions, then it makes sense to use Bounded in any case where halo communication is required. Then there is no need for extra topologies.

In principle I completely agree with this. In principle communicating with another sub-domain is treated exactly the same as you treat BCs, right? Which is to fill the halo region accordingly. The only difference being that the proper way to fill it now depends on other processes, instead of depending on the user definitions of the BCs.

This is also true when a subdomain is, say, Periodic on the left and communicating on the right. Both the left and right halo regions have to filled using information outside of that particular subdomain. So I think treating that subdomain direction as Bounded is appropriate!

ali-ramadhan · 2021-02-25T18:59:55Z

Sorry if I'm misunderstand some things (I'm not very familiar with the pressure-solver implementations) but it looks to me like this isn't an issue for the pressure solver, right? (Based on the fact that you got a 2D turbulence example going.)

As long as you only decompose the domain in the y direction (dimension 2) then it should work right now. That's the current limitation I guess, should be the same for 1D, 2D, and 3D.

So in the 2D turbulence example the domain is decomposed in y since that's the only decomposition currently supported.

Also (again correct me if I'm missing something) from my talks with you guys about this it seems that the primary goal of MPI-distributed parallelism is to run multi-GPU simulations, since GPUs have a relatively low memory limit. In this case, I think the way to use MPI is very different from the traditional multi-CPU runs. That is, I think the important capabilities when distributing a simulation across a few GPUs are different from the important capabilities when distributing it across hundreds of CPU cores.

Yeah for sure.

If you can only decompose in y and you want to run on hundreds of CPU ranks then each rank gets a very thin domain, almost a slice. The surface-to-volume ratio of each rank is high which means lots of communication overhead.

If you're running a huge simulation on a moderate number of GPUs then each rank gets a pretty thick slice so the surface-to-volume ratio is smaller, leading to less communication overhead.

Definitely agree that it's probably not a huge deal for now.

In principle I completely agree with this. In principle communicating with another sub-domain is treated exactly the same as you treat BCs, right? Which is to fill the halo region accordingly. The only difference being that the proper way to fill it now depends on other processes, instead of depending on the user definitions of the BCs.

This is also true when a subdomain is, say, Periodic on the left and communicating on the right. Both the left and right halo regions have to filled using information outside of that particular subdomain. So I think treating that subdomain direction as Bounded is appropriate!

Thanks for the thoughts! I think using Bounded makes a lot of sense for the reasons you mentioned.

Also would be good to avoid adding a new non-physical topology to do with halo communication haha.

francispoulin · 2021-02-26T13:53:31Z

Perhaps a silly question but with the new MPI parallelism, will be able to do a traditional multi-CPU run with Oceananigans? I presume so but the fact that people are talking about the MPI mostly supporting GPU's, I'm a bit confused.

ali-ramadhan · 2021-02-26T14:40:42Z

Perhaps a silly question but with the new MPI parallelism, will be able to do a traditional multi-CPU run with Oceananigans? I presume so but the fact that people are talking about the MPI mostly supporting GPU's, I'm a bit confused.

Yes! Actually only CPU + MPI is supported right now haha.

GPU + MPI will require distributed FFTs across multiple GPUs. Unfortunately right now PencilFFTs.jl only provides distributed FFTs across multiple CPUs.

The benefit for multi-GPU parallelism was brought up since we are currently limited to decomposing the domain in the y-direction only. This is probably fine if you don't have too many ranks so the current limitation favors multi-GPU performance over multi-CPU performance.

For models that don't need distributed FFTs, e.g. ShallowWaterModel, we can decompose in x, y, and/or z, and getting multi-GPU support might not be that much more extra work.

francispoulin · 2021-02-26T14:56:19Z

That's great news @ali-ramadhan . I guess by looking at your code I can learn how to adapt it to ShallowWaterModel. Maybe I will start by doing some tests for the two-dimensional turbulence example, or has someone done that already? By that I mean checking scalabiilty.

ali-ramadhan · 2021-02-26T16:35:19Z

I haven't done any scaling tests, that would be super useful!

Been meaning to clean up this PR a little bit and integrate it into the main code (right now it's completely separate) so the PR can be merged and development can continue in future PRs. Should probably also rename DistributedModel to DistributedIncompressibleModel.

francispoulin · 2021-02-26T17:48:23Z

Thanks @ali-ramadhan. Would ti make sense that I wait to do tests until after merger, just in case anything changes along the way?

I'm keen to play but can certanily wait a well.

ali-ramadhan · 2021-02-26T19:16:41Z

I think the cleanup mostly involves moving stuff around in files, etc. so the functions themselves shouldn't change.

Should be able to play around right now! Might just need to change the using/import statements once this PR is merged.

francispoulin

The changes you've made so far also look good to me. I see that one changed grid to my_grid. Not sure if that was a bug or not but if so glad you caught it.

ali-ramadhan · 2021-03-05T13:48:45Z

Alright I think this PR is ready to be merged.

Extra change since yesterday: I think we can have a cleaner design by getting rid of DistributedIncompressibleModel and just reusing IncompressibleModel but with architecture = MultiCPU.

So now instead of dispatching on ::CPU and ::GPU we would dispatch on ::AbstractCPUArchitecture and ::AbstractGPUArchitecture and things should work out whether the model is distributed or not.

I see that one changed grid to my_grid. Not sure if that was a bug or not but if so glad you caught it.

Yes I think it was picking default boundary conditions based on the full_grid but it should be using the local my_grid (not that it matters right now since the local grid inherits the topology of the full grid but we might change that soon).

EDIT: Also bumping v0.53.0.

src/Distributed/halo_communication.jl

glwagner

A few things discussed on zoom with @ali-ramadhan:

Move the "injection" of halo communication boundary conditions from DistributedIncompressibleModel to Field constructor
Make sure that halo region locations are named and ordered from left to right, eg (west, east, south, north, bottom, top)
Use terminology "local" and "distributed" instead of "my" and "full" for operations in local memory versus distributed operations / data
docstring in RankConnectivity to explain what's going on

vchuravy · 2021-03-08T18:27:43Z

Before merging it would also be good to do both a weak and strong scaling analysis to understand the basis of where we are at.

ali-ramadhan · 2021-03-09T14:49:30Z

@vchuravy How big do you think we should go? I guess use one node then go up to 32 or 64 ranks or something? Or should we try doing some multi-node stuff?

Also, should we use multithreading via KernelAbstractions.jl to fill up each node and run on multiple nodes?

Right now we're limited to 1D/slab domain decompositions so I don't think IncompressibleModel will scale super well for many ranks until #1443 is resolved.

francispoulin · 2021-03-09T14:58:39Z

@vchuravy I like the idea of doing both weak and strong scaling analysis. Can you point out any examples that you think do this well that we could use as a guide?

ali-ramadhan · 2021-03-10T06:23:36Z

Did a quick small strong scaling benchmark on Tartarus (256^3) up to 16 cores but results don't look super great? ~9.5x speedup on 16 cores. Better than multi-threading though.

Maybe I'm not benchmarking properly though. Could also be missing some MPI barriers. Should probably learn how to profile MPI code.

                            Incompressible model strong scaling benchmark
┌─────────────────┬───────┬────────────┬────────────┬────────────┬────────────┬────────────┬────────┐
│            size │ ranks │        min │     median │       mean │        max │     memory │ allocs │
├─────────────────┼───────┼────────────┼────────────┼────────────┼────────────┼────────────┼────────┤
│ (256, 256, 256) │     1 │    3.641 s │    3.686 s │    3.686 s │    3.730 s │ 355.28 KiB │   2336 │
│ (256, 256, 256) │     2 │    1.917 s │    1.918 s │    1.921 s │    1.928 s │ 346.00 KiB │   2782 │
│ (256, 256, 256) │     4 │    1.249 s │    1.283 s │    1.279 s │    1.300 s │ 348.47 KiB │   2822 │
│ (256, 256, 256) │     8 │ 652.029 ms │ 714.833 ms │ 704.940 ms │ 738.885 ms │ 353.84 KiB │   2902 │
│ (256, 256, 256) │    16 │ 377.153 ms │ 388.435 ms │ 394.780 ms │ 415.562 ms │ 366.16 KiB │   3062 │
└─────────────────┴───────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────┘

       Incompressible model strong scaling speedup
┌─────────────────┬───────┬─────────┬──────────┬─────────┐
│            size │ ranks │ speedup │   memory │  allocs │
├─────────────────┼───────┼─────────┼──────────┼─────────┤
│ (256, 256, 256) │     1 │     1.0 │      1.0 │     1.0 │
│ (256, 256, 256) │     2 │ 1.92195 │ 0.973876 │ 1.19092 │
│ (256, 256, 256) │     4 │ 2.87312 │ 0.980825 │ 1.20805 │
│ (256, 256, 256) │     8 │ 5.15614 │ 0.995954 │ 1.24229 │
│ (256, 256, 256) │    16 │ 9.48879 │  1.03061 │ 1.31079 │
└─────────────────┴───────┴─────────┴──────────┴─────────┘

tomchor · 2021-03-10T14:28:01Z

Awesome work @ali-ramadhan! Excited to finally have this in the code.

About the scaling. I'm not sure how you're calculating the stats exactly, but one thing to remember is that, given Julia's JIT compiling + all the MPI stuff, I'm guessing the start-up for these simulations is pretty significant and might impacting your results, no?

So I guess two ways to circumvent that are to (1) benchmark with pretty long simulations or (2) compile everything ahead of time with PackageCompiler.jl.

I don't think we necessarily need to do this now though, but might be good to keep in mind for the future.

Thanks again for the great work!

francispoulin · 2021-03-10T14:35:23Z

@ali-ramadhan I agree you've done a great job with this!

One thing I might suggest we calculate is the efficiency. The way I have seen it before is the time for the serial run divided by the number of cores and total time for the MPI run. For these cases I find the efficiens of

np       efficiency
==       ==========
2         0.96
4         0.71
8         0.62
16        0.56

The numbers are decreasing but I suspect we can probably do better by looking to see where the bottle necks are. For example, I wonder if we could compute the efficiency of just computing the tendencies, which should be pretty fast. Not sure how easy that is to do though.

ali-ramadhan · 2021-03-10T16:47:35Z

About the scaling. I'm not sure how you're calculating the stats exactly, but one thing to remember is that, given Julia's JIT compiling + all the MPI stuff, I'm guessing the start-up for these simulations is pretty significant and might impacting your results, no?

So I guess two ways to circumvent that are to (1) benchmark with pretty long simulations or (2) compile everything ahead of time with PackageCompiler.jl.

That's true but the benchmarking scripts call time_step!(model, some_time_step) once before benchmarking as a warmup to make sure time stepping has been compiled before @benchmark is used. Should be enough but I could be wrong.

But yeah I only @benchmark 10 time steps, so maybe doing more time steps to collect better statistics would help.

Never considered PackageCompiler.jl for benchmarking but makes sense here seeing as it's pretty easy to use!

The numbers are decreasing but I suspect we can probably do better by looking to see where the bottle necks are.

Yes for sure. I'll open an issue to figure out how we should profile and benchmark Julia + MPI better.

I wonder if we could compute the efficiency of just computing the tendencies, which should be pretty fast. Not sure how easy that is to do though.

Ah the efficiency metric is great! I'll see if we can easily add it to the benchmark results table.

It should be pretty easy to isolate a part of time_step! to benchmark. I guess MPI only enters in through the pressure solve and the halo filling. Could be useful to benchmark the distributed pressure solve separately and do the same for the halo filing? We should hopefully get similar scaling results as PencilFFTs.jl if we benchmark the pressure solve.

vchuravy · 2021-03-10T18:22:09Z

Better than multi-threading though.

Them fighting words ;)

ali-ramadhan added feature 🌟 Something new and shiny abstractions 🎨 Whatever that means labels Jan 6, 2020

ali-ramadhan changed the title ~~[WIP] Non-invasive MPI distributed parallelism~~ [WIP] MPI distributed parallelism Jan 10, 2020

ali-ramadhan mentioned this pull request Feb 4, 2020

Overlapping computation and MPI halo communication #615

Closed

ali-ramadhan mentioned this pull request Feb 21, 2020

Upgrade fields to store boundary conditions #631

Merged

ali-ramadhan mentioned this pull request Mar 6, 2020

Set up Slurm pipeline for performance regression testing. #280

Closed

ali-ramadhan mentioned this pull request Aug 27, 2020

We need to upgrade our testing infrastructure soon-ish. #139

Closed

ali-ramadhan mentioned this pull request Sep 26, 2020

Initial support for MPI parallelization trixi-framework/Trixi.jl#159

Open

29 tasks

ali-ramadhan mentioned this pull request Oct 19, 2020

Compressible model #1079

Closed

3 tasks

ali-ramadhan mentioned this pull request Nov 9, 2020

contributing to Oceananigans #1153

Closed

ali-ramadhan changed the title ~~[WIP] MPI distributed parallelism~~ MPI distributed parallelism Nov 24, 2020

ali-ramadhan marked this pull request as draft November 24, 2020 12:00

ali-ramadhan mentioned this pull request Dec 1, 2020

Roadmap to version 1.0 #1234

Closed

8 tasks

ali-ramadhan mentioned this pull request Feb 7, 2021

Is it possible to support parallel NetCDF I/O? Alexander-Barth/NCDatasets.jl#122

Open

ali-ramadhan force-pushed the ar/distributed-model branch from 5f8aa56 to efd410d Compare February 8, 2021 16:57

Address PR comments

f7eb664

francispoulin approved these changes Mar 5, 2021

View reviewed changes

Bump v0.53.0

faa0729

francispoulin approved these changes Mar 5, 2021

View reviewed changes

ali-ramadhan mentioned this pull request Mar 5, 2021

Distributed shallow water model #1430

Merged

ali-ramadhan commented Mar 8, 2021

View reviewed changes

src/Distributed/halo_communication.jl Outdated Show resolved Hide resolved

glwagner approved these changes Mar 8, 2021

View reviewed changes

Update src/Distributed/halo_communication.jl

42d2eab

ali-ramadhan mentioned this pull request Mar 9, 2021

Get rid of vertical integrals for IncompressibleModel #1443

Closed

ali-ramadhan added 5 commits March 9, 2021 21:02

Inject halo communication BCs in Field constructor

3ae697f

Nuke sandbox

50c9b7a

Some renaming and added a communicator to MultiCPU

3a4f9ea

Left to right

447e121

Strong scaling benchmark for incompressible model

3992e47

christophernhill approved these changes Mar 10, 2021

View reviewed changes

ali-ramadhan added 2 commits March 10, 2021 01:18

Slightly better strong scaling benchmark

b3a75b8

Update [compat] entries

ab3e539

ali-ramadhan merged commit 0cb4885 into master Mar 10, 2021

ali-ramadhan deleted the ar/distributed-model branch March 10, 2021 06:23

ali-ramadhan mentioned this pull request Mar 10, 2021

Profiling MPI and benchmarking strong + weak scaling #1451

Closed

ali-ramadhan mentioned this pull request Mar 16, 2021

Revert "Improved fill_halo_regions" #1484

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI distributed parallelism #590

MPI distributed parallelism #590

ali-ramadhan commented Jan 6, 2020

simonbyrne commented Jan 6, 2020 •

edited

Loading

codecov bot commented Jan 6, 2020 •

edited

Loading

francispoulin commented Nov 23, 2020

ali-ramadhan commented Nov 24, 2020

francispoulin commented Nov 24, 2020

ali-ramadhan commented Feb 8, 2021 •

edited

Loading

ali-ramadhan commented Feb 8, 2021

tomchor commented Feb 24, 2021

ali-ramadhan commented Feb 25, 2021

francispoulin commented Feb 26, 2021

ali-ramadhan commented Feb 26, 2021

francispoulin commented Feb 26, 2021

ali-ramadhan commented Feb 26, 2021

francispoulin commented Feb 26, 2021

ali-ramadhan commented Feb 26, 2021

francispoulin left a comment

ali-ramadhan commented Mar 5, 2021 •

edited

Loading

glwagner left a comment

vchuravy commented Mar 8, 2021

ali-ramadhan commented Mar 9, 2021

francispoulin commented Mar 9, 2021

ali-ramadhan commented Mar 10, 2021

tomchor commented Mar 10, 2021

francispoulin commented Mar 10, 2021

ali-ramadhan commented Mar 10, 2021

vchuravy commented Mar 10, 2021

MPI distributed parallelism #590

MPI distributed parallelism #590

Conversation

ali-ramadhan commented Jan 6, 2020

simonbyrne commented Jan 6, 2020 • edited Loading

codecov bot commented Jan 6, 2020 • edited Loading

Codecov Report

francispoulin commented Nov 23, 2020

ali-ramadhan commented Nov 24, 2020

francispoulin commented Nov 24, 2020

ali-ramadhan commented Feb 8, 2021 • edited Loading

Progress update

Some notes

Domain decomposition

Local topologies

Distributed architecture types

DistributedModel output

Proof of concept

ali-ramadhan commented Feb 8, 2021

tomchor commented Feb 24, 2021

ali-ramadhan commented Feb 25, 2021

francispoulin commented Feb 26, 2021

ali-ramadhan commented Feb 26, 2021

francispoulin commented Feb 26, 2021

ali-ramadhan commented Feb 26, 2021

francispoulin commented Feb 26, 2021

ali-ramadhan commented Feb 26, 2021

francispoulin left a comment

Choose a reason for hiding this comment

ali-ramadhan commented Mar 5, 2021 • edited Loading

glwagner left a comment

Choose a reason for hiding this comment

vchuravy commented Mar 8, 2021

ali-ramadhan commented Mar 9, 2021

francispoulin commented Mar 9, 2021

ali-ramadhan commented Mar 10, 2021

tomchor commented Mar 10, 2021

francispoulin commented Mar 10, 2021

ali-ramadhan commented Mar 10, 2021

vchuravy commented Mar 10, 2021

simonbyrne commented Jan 6, 2020 •

edited

Loading

codecov bot commented Jan 6, 2020 •

edited

Loading

ali-ramadhan commented Feb 8, 2021 •

edited

Loading

`DistributedModel` output

ali-ramadhan commented Mar 5, 2021 •

edited

Loading