Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a better LoadBalancer #1578

Open
drwells opened this issue Apr 12, 2023 · 11 comments
Open

Write a better LoadBalancer #1578

drwells opened this issue Apr 12, 2023 · 11 comments
Labels

Comments

@drwells
Copy link
Member

drwells commented Apr 12, 2023

I've been doing some application profiling in the background while working on other things and trying to figure out why, for some problems, we spend an unreasonable amount of time communicating. The answer appears to be 'load imbalances and too much ghost data'. The default LoadBalancer class does its work in to separate steps:

  1. chop boxes so that their workload is within some factor of the average workload
  2. bin pack those boxes

The first step can be problematic when we have weird grids, e.g., for the nozzle example and 24 processors we get

grid

which isn't even that good: the number of cells per processor varies from about 13k to about 16k. A better approach would be to interleave these two steps so that a processor can, ideally, only have one big patch if it is the right size. One way to achieve this would be to bin pack first, pair the 'biggest' bins with the 'smallest' bins, and load balance in pairs.

Another nice followup would be to get rid of the SecondaryHierarchy and load balance by IB points and then by number of cells so that a single Eulerian partitioning is load-balanced for both cases.

@drwells drwells added the SAMRAI label Apr 12, 2023
@boyceg
Copy link
Contributor

boyceg commented Apr 12, 2023

Would it be easier/better to adopt a simpler grid generation algorithm --- e.g. a tree-structure with fixed-sized patches?

@drwells
Copy link
Member Author

drwells commented Apr 12, 2023

That may be more efficacious, but I think just redoing the load balancer will be easier. If we use a different grid generation algorithm we'll have to redo a load balancer and also the gridding algorithm instead of just the load balancer. Furthermore, being able to chop boxes in more arbitrary ways might be advantageous for IB load balancing.

@drwells
Copy link
Member Author

drwells commented Apr 18, 2023

This has come up a couple of times on slack in different places so I'll summarize what I know here.

I've been running the nozzle benchmark on my machine to try and address some IFED scalability issues - at this point we have more parallelization problems with SAMRAI then IFED, which is good since, fundamentally, we can fix a lot of these by just writing a better LoadBalancer.

A few highlights from callgrind:

  • some basic stuff which is very easy to patch (Patch tbox::Schedule #1562, Patch ArrayDataOperationsUtilities for copies #1563, Compute nonoverlapping boxes on demand. #1564) and marginally improves performance (each is a few percent of the overall solver) (the 'recompute overlapping boxes on demand' patch actually makes my benchmark run twice as fast)
  • SAMRAI's communication uses extra buffers on both ends - we can get another marginal improvement in communication cost by writing directly into PatchData from MPI's buffer.
  • we spend roughly half our time when doing a MatMult communicating ghost data: that's good motivation to get rid of some more intermediate copies in SAMRAI and reduce the total number of ghost cells by writing a better load balancer. Reducing copies is a marginal improvement compared to writing a really good load balancer. Maybe we can collaborate with the AMREX people on this? I haven't found a ton of load-balancing algorithms for patch-based FDM in the literature. I should look more.
  • Nodal IFED is now fast enough that we spend a lot more time communicating Cartesian-grid data than actually doing IFED :) The global FE scatter isn't that bad (and there isn't much more we can do here - see drwells/fiddle@d24a478) but doing a global redistribution of Cartesian grid data is about 10x as expensive as doing IFED stuff. Scattering the accumulated forces back is also quite slow (maybe 3x the cost of doing the actual spreading). This implies that we need to get rid of SecondaryHierarchy and have a single global partitioning of the Cartesian-grid data which load balances both the total number of cells/processor and number of IB points/processor.

Update 1: some relevant discussion on AMR in the Parthenon paper (AFAICT they use trees, not patches)

  • https://arxiv.org/pdf/2202.12309.pdf: "PARTHENON provides infrastructure for AMR with both cell- and face-centered data. The size of these arrays must be the same on all MeshBlocks, and moreover the overall domain must contain an integer number of MeshBlocks in each dimension. However, the number and size of individual MeshBlocks tiling the computational domain is arbitrary."

Update 2: I spent a little bit playing around with a possible new load-balancing algorithm for patches. It does a better job than the default LoadBalancer because it alternates between chopping and balancing. The algorithm is something like:

  1. Calculate a weight for each patch (e.g., number of cells)
  2. use a bin packing algorithm (e.g., Karmarkar-Karp) to balance patches between bins
  3. pair the biggest and smallest bin quartiles together. The bigger bin donates something (either a patch or a subset of a patch) to the smaller bin to move it close to the optimal workload.
  4. Repeat 6-7 times.

My rough Python implementation of this algorithm converges nicely: the difference between the biggest and smallest bins roughly halves at each step and we create a constant (n_procs / 4) number of new patches at most. It's just a simulacrum (I approximate bad chops by adding some randomness) but I get far better results than SAMRAI's load balancer (60 total patches instead of 300, maximum difference of 100 instead of 3000)

Update 3: We may have found a capable undergraduate student to work on this this summer.

Another thing we should examine, if we end up using integer programming, is penalizing using more ghost regions. Formulations of bin packing as an integer program typically involve minimizing some quantity r. We could penalize the existence of ghost regions by just adding up the number of ghost regions present with certain bin packing operations. It might not help a lot but it could keep adjacent patches preferentially together (and then we can just merge them in a postprocessing step).

@knepley
Copy link
Contributor

knepley commented Apr 18, 2023 via email

@drwells
Copy link
Member Author

drwells commented Apr 18, 2023

@knepley We can probably interpret this load-balancing problem as a weighted graph-balancing problem. A fundamental difference between this and a more normal situation (e.g., FEM) is that we can chop patches. For example, on a single processor SAMRAI generates

visit0000

for exactly the same grid as the one in my first picture. Just partitioning this isn't going to work since there are only 16 patches: we need a way to chop them up in a nice way too.

@boyceg
Copy link
Contributor

boyceg commented Apr 19, 2023

In some sense, we don't have to use SAMR as long as the levels are properly nested. Would it make sense to use p4est?

@knepley
Copy link
Contributor

knepley commented Apr 19, 2023

Also, by checking the refinement criteria you input, you can force p4est to deliver blocks if you want, and the 2:1 balance is also optional.

@boyceg
Copy link
Contributor

boyceg commented Apr 19, 2023

Is it possible straightforward to get p4est to generate what SAMR folks call "properly nested" grids?

@drwells
Copy link
Member Author

drwells commented Apr 19, 2023

Would it make sense to use p4est?

Maybe! That's essentially what Parthenon is doing (and a lot of other libraries). Notably, AMREX does not (they still use SAMR).

Is it straightforward to get p4est to generate what SAMR folks call "properly nested" grids?

I believe this is the default - e.g., everything in deal.II is properly nested.

@boyceg
Copy link
Contributor

boyceg commented Apr 19, 2023

Seems pretty tempting to try to use p4est here...

@knepley
Copy link
Contributor

knepley commented Apr 19, 2023

I have been using it for a while now. All the plasma stuff I am doing with Mark Adams uses p4est. You can convert seamlessly to a Plex (and pretty seamlessly back). We have some built-in stuff to define Vec objects that define the refinement/coarsening (see VecTagger and DMAdaptLabel/DMAdaptMetric).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants