-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write a better LoadBalancer #1578
Comments
Would it be easier/better to adopt a simpler grid generation algorithm --- e.g. a tree-structure with fixed-sized patches? |
That may be more efficacious, but I think just redoing the load balancer will be easier. If we use a different grid generation algorithm we'll have to redo a load balancer and also the gridding algorithm instead of just the load balancer. Furthermore, being able to chop boxes in more arbitrary ways might be advantageous for IB load balancing. |
This has come up a couple of times on slack in different places so I'll summarize what I know here. I've been running the nozzle benchmark on my machine to try and address some IFED scalability issues - at this point we have more parallelization problems with SAMRAI then IFED, which is good since, fundamentally, we can fix a lot of these by just writing a better A few highlights from callgrind:
Update 1: some relevant discussion on AMR in the Parthenon paper (AFAICT they use trees, not patches)
Update 2: I spent a little bit playing around with a possible new load-balancing algorithm for patches. It does a better job than the default LoadBalancer because it alternates between chopping and balancing. The algorithm is something like:
My rough Python implementation of this algorithm converges nicely: the difference between the biggest and smallest bins roughly halves at each step and we create a constant (n_procs / 4) number of new patches at most. It's just a simulacrum (I approximate bad chops by adding some randomness) but I get far better results than SAMRAI's load balancer (60 total patches instead of 300, maximum difference of 100 instead of 3000) Update 3: We may have found a capable undergraduate student to work on this this summer. Another thing we should examine, if we end up using integer programming, is penalizing using more ghost regions. Formulations of bin packing as an integer program typically involve minimizing some quantity |
Can the load balance problem be phrased as just graph partitioning with a
balance constraint (as PArMetis supports)?
(edited by @drwells to remove extra email text)
|
@knepley We can probably interpret this load-balancing problem as a weighted graph-balancing problem. A fundamental difference between this and a more normal situation (e.g., FEM) is that we can chop patches. For example, on a single processor SAMRAI generates for exactly the same grid as the one in my first picture. Just partitioning this isn't going to work since there are only 16 patches: we need a way to chop them up in a nice way too. |
In some sense, we don't have to use SAMR as long as the levels are properly nested. Would it make sense to use |
Also, by checking the refinement criteria you input, you can force p4est to deliver blocks if you want, and the 2:1 balance is also optional. |
Is it |
Maybe! That's essentially what Parthenon is doing (and a lot of other libraries). Notably, AMREX does not (they still use SAMR).
I believe this is the default - e.g., everything in deal.II is properly nested. |
Seems pretty tempting to try to use |
I have been using it for a while now. All the plasma stuff I am doing with Mark Adams uses p4est. You can convert seamlessly to a Plex (and pretty seamlessly back). We have some built-in stuff to define Vec objects that define the refinement/coarsening (see VecTagger and DMAdaptLabel/DMAdaptMetric). |
I've been doing some application profiling in the background while working on other things and trying to figure out why, for some problems, we spend an unreasonable amount of time communicating. The answer appears to be 'load imbalances and too much ghost data'. The default
LoadBalancer
class does its work in to separate steps:The first step can be problematic when we have weird grids, e.g., for the nozzle example and 24 processors we get
which isn't even that good: the number of cells per processor varies from about 13k to about 16k. A better approach would be to interleave these two steps so that a processor can, ideally, only have one big patch if it is the right size. One way to achieve this would be to bin pack first, pair the 'biggest' bins with the 'smallest' bins, and load balance in pairs.
Another nice followup would be to get rid of the
SecondaryHierarchy
and load balance by IB points and then by number of cells so that a single Eulerian partitioning is load-balanced for both cases.The text was updated successfully, but these errors were encountered: