Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAC projection maxing out on ABL calculations #886

Closed
lawrenceccheung opened this issue Jul 25, 2023 · 24 comments · Fixed by #1106
Closed

MAC projection maxing out on ABL calculations #886

lawrenceccheung opened this issue Jul 25, 2023 · 24 comments · Fixed by #1106
Assignees

Comments

@lawrenceccheung
Copy link
Contributor

I'm running a stable ABL case on Summit, and now encountering a situation where the MAC projection iterations are maxing out. Contrary to issue #859, the nodal projections are fine, but the MAC projections hit 200 iterations immediately.

The stable case input file we're trying to run is here:
https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/precursor/StableABL1/StableABL_precursor2.inp

A few things to note:

  • If we switch from MLMG to hypre, (see this input file), the MAC projection converges, but it takes a lot longer every timestep.
  • If we stick to the level 0 mesh only (remove all refinements), the case runs fine
  • This ABL has run fine before: it uses the exact same boundary conditions as this case, except it's larger and has refinement levels. It's also the same stable ABL as this case here, which also included mesh refinement.

There was another case where MAC_projections also failed on an unstable ABL case with multiple refinement levels, and switching to hypre allowed it to keep running (see input file here), but this is again non-optimal -- ideally MLMG should work in these cases.

Lawrence

@rthedin
Copy link

rthedin commented Jul 26, 2023

Lawrence, I've run into similar issues in the past and maybe our issues are related. My issue was with the bottom solver, meaning my smallest problem for the multigrid solver was still too large. I had round number of grid points in some directions and that was only divisible by 2 so many times. I also had issues with refinements, and that was especially bad when the refinements were touching the bottom boundary.

Looking at your number of cells, that seems similar to what I had. If you want to rule that out, you could try modifying the amr.n_cell to powers of 2 or something close (like 4096 5120 96 in your case).

@lawrenceccheung
Copy link
Contributor Author

Thanks @rthedin, these are good suggestions. One thing I tried was to move the refinement zones higher so they wouldn't touch the bottom boundary, but it didn't help with MAC projection. Although, we do need refinement zones close to the ground surface for this application, so it wouldn't have been a perfect solution anyway.

Yes, I think there is something possibly going on with the cell counts or grid sizes. Changing the amr.blocking_factor and the amr.max_grid_size in the unstable ABL case where MAC_projections failed didn't help, but maybe a strict power of 2 is necessary....

Lawrence

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity.

@lawrenceccheung
Copy link
Contributor Author

More updates on this MAC projection issues: I tried this out on Frontier, and see the same problem happening, so it's independent of the machine architecture. However, it does look like it's sensitive to a number of things, including:

  • mesh domain/n_cell
  • the number of cores that are used
  • the forcing used.

If you're interested in trying this out, StableABL_precursor1.inp is another case, which has the same ABL BC's as StableABL_precursor2.inp, but set up on a slightly smaller domain and with different mesh counts. I'm also just testing it out for 10 iterations to use the debug queue.

  1. It works on 128 nodes/1024 GPU's on Frontier. The first nodal projection step takes 92 iterations, but after that, both MAC and nodal projection seem to converge within 10 iterations.
  2. If you use 256 nodes/2048 GPU's, the MAC projection's max out
  3. If you turn on ABLMeanBoussinesq, the MAC projection's also max out.

Any thoughts @psakievich, @asalmgren, or @jrood-nrel? I can try other cases or mesh counts, but this seems to be fairly hit-or-miss at this point.

Lawrence

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity.

Copy link

This issue is stale because it has been open 30 days with no activity.

@lawrenceccheung
Copy link
Contributor Author

Just to confirm on a recent build of AMR-Wind a1caec3 on Frontier, the MAC projections are still hitting the limits when I run on GPU's. I haven't been able to test on very many different combinations of GPU's/CPU's, but other ABL cases with level 0 only have been running fine on Frontier so far.

Lawrence

@asalmgren
Copy link
Contributor

asalmgren commented Dec 28, 2023 via email

@lawrenceccheung
Copy link
Contributor Author

I don't have any case right now which consistently runs on CPU's and fails on GPU's, but because there seems to be a processor dependence, it's likely that the case StableABL_precursor1.inp mentioned above will work on some number of CPU's and fail on some other number of GPU's.

Lawrence

@asalmgren
Copy link
Contributor

asalmgren commented Dec 30, 2023 via email

Copy link

This issue is stale because it has been open 30 days with no activity.

@jrood-nrel
Copy link
Contributor

I just changed a working case from a blockIng_factor of 32 to 64 and the simulation went from fine to maxing out iterations on all solvers. It was on the GPU and I haven't tried it on the CPU.

@asalmgren
Copy link
Contributor

@jrood-nrel - is there any update on this? It would be good to determine if it's machine weirdness or something in the code we can fix...

Copy link

This issue is stale because it has been open 30 days with no activity.

@marchdf
Copy link
Contributor

marchdf commented Jun 3, 2024

I am just playing catchup here and trying to replicate what others have done.

Preliminaries

  • The first thing I noticed was that the time step for the input file I was given was a bit large and led to CFL violation warnings. So I dropped the time step a bit to avoid that.
  • This is the input file I am running:
    StableABL_precursor1.inp.txt
  • I am using this command to run on Frontier GPUs:
 srun -N128 -n1024 -c1 --gpus-per-node=8 --gpu-bind=closest amr_wind StableABL_precursor1.inp time.max_step=20 amrex.abort_on_out_of_gpu_memory=1 amrex.the_arena_is_managed=0 amr.blocking_factor=16 amr.max_grid_size=128 amrex.use_profiler_syncs=0 amrex.async_out=0
  • I am using this version of amr-wind:
==============================================================================
                AMR-Wind (https://github.com/exawind/amr-wind)

  AMR-Wind version :: v2.1.0-13-ge986100e
  AMR-Wind Git SHA :: e986100e5722648d991f9102c7b3859b1d1a03a5
  AMReX version    :: 24.05-20-g5d02c6480a0d

  Exec. time       :: Fri May 31 16:44:38 2024
  Build time       :: May 24 2024 12:40:23
  C++ compiler     :: Clang 17.0.0

  MPI              :: ON    (Num. ranks = 1024)
  GPU              :: ON    (Backend: HIP)
  OpenMP           :: OFF

  Enabled third-party libraries:
    NetCDF    4.9.2

           This software is released under the BSD 3-clause license.
 See https://github.com/Exawind/amr-wind/blob/development/LICENSE for details.
-----------------------------------------------------------------------------

Observations

  • This case is a pain to debug. Turn around time in the Frontier debug queue for 1 run is several hours. The run itself is about 7mins but you end up sitting in the queue forever. Here's a summary of the grid.
Grid summary:
  Level 0   1848 grids  1200291840 cells  100 % of domain
            smallest grid: 128 x 32 x 80  biggest grid: 128 x 64 x 80
  Level 1   3100 grids  4846387200 cells  50.47092547 % of domain
            smallest grid: 112 x 128 x 96  biggest grid: 128 x 128 x 96
  Level 2   6729 grids  13753548800 cells  17.90391244 % of domain
            smallest grid: 32 x 16 x 128  biggest grid: 128 x 128 x 128
  • I did a run with 128 nodes and one with 256 nodes. They both show MAC_projection hitting 200 iterations but at different time steps. And the one with fewer nodes shows just 1 instance of this in the first 20 stepts, while the 256 node case shows 4 instances of this.
  • These max iters seem to happen early in the run. In the first 10 steps, not in the next 10 steps. I haven't run more of this.
  • Comparing steps at which this happens across runs with different nodes doesn't lead to much information. In this snapshot, on step 4, the 256 node case hit the max iters. But looking at the values of these things... they all look pretty similar. Values of residuals, min/max velocities and their locations, etc, are all basically the same. Except obviously when the mac_projection line is different (O(1e-8) vs O(1e-9)):
    Screenshot 2024-06-03 at 9 55 15 AM

Next steps

@asalmgren I will take suggestions for things to try. I can try to make this case smaller but that's going to take a while to find a smaller case where this happens given that things keep changing with node counts, blocking factors, grids, etc. Maybe the mac projection tolerances are too tight?

@asalmgren
Copy link
Contributor

asalmgren commented Jun 3, 2024 via email

@marchdf
Copy link
Contributor

marchdf commented Jun 3, 2024

Ok will give these a shot. The thought had also occurred to me about the other issue...

@marchdf
Copy link
Contributor

marchdf commented Jun 3, 2024

Ok so I am getting a whole bunch of:

MLCGSolver_BiCGStab: Initial error (error0) =        1.349805439e-11
MLCGSolver_BiCGStab: Final: Iteration  201 rel. err. 0.003540589525
MLCGSolver_BiCGStab:: failed to converge!!
MLCGSolver_BiCGStab: Initial error (error0) =        1.344908633e-11
MLCGSolver_BiCGStab: Final: Iteration  201 rel. err. 0.005534598243
MLCGSolver_BiCGStab:: failed to converge!!
MLCGSolver_BiCGStab: Initial error (error0) =        1.340034281e-11
MLCGSolver_BiCGStab: Final: Iteration  201 rel. err. 0.002623853639
MLCGSolver_BiCGStab:: failed to converge!!
MLMG: Timers: Solve = 43.64502304 Iter = 43.60381034 Bottom = 28.08960259
  MAC_projection               200         0.01090075748       3.583547755e-08

and I am running with:

mac_proj.verbose = 1
mac_proj.bottom_verbose = 2

Is that verbose enough or do you want more? It's hard to know what the verbosity levels correspond to so I took a wild guess.

@asalmgren
Copy link
Contributor

asalmgren commented Jun 3, 2024 via email

@marchdf
Copy link
Contributor

marchdf commented Jun 3, 2024

Things are getting verbose ;) Here's the output:
debug_mac_segfault.o1997522.txt

@asalmgren
Copy link
Contributor

asalmgren commented Jun 3, 2024 via email

@marchdf
Copy link
Contributor

marchdf commented Jun 3, 2024

It is non-deterministic. Just because things weren't fun enough. Here's the other log file so you can look as well:
debug_mac_segfault.o1997967.txt

Screenshot 2024-06-03 at 3 37 16 PM

@asalmgren
Copy link
Contributor

asalmgren commented Jun 4, 2024 via email

@marchdf
Copy link
Contributor

marchdf commented Jun 20, 2024

Worked with @WeiqunZhang and @asalmgren on this. I think this is solved once this PR: AMReX-Codes/amrex#3991 gets merged in. I ran with the following case StableABL_precursor1.inp.txt and all MAC_projection iterations are around 6-9.

Per @WeiqunZhang:

The observation is that for the failed bottom solves, the bottom solver was able to reduce the residual by 1.e-2, but not the target of 1.e-4. In the development branch of amrex, we discard that result and apply the smoother 8 times. That probably makes things worse compared to the result of bicgstab. In the draft PR, the starting point for the smoothing is the result of bicgstab if it is an improvement, even though unconverged. I also added an option to AMReX-Hydro to change the default number of smoothing after the bicgstab. The default in amrex is 8, but you can use something like mac_proj.num_final_smooth = 16 if 8 is not sufficient.

I will close this issue once I've updated the submodules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants