-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAC projection maxing out on ABL calculations #886
Comments
Lawrence, I've run into similar issues in the past and maybe our issues are related. My issue was with the bottom solver, meaning my smallest problem for the multigrid solver was still too large. I had round number of grid points in some directions and that was only divisible by 2 so many times. I also had issues with refinements, and that was especially bad when the refinements were touching the bottom boundary. Looking at your number of cells, that seems similar to what I had. If you want to rule that out, you could try modifying the |
Thanks @rthedin, these are good suggestions. One thing I tried was to move the refinement zones higher so they wouldn't touch the bottom boundary, but it didn't help with MAC projection. Although, we do need refinement zones close to the ground surface for this application, so it wouldn't have been a perfect solution anyway. Yes, I think there is something possibly going on with the cell counts or grid sizes. Changing the Lawrence |
This issue is stale because it has been open 30 days with no activity. |
More updates on this MAC projection issues: I tried this out on Frontier, and see the same problem happening, so it's independent of the machine architecture. However, it does look like it's sensitive to a number of things, including:
If you're interested in trying this out, StableABL_precursor1.inp is another case, which has the same ABL BC's as
Any thoughts @psakievich, @asalmgren, or @jrood-nrel? I can try other cases or mesh counts, but this seems to be fairly hit-or-miss at this point. Lawrence |
This issue is stale because it has been open 30 days with no activity. |
This issue is stale because it has been open 30 days with no activity. |
Just to confirm on a recent build of AMR-Wind a1caec3 on Frontier, the MAC projections are still hitting the limits when I run on GPU's. I haven't been able to test on very many different combinations of GPU's/CPU's, but other ABL cases with level 0 only have been running fine on Frontier so far. Lawrence |
Do we have a specific case that runs on cpu and fails on gpu?
…Sent from my iPhone
On Dec 28, 2023, at 9:13 AM, lawrenceccheung ***@***.***> wrote:
Assigned #886 <#886> to @asalmgren
<https://github.com/asalmgren>.
—
Reply to this email directly, view it on GitHub
<#886 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YRFK44INZTZ4NZO7VDYLWSETAVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGM2TEMRXGA2TONQ>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
I don't have any case right now which consistently runs on CPU's and fails on GPU's, but because there seems to be a processor dependence, it's likely that the case StableABL_precursor1.inp mentioned above will work on some number of CPU's and fail on some other number of GPU's. Lawrence |
Maybe the best use of everyone’s time is to wait until the next time this
happens so we can go after a specific case?
…Sent from my iPhone
On Dec 28, 2023, at 9:25 AM, lawrenceccheung ***@***.***> wrote:
I don't have any case right now which consistently runs on CPU's and fails
on GPU's, but because there seems to be a processor dependence, it's likely
that the case StableABL_precursor1.inp
<https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/precursor/StableABL1/StableABL_precursor1.inp>
mentioned above will work on some number of CPU's and fail on some other
number of GPU's.
Lawrence
—
Reply to this email directly, view it on GitHub
<#886 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YQ74V6RAGGZPVK26XLYLWTRBAVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZRGM3DENRYGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This issue is stale because it has been open 30 days with no activity. |
I just changed a working case from a blockIng_factor of 32 to 64 and the simulation went from fine to maxing out iterations on all solvers. It was on the GPU and I haven't tried it on the CPU. |
@jrood-nrel - is there any update on this? It would be good to determine if it's machine weirdness or something in the code we can fix... |
This issue is stale because it has been open 30 days with no activity. |
I am just playing catchup here and trying to replicate what others have done. Preliminaries
Observations
Next steps@asalmgren I will take suggestions for things to try. I can try to make this case smaller but that's going to take a while to find a smaller case where this happens given that things keep changing with node counts, blocking factors, grids, etc. Maybe the mac projection tolerances are too tight? |
Marc -- we need to know why it isn't converging -- so the first thing to
look at is whether the bottom solver is converging.
Turn on the bottom solver verbosity enough to tell whether it is maxing out
on iterations.
Something that would also help would be to see what the residual is at each
level up and down the V-cycle -- thal will tell us whether the issue is
something in one of the higher AMR levels or something coarser than amr
level 0.
Can you turn on more verbosity so we can see that as well?
One final thought -- I'd be more comfortable resolving the other MAC issue
(with inconsistent filling of boundary values) before going after this --
is it possible that has caused this issue (i.e. if bc's are making it not
solvable due to inconsistent filling of boundary values).
So maybe do both -- set off the runs with a bunch of verbosity then at the
same time see if you can resolve the fillpatch issue?
Those are my best suggestions for path forward
…On Mon, Jun 3, 2024 at 9:15 AM Marc T. Henry de Frahan < ***@***.***> wrote:
I am just playing catchup here and trying to replicate what others have
done.
Preliminaries
- The first thing I noticed was that the time step for the input file
I was given was a bit large and led to CFL violation warnings. So I dropped
the time step a bit to avoid that.
- This is the input file I am running:
StableABL_precursor1.inp.txt
<https://github.com/user-attachments/files/15536581/StableABL_precursor1.inp.txt>
- I am using this command to run on Frontier GPUs:
srun -N128 -n1024 -c1 --gpus-per-node=8 --gpu-bind=closest amr_wind StableABL_precursor1.inp time.max_step=20 amrex.abort_on_out_of_gpu_memory=1 amrex.the_arena_is_managed=0 amr.blocking_factor=16 amr.max_grid_size=128 amrex.use_profiler_syncs=0 amrex.async_out=0
- I am using this version of amr-wind:
==============================================================================
AMR-Wind (https://github.com/exawind/amr-wind)
AMR-Wind version :: v2.1.0-13-ge986100e
AMR-Wind Git SHA :: e986100
AMReX version :: 24.05-20-g5d02c6480a0d
Exec. time :: Fri May 31 16:44:38 2024
Build time :: May 24 2024 12:40:23
C++ compiler :: Clang 17.0.0
MPI :: ON (Num. ranks = 1024)
GPU :: ON (Backend: HIP)
OpenMP :: OFF
Enabled third-party libraries:
NetCDF 4.9.2
This software is released under the BSD 3-clause license.
See https://github.com/Exawind/amr-wind/blob/development/LICENSE for details.
-----------------------------------------------------------------------------
Observations
- This case is a pain to debug. Turn around time in the Frontier debug
queue for 1 run is several hours. The run itself is about 7mins but you end
up sitting in the queue forever. Here's a summary of the grid.
Grid summary:
Level 0 1848 grids 1200291840 cells 100 % of domain
smallest grid: 128 x 32 x 80 biggest grid: 128 x 64 x 80
Level 1 3100 grids 4846387200 cells 50.47092547 % of domain
smallest grid: 112 x 128 x 96 biggest grid: 128 x 128 x 96
Level 2 6729 grids 13753548800 cells 17.90391244 % of domain
smallest grid: 32 x 16 x 128 biggest grid: 128 x 128 x 128
- I did a run with 128 nodes and one with 256 nodes. They both show
MAC_projection hitting 200 iterations but at different time steps. And
the one with fewer nodes shows just 1 instance of this in the first 20
stepts, while the 256 node case shows 4 instances of this.
- These max iters seem to happen early in the run. In the first 10
steps, not in the next 10 steps. I haven't run more of this.
- Comparing steps at which this happens across runs with different
nodes doesn't lead to much information. In this snapshot, on step 4, the
256 node case hit the max iters. But looking at the values of these
things... they all look pretty similar. Values of residuals, min/max
velocities and their locations, etc, are all basically the same. Except
obviously when the mac_projection line is different (O(1e-8) vs O(1e-9)):
Screenshot.2024-06-03.at.9.55.15.AM.png (view on web)
<https://github.com/Exawind/amr-wind/assets/15038415/579c157b-d035-404d-9eaa-7fca9667eb93>
Next steps
@asalmgren <https://github.com/asalmgren> I will take suggestions for
things to try. I can try to make this case smaller but that's going to take
a while to find a smaller case where this happens given that things keep
changing with node counts, blocking factors, grids, etc. Maybe the mac
projection tolerances are too tight?
—
Reply to this email directly, view it on GitHub
<#886 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YRVO4EKZUDRFPON4HTZFSJDJAVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGYYTQMJQHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Ann Almgren
Senior Scientist; Dept. Head, Applied Mathematics
Pronouns: she/her/hers
|
Ok will give these a shot. The thought had also occurred to me about the other issue... |
Ok so I am getting a whole bunch of:
and I am running with:
Is that verbose enough or do you want more? It's hard to know what the verbosity levels correspond to so I took a wild guess. |
If the bottom solver is going 1e-11 to 1e-14 that is fine.
Go ahead and set the bottom solver absolute tolerance to 1e-14.
Set mac_proj.verbose to 4
Ann Almgren
Senior Scientist; Dept. Head, Applied Mathematics
Pronouns: she/her/hers
…On Mon, Jun 3, 2024 at 8:23 PM Marc T. Henry de Frahan < ***@***.***> wrote:
Ok so I am getting a whole bunch of:
MLCGSolver_BiCGStab: Initial error (error0) = 1.349805439e-11
MLCGSolver_BiCGStab: Final: Iteration 201 rel. err. 0.003540589525
MLCGSolver_BiCGStab:: failed to converge!!
MLCGSolver_BiCGStab: Initial error (error0) = 1.344908633e-11
MLCGSolver_BiCGStab: Final: Iteration 201 rel. err. 0.005534598243
MLCGSolver_BiCGStab:: failed to converge!!
MLCGSolver_BiCGStab: Initial error (error0) = 1.340034281e-11
MLCGSolver_BiCGStab: Final: Iteration 201 rel. err. 0.002623853639
MLCGSolver_BiCGStab:: failed to converge!!
MLMG: Timers: Solve = 43.64502304 Iter = 43.60381034 Bottom = 28.08960259
MAC_projection 200 0.01090075748 3.583547755e-08
and I am running with:
mac_proj.verbose = 1
mac_proj.bottom_verbose = 2
Is that verbose enough or do you want more? It's hard to know what the
verbosity levels correspond to so I took a wild guess.
—
Reply to this email directly, view it on GitHub
<#886 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YXHTFUPAOQEEKEK3DLZFSYC3AVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVHA2TAMRVGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Things are getting verbose ;) Here's the output: |
Ah ok -- that is useful.
Next question -- if you run this again with exactly the same executable and
inputs file on the same number of ranks/nodes, will it fail exactly the
same?
i.e. will you see exactly these same numbers at the same steps?
MAC_projection 9 0.01059370304
2.170464121e-09
MAC_projection 200 0.01090075747
3.406772667e-08
MAC_projection 200 0.01070578287
3.728472752e-08
MAC_projection 200 0.01059956843
3.751627153e-08
MAC_projection 200 0.01398727573
3.031940042e-08
MAC_projection 7 0.01349900866
4.186323505e-09
MAC_projection 7 0.0133575733
3.643233593e-09
MAC_projection 6 0.01556601993
1.358084751e-08
MAC_projection 6 0.01523504224
1.387033766e-08
MAC_projection 7 0.01389531987
3.696991249e-09
MAC_projection 7 0.01349196508
3.703382658e-09
MAC_projection 7 0.01397295297
3.724432835e-09
MAC_projection 7 0.0145134622
4.762901849e-09
MAC_projection 7 0.0155179704
4.115332929e-09
MAC_projection 7 0.01654709104
2.063066006e-09
MAC_projection 6 0.01760117101
1.673764457e-08
MAC_projection 6 0.01867633017
1.72870855e-08
MAC_projection 6 0.01973103806
1.747265777e-08
MAC_projection 6 0.02076201332
1.68252392e-08
MAC_projection 6 0.0217731548
1.751393021e-08
MAC_projection 6 0.02276651583
1.749625856e-08
MAC_projection 6 0.02374135767
1.806140752e-08
MAC_projection 6 0.02469769443
1.901520436e-08
…On Mon, Jun 3, 2024 at 1:38 PM Marc T. Henry de Frahan < ***@***.***> wrote:
Things are getting verbose ;) Here's the output:
debug_mac_segfault.o1997522.txt
<https://github.com/user-attachments/files/15539414/debug_mac_segfault.o1997522.txt>
—
Reply to this email directly, view it on GitHub
<#886 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YQBBJHLX2GBG2Z6E7TZFTH4BAVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBWGA3TSNJUHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Ann Almgren
Senior Scientist; Dept. Head, Applied Mathematics
Pronouns: she/her/hers
|
It is non-deterministic. Just because things weren't fun enough. Here's the other log file so you can look as well: |
ok just looking at the second MAC projection in each stack, you can see the
differences in the first V-cycle.
this suggests something is either uninitialized or at least not
consistently initialized in the data
let's get the other MAC (fillpatch) issue fixed first and see if unraveling
that thread fixes this as well
…On Mon, Jun 3, 2024 at 2:39 PM Marc T. Henry de Frahan < ***@***.***> wrote:
It is non-deterministic. Just because things weren't fun enough. Here's
the other log file so you can look as well:
debug_mac_segfault.o1997967.txt
<https://github.com/user-attachments/files/15540032/debug_mac_segfault.o1997967.txt>
Screenshot.2024-06-03.at.3.37.16.PM.png (view on web)
<https://github.com/Exawind/amr-wind/assets/15038415/6f41a9df-ef14-444d-8283-47c97d1522eb>
—
Reply to this email directly, view it on GitHub
<#886 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YVSAPQVAQ3EZU3ZYTTZFTPBRAVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBWGE3TAOJXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Ann Almgren
Senior Scientist; Dept. Head, Applied Mathematics
Pronouns: she/her/hers
|
Worked with @WeiqunZhang and @asalmgren on this. I think this is solved once this PR: AMReX-Codes/amrex#3991 gets merged in. I ran with the following case StableABL_precursor1.inp.txt and all MAC_projection iterations are around 6-9. Per @WeiqunZhang:
I will close this issue once I've updated the submodules. |
I'm running a stable ABL case on Summit, and now encountering a situation where the MAC projection iterations are maxing out. Contrary to issue #859, the nodal projections are fine, but the MAC projections hit 200 iterations immediately.
The stable case input file we're trying to run is here:
https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/precursor/StableABL1/StableABL_precursor2.inp
A few things to note:
There was another case where MAC_projections also failed on an unstable ABL case with multiple refinement levels, and switching to hypre allowed it to keep running (see input file here), but this is again non-optimal -- ideally MLMG should work in these cases.
Lawrence
The text was updated successfully, but these errors were encountered: