Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodal projections maxing out on ABL calculation #859

Open
lawrenceccheung opened this issue Jun 20, 2023 · 38 comments
Open

Nodal projections maxing out on ABL calculation #859

lawrenceccheung opened this issue Jun 20, 2023 · 38 comments

Comments

@lawrenceccheung
Copy link
Contributor

I am re-running an ABL case as a part of AWAKEN, and with the latest build of amr-wind (a75d2ec) the nodal_projections are maxing out. This is a case that I've run before, but I'm adding in different sampling planes. You can see the basic configuration here: https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/UnstableABL_farmrun1/UnstableABL_farmrun1_noturbs.inp, and the last time I ran this the, both the nodal projections and MAC projections only required 8 iterations per timestep.

I tried this case with a slightly older build of amr-wind (185c360) from April, and the case is working fine with that exe. So sometime between then and now, something was introduced which affected the ABL solver. I'll continue trying to find which commit is causing the issue, but I'm curious if anybody else is seeing this problem.

Lawrence

@psakievich
Copy link
Contributor

This sounds like we are seeing for the blade resolved cases as well. @ashesh2512 @PaulMullowney @marchdf

@PaulMullowney
Copy link
Contributor

There is a problem in the amr-wind boundary conditions. I hope that it's the same problem.

@ashesh2512
Copy link
Contributor

The nodal projections maxing in the middle of a GPU simulation has always been an issue for the large blade-resolved runs. I have observed the issue for over a year now.

@psakievich
Copy link
Contributor

This is a pretty high priority issue for AWAKEN. They have several runs planned for Summit before the ALCC allocation is up in the next few weeks. If anyone has ideas please jump on this.

@lawrenceccheung
Copy link
Contributor Author

Quick update: the problem occurs somewhere between 257c13c (May 1) and the current commit a75d2ec.

Lawrence

@michaelasprague
Copy link

@asalmgren Wanted to get this on your radar.

@asalmgren
Copy link
Contributor

@lawrenceccheung -- could you do some additional git bisection to see which git ommit breaks things?

@lawrenceccheung
Copy link
Contributor Author

Yes, the latest bisection I did shows that the problem is happening somewhere between bbe0fdd and a75d2ec. I also tried the very latest commit (4b71037), and that also maxes out on the nodal projection.

However, the more frustrating thing I've found is that this problem seems to have a random element to it. On a commit that I thought was working (9eb5e61, from Phil's b/awaken-runs branch), I resubmitted the exact same job with the same executable, and something that was working before is now maxing out on nodal_projections. Is there some Summit hardware component to this issue? Commits that were never working seem to be consistently failing, though.

Lawrence

@asalmgren
Copy link
Contributor

asalmgren commented Jun 28, 2023 via email

@lawrenceccheung
Copy link
Contributor Author

@asalmgren -- they're run with the amrex library as a submodule. Everything's been built with spack-manager.

Lawrence

@asalmgren
Copy link
Contributor

asalmgren commented Jun 28, 2023 via email

@lawrenceccheung
Copy link
Contributor Author

Yes, everything is a submodule. My perspective is that something changed over the last two months -- either in the ExaWind code, or Summit hardware, or both -- which is causing the bottom solver to not converge. I'd like to eliminate the ExaWind code as a possible source of the problem, there are some commits which seem to always fail, so if we can get to a commit that at least works part of the time, we can go from there.

@asalmgren
Copy link
Contributor

asalmgren commented Jun 28, 2023 via email

@lawrenceccheung
Copy link
Contributor Author

These cases I'm running are without hypre, so with the amrex defaults. Another data point we just got is that if we run just a simple, single-level precursor:
https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/precursor/StableABL1/KingPlains_stable_precursor9.inp
then the latest commit works fine, no issues with MAC or nodal projection. However, the production cases using multiple levels, inflow/outflow BC, with and without turbines see projection problems.

I'll continue bisecting to see if I can isolate the problem down to a single commit, but obviously we will need to run multiple times to get a sense of whether things are truly working or not.

Lawrence

@PaulMullowney
Copy link
Contributor

These symptoms are identical to what Marc and I debugged last week.

@asalmgren
Copy link
Contributor

asalmgren commented Jun 28, 2023 via email

@PaulMullowney
Copy link
Contributor

Specifically, you need 5ae3533

@lawrenceccheung
Copy link
Contributor Author

Yes, I tried out 4b71037 which includes Paul's fixes.

Just to keep everyone up to date, I talked with @PaulMullowney and @psakievich earlier, and I'm going to get some debug information from the nodal projection operation to help diagnose things. We will also try this problem with a CPU-only build on Summit to see if that has different behavior.

Lawrence

@lawrenceccheung
Copy link
Contributor Author

More data for those interested in this problem. Here's the verbose output from the nodal projection:

Nodal Projection:
 >> Before projection:
  * On lev 0 max(abs(rhs)) = 0.05129219173
  * On lev 1 max(abs(rhs)) = 0.08140351992
  * On lev 2 max(abs(rhs)) = 0.06800058622
  * On lev 3 max(abs(rhs)) = 0.1416014512

MLMG: # of AMR levels: 4
      # of MG levels on the coarsest AMR level: 7
MLMG: Initial rhs               = 0.1416014512
MLMG: Initial residual (resid0) = 0.1416014512
MLMG: Iteration   1 Fine resid/bnorm = 0.003253548735
MLMG: Iteration   2 Fine resid/bnorm = 0.0001444938807
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration   3 Fine resid/bnorm = 2.6269658e-05
MLMG: Iteration   4 Fine resid/bnorm = 5.256018808e-06
MLMG: Iteration   5 Fine resid/bnorm = 1.160243444e-06
MLMG: Iteration   6 Fine resid/bnorm = 2.735217813e-07
MLMG: Iteration   6 Crse resid/bnorm = 0.01787294653
MLMG: Iteration   7 Fine resid/bnorm = 6.391661217e-08
MLMG: Iteration   7 Crse resid/bnorm = 0.01787773617
MLMG: Iteration   8 Fine resid/bnorm = 1.489605667e-08
MLMG: Iteration   8 Crse resid/bnorm = 0.01788884284
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration   9 Fine resid/bnorm = 3.472879871e-09
MLMG: Iteration   9 Crse resid/bnorm = 0.01787472629
MLMG: Iteration  10 Fine resid/bnorm = 8.11588805e-10
MLMG: Iteration  10 Crse resid/bnorm = 0.01787888439
MLMG: Iteration  11 Fine resid/bnorm = 1.903855552e-10
MLMG: Iteration  11 Crse resid/bnorm = 0.01787938321
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  12 Fine resid/bnorm = 4.502377799e-11
MLMG: Iteration  12 Crse resid/bnorm = 0.01787897501
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  13 Fine resid/bnorm = 1.091071081e-11
MLMG: Iteration  13 Crse resid/bnorm = 0.01787297062
MLMG: Iteration  14 Fine resid/bnorm = 2.759064114e-12
MLMG: Iteration  14 Crse resid/bnorm = 0.01789029449
MLMG: Iteration  15 Fine resid/bnorm = 5.996495376e-13
MLMG: Iteration  15 Crse resid/bnorm = 0.01787960706
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  16 Fine resid/bnorm = 1.656790951e-13
MLMG: Iteration  16 Crse resid/bnorm = 0.01787786377
MLMG: Iteration  17 Fine resid/bnorm = 3.194018445e-13
MLMG: Iteration  17 Crse resid/bnorm = 0.01787936302
MLMG: Iteration  18 Fine resid/bnorm = 2.408747894e-12
MLMG: Iteration  18 Crse resid/bnorm = 0.01787785641
MLMG: Iteration  19 Fine resid/bnorm = 6.594079131e-14
MLMG: Iteration  19 Crse resid/bnorm = 0.01787782496
MLMG: Iteration  20 Fine resid/bnorm = 1.525122225e-14
MLMG: Iteration  20 Crse resid/bnorm = 0.01787782455
MLMG: Iteration  21 Fine resid/bnorm = 2.666321369e-12
MLMG: Iteration  21 Crse resid/bnorm = 0.01787936213
MLMG: Iteration  22 Fine resid/bnorm = 3.029578972e-13
MLMG: Iteration  22 Crse resid/bnorm = 0.01787939374
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  23 Fine resid/bnorm = 2.882786317e-13
MLMG: Iteration  23 Crse resid/bnorm = 0.01787785749
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  24 Fine resid/bnorm = 2.615021368e-13
MLMG: Iteration  24 Crse resid/bnorm = 0.0178782228
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  25 Fine resid/bnorm = 2.342811696e-13
MLMG: Iteration  25 Crse resid/bnorm = 0.017872955
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  26 Fine resid/bnorm = 2.099695633e-13
MLMG: Iteration  26 Crse resid/bnorm = 0.01787854881
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  27 Fine resid/bnorm = 1.886630268e-13
MLMG: Iteration  27 Crse resid/bnorm = 0.01787895266
MLMG: Iteration  28 Fine resid/bnorm = 2.005579417e-12
MLMG: Iteration  28 Crse resid/bnorm = 0.01788886749
MLMG: Iteration  29 Fine resid/bnorm = 1.385191902e-13
MLMG: Iteration  29 Crse resid/bnorm = 0.01787805164
MLMG: Iteration  30 Fine resid/bnorm = 3.14270313e-12
MLMG: Iteration  30 Crse resid/bnorm = 0.0178834543
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  31 Fine resid/bnorm = 3.28548558e-12
MLMG: Iteration  31 Crse resid/bnorm = 0.01787303399
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  32 Fine resid/bnorm = 3.230135942e-12
MLMG: Iteration  32 Crse resid/bnorm = 0.01788467295
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  33 Fine resid/bnorm = 3.156100853e-12
MLMG: Iteration  33 Crse resid/bnorm = 0.01788489252
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  34 Fine resid/bnorm = 3.081448632e-12
MLMG: Iteration  34 Crse resid/bnorm = 0.01787796392
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  35 Fine resid/bnorm = 3.007326257e-12
MLMG: Iteration  35 Crse resid/bnorm = 0.01787782794
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  36 Fine resid/bnorm = 2.935984035e-12
MLMG: Iteration  36 Crse resid/bnorm = 0.01787782485
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  37 Fine resid/bnorm = 2.865849277e-12
MLMG: Iteration  37 Crse resid/bnorm = 0.01787294671
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  38 Fine resid/bnorm = 2.798755001e-12
MLMG: Iteration  38 Crse resid/bnorm = 0.01787811886
MLMG: Iteration  39 Fine resid/bnorm = 1.464744421e-12
MLMG: Iteration  39 Crse resid/bnorm = 0.01787865283
MLMG: Iteration  40 Fine resid/bnorm = 1.891242674e-12
MLMG: Iteration  40 Crse resid/bnorm = 0.01787937996
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  41 Fine resid/bnorm = 1.973565393e-12
MLMG: Iteration  41 Crse resid/bnorm = 0.01787298012
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  42 Fine resid/bnorm = 1.941662152e-12
MLMG: Iteration  42 Crse resid/bnorm = 0.01787300104
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  43 Fine resid/bnorm = 1.898929252e-12
MLMG: Iteration  43 Crse resid/bnorm = 0.017873055
MLMG: Iteration  44 Fine resid/bnorm = 1.232316368e-12
MLMG: Iteration  44 Crse resid/bnorm = 0.01787852775
MLMG: Iteration  45 Fine resid/bnorm = 1.714398555e-12
MLMG: Iteration  45 Crse resid/bnorm = 0.01787348297
MLMG: Iteration  46 Fine resid/bnorm = 9.002805586e-14
MLMG: Iteration  46 Crse resid/bnorm = 0.01787286996
MLMG: Iteration  47 Fine resid/bnorm = 2.82221786e-14
MLMG: Iteration  47 Crse resid/bnorm = 0.0178777298
MLMG: Iteration  48 Fine resid/bnorm = 3.003004806e-12
MLMG: Iteration  48 Crse resid/bnorm = 0.01787936012
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  49 Fine resid/bnorm = 3.137323521e-12
MLMG: Iteration  49 Crse resid/bnorm = 0.01788476414
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  50 Fine resid/bnorm = 3.091100689e-12
MLMG: Iteration  50 Crse resid/bnorm = 0.01788714677
MLMG: Iteration  51 Fine resid/bnorm = 1.551855651e-12
MLMG: Iteration  51 Crse resid/bnorm = 0.01787955199
MLMG: Iteration  52 Fine resid/bnorm = 2.030891852e-12
MLMG: Iteration  52 Crse resid/bnorm = 0.01787786083
MLMG: Iteration  53 Fine resid/bnorm = 1.185811003e-13
MLMG: Iteration  53 Crse resid/bnorm = 0.01787782542
MLMG: Iteration  54 Fine resid/bnorm = 2.747905981e-12
MLMG: Iteration  54 Crse resid/bnorm = 0.01787936216
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  55 Fine resid/bnorm = 2.709708154e-12
MLMG: Iteration  55 Crse resid/bnorm = 0.01787785774
MLMG: Iteration  56 Fine resid/bnorm = 1.351785648e-12
MLMG: Iteration  56 Crse resid/bnorm = 0.01787864724
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  57 Fine resid/bnorm = 1.376379021e-12
MLMG: Iteration  57 Crse resid/bnorm = 0.01787784289
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  58 Fine resid/bnorm = 1.351846136e-12
MLMG: Iteration  58 Crse resid/bnorm = 0.01787782532
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  59 Fine resid/bnorm = 1.322915232e-12
MLMG: Iteration  59 Crse resid/bnorm = 0.01787782483
MLMG: Iteration  60 Fine resid/bnorm = 3.455525941e-12
MLMG: Iteration  60 Crse resid/bnorm = 0.01788628357
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  61 Fine resid/bnorm = 3.537860911e-12
MLMG: Iteration  61 Crse resid/bnorm = 0.01788603893
MLMG: Iteration  62 Fine resid/bnorm = 2.154511215e-12
MLMG: Iteration  62 Crse resid/bnorm = 0.01787798669
MLMG: Iteration  63 Fine resid/bnorm = 1.304578162e-13
MLMG: Iteration  63 Crse resid/bnorm = 0.01787782822
MLMG: Iteration  64 Fine resid/bnorm = 1.465418978e-14
MLMG: Iteration  64 Crse resid/bnorm = 0.01787782462
MLMG: Iteration  65 Fine resid/bnorm = 1.124168693e-14
MLMG: Iteration  65 Crse resid/bnorm = 0.01787294584
MLMG: Iteration  66 Fine resid/bnorm = 7.215115409e-15
MLMG: Iteration  66 Crse resid/bnorm = 0.01787773615
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  67 Fine resid/bnorm = 7.196787144e-15
MLMG: Iteration  67 Crse resid/bnorm = 0.01788007489
MLMG: Iteration  68 Fine resid/bnorm = 2.608413622e-14
MLMG: Iteration  68 Crse resid/bnorm = 0.01787787559
MLMG: Iteration  69 Fine resid/bnorm = 1.268920833e-13
MLMG: Iteration  69 Crse resid/bnorm = 0.01787782574
MLMG: Iteration  70 Fine resid/bnorm = 5.934529569e-14
MLMG: Iteration  70 Crse resid/bnorm = 0.01787294656
MLMG: Iteration  71 Fine resid/bnorm = 2.371935932e-14
MLMG: Iteration  71 Crse resid/bnorm = 0.01787773583
MLMG: Iteration  72 Fine resid/bnorm = 1.981232829e-14
MLMG: Iteration  72 Crse resid/bnorm = 0.01787782253
MLMG: Iteration  73 Fine resid/bnorm = 1.784378647e-14
MLMG: Iteration  73 Crse resid/bnorm = 0.01787782448
MLMG: Iteration  74 Fine resid/bnorm = 1.577417599e-14
MLMG: Iteration  74 Crse resid/bnorm = 0.01787783286
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  75 Fine resid/bnorm = 1.472195172e-14
MLMG: Iteration  75 Crse resid/bnorm = 0.01787311883
MLMG: Iteration  76 Fine resid/bnorm = 2.57176628e-12
MLMG: Iteration  76 Crse resid/bnorm = 0.01787880959
MLMG: Iteration  77 Fine resid/bnorm = 1.205481874e-12
MLMG: Iteration  77 Crse resid/bnorm = 0.01787316656
MLMG: Iteration  78 Fine resid/bnorm = 6.95883745e-13
MLMG: Iteration  78 Crse resid/bnorm = 0.01788467394
MLMG: Iteration  79 Fine resid/bnorm = 1.121402704e-13
MLMG: Iteration  79 Crse resid/bnorm = 0.01787795867
MLMG: Iteration  80 Fine resid/bnorm = 2.807411685e-14
MLMG: Iteration  80 Crse resid/bnorm = 0.01787782755
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  81 Fine resid/bnorm = 2.310720483e-14
MLMG: Iteration  81 Crse resid/bnorm = 0.01788629447
MLMG: Iteration  82 Fine resid/bnorm = 3.203596997e-13
MLMG: Iteration  82 Crse resid/bnorm = 0.01787799248
MLMG: Iteration  83 Fine resid/bnorm = 1.347752089e-12
MLMG: Iteration  83 Crse resid/bnorm = 0.01787894036
MLMG: Iteration  84 Fine resid/bnorm = 1.875598279e-12
MLMG: Iteration  84 Crse resid/bnorm = 0.01787938449
MLMG: Iteration  85 Fine resid/bnorm = 3.467252968e-12
MLMG: Iteration  85 Crse resid/bnorm = 0.01787785312
MLMG: Iteration  86 Fine resid/bnorm = 1.61910842e-13
MLMG: Iteration  86 Crse resid/bnorm = 0.01787294289
MLMG: Iteration  87 Fine resid/bnorm = 2.41671241e-12
MLMG: Iteration  87 Crse resid/bnorm = 0.01788620696
MLMG: Iteration  88 Fine resid/bnorm = 2.710866232e-13
MLMG: Iteration  88 Crse resid/bnorm = 0.01787952809
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  89 Fine resid/bnorm = 2.688125784e-13
MLMG: Iteration  89 Crse resid/bnorm = 0.01787271123
MLMG: Iteration  90 Fine resid/bnorm = 1.394295738e-12
MLMG: Iteration  90 Crse resid/bnorm = 0.0178788482
MLMG: Iteration  91 Fine resid/bnorm = 1.193897609e-12
MLMG: Iteration  91 Crse resid/bnorm = 0.01788477827
MLMG: Iteration  92 Fine resid/bnorm = 1.825113814e-13
MLMG: Iteration  92 Crse resid/bnorm = 0.01787794673
MLMG: Iteration  93 Fine resid/bnorm = 3.025474972e-14
MLMG: Iteration  93 Crse resid/bnorm = 0.01787782761
MLMG: Iteration  94 Fine resid/bnorm = 2.962240447e-12
MLMG: Iteration  94 Crse resid/bnorm = 0.01787936209
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  95 Fine resid/bnorm = 3.100728245e-12
MLMG: Iteration  95 Crse resid/bnorm = 0.01787785774
MLMG: Iteration  96 Fine resid/bnorm = 1.396405929e-12
MLMG: Iteration  96 Crse resid/bnorm = 0.01787893771
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  97 Fine resid/bnorm = 1.306403523e-12
MLMG: Iteration  97 Crse resid/bnorm = 0.01787296948
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  98 Fine resid/bnorm = 1.309746445e-12
MLMG: Iteration  98 Crse resid/bnorm = 0.01787285825
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration  99 Fine resid/bnorm = 1.289355844e-12
MLMG: Iteration  99 Crse resid/bnorm = 0.01787773427
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 100 Fine resid/bnorm = 1.263828351e-12
MLMG: Iteration 100 Crse resid/bnorm = 0.01787782264
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 101 Fine resid/bnorm = 1.23434157e-12
MLMG: Iteration 101 Crse resid/bnorm = 0.01787782463
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 102 Fine resid/bnorm = 1.207193916e-12
MLMG: Iteration 102 Crse resid/bnorm = 0.01787782464
MLMG: Iteration 103 Fine resid/bnorm = 2.738137542e-13
MLMG: Iteration 103 Crse resid/bnorm = 0.01787782454
MLMG: Iteration 104 Fine resid/bnorm = 5.367004168e-14
MLMG: Iteration 104 Crse resid/bnorm = 0.0178775522
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 105 Fine resid/bnorm = 5.294853971e-14
MLMG: Iteration 105 Crse resid/bnorm = 0.01787449382
MLMG: Iteration 106 Fine resid/bnorm = 5.030654182e-14
MLMG: Iteration 106 Crse resid/bnorm = 0.0178777665
MLMG: Iteration 107 Fine resid/bnorm = 1.484204732e-13
MLMG: Iteration 107 Crse resid/bnorm = 0.01787822016
MLMG: Iteration 108 Fine resid/bnorm = 7.001659934e-12
MLMG: Iteration 108 Crse resid/bnorm = 0.01788584473
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 109 Fine resid/bnorm = 7.21933962e-12
MLMG: Iteration 109 Crse resid/bnorm = 0.01787953222
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 110 Fine resid/bnorm = 7.084311127e-12
MLMG: Iteration 110 Crse resid/bnorm = 0.01787786253
MLMG: Iteration 111 Fine resid/bnorm = 2.959174698e-12
MLMG: Iteration 111 Crse resid/bnorm = 0.01787936305
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 112 Fine resid/bnorm = 3.067200249e-12
MLMG: Iteration 112 Crse resid/bnorm = 0.01787785776
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 113 Fine resid/bnorm = 3.029565955e-12
MLMG: Iteration 113 Crse resid/bnorm = 0.01787782597
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 114 Fine resid/bnorm = 2.968606241e-12
MLMG: Iteration 114 Crse resid/bnorm = 0.01787782462
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 115 Fine resid/bnorm = 2.900174337e-12
MLMG: Iteration 115 Crse resid/bnorm = 0.01787780239
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 116 Fine resid/bnorm = 2.829401775e-12
MLMG: Iteration 116 Crse resid/bnorm = 0.0178775525
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 117 Fine resid/bnorm = 2.76173248e-12
MLMG: Iteration 117 Crse resid/bnorm = 0.01787429843
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 118 Fine resid/bnorm = 2.694767603e-12
MLMG: Iteration 118 Crse resid/bnorm = 0.01787776272
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 119 Fine resid/bnorm = 2.631992481e-12
MLMG: Iteration 119 Crse resid/bnorm = 0.0178778233
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 120 Fine resid/bnorm = 2.571245623e-12
MLMG: Iteration 120 Crse resid/bnorm = 0.01788007727
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 121 Fine resid/bnorm = 2.51664864e-12
MLMG: Iteration 121 Crse resid/bnorm = 0.01787787558
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 122 Fine resid/bnorm = 2.46879799e-12
MLMG: Iteration 122 Crse resid/bnorm = 0.01787685507
MLMG: Iteration 123 Fine resid/bnorm = 1.400786336e-12
MLMG: Iteration 123 Crse resid/bnorm = 0.01787892081
MLMG: Iteration 124 Fine resid/bnorm = 1.63741027e-12
MLMG: Iteration 124 Crse resid/bnorm = 0.01787296994
MLMG: Iteration 125 Fine resid/bnorm = 7.471117727e-14
MLMG: Iteration 125 Crse resid/bnorm = 0.01787773658
MLMG: Iteration 126 Fine resid/bnorm = 3.228607661e-14
MLMG: Iteration 126 Crse resid/bnorm = 0.01787782255
MLMG: Iteration 127 Fine resid/bnorm = 3.698972775e-12
MLMG: Iteration 127 Crse resid/bnorm = 0.01787936208
MLMG: Iteration 128 Fine resid/bnorm = 2.936675436e-12
MLMG: Iteration 128 Crse resid/bnorm = 0.01787433599
MLMG: Iteration 129 Fine resid/bnorm = 1.359480073e-12
MLMG: Iteration 129 Crse resid/bnorm = 0.017879301
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 130 Fine resid/bnorm = 1.317434555e-12
MLMG: Iteration 130 Crse resid/bnorm = 0.01787297835
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 131 Fine resid/bnorm = 1.266002093e-12
MLMG: Iteration 131 Crse resid/bnorm = 0.01788467087
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 132 Fine resid/bnorm = 1.218271653e-12
MLMG: Iteration 132 Crse resid/bnorm = 0.01787795302
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 133 Fine resid/bnorm = 1.176062472e-12
MLMG: Iteration 133 Crse resid/bnorm = 0.01787782788
MLMG: Iteration 134 Fine resid/bnorm = 1.484070739e-12
MLMG: Iteration 134 Crse resid/bnorm = 0.01787893702
MLMG: Iteration 135 Fine resid/bnorm = 3.183126765e-14
MLMG: Iteration 135 Crse resid/bnorm = 0.01788264927
MLMG: Iteration 136 Fine resid/bnorm = 1.688581638e-12
MLMG: Iteration 136 Crse resid/bnorm = 0.01787792994
MLMG: Iteration 137 Fine resid/bnorm = 2.202664114e-13
MLMG: Iteration 137 Crse resid/bnorm = 0.01788884724
MLMG: Iteration 138 Fine resid/bnorm = 8.118799378e-14
MLMG: Iteration 138 Crse resid/bnorm = 0.01787453072
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 139 Fine resid/bnorm = 7.805496109e-14
MLMG: Iteration 139 Crse resid/bnorm = 0.01787888564
MLMG: Iteration 140 Fine resid/bnorm = 2.020067695e-13
MLMG: Iteration 140 Crse resid/bnorm = 0.01787784561
MLMG: Iteration 141 Fine resid/bnorm = 1.496696186e-13
MLMG: Iteration 141 Crse resid/bnorm = 0.01787782505
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 142 Fine resid/bnorm = 1.418957543e-13
MLMG: Iteration 142 Crse resid/bnorm = 0.0178786484
MLMG: Iteration 143 Fine resid/bnorm = 1.421040649e-13
MLMG: Iteration 143 Crse resid/bnorm = 0.01787784119
MLMG: Iteration 144 Fine resid/bnorm = 3.490094102e-13
MLMG: Iteration 144 Crse resid/bnorm = 0.01788475439
MLMG: Iteration 145 Fine resid/bnorm = 2.183553907e-13
MLMG: Iteration 145 Crse resid/bnorm = 0.01787796058
MLMG: Iteration 146 Fine resid/bnorm = 3.989054372e-12
MLMG: Iteration 146 Crse resid/bnorm = 0.01788856701
MLMG: Iteration 147 Fine resid/bnorm = 1.226244209e-12
MLMG: Iteration 147 Crse resid/bnorm = 0.01793223411
MLMG: Iteration 148 Fine resid/bnorm = 3.042942957e-12
MLMG: Iteration 148 Crse resid/bnorm = 0.01787886469
MLMG: Iteration 149 Fine resid/bnorm = 9.026924243e-14
MLMG: Iteration 149 Crse resid/bnorm = 0.01787296888
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 150 Fine resid/bnorm = 6.345982373e-14
MLMG: Iteration 150 Crse resid/bnorm = 0.01787435767
MLMG: Iteration 151 Fine resid/bnorm = 4.328724693e-14
MLMG: Iteration 151 Crse resid/bnorm = 0.01788469694
MLMG: Iteration 152 Fine resid/bnorm = 3.456356695e-14
MLMG: Iteration 152 Crse resid/bnorm = 0.0178779592
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 153 Fine resid/bnorm = 3.195836915e-14
MLMG: Iteration 153 Crse resid/bnorm = 0.01787936387
MLMG: Iteration 154 Fine resid/bnorm = 3.434313008e-13
MLMG: Iteration 154 Crse resid/bnorm = 0.01787785652
MLMG: Iteration 155 Fine resid/bnorm = 1.503857226e-12
MLMG: Iteration 155 Crse resid/bnorm = 0.01787893309
MLMG: Iteration 156 Fine resid/bnorm = 1.474168308e-12
MLMG: Iteration 156 Crse resid/bnorm = 0.01787784709
MLMG: Iteration 157 Fine resid/bnorm = 1.48779956e-13
MLMG: Iteration 157 Crse resid/bnorm = 0.01787294707
MLMG: Iteration 158 Fine resid/bnorm = 2.233000025e-14
MLMG: Iteration 158 Crse resid/bnorm = 0.01787773064
MLMG: Iteration 159 Fine resid/bnorm = 1.80161583e-14
MLMG: Iteration 159 Crse resid/bnorm = 0.01787294466
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 160 Fine resid/bnorm = 1.81144514e-14
MLMG: Iteration 160 Crse resid/bnorm = 0.01787927228
MLMG: Iteration 161 Fine resid/bnorm = 3.509517278e-13
MLMG: Iteration 161 Crse resid/bnorm = 0.01788478787
MLMG: Iteration 162 Fine resid/bnorm = 2.954479982e-13
MLMG: Iteration 162 Crse resid/bnorm = 0.0178779613
MLMG: Iteration 163 Fine resid/bnorm = 6.336857099e-12
MLMG: Iteration 163 Crse resid/bnorm = 0.01789193808
MLMG: Iteration 164 Fine resid/bnorm = 5.738945636e-12
MLMG: Iteration 164 Crse resid/bnorm = 0.01787810343
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 165 Fine resid/bnorm = 5.834140109e-12
MLMG: Iteration 165 Crse resid/bnorm = 0.01787936827
MLMG: Iteration 166 Fine resid/bnorm = 1.713827746e-12
MLMG: Iteration 166 Crse resid/bnorm = 0.01787312027
MLMG: Iteration 167 Fine resid/bnorm = 1.224733587e-12
MLMG: Iteration 167 Crse resid/bnorm = 0.01787885203
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 169 Fine resid/bnorm = 1.178231285e-12
MLMG: Iteration 169 Crse resid/bnorm = 0.01787297988
MLMG: Iteration 170 Fine resid/bnorm = 3.374586025e-12
MLMG: Iteration 170 Crse resid/bnorm = 0.01787927495
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 171 Fine resid/bnorm = 3.536430636e-12
MLMG: Iteration 171 Crse resid/bnorm = 0.01787297775
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 172 Fine resid/bnorm = 3.472631046e-12
MLMG: Iteration 172 Crse resid/bnorm = 0.01787773741
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 173 Fine resid/bnorm = 3.396518689e-12
MLMG: Iteration 173 Crse resid/bnorm = 0.01788007562
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 174 Fine resid/bnorm = 3.319256294e-12
MLMG: Iteration 174 Crse resid/bnorm = 0.01787787563
MLMG: Iteration 175 Fine resid/bnorm = 1.43151274e-12
MLMG: Iteration 175 Crse resid/bnorm = 0.01788079541
MLMG: Iteration 176 Fine resid/bnorm = 2.043819596e-12
MLMG: Iteration 176 Crse resid/bnorm = 0.01787789081
MLMG: Iteration 177 Fine resid/bnorm = 4.745514995e-12
MLMG: Iteration 177 Crse resid/bnorm = 0.01793850996
MLMG: Iteration 178 Fine resid/bnorm = 5.700713353e-12
MLMG: Iteration 178 Crse resid/bnorm = 0.01787410392
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 179 Fine resid/bnorm = 5.871527808e-12
MLMG: Iteration 179 Crse resid/bnorm = 0.01787887286
MLMG: Iteration 180 Fine resid/bnorm = 6.283989768e-13
MLMG: Iteration 180 Crse resid/bnorm = 0.0178847788
MLMG: Iteration 181 Fine resid/bnorm = 1.615490622e-13
MLMG: Iteration 181 Crse resid/bnorm = 0.01787308401
MLMG: Iteration 182 Fine resid/bnorm = 2.862311108e-12
MLMG: Iteration 182 Crse resid/bnorm = 0.01788620955
MLMG: Iteration 183 Fine resid/bnorm = 1.153170421e-13
MLMG: Iteration 183 Crse resid/bnorm = 0.01787952824
MLMG: Iteration 184 Fine resid/bnorm = 3.304176391e-12
MLMG: Iteration 184 Crse resid/bnorm = 0.01787786042
MLMG: Iteration 185 Fine resid/bnorm = 3.238101703e-12
MLMG: Iteration 185 Crse resid/bnorm = 0.01787936302
MLMG: Iteration 186 Fine resid/bnorm = 2.538988068e-12
MLMG: Iteration 186 Crse resid/bnorm = 0.01787785645
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 187 Fine resid/bnorm = 2.84285386e-12
MLMG: Iteration 187 Crse resid/bnorm = 0.01788007925
MLMG: Iteration 188 Fine resid/bnorm = 2.889767328e-13
MLMG: Iteration 188 Crse resid/bnorm = 0.01788480878
MLMG: Iteration 189 Fine resid/bnorm = 2.383486089e-13
MLMG: Iteration 189 Crse resid/bnorm = 0.01788204861
MLMG: Iteration 190 Fine resid/bnorm = 2.152865787e-13
MLMG: Iteration 190 Crse resid/bnorm = 0.01787791889
MLMG: Iteration 191 Fine resid/bnorm = 6.015901324e-14
MLMG: Iteration 191 Crse resid/bnorm = 0.01787782691
MLMG: Iteration 192 Fine resid/bnorm = 2.895272507e-14
MLMG: Iteration 192 Crse resid/bnorm = 0.01787294658
MLMG: Iteration 193 Fine resid/bnorm = 2.123705182e-14
MLMG: Iteration 193 Crse resid/bnorm = 0.0178777196
MLMG: Iteration 194 Fine resid/bnorm = 1.527093829e-14
MLMG: Iteration 194 Crse resid/bnorm = 0.01788884259
MLMG: Iteration 195 Fine resid/bnorm = 3.0959338e-12
MLMG: Iteration 195 Crse resid/bnorm = 0.01788367561
MLMG: Iteration 196 Fine resid/bnorm = 1.84887117e-13
MLMG: Iteration 196 Crse resid/bnorm = 0.01787948996
MLMG: Iteration 197 Fine resid/bnorm = 3.430498432e-12
MLMG: Iteration 197 Crse resid/bnorm = 0.0178729816
MLMG: Iteration 198 Fine resid/bnorm = 1.228458914e-13
MLMG: Iteration 198 Crse resid/bnorm = 0.01787773691
MLMG: Iteration 199 Fine resid/bnorm = 4.128692974e-14
MLMG: Iteration 199 Crse resid/bnorm = 0.01787782261
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 200 Fine resid/bnorm = 3.439665053e-14
MLMG: Iteration 200 Crse resid/bnorm = 0.01789094759
MLMG: Timers: Solve = 31.88271638 Iter = 31.85704381 Bottom = 17.38664955
 >> After projection:
  * On lev 0 max(abs(rhs)) = 0.05281370431
  * On lev 1 max(abs(rhs)) = 0.08601693711
  * On lev 2 max(abs(rhs)) = 0.06112537326
  * On lev 3 max(abs(rhs)) = 0.06269878415

  Nodal_projection             200          0.1416014512        0.002533384142

This is run with 9eb5e61, which is based off of the b/awaken-runs branch that Phil put together.

Now, what's super interesting is that I've got a production run (same ABL setup, but with OpenFAST turbines) going right now, running simultaneously with this debug case on Summit, and using the same exe, but that one so far has no issues with the bottom solver (knock on wood). I'm not sure how to explain any of this behavior.

Lawrence

@alhs6577
Copy link

I recently re-ran the simple case Lawrence mentioned previously (https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/precursor/StableABL1/KingPlains_stable_precursor9.inp) on Summit (with GPUs), but with two added levels of refinement. With the latest amr-wind build, neither the nodal or MAC projections maxed out.

@psakievich
Copy link
Contributor

@alhs6577 would you comment on your build process? It would be good to see if we can reproduce this. @lawrenceccheung has seen builds that work for a set of runs and then suddenly stop converging so there appears to be an intermittent nature to this issue.

@alhs6577
Copy link

@psakievich I used one of the latest Summit builds from @lawrenceccheung so he would be the person to ask.

@lawrenceccheung
Copy link
Contributor Author

Oh that build that @alhs6577 is from commit 4b71037, and compiled using spack-manager.

@lawrenceccheung
Copy link
Contributor Author

lawrenceccheung commented Jun 29, 2023

In case it helps, I put together a log of different runs I was doing to try and bisect the case. I will keep adding to this list with more data points.

Date Commit Job type Result
June 18 a75d2ec ABL only, bndry I/O, no turbines failed
June 19 a75d2ec ABL only, bndry I/O, no turbines failed
June 19 f92aae1 ABL only, bndry I/O, no turbines works
June 19 185c360 ABL only, bndry I/O, no turbines works
June 21 257c13c ABL only, bndry I/O, no turbines works
June 21 9eb5e61 ABL only, bndry I/O, no turbines works
June 22 9eb5e61 ABL only, bndry I/O, no turbines works
June 24 9eb5e61 ABL only, bndry I/O, no turbines failed
June 24 9eb5e61 Production run w/turbines works
June 27 f92aae1 ABL Production run, no turbines works
June 26 4b71037 ABL only, bndry I/O, no turbines failed
June 27 bbe0fdd ABL only, bndry I/O, no turbines works
June 27 9eb5e61 Production run w/turbines failed
June 28 9eb5e61 ABL only, bndry I/O, no turbines, with nodal verbose output failed
June 28 9eb5e61 Production run w/turbines works
June 29 4b71037 Periodic ABL, 2 levels of refinement, no turbines works
June 30 9537522 Production run w/turbines and radar failed
June 30 9cb0abaa Production run w/turbines and radar (debug) works
July 6 9cb0abaa Production run w/turbines and radar works
July 7 e97f8472 Debug run, CPU only failed (see note below)

@lawrenceccheung
Copy link
Contributor Author

@asalmgren @PaulMullowney @psakievich it occurred to me that we might have a way to determine if this is a software or a hardware issue. I have an old executable from commit f92aae1, compiled on April 7, which previously hasn't shown any issues with the nodal projections. We can run a debug test case with this exe many times (say 10 times), and if there aren't any issues with the bottom solver on Summit, then something must have happened to code itself to cause these changes.

Lawrence

@asalmgren
Copy link
Contributor

asalmgren commented Jun 30, 2023 via email

@lawrenceccheung
Copy link
Contributor Author

yes, I'm including system software when I say hardware. Although, I haven't changed the way I compile amr-wind in the last 6 months or so -- they should all be using gcc 10.2.0 toolset (see https://github.com/sandialabs/spack-manager/blob/main/configs/summit/compilers.yaml) -- so hopefully that's not a factor

@lawrenceccheung
Copy link
Contributor Author

Another interesting data point. I just tried a run using 9537522, which is based off the most recent branch 4b71037 with additional radar scan functionality. That failed, but the bottom solver failed differently than before.

Normally when the nodal projections max out, it does so consistently after the first few iterations, like so:

$ grep Nodal_projection wturbs.3018138
  Nodal_projection               8          0.3642743485       1.249139105e-07
  Nodal_projection               7          0.4217680262       1.907383986e-07
  Nodal_projection               7          0.4151295041        1.40615603e-07
  Nodal_projection               7          0.4089079335       1.485313398e-07
  Nodal_projection             200          0.4104877735        0.004535525282
  Nodal_projection             200          0.4062099661        0.003023266983
  Nodal_projection             200          0.3979261111        0.001511616422
  Nodal_projection             200          0.3886770588        0.008969860162
  Nodal_projection             200          0.3817778914        0.004502683967
...

However, with this radar functionality built in, it will fail intermittently:

  Nodal_projection               8          0.3642743489       1.249344916e-07
  Nodal_projection               7          0.4217680263       1.908786274e-07
  Nodal_projection               7          0.4151295047       1.407023882e-07
  Nodal_projection               7          0.4089079343       1.486727061e-07
  Nodal_projection               7          0.4104877746       1.586428909e-07
  Nodal_projection             200          0.4062096545        0.001289872447
  Nodal_projection             200           0.397925796        0.002579744251
  Nodal_projection               7          0.3886768938       1.959483487e-07
  Nodal_projection             200          0.3817772175        0.003831758237
  Nodal_projection               7          0.3772330206       2.150512131e-07
  Nodal_projection             200          0.3746421235        0.005678088441
  Nodal_projection               7            0.37307631       2.163975566e-07
  Nodal_projection             200          0.3712630136        0.004975038183
  Nodal_projection             200          0.3692193623        0.009950076829

Not sure what to make of that either, but it case it helps anything.

@lawrenceccheung
Copy link
Contributor Author

Latest run with CPU's only (July 7 with e97f8472 above), also failed, but please note -- it failed (as in core dumped) on the first MAC projection step, not on the nodal projection step. So something else is going on too.

Lawrence

@asalmgren
Copy link
Contributor

Hey Lawrence -- can you turn on the verbosity and see where in the MAC it failed?

@lawrenceccheung
Copy link
Contributor Author

lawrenceccheung commented Jul 11, 2023

@asalmgren I'm away right now, but I can get more details on where the MAC was failing next week. It's actually something I've seen in multiple cases now, so we might need to put a separate issue request on it. Looping in @alhs6577 and @ndevelder here too -- we can put together a series of cases where we've seen MAC issues.

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity.

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity.

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity.

@ashesh2512
Copy link
Contributor

@lawrenceccheung Is this still an issue? It went away for me a few months back. It is concerning that I don't know what fixed it.

Copy link

This issue is stale because it has been open 30 days with no activity.

@lawrenceccheung
Copy link
Contributor Author

@lawrenceccheung Is this still an issue? It went away for me a few months back. It is concerning that I don't know what fixed it.

@ashesh2512 I will be testing this out again soon.

Lawrence

Copy link

This issue is stale because it has been open 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants