-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAM Crashes with 58 levels and higher horiz resolution #442
Comments
However, if the model crashes within 10 time steps, even in aquaplanet mode, when it runs with 120-km resolution, then perhaps a variable is not initialized properly. Maybe running with floating point trapping turned on could catch the first NaN, which might lead us to an initialization error. |
@vlarson The model works for CAM-MPAS at 120km resolution with ZM2 (replaced the zm_conv_intr.F90 and zm_conv.F90 as shared by Adam). However, that solution does not work for the 60km aquaplanet run (Bill tested here) or the 60-3km full-topography one (I did here) with the same error message. Here are the first NaN values printed out in my cesm log file:
edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values. If setting the edsclrm values to zeros will cause other issues with model crashing at the same time step. The code blocks are highlighted here by Adam for the functions windm_edsclrm_rhs, windm_edsclrm_lhs, and windm_edsclrm_solve and the clubb_at_least_debug_level: |
Sorry if I missed it, but is the crash in a 3km column and is that column at or near orography? |
Is edsclrm # 9, 10, 11, or 12 initialized to a reasonable value? Turning on floating point trapping might catch an uninitialized variable if there is one. |
Is the dynamics running at all? Or does this failure happen before the dynamics is ran? |
@swrneale : we can get the crash in an aquaplanet model, so no land or topography. We have not located the point and level where it is happening. @xhuang-ncar will need some guidance on that (I don't really know how to pull out a column). Thanks! |
Could you send the location of your log files?
…On Mon, Oct 4, 2021 at 11:09 AM xhuang-ncar ***@***.***> wrote:
The model works for CAM-MPAS at 120km resolution with ZM2 (replaced the
zm_conv_intr.F90 and zm_conv.F90 as shared by Adam). However, that solution
does not work for the 60km aquaplanet run (Bill tested here) or the 60-3km
full-topography one (I did here) with the same error message.
Here are the first NaN values printed out in my cesm log file:
edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ...
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN
edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.
If setting the edsclrm values to zeros will cause other issues with model
crashing at the same time step.
The code blocks are highlighted here by Adam for the functions
windm_edsclrm_rhs, windm_edsclrm_lhs, and windm_edsclrm_solve and the
clubb_at_least_debug_level:
https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L464-L524
https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L555-L593
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#442 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACGLMTWOMEDFGYQFEXVK2PDUFHNUPANCNFSM5FFNKEQA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Yes, it was running. It crashed after 16 time steps. Also, I am using 120s as the dtime given the refined 3km resolution. |
Not sure about that. How should that be initialized normally? Let me turn on the debug mode and also to print out the input variables in the windm_edsclrm_solve function to the log file. |
Thanks Xingying,
This is a question for everyone. The log file shows error messages that
clearly originate in Adam's blocked out portions of
'advance_windm_edsclrm', yet the messages in the log file don't advance
through the entire
if ( err_code == clubb_fatal_error ) then
...
endif
block. They are always interrupted by what appears to a new entry to the
error block, e.g.:
776: up2 = 0.351420540093032 0.297728678007049
0.266079981897943
776: 0.247924301489710 0.234454106712590 0.222438014461757
776: 0.210447703011471 0.199152298849457 0.192144528317254
776: 0.195549715651495 0.214431704931684 0.244532312820753
800: Fatal error solving for eddsclrm
800: Error in advance_windm_edsclrm
800: Intent(in)
800: dt = 40.0000000000000
800: wm_zt = 0.000000000000000E+000 9.109764803756008E-005
800: 3.344162123594033E-004 6.675431334319516E-004 1.065847183270905E-003
So here as up2 is being written out suddenly a new error stream starts.
Does this mean the code is failing on multiple processes at the same time?
If this is so, can the error messages be forced to go through the entire
block so that we get a clear idea of the profiles going into
advance_windm_edslcrm?
…On Mon, Oct 4, 2021 at 12:31 PM xhuang-ncar ***@***.***> wrote:
@JulioTBacmeister <https://github.com/JulioTBacmeister> Sure. The log
file is here:
/glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/run/cesm.log.660137.chadmin1.ib0.cheyenne.ucar.edu.210925-034021
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#442 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACGLMTVVL6LGVIQUQLABR2LUFHXIRANCNFSM5FFNKEQA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@JulioTBacmeister Looking through the log, I am only seeing NaN's for processor 776. However, I am finding that the entire write statement code block is being written out for 776, because its last entries in the cesm.log are:
Where But 776 isn't the only processor with "Error in advance_windm_edsclrm," but again, 776 is the only one with NaNs. If it were me, I'd probably write out all the input/output variables to the subroutine calls Lastly, if you set ./xmlchange DEBUG=TRUE, does that turn on floating point trapping as suggested by Vince (@cacraigucar)? And I presume this need to be ran on cheyenne (cuz it's such a large grid), and so I think only the intel compiler is used. |
@adamrher Thanks for the suggestion. I will give it try. Also, after setting the DEBUG to TRUE as Vince suggested here, the run crashed with MPT ERROR as in this log file (https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404). How should I interpret this? |
Both the 60km (aquaplanet) and 60-3km (full topography) run crashed with this CLUBB error when using L58. Can we locate those NaN values? I am also trying to figure out how it works for the CAM-SE at 25km (setting up a test for that). |
The line of code that contained the first FPE is here:
One question is whether this line of code has a FPE even when using standard CAM code that runs fine. If so, then it's a red herring. If not, then it would be interesting to know if the problem is an uninitialized variable, or if the FPE appears after a few time steps. |
I see. I can set up a quick test using the 32 levels with the DEBUG on to
check that out.
…On Tue, Oct 5, 2021 at 1:03 PM Vincent Larson ***@***.***> wrote:
@adamrher <https://github.com/adamrher> Thanks for the suggestion. I will
give it try. Also, after setting the DEBUG to TRUE as Vince suggested here,
the run crashed with MPT ERROR as in this log file (
https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404
<https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404?rgh-link-date=2021-10-05T18%3A45%3A05Z>).
How should I interpret this?
The line of code that contained the first FPE is here:
825:MPT: #1 0x00002b55ec5dc306 in mpi_sgi_system (
825:MPT: #2 MPI_SGI_stacktraceback (
825:MPT: ***@***.***=0x7ffeb377a050 "MPT ERROR: Rank 825(g:825) received signal SIGFPE(8).\n\tProcess ID: 64229, Host: r5i4n31, Program: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/bld/cesm.exe\n\tMPT Version: HPE MPT 2.22 03"...) at sig.c:340
825:MPT: #3 0x00002b55ec5dc4ff in first_arriver_handler ***@***.***=8,
825:MPT: ***@***.***=0x2b55f6c80080) at sig.c:489
825:MPT: #4 0x00002b55ec5dc793 in slave_sig_handler (signo=8, siginfo=<optimized out>,
825:MPT: extra=<optimized out>) at sig.c:565
825:MPT: #5 <signal handler called>
825:MPT: #6 0x000000000a554001 in __libm_log_l9 ()
825:MPT: #7 0x0000000002ce9481 in mo_drydep::drydep_xactive (sfc_temp=...,
825:MPT: pressure_sfc=..., wind_speed=..., spec_hum=..., air_temp=...,
825:MPT: pressure_10m=..., rain=..., snow=..., solar_flux=..., dvel=..., dflx=...,
825:MPT: mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12, ocnfrc=..., icefrc=...,
825:MPT: beglandtype=7, endlandtype=8)
825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110
825:MPT: #8 0x0000000002cb913d in mo_drydep::drydep_fromlnd (ocnfrac=..., icefrac=...,
825:MPT: sfc_temp=..., pressure_sfc=..., wind_speed=..., spec_hum=...,
825:MPT: air_temp=..., pressure_10m=..., rain=..., snow=..., solar_flux=...,
825:MPT: dvelocity=..., dflx=...,
825:MPT: mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12)
825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:210
825:MPT: #9 0x0000000002da7d99 in mo_gas_phase_chemdr::gas_phase_chemdr (lchnk=12,
825:MPT: ncol=16, imozart=10,
825:MPT: q=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, phis=..., zm=..., zi=..., calday=1.0208333333333333, tfld=...,
825:MPT: pmid=..., pdel=..., pint=..., cldw=..., troplev=..., troplevchem=...,
825:MPT: ncldwtr=..., ufld=..., vfld=..., delt=120, ps=..., xactive_prates=.FALSE.,
825:MPT: fsds=..., ts=..., asdir=..., ocnfrac=..., icefrac=..., precc=...,
825:MPT: precl=..., snowhland=..., ghg_chem=.FALSE., latmapback=..., drydepflx=...,
825:MPT: wetdepflx=..., cflx=..., fire_sflx=<not associated>,
825:MPT: fire_ztop=<not associated>, nhx_nitrogen_flx=..., noy_nitrogen_flx=...,
825:MPT: qtend=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, pbuf=0x2b75ed00a248)
825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_gas_phase_chemdr.F90:1063
825:MPT: #10 0x00000000025278a4 in chemistry::chem_timestep_tend (state=..., ptend=...,
825:MPT: cam_in=..., cam_out=..., dt=120, pbuf=0x2b75ed00a248, fh2o=...)
825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/chemistry.F90:1290
825:MPT: #11 0x0000000000ee1b7c in physpkg::tphysac (ztodt=120, cam_in=...,
825:MPT: cam_out=..., state=..., tend=..., pbuf=0x2b75ed00a248)
825:MPT: at /glade/work/xyhuang/CAM-1/src/physics/cam/physpkg.F90:1562
One question is whether this line of code has a FPE even when using
standard CAM code that runs fine. If so, then it's a red herring. If not,
then it would be interesting to know if the problem is an uninitialized
variable, or if the FPE appears after a few time steps.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#442 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AV5GA5LAHKYUEVVLIJ6N4BDUFNDYBANCNFSM5FFNKEQA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@vlarson As tested, I did not notice a FPE when using the 32 levels, which runs without any issue. Also, for the 58 levels, the FPE appears after 15 steps. @adamrher I have the log file with all the input/output variables to the subroutine calls (
Any further ideas about something (any particular values) being abnormal here? |
To summarize, we have two leads: (1) when clubb diffuses scalars # 9-12, it gives them NaNs on task 776. However, the same clubb error messages are being triggered for other tasks as well, but the log ends before those other tasks print out their updated values of the scalars (which I suspect would show NaNs just like tasks 776). (2) we are getting floating point exceptions with DEBUG=TRUE, that are not present in the 120km 58 level MPAS runs. Not sure what to make of this. I'm working with Xingying to reverse engineer these NaNs in (1). And will update the git issue if/when we learn anything. |
There was also an issue with giving MPAS the right topography file. This seems to fix some of the issues (or all) with the 60km (uniform) MPAS 58L according to @PeterHjortLauritzen |
I wonder what you'd find if you print out values of scalars 9-12 whenever they are negative, or NaN, or too large. Maybe that would lead to an initialization problem.
If 120 km can run stably for a year without strange output in scalars 9-12, but 60 km crashes after 32 time steps, then I speculate that something is mis-configured or left uninitialized in the 60 km run. |
That fix is only for the 60km with Topography. @skamaroc has been seeing this issue with the 60km aquaplanet as well, I just recreated the case. |
@MiCurry are there NaN's in the cesm.log? If so, for which variable? |
@adamrher Its the same as @xhuang-ncar, |
Thanks @MiCurry looks like there are also NaNs in I've been paying attn to this code block:
I had Xingyihng print out I'd like to proceed with Vince's suggestion; write out the array if there are any NaNs. I would like to focus on |
One notable thing is that in this example, edsclrm 9 has NaNs at all grid levels. Usually if CLUBB suffers a numerical instability, NaNs appear first at just a few grid levels. Maybe there is a memory error. |
It might be worth trying to free up memory by halving the tasks per node: |
@adamrher Sure, I can write out the array when there are any NaNs for lhs, rhs, solution and edsclrm first. Not sure how to do that for every variable though (if needed). Oh, also, I tried to free up memory and it did not work. |
could you also write out
|
I'm thinking that it might help to determine whether there is an initialization problem. Do scalars 9-12 look bad already at the first time step? Perhaps some plots comparing the run that crashes and a run that works would be illuminating. It might also be useful to know whether the problem occurs near the lower surface or aloft. If it's near the lower surface, the problem might have to do with either fine grid spacing, coupling with the surface, or surface emissions. If aloft, it might help to reduce the time step. Again, some plots might help. |
I thought we has some action items from the mtg last Tuesday? We discussed trying to comment out the dry deposition code on account of this DEBUG=TRUE FPE error (see below). If you look at the actual line that's triggering the FPE (/glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110), lots could be going wrong. And then we determined that the RHS variable is all NaNs in clubb's scalar mixing subroutine, and so some further debugging there would be helpful too (perhaps as Vince suggests, pinpointing where in the column a bad value is triggering a whole column of NaNs). @xhuang-ncar any updates? 825:MPT: #1 0x00002b55ec5dc306 in mpi_sgi_system ( |
I spoke very briefly with @vlarson about this yesterday and wanted to share some detail that may be helpful. A.) I am certainly a bit gunshy, but we did have a bit of a red herring issue a few years ago where I was convinced there was an error in CLM5 + updated CAM-SE, but it really was just a better error trap from CLM's side. Chasing NaN's and stack traces led me down the wrong path -- it was only once I started dumping all state fields every timestep that I was able to pinpoint the regime causing the blowups. B.) At the head of the E3SM repo, there have been modifications to CLUBB tuning compared to v1. These changes seem to have resulted in situations with "noise," even in vertically integrated fields. For example, @jjbenedict is running some VR-E3SM simulations with me for a DoE project and noted noise in the precip field of Hurricane Irene hindcasts (below) with dev code at E3SM head (2 subtly different tuning configs). A quick look at the published v1 data showed either no (or at least much less) noise, which seems to imply a response to changes in CLUBB tuning. NOTE: while we are focused on the TC, you can also see some noise in the top right, which means this may not be 100% a TC response. After removing the lowest model level (i.e., merging lev(71) and lev(70) from a 0-based perspective), this raises the lowest model level from ~15m to ~40m (still lower than CAM6's L32...). Just "thickening" the lowest model level eliminates such instability, even with everything else (init, config, etc.) identical as before (again, the only change is L72 -> L71). See below precip snapshot. Anecdotally, we have also had crashes (generally over cold mountainous areas) with L72 that go away with L71. Haven't followed up too closely, however. My (completely unsubstantiated) hypothesis is the both CLUBB and the dycore are essentially providing vertical diffusion to the model -- whether that be considered implicit or explicit. Generally, this isn't an issue with thicker layers, but thinner layers have less mass/inertia and are more susceptible to CFL errors, so it's possible they are also more susceptible to the combined diffusion from the dynamics + physics. This may also help explain some of the horizontal resolution sensitivity, as the dycore's explicit diffusion will scale with resolution, so some of the power in the smaller scales + sharper gradients will still modulate the vertical diffusion, even in the absence of explicit changes to the vertical coord. Anyway, my (super stupid easy test) would be to run a similar case with MPAS to the one that crashes but with a thicker lowest level -- everything else (horiz res, init, forcing, config) remains identical. Merging the bottom two levs was the easiest test for me, but I suppose just shifting everything up a bit probably would work. If the model follows the same stability trajectory, we are right where we started, but if it doesn't, it provides a more focused target for debugging. |
Question for @vlarson: I noticed that the vertical dissipation in CLUBB is implemented using semi-implicit Crank-Nicolson time integration. This can produce oscillatory behavior for large eddy viscosities (i.e. large K dt/dz^2). Is there a way we can change the weights on the time levels in the integration? It looks like the weights (1/2, 1/2) are hardwired in the code. It may be that the vertical discretization in the nonhydrostatic solver in MPAS, that uses a Lorenz vertical staggering, is not happy with oscillations that may be produced by the mixing in these thin layers. I'm also wondering about potential feedback from the mixing and the nonlinear computation of the eddy viscosities. |
I recently heard from Adam H that when the model is compiled with debugging
ON, the crash is seen to originate in the dry deposition code, not in
CLUBB. There is a particular line flagged that involves calculations with
surface pressure. Am I misunderstanding?
…On Fri, Oct 22, 2021 at 10:06 AM skamaroc ***@***.***> wrote:
Question for @vlarson <https://github.com/vlarson>: I noticed that the
vertical dissipation in CLUBB is implemented using semi-implicit
Crank-Nicolson time integration. This can produce oscillatory behavior for
large eddy viscosities (i.e. large K dt/dz^2). Is there a way we can change
the weights on the time levels in the integration? It looks like the
weights (1/2, 1/2) are hardwired in the code. It may be that the vertical
discretization in the nonhydrostatic solver in MPAS, that uses a Lorenz
vertical staggering, is not happy with oscillations that may be produced by
the mixing in these thin layers. I'm also wondering about potential
feedback from the mixing and the nonlinear computation of the eddy
viscosities.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#442 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACGLMTXMIUFDQRRW36MAAVDUIGDZRANCNFSM5FFNKEQA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@adamrher, can you clarify @JulioTBacmeister 's comment... might it be coming from surface pressure calculations in the dry dep code? Thanks! |
@andrewgettelman my earlier comment seems to have gotten lost in the thread:
And here is the line
where z is an elevation derived from hydrostatic balance, and so uses the surface pressure. In parallel, I think we should also consider @skamaroc point, that the time integration in CLUBB eddy diffusivity may be producing oscillations that does not play well with the MPAS solver. On our end, I suggest we continue with the other action item from last weeks mtg:
|
Thanks @adamrher, super helpful for reminding us. I think we can check the dry dep code (which does look problematic, and probably needs to be rewritten to get heights if appropriate, or make sure they do not go negative...). I also agree that Bill is probably on to something as well with the CLUBB solver, and that could be adjusted. |
From @nusbaume 👍 I am pretty early in my effort (I think you are all ahead of me), but I just wanted to say that in my run the NaN is coming into CLUBB via the state%q variable with constituent index 12 (i.e. DMS). So I can at least say that CLUBB isn't just randomly producing a NaN internally, but is instead being fed one (which would possibly point to dry deposition being the culprit). I'll hold off for now until I hear how the dry deposition check goes. If turning off dry deposition doesn't work, or if you run into a new error, then just let me know and I will keep going with the debugging. |
Commenting out this line, the model did not get to run. Is there other way to turn this off?
That's insightful. I am also checking out the state variables at each time step before the crash, I will update later.
Jesse is working on this and has figured out that the NaN is coming into CLUBB. |
The z/z0 is the height (above the surface?) of the lowest model level, divided by a roughness length z0. Reasonable values would be ln(50 m / 0.1 m) ~ 6. So why not just set this line to:
|
@xhuang-ncar Last week we had discussed simply commenting out the call to the dry deposition routine. Was that tried? |
Brian Griffin (@bmg929) and I looked at the advance_windm_edsclrm code. We'll think about how best to relax the hard-wiring of the weights. There are a number of places where that would need to happen. For my own reference, I'll take this opportunity to reference possibly related tickets (#342 and E3SM-Project/E3SM#4073). Do we know whether the bad values originate near the lower surface or aloft? |
@adamrher @dan800 @vlarson & everyone. Sorry for a late catch up on this. I was taking some time off for personal business. Here is what I got: from this problematic line: cvarb = vonkar/log( z(i)/z0b(i) ) I set the z(i) to 5.0, similar to what Andrew and Adam suggested. And that works. At least, the run (I am testing the 60km aqua one) is going on without crashing again. I will get a longer simulation and also track back to the bad z values (as in z(i) = - r/grav * air_temp(i) * (1._r8 + .61_r8*spec_hum(i)) * log(p/pg)). Thanks all! Will update further. |
@adamrher and all. Good news. I think we found the issue (at least for now). I have tested both 60km and 60-3km runs for the five-day simulation. Both went well. So far, I am tracing back why the z values become invalid, and that relates to the p/pg (i.e. pressure at midpoint first layer/surface pressure). Still figuring out what's going on with the pressures though. Many thanks to Adam and everyone! I will update further as more tests are conducting. |
Further updates:
So for the moment we can fix this lowest layer height in dry deposition at 5 or 10m and it runs. The medium term solution is to make sure the interface and mid-point pressure calculation in dp_coupling is more consistent for a non-hydrostatic model. This might take help from @PeterHjortLauritzen THanks! |
Reporting on results from @skamaroc :
I'm assuming this is in: src/dynamics/mpas/dp_coupling.F90 I cannot quite follow the logic in the code (since I think the call is to hydrostatic_pressure for both pint and pmid). But it seems like we need to make the pint and pmid consistent for MPAS. Should we use the full non-hydrostatic calculation and adjust DP coupling? Or is there a reason not to? Will need guidance from @skamaroc and @PeterHjortLauritzen on the path forward: But it seems pretty simple to adjust and then test. Thanks! |
looking into it |
The current physics-dynamics coupling is discussed (in note form) in: https://www.overleaf.com/read/jzjvmnqdxcfc (equation numbers I refer to may not continue to match as the Overleaf document evolves - apologies!) As described by @skamroc the issue is that the interface pressure is computed using hydrostatic balance (equation 8 in Overleaf notes) and the mid-level pressure is computed using the equation of state (equation 19 in Overleaf notes). The mid-level pressure is therefore the full non-hydrostatic pressure which is not necessarily bounded by the hydrostatic interface pressures if there is non-hydrostatic motion. If that is the case the state passed to physics is inconsistent (unphysical). Why this choice and possible solution: Why? (this is going to be technical) Choose your poison! Since CAM physics is pressure-based and hydrostatic, and MPAS is constant volume and non-hydrostatic, it is hard to make CAM-MPAS physics-dynamics coupling 100% consistent (unless CAM physics is reformulated as constant volume physics and relaxes the hydrostatic assumption). But some level of consistency can be enforced; in particular, one does not want the energy fixer to fix the wrong energy! In cam6_3_032 I chose the following constraints and assumptions:
Solution to pressure problem: Height is a diagnostic field in CAM physics. The energy formula in CAM physics does not use height. So energetically speaking the heights don’t matter in terms of the energy budget. That said, the height is used by certain parameterizations (and diagnostics) as an input variable to do their thing (e.g. compute thresholds for some process). Hence we could choose to violate 1, i.e. the mid-level height will not be entirely consistent with z in MPAS computed from equation (1) in Overleaf document. Perhaps that is OK since we have a splitting error anyway (with physics evolving on constant pressure and applied at constant volume index-wise). I propose we use a hydrostatic mean pressure computed using a finite-volume mean approach (A7 in Overleaf document). In this case the mid z-level will not exactly match MPAS but since it is a diagnostic variable in CAM physics maybe that is “good enough”. This will make sure that pmid is bounded by the interface pressures sacrificing the “z consistency. Draft code with fix 2 fixes here (search for #define pmid_fix) - @skamroc please advice which you prefer: https://github.com/PeterHjortLauritzen/CAM/blob/mpas_pressure_fix/src/dynamics/mpas/dp_coupling.F90 (also discovered indexing bug in energy diagnostics made during PR changes - this is fixed in this code to) It has been verified that the energy consistency persists with the pressure fix. |
Update on testing of proposed solution using full level pressure half way between hydrostatic interface pressures: Ran 13 months (used last 12 months for analysis, L32, F2000climo compset, 1 degree). With the fix the z-levels diagnosed in physics do not match MPAS z levels. The Figure below shows difference (in meters) between CAM physics z levels (based on hydrostatic integration) and MPAS z levels for "random" locations as a function of level: Same Figure but normalized (not exactly same locations): The differences are on the order of 0.1-0.2%. For comparison here are similar Figures but showing how much z-levels move during physics (z end of physics minus z beginning of physics normalized by z): => the levels move less due to physics processes compared to the discrepancy between MPAS z levels and CAM physics z levels computed as described above. The z level discrepancy impacts the energy fixer in the geopotential term (rhogz*dz): In terms of the dry mass adjustment (dme_adjust) the change is: -0.410553 W/M^2 (dme_adjust using physics z levels) versus -0.41074 W/M^2 (dme_adjust using dycore z levels). The difference is in the 4th digit so small! In all, the modification is acceptable in terms of energy budget closure, however, it would be more consistent to change the thermodynamic potential in physics so that z stays fixed (that is, however, a research project). |
These are small, but isn't this only measuring the change in levels due to the physics? Whether you start with MPAS z or CAM z, it is only measuring the change due to physics, right? Might these 100m differences in the top ten or so levels have a larger impact on the energy fixer (not dme adjust)? |
Correct (z using CAM physics z versus MPAS z)!
In terms of energy the terms are weighted with rho*dz. Since there is very little mass up there these differences do not really show (but it is indeed a bias) |
One more figure. I used the same code to plot the zonal and annual averaged 'PMID' on each pressure level. The difference with the new dp_coupling formulation is the lower right panel. In millibars, so generally < 1mb and small, up to 2mb near the surface. Might be related to the wind/temperature differences.... |
From @xhuang-ncar
Regarding @PeterHjortLauritzen 's comment: do we need to investigate further? I guess the issue is what parameterizations use z as an input, and can we diagnose potential differences? |
cam6_3_040: MPAS pressure fix MPAS uses hydrostatic interface pressures and non-hydrostatic mid level pressures when passing state to CAM physics (this ensures that geopotential_t will exactly recover MPAS z levels from CAM physics state); unfortunately, the hydrostatic interface pressures did not bound the non-hydrostatic mid level pressure at high resolution leading to a physically inconsistent pressure field. Mid-level pressure computation changed to the arithmetic mean of hydrostatic interface pressures. This makes sure that the mid level pressure is bounded (but violates the geopotentian_t consistency for mid level height). This PR also addresses an error in the calculation of some MPAS energy diagnostics due to swapped array dimensions in the rhod and dz arrays. closes #442 closes #487
Opening an issue to describe crashes with high vertical resolution.
So far this has only been seen with higher resolution simulations, and with CAM-MPAS.
The basic test case is 58L CAM-MPAS aquaplanet crashes almost immediately with an error from CLUBB:
The errors are coming out of CLUBB ( we are not necessarily convinced it's CLUBB's fault yet) in advance_windm_edsclrm_module.F90.
The error is:
405: Fatal error solving for eddsclrm
405: Error in advance_windm_edsclrm
The error has been seen by @skamaroc and Xingying Huang (not sure their github names yet).
Vince Larson notes that:
The cause could be initial conditions (not initializing CLUBB variables). Or it could be upstream of CLUBB (and be the input winds).
Still trying to debug....
The text was updated successfully, but these errors were encountered: