Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAM Crashes with 58 levels and higher horiz resolution #442

Closed
andrewgettelman opened this issue Oct 1, 2021 · 69 comments
Closed

CAM Crashes with 58 levels and higher horiz resolution #442

andrewgettelman opened this issue Oct 1, 2021 · 69 comments
Assignees

Comments

@andrewgettelman
Copy link
Collaborator

andrewgettelman commented Oct 1, 2021

Opening an issue to describe crashes with high vertical resolution.

So far this has only been seen with higher resolution simulations, and with CAM-MPAS.

The basic test case is 58L CAM-MPAS aquaplanet crashes almost immediately with an error from CLUBB:

The errors are coming out of CLUBB ( we are not necessarily convinced it's CLUBB's fault yet) in advance_windm_edsclrm_module.F90.

The error is:

405: Fatal error solving for eddsclrm
405: Error in advance_windm_edsclrm

The error has been seen by @skamaroc and Xingying Huang (not sure their github names yet).

Vince Larson notes that:

edsclrm is CLUBB's array of scalars diffused by CLUBB's eddy diffusivity. windm is CLUBB's representation of the horizontal wind components. ("m" in CLUBB-speak denotes "grid-mean." So "windm" refers to the grid-mean values of u and v.)

I am guessing that the wind is the problem, not the eddy scalars, which are chemical species, etc.

The cause could be initial conditions (not initializing CLUBB variables). Or it could be upstream of CLUBB (and be the input winds).

Still trying to debug....

@vlarson
Copy link

vlarson commented Oct 1, 2021

edsclrm is CLUBB's array of scalars diffused by CLUBB's eddy diffusivity. windm is CLUBB's representation of the horizontal wind components. ("m" in CLUBB-speak denotes "grid-mean." So "windm" refers to the grid-mean values of u and v.)
I am guessing that the wind is the problem, not the eddy scalars, which are chemical species, etc.

However, if the model crashes within 10 time steps, even in aquaplanet mode, when it runs with 120-km resolution, then perhaps a variable is not initialized properly. Maybe running with floating point trapping turned on could catch the first NaN, which might lead us to an initialization error.

@xhuang-ncar
Copy link

xhuang-ncar commented Oct 4, 2021

@vlarson The model works for CAM-MPAS at 120km resolution with ZM2 (replaced the zm_conv_intr.F90 and zm_conv.F90 as shared by Adam). However, that solution does not work for the 60km aquaplanet run (Bill tested here) or the 60-3km full-topography one (I did here) with the same error message.

Here are the first NaN values printed out in my cesm log file:

edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ... 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.

If setting the edsclrm values to zeros will cause other issues with model crashing at the same time step.

The code blocks are highlighted here by Adam for the functions windm_edsclrm_rhs, windm_edsclrm_lhs, and windm_edsclrm_solve and the clubb_at_least_debug_level:

https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L464-L524

https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L555-L593

@swrneale
Copy link
Collaborator

swrneale commented Oct 4, 2021

Sorry if I missed it, but is the crash in a 3km column and is that column at or near orography?

@vlarson
Copy link

vlarson commented Oct 4, 2021

edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ... 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.

Is edsclrm # 9, 10, 11, or 12 initialized to a reasonable value?

Turning on floating point trapping might catch an uninitialized variable if there is one.

@MiCurry
Copy link

MiCurry commented Oct 4, 2021

Is the dynamics running at all? Or does this failure happen before the dynamics is ran?

@andrewgettelman
Copy link
Collaborator Author

@swrneale : we can get the crash in an aquaplanet model, so no land or topography. We have not located the point and level where it is happening. @xhuang-ncar will need some guidance on that (I don't really know how to pull out a column). Thanks!

@JulioTBacmeister
Copy link
Collaborator

JulioTBacmeister commented Oct 4, 2021 via email

@xhuang-ncar
Copy link

xhuang-ncar commented Oct 4, 2021

@xhuang-ncar
Copy link

Is the dynamics running at all? Or does this failure happen before the dynamics is ran?

Yes, it was running. It crashed after 16 time steps. Also, I am using 120s as the dtime given the refined 3km resolution.

@xhuang-ncar
Copy link

edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ... 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.

Is edsclrm # 9, 10, 11, or 12 initialized to a reasonable value?

Turning on floating point trapping might catch an uninitialized variable if there is one.

Not sure about that. How should that be initialized normally? Let me turn on the debug mode and also to print out the input variables in the windm_edsclrm_solve function to the log file.

@JulioTBacmeister
Copy link
Collaborator

JulioTBacmeister commented Oct 4, 2021 via email

@adamrher
Copy link

adamrher commented Oct 5, 2021

@JulioTBacmeister Looking through the log, I am only seeing NaN's for processor 776. However, I am finding that the entire write statement code block is being written out for 776, because its last entries in the cesm.log are:

776: wpedsclrp = 0.000000000000000E+000 1.038482017559906E-005 776: 9.159381691910688E-006 7.466857448689597E-006 7.067816049197199E-006 776: 8.285625416776367E-006 1.226171921624444E-005 1.963394231215161E-005 776: 3.335552805102376E-005 4.535006155606812E-005 4.986023016470998E-005 776: 5.817265387986108E-005 6.110989684521623E-005 2.089284626521818E-005 776: 2.964944981635684E-006 2.922172823855216E-007 2.468995295154835E-008 776: 1.433156461244229E-008 3.468759159148850E-008 2.422783580467906E-008 ...

Where wpedsclrp is the last entry in that code block:
https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L555-L593

But 776 isn't the only processor with "Error in advance_windm_edsclrm," but again, 776 is the only one with NaNs.

If it were me, I'd probably write out all the input/output variables to the subroutine calls windm_edsclrm_rhs, windm_edsclrm_lhs, windm_edsclrm_solve. I'd first try conditioning these write statements on if ( err_code == clubb_fatal_error ) then right here:
https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L520-L524

Lastly, if you set ./xmlchange DEBUG=TRUE, does that turn on floating point trapping as suggested by Vince (@cacraigucar)? And I presume this need to be ran on cheyenne (cuz it's such a large grid), and so I think only the intel compiler is used.

@xhuang-ncar
Copy link

xhuang-ncar commented Oct 5, 2021

@adamrher Thanks for the suggestion. I will give it try. Also, after setting the DEBUG to TRUE as Vince suggested here, the run crashed with MPT ERROR as in this log file (https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404). How should I interpret this?
(It is on Cheyenne with intel compiler.)

@xhuang-ncar
Copy link

Sorry if I missed it, but is the crash in a 3km column and is that column at or near orography?

Both the 60km (aquaplanet) and 60-3km (full topography) run crashed with this CLUBB error when using L58. Can we locate those NaN values?

I am also trying to figure out how it works for the CAM-SE at 25km (setting up a test for that).

@vlarson
Copy link

vlarson commented Oct 5, 2021

@adamrher Thanks for the suggestion. I will give it try. Also, after setting the DEBUG to TRUE as Vince suggested here, the run crashed with MPT ERROR as in this log file (https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404). How should I interpret this?

The line of code that contained the first FPE is here:

825:MPT: #1  0x00002b55ec5dc306 in mpi_sgi_system (
825:MPT: #2  MPI_SGI_stacktraceback (
825:MPT:     header=header@entry=0x7ffeb377a050 "MPT ERROR: Rank 825(g:825) received signal SIGFPE(8).\n\tProcess ID: 64229, Host: r5i4n31, Program: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/bld/cesm.exe\n\tMPT Version: HPE MPT 2.22  03"...) at sig.c:340
825:MPT: #3  0x00002b55ec5dc4ff in first_arriver_handler (signo=signo@entry=8, 
825:MPT:     stack_trace_sem=stack_trace_sem@entry=0x2b55f6c80080) at sig.c:489
825:MPT: #4  0x00002b55ec5dc793 in slave_sig_handler (signo=8, siginfo=<optimized out>, 
825:MPT:     extra=<optimized out>) at sig.c:565
825:MPT: #5  <signal handler called>
825:MPT: #6  0x000000000a554001 in __libm_log_l9 ()
825:MPT: #7  0x0000000002ce9481 in mo_drydep::drydep_xactive (sfc_temp=..., 
825:MPT:     pressure_sfc=..., wind_speed=..., spec_hum=..., air_temp=..., 
825:MPT:     pressure_10m=..., rain=..., snow=..., solar_flux=..., dvel=..., dflx=..., 
825:MPT:     mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12, ocnfrc=..., icefrc=..., 
825:MPT:     beglandtype=7, endlandtype=8)
825:MPT:     at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110
825:MPT: #8  0x0000000002cb913d in mo_drydep::drydep_fromlnd (ocnfrac=..., icefrac=..., 
825:MPT:     sfc_temp=..., pressure_sfc=..., wind_speed=..., spec_hum=..., 
825:MPT:     air_temp=..., pressure_10m=..., rain=..., snow=..., solar_flux=..., 
825:MPT:     dvelocity=..., dflx=..., 
825:MPT:     mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12)
825:MPT:     at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:210
825:MPT: #9  0x0000000002da7d99 in mo_gas_phase_chemdr::gas_phase_chemdr (lchnk=12, 
825:MPT:     ncol=16, imozart=10, 
825:MPT:     q=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, phis=..., zm=..., zi=..., calday=1.0208333333333333, tfld=..., 
825:MPT:     pmid=..., pdel=..., pint=..., cldw=..., troplev=..., troplevchem=..., 
825:MPT:     ncldwtr=..., ufld=..., vfld=..., delt=120, ps=..., xactive_prates=.FALSE., 
825:MPT:     fsds=..., ts=..., asdir=..., ocnfrac=..., icefrac=..., precc=..., 
825:MPT:     precl=..., snowhland=..., ghg_chem=.FALSE., latmapback=..., drydepflx=..., 
825:MPT:     wetdepflx=..., cflx=..., fire_sflx=<not associated>, 
825:MPT:     fire_ztop=<not associated>, nhx_nitrogen_flx=..., noy_nitrogen_flx=..., 
825:MPT:     qtend=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, pbuf=0x2b75ed00a248)
825:MPT:     at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_gas_phase_chemdr.F90:1063
825:MPT: #10 0x00000000025278a4 in chemistry::chem_timestep_tend (state=..., ptend=..., 
825:MPT:     cam_in=..., cam_out=..., dt=120, pbuf=0x2b75ed00a248, fh2o=...)
825:MPT:     at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/chemistry.F90:1290
825:MPT: #11 0x0000000000ee1b7c in physpkg::tphysac (ztodt=120, cam_in=..., 
825:MPT:     cam_out=..., state=..., tend=..., pbuf=0x2b75ed00a248)
825:MPT:     at /glade/work/xyhuang/CAM-1/src/physics/cam/physpkg.F90:1562

One question is whether this line of code has a FPE even when using standard CAM code that runs fine. If so, then it's a red herring. If not, then it would be interesting to know if the problem is an uninitialized variable, or if the FPE appears after a few time steps.

@xhuang-ncar
Copy link

xhuang-ncar commented Oct 5, 2021 via email

@xhuang-ncar
Copy link

xhuang-ncar commented Oct 5, 2021

I see. I can set up a quick test using the 32 levels with the DEBUG on to check that out.

@vlarson As tested, I did not notice a FPE when using the 32 levels, which runs without any issue. Also, for the 58 levels, the FPE appears after 15 steps.

@adamrher I have the log file with all the input/output variables to the subroutine calls (windm_edsclrm_rhs , windm_edsclrm_lhs , windm_edsclrm_solve )

https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.874406.chadmin1.ib0.cheyenne.ucar.edu.211005-160912

(or on Cheyenne: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/run/cesm.log.874406.chadmin1.ib0.cheyenne.ucar.edu.211005-160912)

Any further ideas about something (any particular values) being abnormal here?

@adamrher
Copy link

adamrher commented Oct 6, 2021

To summarize, we have two leads:

(1) when clubb diffuses scalars # 9-12, it gives them NaNs on task 776. However, the same clubb error messages are being triggered for other tasks as well, but the log ends before those other tasks print out their updated values of the scalars (which I suspect would show NaNs just like tasks 776).

(2) we are getting floating point exceptions with DEBUG=TRUE, that are not present in the 120km 58 level MPAS runs. Not sure what to make of this.

I'm working with Xingying to reverse engineer these NaNs in (1). And will update the git issue if/when we learn anything.

@andrewgettelman
Copy link
Collaborator Author

There was also an issue with giving MPAS the right topography file. This seems to fix some of the issues (or all) with the 60km (uniform) MPAS 58L according to @PeterHjortLauritzen

@vlarson
Copy link

vlarson commented Oct 6, 2021

To summarize, we have two leads:

(1) when clubb diffuses scalars # 9-12, it gives them NaNs on task 776. However, the same clubb error messages are being triggered for other tasks as well, but the log ends before those other tasks print out their updated values of the scalars (which I suspect would show NaNs just like tasks 776).

I wonder what you'd find if you print out values of scalars 9-12 whenever they are negative, or NaN, or too large. Maybe that would lead to an initialization problem.

(2) we are getting floating point exceptions with DEBUG=TRUE, that are not present in the 120km 58 level MPAS runs. Not sure what to make of this.

If 120 km can run stably for a year without strange output in scalars 9-12, but 60 km crashes after 32 time steps, then I speculate that something is mis-configured or left uninitialized in the 60 km run.

@MiCurry
Copy link

MiCurry commented Oct 6, 2021

There was also an issue with giving MPAS the right topography file. This seems to fix some of the issues (or all) with the 60km (uniform) MPAS 58L according to @PeterHjortLauritzen

That fix is only for the 60km with Topography. @skamaroc has been seeing this issue with the 60km aquaplanet as well, I just recreated the case.

@adamrher
Copy link

adamrher commented Oct 6, 2021

@MiCurry are there NaN's in the cesm.log? If so, for which variable?

@MiCurry
Copy link

MiCurry commented Oct 6, 2021

@MiCurry are there NaN's in the cesm.log? If so, for which variable?

@adamrher Its the same as @xhuang-ncar, edsclrm. @skamaroc's case is here: /glade/scratch/skamaroc/qpc6-60km-58L/run.

@adamrher
Copy link

adamrher commented Oct 7, 2021

Thanks @MiCurry looks like there are also NaNs in wpedsclrp which isn't surprising because it is derived from edsclrm. Seems like this might be a cheaper grid to debug, instead of using the 60-3km grid.

I've been paying attn to this code block:

  ! Decompose and back substitute for all eddy-scalar variables
  call windm_edsclrm_solve( edsclr_dim, 0, &     ! in
                            lhs, rhs, &          ! in/out
                            solution )           ! out

  if ( clubb_at_least_debug_level( 0 ) ) then
    if ( err_code == clubb_fatal_error ) then
      write(fstderr,*) "Fatal error solving for eddsclrm"
    end if
  end if

  !----------------------------------------------------------------
  ! Update Eddy-diff. Passive Scalars
  !----------------------------------------------------------------
  edsclrm(1:gr%nz,1:edsclr_dim) = solution(1:gr%nz,1:edsclr_dim)

I had Xingyihng print out lhs, rhs and solution. I expected solution to have NaNs for edsclrm # 9, but to my surprise, solution did not have any NaNs, whereas edsclrm did. See:

Screen Shot 2021-10-06 at 4 53 31 PM

Screen Shot 2021-10-06 at 4 53 16 PM

I'd like to proceed with Vince's suggestion; write out the array if there are any NaNs. I would like to focus on lhs, rhs, solution and edsclrm first. And I'd like to comment out all the write statements with the if ( err_code == clubb_fatal_error ) then conditional, so that the log's are easier to read. Anyone have any other suggestions? @xhuang-ncar or @MiCurry, can you do this?

@vlarson
Copy link

vlarson commented Oct 7, 2021

I had Xingyihng print out lhs, rhs and solution. I expected solution to have NaNs for edsclrm # 9, but to my surprise, solution did not have any NaNs, whereas edsclrm did. See:

One notable thing is that in this example, edsclrm 9 has NaNs at all grid levels. Usually if CLUBB suffers a numerical instability, NaNs appear first at just a few grid levels. Maybe there is a memory error.

@adamrher
Copy link

adamrher commented Oct 7, 2021

It might be worth trying to free up memory by halving the tasks per node:
./xmlchange MAX_TASKS_PER_NODE=16,MAX_MPITASKS_PER_NODE=16

@xhuang-ncar
Copy link

@adamrher Sure, I can write out the array when there are any NaNs for lhs, rhs, solution and edsclrm first. Not sure how to do that for every variable though (if needed). Oh, also, I tried to free up memory and it did not work.

@adamrher
Copy link

adamrher commented Oct 7, 2021

could you also write out gr%nz when there are NaN's for any of the arrays? This is just a double-check that there's no funny business with this expression:

  edsclrm(1:gr%nz,1:edsclr_dim) = solution(1:gr%nz,1:edsclr_dim)

@vlarson
Copy link

vlarson commented Oct 19, 2021

I'm thinking that it might help to determine whether there is an initialization problem.  Do scalars 9-12 look bad already at the first time step?  Perhaps some plots comparing the run that crashes and a run that works would be illuminating.

It might also be useful to know whether the problem occurs near the lower surface or aloft.  If it's near the lower surface, the problem might have to do with either fine grid spacing, coupling with the surface, or surface emissions.  If aloft, it might help to reduce the time step.  Again, some plots might help.

@adamrher
Copy link

I thought we has some action items from the mtg last Tuesday? We discussed trying to comment out the dry deposition code on account of this DEBUG=TRUE FPE error (see below). If you look at the actual line that's triggering the FPE (/glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110), lots could be going wrong. And then we determined that the RHS variable is all NaNs in clubb's scalar mixing subroutine, and so some further debugging there would be helpful too (perhaps as Vince suggests, pinpointing where in the column a bad value is triggering a whole column of NaNs). @xhuang-ncar any updates?

825:MPT: #1 0x00002b55ec5dc306 in mpi_sgi_system (
825:MPT: #2 MPI_SGI_stacktraceback (
825:MPT: header=header@entry=0x7ffeb377a050 "MPT ERROR: Rank 825(g:825) received signal SIGFPE(8).\n\tProcess ID: 64229, Host: r5i4n31, Program: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/bld/cesm.exe\n\tMPT Version: HPE MPT 2.22 03"...) at sig.c:340
825:MPT: #3 0x00002b55ec5dc4ff in first_arriver_handler (signo=signo@entry=8,
825:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b55f6c80080) at sig.c:489
825:MPT: #4 0x00002b55ec5dc793 in slave_sig_handler (signo=8, siginfo=,
825:MPT: extra=) at sig.c:565
825:MPT: #5
825:MPT: #6 0x000000000a554001 in __libm_log_l9 ()
825:MPT: #7 0x0000000002ce9481 in mo_drydep::drydep_xactive (sfc_temp=...,
825:MPT: pressure_sfc=..., wind_speed=..., spec_hum=..., air_temp=...,
825:MPT: pressure_10m=..., rain=..., snow=..., solar_flux=..., dvel=..., dflx=...,
825:MPT: mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12, ocnfrc=..., icefrc=...,
825:MPT: beglandtype=7, endlandtype=8)
825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110

@zarzycki
Copy link

I spoke very briefly with @vlarson about this yesterday and wanted to share some detail that may be helpful.

A.) I am certainly a bit gunshy, but we did have a bit of a red herring issue a few years ago where I was convinced there was an error in CLM5 + updated CAM-SE, but it really was just a better error trap from CLM's side. Chasing NaN's and stack traces led me down the wrong path -- it was only once I started dumping all state fields every timestep that I was able to pinpoint the regime causing the blowups.

B.) At the head of the E3SM repo, there have been modifications to CLUBB tuning compared to v1. These changes seem to have resulted in situations with "noise," even in vertically integrated fields. For example, @jjbenedict is running some VR-E3SM simulations with me for a DoE project and noted noise in the precip field of Hurricane Irene hindcasts (below) with dev code at E3SM head (2 subtly different tuning configs). A quick look at the published v1 data showed either no (or at least much less) noise, which seems to imply a response to changes in CLUBB tuning.

NOTE: while we are focused on the TC, you can also see some noise in the top right, which means this may not be 100% a TC response.
NOTEx2: this noise is also apparent in at least some CLUBB moments -- I don't have any of these runs anymore, but distinctly remembering seeing ~2dx mode in upwp.

Irene_L72

After removing the lowest model level (i.e., merging lev(71) and lev(70) from a 0-based perspective), this raises the lowest model level from ~15m to ~40m (still lower than CAM6's L32...). Just "thickening" the lowest model level eliminates such instability, even with everything else (init, config, etc.) identical as before (again, the only change is L72 -> L71). See below precip snapshot.

Irene_L71

Anecdotally, we have also had crashes (generally over cold mountainous areas) with L72 that go away with L71. Haven't followed up too closely, however.

My (completely unsubstantiated) hypothesis is the both CLUBB and the dycore are essentially providing vertical diffusion to the model -- whether that be considered implicit or explicit. Generally, this isn't an issue with thicker layers, but thinner layers have less mass/inertia and are more susceptible to CFL errors, so it's possible they are also more susceptible to the combined diffusion from the dynamics + physics. This may also help explain some of the horizontal resolution sensitivity, as the dycore's explicit diffusion will scale with resolution, so some of the power in the smaller scales + sharper gradients will still modulate the vertical diffusion, even in the absence of explicit changes to the vertical coord.

Anyway, my (super stupid easy test) would be to run a similar case with MPAS to the one that crashes but with a thicker lowest level -- everything else (horiz res, init, forcing, config) remains identical. Merging the bottom two levs was the easiest test for me, but I suppose just shifting everything up a bit probably would work. If the model follows the same stability trajectory, we are right where we started, but if it doesn't, it provides a more focused target for debugging.

@skamaroc
Copy link

Question for @vlarson: I noticed that the vertical dissipation in CLUBB is implemented using semi-implicit Crank-Nicolson time integration. This can produce oscillatory behavior for large eddy viscosities (i.e. large K dt/dz^2). Is there a way we can change the weights on the time levels in the integration? It looks like the weights (1/2, 1/2) are hardwired in the code. It may be that the vertical discretization in the nonhydrostatic solver in MPAS, that uses a Lorenz vertical staggering, is not happy with oscillations that may be produced by the mixing in these thin layers. I'm also wondering about potential feedback from the mixing and the nonlinear computation of the eddy viscosities.

@JulioTBacmeister
Copy link
Collaborator

JulioTBacmeister commented Oct 22, 2021 via email

@andrewgettelman
Copy link
Collaborator Author

@adamrher, can you clarify @JulioTBacmeister 's comment... might it be coming from surface pressure calculations in the dry dep code? Thanks!

@adamrher
Copy link

@andrewgettelman my earlier comment seems to have gotten lost in the thread:

I thought we has some action items from the mtg last Tuesday? We discussed trying to comment out the dry deposition code on account of this DEBUG=TRUE FPE error (see below). If you look at the actual line that's triggering the FPE (/glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110), lots could be going wrong.

And here is the line

   cvarb  = vonkar/log( z(i)/z0b(i) )

where z is an elevation derived from hydrostatic balance, and so uses the surface pressure.

In parallel, I think we should also consider @skamaroc point, that the time integration in CLUBB eddy diffusivity may be producing oscillations that does not play well with the MPAS solver. On our end, I suggest we continue with the other action item from last weeks mtg:

And then we determined that the RHS variable is all NaNs in clubb's scalar mixing subroutine, and so some further debugging there would be helpful too (perhaps as Vince suggests, pinpointing where in the column a bad value is triggering a whole column of NaNs). @xhuang-ncar any updates?

@andrewgettelman
Copy link
Collaborator Author

Thanks @adamrher, super helpful for reminding us. I think we can check the dry dep code (which does look problematic, and probably needs to be rewritten to get heights if appropriate, or make sure they do not go negative...).

I also agree that Bill is probably on to something as well with the CLUBB solver, and that could be adjusted.

@andrewgettelman
Copy link
Collaborator Author

From @nusbaume 👍

I am pretty early in my effort (I think you are all ahead of me), but I just wanted to say that in my run the NaN is coming into CLUBB via the state%q variable with constituent index 12 (i.e. DMS). So I can at least say that CLUBB isn't just randomly producing a NaN internally, but is instead being fed one (which would possibly point to dry deposition being the culprit).

I'll hold off for now until I hear how the dry deposition check goes. If turning off dry deposition doesn't work, or if you run into a new error, then just let me know and I will keep going with the debugging.

@xhuang-ncar
Copy link

@andrewgettelman my earlier comment seems to have gotten lost in the thread:

I thought we has some action items from the mtg last Tuesday? We discussed trying to comment out the dry deposition code on account of this DEBUG=TRUE FPE error (see below). If you look at the actual line that's triggering the FPE (/glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110), lots could be going wrong.

And here is the line

   cvarb  = vonkar/log( z(i)/z0b(i) )

where z is an elevation derived from hydrostatic balance, and so uses the surface pressure.

Commenting out this line, the model did not get to run. Is there other way to turn this off?

In parallel, I think we should also consider @skamaroc point, that the time integration in CLUBB eddy diffusivity may be producing oscillations that does not play well with the MPAS solver. On our end, I suggest we continue with the other action item from last weeks mtg:

That's insightful. I am also checking out the state variables at each time step before the crash, I will update later.

And then we determined that the RHS variable is all NaNs in clubb's scalar mixing subroutine, and so some further debugging there would be helpful too (perhaps as Vince suggests, pinpointing where in the column a bad value is triggering a whole column of NaNs). @xhuang-ncar any updates?

Jesse is working on this and has figured out that the NaN is coming into CLUBB.

@adamrher
Copy link

adamrher commented Oct 22, 2021

Commenting out this line, the model did not get to run. Is there other way to turn this off?

The z/z0 is the height (above the surface?) of the lowest model level, divided by a roughness length z0. Reasonable values would be ln(50 m / 0.1 m) ~ 6. So why not just set this line to:

   cvarb  = vonkar/6._r8

@dan800
Copy link
Collaborator

dan800 commented Oct 22, 2021

@xhuang-ncar Last week we had discussed simply commenting out the call to the dry deposition routine. Was that tried?

@vlarson
Copy link

vlarson commented Oct 22, 2021

Question for @vlarson: I noticed that the vertical dissipation in CLUBB is implemented using semi-implicit Crank-Nicolson time integration. This can produce oscillatory behavior for large eddy viscosities (i.e. large K dt/dz^2). Is there a way we can change the weights on the time levels in the integration? It looks like the weights (1/2, 1/2) are hardwired in the code. It may be that the vertical discretization in the nonhydrostatic solver in MPAS, that uses a Lorenz vertical staggering, is not happy with oscillations that may be produced by the mixing in these thin layers. I'm also wondering about potential feedback from the mixing and the nonlinear computation of the eddy viscosities.

Brian Griffin (@bmg929) and I looked at the advance_windm_edsclrm code. We'll think about how best to relax the hard-wiring of the weights. There are a number of places where that would need to happen.

For my own reference, I'll take this opportunity to reference possibly related tickets (#342 and E3SM-Project/E3SM#4073).

Do we know whether the bad values originate near the lower surface or aloft?

@xhuang-ncar
Copy link

@adamrher @dan800 @vlarson & everyone. Sorry for a late catch up on this. I was taking some time off for personal business.

Here is what I got: from this problematic line: cvarb = vonkar/log( z(i)/z0b(i) ) I set the z(i) to 5.0, similar to what Andrew and Adam suggested. And that works. At least, the run (I am testing the 60km aqua one) is going on without crashing again. I will get a longer simulation and also track back to the bad z values (as in z(i) = - r/grav * air_temp(i) * (1._r8 + .61_r8*spec_hum(i)) * log(p/pg)).

Thanks all! Will update further.

@xhuang-ncar
Copy link

@adamrher and all. Good news. I think we found the issue (at least for now). I have tested both 60km and 60-3km runs for the five-day simulation. Both went well. So far, I am tracing back why the z values become invalid, and that relates to the p/pg (i.e. pressure at midpoint first layer/surface pressure). Still figuring out what's going on with the pressures though. Many thanks to Adam and everyone! I will update further as more tests are conducting.

@andrewgettelman
Copy link
Collaborator Author

Further updates:

  1. The two pressures put into the offending line in the wet deposition (p/pg) are pressure_10m and psurf. After a traceback through the dry deposition code interfaces it pressure_10m = pmid(:,k) i.e. the midpoint pressure of the bottom level. Sometimes this gets to be less than surface pressure. Why?
  2. Reporting from @skamaroc :

Here's what's going in the dp_coupling for the pressure. The interface pressures are calculated using the hydrostatic equation integrating down from the top of the model. The mid-level pressures are calculated using the state equation and the MPAS state , i.e. the mid-level pressures computed in dp_coupling are what the full nonhydrostatic pressures in the MPAS model. This difference in how the interface pressures and mid-level pressures are what can lead to the problem we're seeing here. Of course for the hydrostatic models this is not a problem because the hydrostatic relation is used to compute all the pressures. I'm not sure what the best way to deal with this is at this point.

So for the moment we can fix this lowest layer height in dry deposition at 5 or 10m and it runs. The medium term solution is to make sure the interface and mid-point pressure calculation in dp_coupling is more consistent for a non-hydrostatic model. This might take help from @PeterHjortLauritzen

THanks!

@andrewgettelman
Copy link
Collaborator Author

Reporting on results from @skamaroc :

Here's what's going in the dp_coupling for the pressure. The interface pressures are calculated using the hydrostatic equation integrating down from the top of the model.

The mid-level pressures are calculated using the state equation and the MPAS state , i.e. the mid-level pressures computed in dp_coupling are what the full nonhydrostatic pressures in the MPAS model.

This difference in how the interface pressures and mid-level pressures are what can lead to the problem we're seeing here. Of course for the hydrostatic models this is not a problem because the hydrostatic relation is used to compute all the pressures. I'm not sure what the best way to deal with this is at this point.

I'm assuming this is in: src/dynamics/mpas/dp_coupling.F90

I cannot quite follow the logic in the code (since I think the call is to hydrostatic_pressure for both pint and pmid).

But it seems like we need to make the pint and pmid consistent for MPAS. Should we use the full non-hydrostatic calculation and adjust DP coupling? Or is there a reason not to? Will need guidance from @skamaroc and @PeterHjortLauritzen on the path forward: But it seems pretty simple to adjust and then test.

Thanks!

@PeterHjortLauritzen
Copy link
Collaborator

looking into it

@PeterHjortLauritzen
Copy link
Collaborator

PeterHjortLauritzen commented Nov 12, 2021

The current physics-dynamics coupling is discussed (in note form) in:

https://www.overleaf.com/read/jzjvmnqdxcfc

(equation numbers I refer to may not continue to match as the Overleaf document evolves - apologies!)

As described by @skamroc the issue is that the interface pressure is computed using hydrostatic balance (equation 8 in Overleaf notes) and the mid-level pressure is computed using the equation of state (equation 19 in Overleaf notes). The mid-level pressure is therefore the full non-hydrostatic pressure which is not necessarily bounded by the hydrostatic interface pressures if there is non-hydrostatic motion. If that is the case the state passed to physics is inconsistent (unphysical).

Why this choice and possible solution:

Why? (this is going to be technical) Choose your poison! Since CAM physics is pressure-based and hydrostatic, and MPAS is constant volume and non-hydrostatic, it is hard to make CAM-MPAS physics-dynamics coupling 100% consistent (unless CAM physics is reformulated as constant volume physics and relaxes the hydrostatic assumption). But some level of consistency can be enforced; in particular, one does not want the energy fixer to fix the wrong energy!

In cam6_3_032 I chose the following constraints and assumptions:

  1. When the MPAS state is passed to CAM physics the z-levels diagnosed from pressure, temperature and moisture in physics is consistent with MPAS z levels (that are constant in MPAS) - when this is done using the hydrostatic interface pressure and non-hydrostatic mid-level pressure as described above (more details in Overleaf doc) then this is fulfilled.

  2. The total energy increment from heating yields the same column energy change in pressure and z-coordinates (this assumes that there is no flux of heat in-out the model top which is different in CAM physics and MPAS; constant p versus constant z) since column heating should be independent of the vertical coordinate system.

  3. The energy change associated with water change in the column (dme_adjust) is computed consistently with hydrostatic MPAS (water removed at a constant MPAS z-level is not the same as removing water from a constant pressure level in CAM physics where the heights evolve throughout physics!). Note that we have chosen that physics tendencies in physics and dynamics are associated according to index (i.e. we are not mapping tendencies from the new z-levels in physics back to MPAS z-levels). Hence there is a splitting error here ...

  4. The total energy fixer needs to use an energy formula consistent with hydrostatic MPAS (note that the total energy fixer fixes energy errors in the dynamical core, 3., and physics-dynamics coupling).

  5. Neglect non-hydrostatic effects (note that the only difference between hydrostatic and non-hydrostatic total energy formulas is the kinetic energy of vertical velocity; likely a small term in a global energy budget).

  6. As with SE and FV3, we neglect the mass-effect of condensates in the total energy formula (this is a very small term globally; but could also quite easily be fixed!).

Solution to pressure problem: Height is a diagnostic field in CAM physics. The energy formula in CAM physics does not use height. So energetically speaking the heights don’t matter in terms of the energy budget. That said, the height is used by certain parameterizations (and diagnostics) as an input variable to do their thing (e.g. compute thresholds for some process). Hence we could choose to violate 1, i.e. the mid-level height will not be entirely consistent with z in MPAS computed from equation (1) in Overleaf document. Perhaps that is OK since we have a splitting error anyway (with physics evolving on constant pressure and applied at constant volume index-wise).

I propose we use a hydrostatic mean pressure computed using a finite-volume mean approach (A7 in Overleaf document). In this case the mid z-level will not exactly match MPAS but since it is a diagnostic variable in CAM physics maybe that is “good enough”. This will make sure that pmid is bounded by the interface pressures sacrificing the “z consistency.

Draft code with fix 2 fixes here (search for #define pmid_fix) - @skamroc please advice which you prefer:

https://github.com/PeterHjortLauritzen/CAM/blob/mpas_pressure_fix/src/dynamics/mpas/dp_coupling.F90

(also discovered indexing bug in energy diagnostics made during PR changes - this is fixed in this code to)

It has been verified that the energy consistency persists with the pressure fix.

@PeterHjortLauritzen
Copy link
Collaborator

Update on testing of proposed solution using full level pressure half way between hydrostatic interface pressures:

Ran 13 months (used last 12 months for analysis, L32, F2000climo compset, 1 degree).

With the fix the z-levels diagnosed in physics do not match MPAS z levels. The Figure below shows difference (in meters) between CAM physics z levels (based on hydrostatic integration) and MPAS z levels for "random" locations as a function of level:

cam phys z versus mpas z

Same Figure but normalized (not exactly same locations):

cam phys z versus mpas z - normalized

The differences are on the order of 0.1-0.2%. For comparison here are similar Figures but showing how much z-levels move during physics (z end of physics minus z beginning of physics normalized by z):

how much do levels move in physics on average - normalized

=> the levels move less due to physics processes compared to the discrepancy between MPAS z levels and CAM physics z levels computed as described above.


The z level discrepancy impacts the energy fixer in the geopotential term (rhogz*dz):

In terms of the dry mass adjustment (dme_adjust) the change is: -0.410553 W/M^2 (dme_adjust using physics z levels) versus -0.41074 W/M^2 (dme_adjust using dycore z levels). The difference is in the 4th digit so small!

In all, the modification is acceptable in terms of energy budget closure, however, it would be more consistent to change the thermodynamic potential in physics so that z stays fixed (that is, however, a research project).

@adamrher
Copy link

adamrher commented Nov 16, 2021

In terms of the dry mass adjustment (dme_adjust) the change is: -0.410553 W/M^2 (dme_adjust using physics z levels) versus -0.41074 W/M^2 (dme_adjust using dycore z levels). The difference is in the 4th digit so small!

These are small, but isn't this only measuring the change in levels due to the physics? Whether you start with MPAS z or CAM z, it is only measuring the change due to physics, right?

Might these 100m differences in the top ten or so levels have a larger impact on the energy fixer (not dme adjust)?

@andrewgettelman
Copy link
Collaborator Author

Ran this last night and did some analysis this morning. (6 years, 120km, 32L) Looks good to me:

Atmospheric Energy balance:

MPAS RESTOM 1.905 RESSURF 1.836 RESID 0.068
MPAS-dplinV2 RESTOM 2.091 RESSURF 2.064 RESID 0.028

Differences from previous MPAS are minor for (A) cloud variables (single level), (B) Temp and (C) Zonal wind. Only differences in B & C are at higher latitudes and upper levels, which could be just some noise. Or it might be real, but the differences are small.

If this works with the 60km APE 58L and the 60-3km 58L, I am fine moving forward with this change.
zmean_dplinV2
zmean_T_diff_dplinV2

zmean_U_diff_dplinV2

@PeterHjortLauritzen
Copy link
Collaborator

In terms of the dry mass adjustment (dme_adjust) the change is: -0.410553 W/M^2 (dme_adjust using physics z levels) versus -0.41074 W/M^2 (dme_adjust using dycore z levels). The difference is in the 4th digit so small!

These are small, but isn't this only measuring the change in levels due to the physics? Whether you start with MPAS z or CAM z, it is only measuring the change due to physics, right?

Correct (z using CAM physics z versus MPAS z)!

Might these 100m differences in the top ten or so levels have a larger impact on the energy fixer (not dme adjust)?

In terms of energy the terms are weighted with rho*dz. Since there is very little mass up there these differences do not really show (but it is indeed a bias)

@PeterHjortLauritzen
Copy link
Collaborator

zmean_U_diff_dplinV2

Above level 15ish CAM physics z is biased low (on average) so physics "thinks" that the heights are lower than they really are. I wonder if that systematically changes the gravity wave scheme (or some other parameterization) using z as input variable which leads to jet changes?

@andrewgettelman
Copy link
Collaborator Author

One more figure. I used the same code to plot the zonal and annual averaged 'PMID' on each pressure level. The difference with the new dp_coupling formulation is the lower right panel. In millibars, so generally < 1mb and small, up to 2mb near the surface. Might be related to the wind/temperature differences....

zmean_PMID_diff_dplinV2

@andrewgettelman
Copy link
Collaborator Author

From @xhuang-ncar

Glad to see the fix and know that the pmid difference is small. As a follow up to Andrew's comparison to the control run, I can also confirm that the pmid fix works with the 60km APE 58L and the 60-3km 58L. I am going to apply this update for future CAM-MPAS simulations.

Regarding @PeterHjortLauritzen 's comment: do we need to investigate further? I guess the issue is what parameterizations use z as an input, and can we diagnose potential differences?

@MiCurry MiCurry removed their assignment Nov 30, 2021
jtruesdal added a commit that referenced this issue Dec 23, 2021
cam6_3_040: MPAS pressure fix
MPAS uses hydrostatic interface pressures and non-hydrostatic mid level pressures when passing state to CAM physics (this ensures that geopotential_t will exactly recover MPAS z levels from CAM physics state); unfortunately, the hydrostatic interface pressures did not bound the non-hydrostatic mid level pressure at high resolution leading to a physically inconsistent pressure field. Mid-level pressure computation changed to the arithmetic mean of hydrostatic interface pressures. This makes sure that the mid level pressure is bounded (but violates the geopotentian_t consistency for mid level height).
This PR also addresses an error in the calculation of some MPAS energy diagnostics due to swapped array dimensions in the rhod and dz arrays.
closes #442
closes #487
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests