Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMS_Lm1.ne30pg2_ECwISC30to60E2r1.WCYCL1850 broken #6230

Closed
darincomeau opened this issue Feb 10, 2024 · 17 comments · Fixed by #6233
Closed

SMS_Lm1.ne30pg2_ECwISC30to60E2r1.WCYCL1850 broken #6230

darincomeau opened this issue Feb 10, 2024 · 17 comments · Fixed by #6233
Assignees
Labels

Comments

@darincomeau
Copy link
Member

The test SMS_Lm1.ne30pg2_ECwISC30to60E2r1.WCYCL1850 off master dies at the end of the month in land with this at the end of the lnd.log:

 time integrated flux =  -1.898220433018436E+050
 net change in state  =    15326895.4805661
 current state        =    11.9818119753306
 relative error [%]   =   1.990829226678872E+043
 ENDRUN:
 ERROR in CNPBudgetMod.F90 at line 929






 ERROR: Unknown error submitted to shr_abort_abort.
@darincomeau
Copy link
Member Author

I also see this with SMS_Lm1_P1280.ne30pg2_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.
This is off master
* 5886da6f5b (HEAD -> master, origin/master, origin/HEAD) Merge branch 'jgfouca/fix_prov_git_config' into master (PR #6227)

@darincomeau
Copy link
Member Author

Narrowing down the range a bit, SMS_Lm1_P1280.ne30pg2_IcoswISC30E3r5.WCYCL1850.chrysalis_intel passes on master as of Jan 24, * e832fb54a5 (HEAD) Merge branch E3SM-Project/ndk/machinefiles/pm-gpu-nvidiagpu-fix (PR #6166)

@ndkeen
Copy link
Contributor

ndkeen commented Feb 11, 2024

This might be similar issue as here: #6188

I also tried SMS_D_Lm1.ne4_oQU240.F2010.pm-cpu_gnu and SMS_D_Lm1.ne4_oQU240.F2010.pm-cpu_intel which failed same way described above. In land log:

time integrated flux =   -1.2116544587617718E+050
 net change in state  =    15566989.875112809
 current state        =    12.274692326702574
 relative error [%]   =    1.2404464877883807E+043
 ENDRUN:ERROR in /global/cfs/cdirs/e3sm/ndk/repos/nexty-feb5/components/elm/src/biogeochem/CNPBudgetMod.F90 at line 929                                                                                                                                              \

 ERROR: Unknown error submitted to shr_abort_abort.

And then with SMS_D_Lm1.ne30pg2_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel:

228: forrtl: error (65): floating invalid
228: Image              PC                Routine            Line        Source
228: libpthread-2.31.s  000014E4D63EF910  Unknown               Unknown  Unknown
228: e3sm.exe           000000000C25352D  ocn_diagnostics_m        1061  mpas_ocn_diagnostics.f90
228: e3sm.exe           000000000C237869  ocn_diagnostics_m         292  mpas_ocn_diagnostics.f90
228: e3sm.exe           000000000D9178B0  ocn_init_routines         168  mpas_ocn_init_routines.f90
228: e3sm.exe           000000000D90C945  ocn_forward_mode_         326  mpas_ocn_forward_mode.f90
228: e3sm.exe           000000000DAFCC87  ocn_core_mp_ocn_c          86  mpas_ocn_core.f90
228: e3sm.exe           000000000BF60118  ocn_comp_mct_mp_o         563  ocn_comp_mct.f90
228: e3sm.exe           00000000004A04A5  component_mod_mp_         257  component_mod.F90
228: e3sm.exe           000000000044D984  cime_comp_mod_mp_        1469  cime_comp_mod.F90
228: e3sm.exe           0000000000497422  MAIN__                    122  cime_driver.F90
228: e3sm.exe           0000000000439D3D  Unknown               Unknown  Unknown
228: libc-2.31.so       000014E4D5E3C24D  __libc_start_main     Unknown  Unknown

Those errors are with next of Feb 5th.

@rljacob
Copy link
Member

rljacob commented Feb 11, 2024

#6188 is from the FAN model and this case doesn't have FAN on.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 21, 2024

Also see the error with ERS_Ld31.ne4pg2_oQU480.F2010.pm-cpu_intel using rljacob/update-test-v3res

@rljacob
Copy link
Member

rljacob commented Feb 26, 2024

@bishtgautam this needs to be fixed before the v3 tag.

@rljacob
Copy link
Member

rljacob commented Feb 26, 2024

@darincomeau which science case is using ne30pg2_ECwISC30to60E2r1 ?
The only tests we have for that MPAS mesh are these:
"ERS_P480_Ld5.T62_ECwISC30to60E2r1.GMPAS-DIB-IAF-PISMF",
"PEM_P480_Ld5.T62_ECwISC30to60E2r1.GMPAS-DIB-IAF-PISMF",

@darincomeau
Copy link
Member Author

@rljacob the grid is misleading, it's what I had on hand when I created the issue. The issue (I believe) is really the compsets, from here: #6108

I've been using #6233 as a fix in my testing that requires more than one month.

bishtgautam added a commit that referenced this issue Feb 26, 2024
Sets the values for three fluxes that are included in monthly CNP budgets.

Fixes #6230
[BFB]
@ndkeen
Copy link
Contributor

ndkeen commented Feb 27, 2024

This is the error that causes tests like the following to fail (on master of Feb 26th)

ERS_Ld31.ne4pg2_oQU480.F2010.pm-cpu_intel
SMS_Ly1.ne4pg2_oQU480.F2010.pm-cpu_intel
SMS_Lm1.ne4pg2_oQU480.F2010.pm-cpu_intel

@rljacob
Copy link
Member

rljacob commented Feb 27, 2024

If you have a chance, try with the latest version of next.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 27, 2024

Same error with next of Feb 26th with ERS_Ld31.ne4pg2_oQU480.F2010.pm-cpu_intel

@rljacob
Copy link
Member

rljacob commented Feb 27, 2024

Does the SMS_Lm1 test work?

@ndkeen
Copy link
Contributor

ndkeen commented Feb 27, 2024

Yes, was about to update that SMS_Lm1.ne4pg2_oQU480.F2010.pm-cpu_intel does complete using next of Feb 26th

@bishtgautam
Copy link
Contributor

bishtgautam commented Feb 27, 2024

I can reproduce @ndkeen's failure of ERS_Ld31.ne4pg2_oQU480.F2010.pm-cpu_intel. The C-budget in ELM reports a large relative error for the second run of the ERS test. I suspect some variable isn't being written out in (or read from) the elm.rh0. I will debug this error today.

@rljacob
Copy link
Member

rljacob commented Feb 27, 2024

I'm seeing differences in fill pattern reported on gcp. 1 for ERS and 2 for ERP which implies one field has a number-of-tasks dependency.

@mahf708
Copy link
Contributor

mahf708 commented Feb 28, 2024

I'm seeing differences in fill pattern reported on gcp. 1 for ERS and 2 for ERP which implies one field has a number-of-tasks dependency.

xref #6266

@rljacob
Copy link
Member

rljacob commented Feb 28, 2024

Turned out the ERP diff was just from #6233

bishtgautam added a commit that referenced this issue Feb 28, 2024
The following updates are made to fix the CNP budget in ELM.

- The value for three fluxes used in the monthly CNP budget is now initialized.
- An MPI_Allreduce (instead of an MPI_Reduce) is used when budgets are computed.
- An additional field is written out in the ELM restart file.
- Long names and units of a few variables in the restart files are corrected.

Fixes #6230
[BFB] (except a new field is added in the ELM restart file)
mahf708 pushed a commit that referenced this issue Feb 28, 2024
Sets the values for three fluxes that are included in monthly CNP budgets.

Fixes #6230
[BFB]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants