Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility problem with clm4_5_10_r187 -- 30% of the time randomly dies with a methane error #143

Closed
ekluzek opened this issue Dec 16, 2017 · 5 comments
Labels
closed: wontfix We won't fix this issue, because it would be too difficult and/or isn't important enough to fix type: bug something is working incorrectly

Comments

@ekluzek
Copy link
Contributor

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2016-08-08 13:16:05 -0600
Bugzilla Id: 2344
Bugzilla Depends: 2345,
Bugzilla CC: andre, dlawren, dll, mvertens, oleson, rfisher, sacks,

A reproducibility issue came in with clm4_5_10_r187 where about 30% of the time the following case will die with a methane error, and other times run to completion. I think simpler cases will die as well, but the simpler tests I did were successful before I realized it wasn't reproducible.

create_newcase -case clm4_5_10_r187_SP_4x5_ADspin -mach yellowstone_intel -res f45_f45 -user_compset 1850_DATM%CRU_CLM50%BGC_SICE_SOCN_MOSART_SGLC_SWAV -user_pes_setby clm

./xmlchange CLM_ACCELERATED_SPINUP="on",STOP_N=19,STOP_OPTION="nmonths",CLM_FORCE_COLDSTART="on",DATM_CLMNCEP_YR_ALIGN=1901
./xmlchange DATM_CLMNCEP_YR_END=1905,DATM_CLMNCEP_YR_START=1901,DATM_MODE=CLMGSWP3

user_nl_clm
paramfile='/glade/u/home/rfisher/Matlab/pft_files/FUNparams/TRY_default_c160708_hydr1.5.nc'
fsurdat='/glade/p/cesm/sdwg_dev/lawrence/surfdata/surfdata_4x5_16pftsmidarctic_simyr2000_c160419.nc'
use_hydrstress=.true.
use_luna=.true.

SourceMods/src.datm/namelist_defaults_datm.xml

<namelist_defaults>

CLMGSWP3.Solar,CLMGSWP3.Precip,CLMGSWP3.TPQW,CLMCRUNCEP_V5.TPQW

<strm_datvar stream="CLMCRUNCEP_V5.TPQW">
TBOT tbot
QBOT shum
</strm_datvar>

<strm_datvar stream="CLMGSWP3.TPQW">
WIND wind
PSRF pbot
FLDS lwdn
</strm_datvar>

</namelist_defaults>

An example error message in cesm.log is...

41: Negative conc. in ch4tran. c,j,deficit (mol): 3512 1
41: 8.843302222960208E-003
0: memory_write: model date = 10623 0 memory = 104.44 MB (highwater) 1946.05 MB (usage) (pe= 0 comps= ATM ICE OCN WAV ESP)
65: Methane demands exceed methane available. Error in methane competition (mol/m^3
65: /s), c,j: -2.755768946371973E-010 6818 2
65: Latdeg,Londeg= -30.0000000000000 35.0000000000000
65: ENDRUN:
65: ERROR: Methane demands exceed methane available.ERROR in ch4Mod.F90 at line 33
65: 90
65:
65:
65:
65:
65:
65:
65: ERROR: Unknown error submitted to shr_sys_abort.
65:Image PC Routine Line Source
65:cesm.exe 000000000153ABA8 Unknown Unknown Unknown
65:cesm.exe 0000000000E43893 shr_sys_mod_mp_sh 401 shr_sys_mod.F90
65:cesm.exe 0000000000509058 abortutils_mp_end 43 abortutils.F90
65:cesm.exe 0000000000A4352D ch4mod_mp_ch4_tra 3389 ch4Mod.F90
65:cesm.exe 0000000000A377CD ch4mod_mp_ch4_ 1980 ch4Mod.F90
65:cesm.exe 0000000000511D82 clm_driver_mp_clm 835 clm_driver.F90
65:cesm.exe 00000000004FDC49 lnd_comp_mct_mp_l 437 lnd_comp_mct.F90
65:cesm.exe 000000000042EDFB component_mod_mp_ 1079 component_mod.F90
65:cesm.exe 0000000000419FB2 cesm_comp_mod_mp_ 2509 cesm_comp_mod.F90
65:cesm.exe 000000000042EB92 MAIN__ 93 cesm_driver.F90
65:cesm.exe 0000000000417B7E Unknown Unknown Unknown
65:libc.so.6 00002ACAABC8FD5D Unknown Unknown Unknown
65:cesm.exe 0000000000417A89 Unknown Unknown Unknown

a precursor to the problem seems to be these ch4tran warnings....

grep ch4tran cesm.log.160806-113740
36: Negative conc. in ch4tran. c,j,deficit (mol): 2842 1
98: Negative conc. in ch4tran. c,j,deficit (mol): 11343 1
24: Negative conc. in ch4tran. c,j,deficit (mol): 1205 1
34: Negative conc. in ch4tran. c,j,deficit (mol): 2561 1
141: Negative conc. in ch4tran. c,j,deficit (mol): 17127 1
53: Negative conc. in ch4tran. c,j,deficit (mol): 5161 1
73: Negative conc. in ch4tran. c,j,deficit (mol): 7928 1
37: Negative conc. in ch4tran. c,j,deficit (mol): 2964 1
40: Negative conc. in ch4tran. c,j,deficit (mol): 3375 1
93: Negative conc. in ch4tran. c,j,deficit (mol): 10667 1

The ch4tran warnings don't occur in the simulations that were successful.

@ekluzek ekluzek added this to the clm5 milestone Dec 16, 2017
@ekluzek
Copy link
Contributor Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2016-08-08 13:17:17 -0600

Rosie tried similar submissions with r185 and r186 and didn't see runs that died with the same problem for 8 submissions. Hence we think this came in with r187.

@ekluzek
Copy link
Contributor Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2016-08-08 13:19:30 -0600

This is for a sequential layout without threading...

env_mach_pes.xml: NTASKS_ATM 15
env_mach_pes.xml: NTASKS_CPL 135
env_mach_pes.xml: NTASKS_ESP 15
env_mach_pes.xml: NTASKS_GLC 135
env_mach_pes.xml: NTASKS_ICE 15
env_mach_pes.xml: NTASKS_LND 135
env_mach_pes.xml: NTASKS_OCN 15
env_mach_pes.xml: NTASKS_ROF 135
env_mach_pes.xml: NTASKS_WAV 15
[erik@yslogin5 clm4_5_10_r187_SP_4x5_ADspin]$ ./xmlquery list | grep THRDS
env_mach_pes.xml: NTHRDS_ATM 1
env_mach_pes.xml: NTHRDS_CPL 1
env_mach_pes.xml: NTHRDS_ESP 1
env_mach_pes.xml: NTHRDS_GLC 1
env_mach_pes.xml: NTHRDS_ICE 1
env_mach_pes.xml: NTHRDS_LND 1
env_mach_pes.xml: NTHRDS_OCN 1
env_mach_pes.xml: NTHRDS_ROF 1
env_mach_pes.xml: NTHRDS_WAV 1

and with daily barriers...

env_run.xml: BARRIER_N 1
env_run.xml: BARRIER_OPTION ndays

@ekluzek
Copy link
Contributor Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2016-08-08 15:01:27 -0600

We think this problem is just because Rosie's param file doesn't have rootprof_beta as a 2D variable. So the 2nd dimension (for Carbon) isn't getting initialized to anything (and set to garbage) hence the problem.

so this file has rootprof_beta as a 1D variable rather than 2D.

/glade/u/home/rfisher/Matlab/pft_files/FUNparams/TRY_default_c160708_hydr1.5.nc

so this problem shouldn't occur for the default code where the standard params file is being used -- just for cases where a params file doesn't have rootprof_beta as a 2D variable.

As a separate issue we should fix the code so it doesn't allow such a thing to happen.

@ekluzek
Copy link
Contributor Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2016-08-22 16:33:43 -0600

This does also occur on hobart_intel and with both PIO1 and PIO2.

@ekluzek ekluzek removed this from the clm5 milestone Jul 7, 2019
@ekluzek ekluzek added closed: wontfix We won't fix this issue, because it would be too difficult and/or isn't important enough to fix type: bug something is working incorrectly labels Jul 7, 2019
@ekluzek
Copy link
Contributor Author

ekluzek commented Jul 7, 2019

I think this issue should have been closed before since the comments indicate it had to do with a params file that didn't have a variable with the right dimension. We don't seem to be running into this problem of late, so I'm going to close it.

@ekluzek ekluzek closed this as completed Jul 7, 2019
billsacks added a commit to billsacks/ctsm that referenced this issue Nov 13, 2020
39ad53263 Merge pull request ESCOMP#150 from gold2718/fix_combo_config
75f8f02f5 Merge pull request ESCOMP#152 from jedwards4b/sort_by_local_path
42687bd53 remove commented code
29e26af81 fix pylint issues
7c9f3c613 add a test for nested repo checkout
75c5353d2 fix spacing
24a3726a1 improve sorting, checkout externals with each comp
29f45b086 remove py2 test and fix super call
880a4e765 remove decode
1c53be854 no need for set call
36c56dbac simplier fix for issue
dc67cc682 simpler solution
b32c6fca9 fix to allow submodule name different from path
5b5e1c2b0 Merge pull request ESCOMP#144 from billsacks/improve_errmsg
c983863c4 Add another option for dealing with modified externals
59ce252cf Add some details to the error message when externals are modified
be5a1a4d7 Merge pull request ESCOMP#143 from jedwards4b/add_exclude
2aa014a1b fix lint issue
49cd5e890 fix lint issues
418173ffd Added tests for ExternalsDescriptionDict
afab352c8 fix lint issue
be85b7d1b fix the test
a580a570b push test
d43710864 add a test
21affe33c fix formatting issue
72e6b64ae add an exclude option

git-subtree-dir: manage_externals
git-subtree-split: 39ad532636944b8e759ad9e56ef5f453aaea81f0
ekluzek added a commit that referenced this issue Dec 16, 2023
0f884bfec Merge pull request #205 from jedwards4b/sunset_svn_git_access
82a5edf79 merge in billsacks:svn_testing_no_github
17532c160 Use a local svn repo for testing
9c904341a different method to determine if in tests
539952ebd remove debug print statement
cc5434fa7 fix submodule testing
1d7f28840 remove broken tests
04e94a519 provide a meaningful error message
38bcc0a8c Merge pull request #201 from jedwards4b/partial_match
b4466a5aa remove debug print statement
c3cf3ec35 fix issue with partial branch match
7b6d92ef6 Merge pull request #198 from johnpaulalex/gitdir
927ce3a98 Merge pull request #197 from johnpaulalex/testpath
a04f1148f Merge pull request #196 from johnpaulalex/readmod
d9c14bf25 Change the rest of the methods to use -C. Still some usage of getcwd in test_unit_repository_git.
332b10640 Fix incorrect logged path of checkout_externals in test_sys_checkout: it was basically the parent of the current directory, which varies throughout the test. (it called abspath with '{0}/../../', which adds arbitrary and not-interpolated subdir '{0}' to the path, then removes it and removes one more level).
932a7499b Remove printlog from read_gitmodules_file since read_externals_description_file() already has a nearly-the-same printlog (but add it to the other caller).
5d13719ed Merge pull request #195 from johnpaulalex/check_repo
423395449 Update utest to mock _git_remote_verbose in a new way, since it is now called via the GitRepository class rather than on the specific GitRepository instance.
d7a42ae96 Check that desired repo was actually checked out.
71596bbc1 Merge pull request #194 from johnpaulalex/manic2
4c96e824e Make the MANIC_TEST_BARE_REPO_ROOT env var special - give it a constant for easy tracking, and automatically tear it down after each test.
259bfc04d test_sys_checkout: use actual paths in on-the-fly configs rather than MANIC_TEST_BARE_REPO_ROOT env var. This will make it easier to test (in the near future) that checkout_externals actually checked out the desired repo dir.
557bbd6eb Merge pull request #193 from johnpaulalex/manic
5314eede1 Remove MANIC_TEST_TMP_REPO_ROOT environment variable in favor of module-level variable.
345fc1e14 Merge pull request #191 from johnpaulalex/test_doc12
2117b843c test_sys_checkout: verify that basic by-tag/branch/hash tests actually take us to the correct git tag/branch/hash.
94d6e5f2b Merge pull request #190 from johnpaulalex/test_doc11
3ff33a6a8 Inline local-path-creation methods
47dea7f64 Merge pull request #189 from johnpaulalex/test_doc10
9ea75cbf8 Grab-bag of renamings: Remove redundant _NAME from repo constants, and consistently add _REPO suffix (This causes the majority of diffs).
c0c847ec8 Merge pull request #188 from johnpaulalex/test_doc9
2dd5ce0f7 test_sys_checkout.py: only check for correct 'required' or 'optional' state in the test that exercises required vs optional behavior. Removed a lot of boilerplate.
eb3085984 Merge pull request #187 from johnpaulalex/test_doc8
1832e1f84 test_sys_checkout: Simplify many tests to only use a single external.
8689d61ec Merge pull request #186 from johnpaulalex/test_doc7
fbee4253e Grab bag of test_sys_checkout cleanups:    Doc inside of each test more clearly/consistently.    TestSysCheckoutSVN didn’t get the inlining-of-helper-methods treatment, now it has that.    Move various standalone repo helper methods (like create_branch) into a RepoUtils class.    README.md was missing newlines when rendered as markdown.    Doc the return value of checkout.main    Fix test_container_exclude_component - it was looking for the wrong key (which is never present); now it looks for the correct key.
f0ed44a6e Merge pull request #185 from johnpaulalex/test_doc6
a3d59f5f2 Merge pull request #184 from johnpaulalex/test_doc5
5329c8ba7 test_sys_checkout: Inline config generation functions that are only called once.
464f2c7a7 test_sys_checkout: Inline another layer (per-config-file checks). Rename the 4 methods that are used multiple times, to reflect what they do rather than what they're called.
8872c0df6 Merge pull request #183 from johnpaulalex/doc_test4
c045335f6 Merge pull request #182 from johnpaulalex/doc_test3
c583b956e Merge pull request #181 from johnpaulalex/doc_test2
e01cfe278 test_sys_checkout: less confusing handling of return values from checkout_externals. Specifically, when doing a checkout, don't return tree_status from _before_ the checkout. Make a new wrapper to call checkout_externals a second time, to calculate the new status after a checkout (very frequent pattern).
23286818c test_sys_checkout: Remove another layer (which generates test component names)
c3717b6bc Merge pull request #180 from johnpaulalex/doc_test
36d7a4434 test_sys_checkout.py: remove one layer of functions (that check for local status enums). No-op.
2c4584bf7 More documentation about tests: * contents of test repositories (n a new README.md) * various constants in test_sys_checkout.py that point to those contents, and terminology like container/simple/mixed. * in each test method, the scenarios being tested. * The coupling between test methods.
55e74bd0a Merge pull request #179 from johnpaulalex/circ
66be84290 Remove circular dependency by making _External stop doing tricky things with sourcetrees.
82d3b247f Merge pull request #178 from johnpaulalex/test_doc
3223f49ea Additional documentation of system tests - global variables, method descriptions.
45b7c01c3 Merge pull request #177 from jedwards4b/git_workflow
ace90b2c2 try setting credentials this way
f4d6aa933 try setting credentials this way
1d61a6944 use this to set git credentials
7f9d330e1 use this to set git credentials
5ac731b85 add tmate code
836847be7 get git workflow working
dcd462d71 Merge pull request #176 from jedwards4b/add_github_testing
2d2479e9d Merge pull request #175 from johnpaulalex/fix
711a53fdf add github testing of prs and automatic tagging of main
cfe0f888a fix typos
5665d6140 Fix broken checkout behavior introduced by PR #172.
27909e255 Merge pull request #173 from johnpaulalex/readall
00ad0440b Further tiny refactorings and docs of checkout API (no-op).    Remove unused load_all param in _External.checkout().    Rename _External.checkout_externals() to checkout_subexternals(), to remove the ambiguity about whether the main external pointed to by the _External is itelf checked out (it is not)    Clarify load_all documentation - it’s always recursive, but applies different criteria at each level.    Rename variables in checkout.py (e.g. ext_description)  to match the equivalent code in sourcetree.py.
2ea3d1a3a Merge pull request #172 from johnpaulalex/fixit
43bf8092c Merge pull request #171 from johnpaulalex/docstatus
e6aa7d21e Merge pull request #170 from johnpaulalex/printdir
adbd71557 On checkout, refresh locally installed optional packages regardless of whether -o is passed in.
add074593 Comment tweaks, and fix 'ppath' typo
696527cb8 Document the format of various status dictionaries, and the various paths and path components within an _External.
c677b9403 When processing an external, print out its path in addition to the base filename (to disambiguate all the externals.cfg's)
975d7fd5a Merge pull request #169 from johnpaulalex/docfix_branch
09709e36d Document _Externals.status().  The original comment was apparently copy-pasted from checkout().
1d880e090 Merge pull request #167 from billsacks/fix_svn_on_windows
3510da848 Tweak a unit test to improve coverage
eb7fc1368 Handle the possibility that the URL already ends with '/'
02ea87e3d Fix svn URLs on Windows
b1c02ab54 Merge pull request #165 from gold2718/doc_fix
9f4be8c7b Add documentation about externals = None feature
a3b3a0373 Merge pull request #162 from ESMCI/fischer/python3
d4f1b1e8d Change shebang lines to python3
2fd941abc Merge pull request #158 from billsacks/modified_solution
de08dc2ee Add another option for when an external is in a modified state
e954582d0 Merge pull request #156 from billsacks/onbranch_show_hash
952e44d51 Change output: put tag/hash before branch name
10288430f Fix pre-existing pylint issues
01b13f78f When on a branch, show tag/hash, too
39ad53263 Merge pull request #150 from gold2718/fix_combo_config
75f8f02f5 Merge pull request #152 from jedwards4b/sort_by_local_path
42687bd53 remove commented code
29e26af81 fix pylint issues
7c9f3c613 add a test for nested repo checkout
75c5353d2 fix spacing
24a3726a1 improve sorting, checkout externals with each comp
29f45b086 remove py2 test and fix super call
880a4e765 remove decode
1c53be854 no need for set call
36c56dbac simplier fix for issue
dc67cc682 simpler solution
b32c6fca9 fix to allow submodule name different from path
5b5e1c2b0 Merge pull request #144 from billsacks/improve_errmsg
c983863c4 Add another option for dealing with modified externals
59ce252cf Add some details to the error message when externals are modified
be5a1a4d7 Merge pull request #143 from jedwards4b/add_exclude
2aa014a1b fix lint issue
49cd5e890 fix lint issues
418173ffd Added tests for ExternalsDescriptionDict
afab352c8 fix lint issue
be85b7d1b fix the test
a580a570b push test
d43710864 add a test
21affe33c fix formatting issue
72e6b64ae add an exclude option

git-subtree-dir: manage_externals
git-subtree-split: 0f884bfec8e43d0c02261de858d6ec3f6d855e51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed: wontfix We won't fix this issue, because it would be too difficult and/or isn't important enough to fix type: bug something is working incorrectly
Projects
None yet
Development

No branches or pull requests

1 participant