Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cam6_3_129: Update externals to match cesm2_3_alpha16d and bring in test_driver which supports derecho #888

Merged

Conversation

cacraigucar
Copy link
Collaborator

Closes #879 (test_driver.sh supports derecho)
Closes #886 (fix line too long in HEMCO external)

Updates externals to match cesm2_3_alpha16d (except ccs_config is newer)

Limited review as reviews were performed in #884 before it was broken into a more limited PR

@cacraigucar
Copy link
Collaborator Author

@nusbaume - This should have all of Chris' updates

Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one change requested based off a previous conversation.

@@ -551,6 +618,10 @@ if [ "${cesm_test_suite}" != "none" -a -n "${cesm_test_mach}" ]; then
chey* | r* )
testargs="${testargs} --queue ${CAM_BATCHQ} --test-root ${cesm_testdir} --output-root ${cesm_testdir}"
;;
# derecho
derc* | dec* )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be derec* based off the statement here:

#879 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committing that change bit both Chris and I. It is now committed!

Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great now, thanks!

@cacraigucar
Copy link
Collaborator Author

My changes in test_driver.sh were committed back to #884 and not #879, so when I merged #879, the changes were not brought in. I have confirmed the only differences in test_driver.sh are now Chris' most recent changes to an error code return and wallclock time

@cacraigucar cacraigucar changed the title Update externals to match cesm2_3_alpha16d and bring in test_driver which supports derecho cam6_3_129: Update externals to match cesm2_3_alpha16d and bring in test_driver which supports derecho Sep 21, 2023
@cacraigucar
Copy link
Collaborator Author

@jtruesdal - I have a failing test: /glade/scratch/cacraig/ERP_Ln9_Vnuopc.C96_C96_mg17.F2000climo.cheyenne_intel.cam-outfrq9s_mg3.20230925_121636_y2r02w/bld/atm.bldlog.230925-123421

This test is failing with an error message about no rule to make libfms.a which is needed by libfv3core.a. Up until this update of externals, we have been holding the FMS external frozen, but it appears that that is no longer going to work. Any ideas on how to solve this? Do we just break FV3 with CAM checkouts with this tag until FV3 is brought in, or do you have a solution I should apply?

@jtruesdal
Copy link
Collaborator

I think this can be fixed with a change to buildlib under cime_config and am looking into it.

@jtruesdal
Copy link
Collaborator

@cacraigucar It turns out the fix is a little more involved since the external updates for this PR include changes to the way the fms library is built. To allow the older FMS/FV3 code that cam is currently using to compile under the latest set of externals would require changes to 4 files.

cime - Makefile
CAM - cime_config/buildlib
FMS_Interface - buildlib
FMS_Interface - Makefile.cesm

The cime and CAM changes would also have been applied when we bring FV3/fms versions up to what CESM is using. We could apply this part of the fms/FV3 update now, the only caveat is that we need a new cime tag for the cime/Makefile change.

  1. make this PR without external updates.
  2. make this PR with only the external updates needed for derecho. Presumably all other external updates would occur when FV3/fms is updated in the near future.
  3. make the fixes to allow the older fv3/fms code to work with the set of external updates in this PR. This would require waiting on a new cime tag and including that in the external updates for this PR.

@cacraigucar
Copy link
Collaborator Author

@cacraigucar It turns out the fix is a little more involved since the external updates for this PR include changes to the way the fms library is built. To allow the older FMS/FV3 code that cam is currently using to compile under the latest set of externals would require changes to 4 files.

cime - Makefile CAM - cime_config/buildlib FMS_Interface - buildlib FMS_Interface - Makefile.cesm

The cime and CAM changes would also have been applied when we bring FV3/fms versions up to what CESM is using. We could apply this part of the fms/FV3 update now, the only caveat is that we need a new cime tag for the cime/Makefile change.

  1. make this PR without external updates.
  2. make this PR with only the external updates needed for derecho. Presumably all other external updates would occur when FV3/fms is updated in the near future.
  3. make the fixes to allow the older fv3/fms code to work with the set of external updates in this PR. This would require waiting on a new cime tag and including that in the external updates for this PR.

@jtruesdal - I vote for #3. The main purpose of this PR was to bring in all of the externals. I am willing to wait for a new cime tag and try this PR again at that point.

@jtruesdal
Copy link
Collaborator

@cacraigucar I found a workaround that just requires changes to cime_config/buildlib which will allow the older version of fms/fv3 to be built and the failing regression tests to pass. I can update the cime component Makefile and cime_config/buildlib files when bringing fms/FV3 up to the current cesm version. Can I commit the mods to your cam_update_ext_test_driver branch for review or would you like the file update another way?

@cacraigucar
Copy link
Collaborator Author

cacraigucar commented Sep 27, 2023 via email

@jtruesdal
Copy link
Collaborator

@nusbaume if you are reviewing changes to buildlib these were mostly provided by Jim Edwards in PR #808. This PR allows CAM to use a shared version of libfms which is now built by the CESM framework instead of CAM.

import sys, os, filecmp, shutil, imp


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtruesdal - I am getting the following error:

File "/glade/u/home/cacraig/cam_update_ext_test_driver/cime_config/buildlib", line 128, in
_main_func()
File "/glade/u/home/cacraig/cam_update_ext_test_driver/cime_config/buildlib", line 122, in _main_func
_build_cam(caseroot, libroot, bldroot)
File "/glade/u/home/cacraig/cam_update_ext_test_driver/cime_config/buildlib", line 92, in _build_cam
libfms = os.path.join(fmsbuilddir, "libfms.a")
UnboundLocalError: local variable 'fmsbuilddir' referenced before assignment

Comment on lines 92 to 95
libfms = os.path.join(fmsbuilddir, "libfms.a")
expect(os.path.isfile(libfms),
"err: Missing the FMS library libfms.a - cannot complete CAM build")
shutil.copy(libfms, libroot)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this section be put inside an if cam_dycore == "fv3": ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtruesdal - I went ahead and put this in an if block and it appears to be working. I have asked for a review from you to confirm this change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cacraigucar Yes. I forgot this is run for all dycores but only needed for fv3.

Copy link
Collaborator

@jtruesdal jtruesdal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching and fixing that error.

@cacraigucar
Copy link
Collaborator Author

@jtruesdal - This regression is still failing:

/glade/scratch/cacraig/aux_cam_20230927145521/ERP_Ln9_Vnuopc.C96_C96_mg17.F2000climo.cheyenne_intel.cam-outfrq9s_mg3.GC.aux_cam_20230927145521

The error message is:
ERROR: err: Missing the FMS library libfms.a - cannot complete CAM build

@jtruesdal
Copy link
Collaborator

jtruesdal commented Sep 28, 2023

@jedwards4b when determining whether to use 'threads' or 'nothreads' in the in the shared library path, the code you provided for PR #808 checked the case environment variable BUILD_THREADED to decide between the two options. In the failing regression test mentioned by Cheryl, the library path being built includes 'threaded' even though BUILD_THREADED is false. I do see that SMP_PRESENT is set TRUE for this case denoting that at least one component is multithreaded although as far as I can tell all components seem to be using 1 thread. Should BUILD_THREADED and SMP_PRESENT should be consistent? In the ERP test failure mentioned above they are not. I've verified that this is the case in older regression test suites for aux_cam so this behavior has been present well before this PR. If this is a bug I could commit a workaround for this PR that checks for BUILD_THREADED or SMP_PRESENT to determine whether to use 'threads' or 'nothreads' in the path name.

@jedwards4b
Copy link

@jtruesdal This is a feature of the ERP test. https://github.com/ESMCI/cime/blob/master/CIME/data/config/config_tests.xml#L263
Because ERP may switch between threaded and non-threaded builds they need to have SMP enabled. I'll look closer at this test and see if I can figure it out.

@jedwards4b
Copy link

@cacraigucar I have a fix for cams buildlib - can I push it to your branch?

@jtruesdal
Copy link
Collaborator

Thanks Jim!

@cacraigucar
Copy link
Collaborator Author

@cacraigucar I have a fix for cams buildlib - can I push it to your branch?

That change did not impact the failing test. I still get the error message about the FMS library not being found.

I also have a couple of other failing tests that I am still looking into with the error message:

MPI error (MPI_File_write_al_all) : I/O error

@jedwards4b
Copy link

@cacraigucar can you point me to the case that can't find libfms.a

@cacraigucar
Copy link
Collaborator Author

The case is:
/glade/scratch/cacraig/aux_cam_20230928121031/ERP_Ln9_Vnuopc.C96_C96_mg17.F2000climo.cheyenne_intel.cam-outfrq9s_mg3.GC.aux_cam_20230928121031

The source code (with your change) is at:
/glade/u/home/cacraig/cam_update_ext_test_driver

@jedwards4b
Copy link

@cacraigucar I tried again here: /glade/scratch/jedwards/ERP_Ln9_Vnuopc.C96_C96_mg17.F2000climo.cheyenne_intel.cam-outfrq9s_mg3.20230929_102509_ua4p8l it works for me and I can't spot the difference - can you? My source is in /glade/work/jedwards/sandboxes/CAM

@cacraigucar
Copy link
Collaborator Author

@jedwards4b - I tried a single test and it worked fine for me as well. I have started up a the full suite testing again, but it will be awhile until it gets to this point. I'm wondering if there could be an incompatibility between the SHAREDLIB_BUILD and this test? I won't know until my testing gets to this point again.

@cacraigucar
Copy link
Collaborator Author

@jedwards4b and @jtruesdal - I did confirm that when I run with the full regression test suite, it fails to find the compiled FMS library (which appears to be there, but I guess it can't find it). When I run the single test (test_cac in the testlist_cam.xml) it runs fine. The two case directories to compare:

/glade/scratch/cacraig/aux_cam_20230929122120/ERP_Ln9_Vnuopc.C96_C96_mg17.F2000climo.cheyenne_intel.cam-outfrq9s_mg3.GC.aux_cam_20230929122120

/glade/scratch/cacraig/test_cac_20230929113325/ERP_Ln9_Vnuopc.C96_C96_mg17.F2000climo.cheyenne_intel.cam-outfrq9s_mg3.GC.test_cac_20230929113325/

@jtruesdal
Copy link
Collaborator

I've made a change to print the missing file and path so I should be able to see what is going on. I'll report back.

@cacraigucar
Copy link
Collaborator Author

The test which is failing with the missing FMS library is the only one which is holding back this tag. (That and awaiting review by @nusbaume). The tests with the MPI I/O error appear to have just been a transitory failure as they are now passing without any change to the source code.

@jtruesdal
Copy link
Collaborator

@JEdwards @cacraig I see what the problem is. There is a common sharedlibroot for the aux_cam regression testing where the fms library should be built but it is being built in the sharedrootlib directory of the ERP case. Thats why it works when executing that one case directly and not via a set of regression tests that is supposed to use a common sharedlibroot directory. Is this an effect of us using an older FMS version with its buildlib and Makefile.cesm or is it an issue with the regression test harness not building FMS in the correct directory?

@cacraigucar
Copy link
Collaborator Author

This error just arose when I updated the externals and the changes that were provided in this PR. It also is only this one test which has this "feature".

@jedwards4b
Copy link

@jtruesdal I think that the error is https://github.com/ESCOMP/CAM/pull/888/files#diff-7ce3095aab695d30cc02c4ee2a2d78290dfa34060661daade519474eb91beac0R93 - the directory where it is looking for libfms.a is wrong.

Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just a minor typo and a question related to the ChangeLog.

doc/ChangeLog Show resolved Hide resolved
doc/ChangeLog Show resolved Hide resolved
@cacraigucar cacraigucar merged commit ce55d98 into ESCOMP:cam_development Oct 4, 2023
@cacraigucar cacraigucar deleted the cam_update_ext_test_driver branch October 4, 2023 15:52
@cacraigucar cacraigucar self-assigned this Apr 19, 2024
gold2718 pushed a commit to gold2718/CAM that referenced this pull request May 2, 2024
Merge pull request ESCOMP#888 from cacraigucar/cam_update_ext_test_driver

cam6_3_129:  Update externals to match cesm2_3_alpha16d and bring in test_driver which supports derecho

ESCOMP commit: ce55d98
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Tag
Development

Successfully merging this pull request may close these issues.

5 participants