-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
part of cam6_3_056: PUMAS GPU regression test suite #577
part of cam6_3_056: PUMAS GPU regression test suite #577
Conversation
Update cam_development branch
modified: Externals.cfg
modified: cime_config/testdefs/testlist_cam.xml new file: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg2_default/shell_commands renamed: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_casper/user_nl_cam -> cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg2_default/user_nl_cam renamed: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_casper/user_nl_clm -> cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg2_default/user_nl_clm renamed: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_casper/shell_commands -> cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_default/shell_commands new file: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_default/user_nl_cam new file: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_default/user_nl_clm new file: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_nondefault/shell_commands new file: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_nondefault/user_nl_cam new file: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_nondefault/user_nl_clm new file: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_pcols1536/shell_commands new file: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_pcols1536/user_nl_cam new file: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_pcols1536/user_nl_clm
modified: test/system/test_driver.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments but otherwise, looks good!
./xmlchange NTASKS_ATM=36 | ||
./xmlchange NTHRDS_ATM=1 | ||
./xmlchange ROOTPE_ATM='0' | ||
./xmlchange NTASKS_LND=36 | ||
./xmlchange NTHRDS_LND=1 | ||
./xmlchange ROOTPE_LND='0' | ||
./xmlchange NTASKS_ROF=36 | ||
./xmlchange NTHRDS_ROF=1 | ||
./xmlchange ROOTPE_ROF='0' | ||
./xmlchange NTASKS_ICE=36 | ||
./xmlchange NTHRDS_ICE=1 | ||
./xmlchange ROOTPE_ICE='0' | ||
./xmlchange NTASKS_OCN=36 | ||
./xmlchange NTHRDS_OCN=1 | ||
./xmlchange ROOTPE_OCN='0' | ||
./xmlchange NTASKS_GLC=36 | ||
./xmlchange NTHRDS_GLC=1 | ||
./xmlchange ROOTPE_GLC='0' | ||
./xmlchange NTASKS_WAV=36 | ||
./xmlchange NTHRDS_WAV=1 | ||
./xmlchange ROOTPE_WAV='0' | ||
./xmlchange NTASKS_CPL=36 | ||
./xmlchange NTHRDS_CPL=1 | ||
./xmlchange ROOTPE_CPL='0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, xmlchange
recognizes these variables as categories so you can replicate this behavior with:
./xmlchange NTASKS_ATM=36 | |
./xmlchange NTHRDS_ATM=1 | |
./xmlchange ROOTPE_ATM='0' | |
./xmlchange NTASKS_LND=36 | |
./xmlchange NTHRDS_LND=1 | |
./xmlchange ROOTPE_LND='0' | |
./xmlchange NTASKS_ROF=36 | |
./xmlchange NTHRDS_ROF=1 | |
./xmlchange ROOTPE_ROF='0' | |
./xmlchange NTASKS_ICE=36 | |
./xmlchange NTHRDS_ICE=1 | |
./xmlchange ROOTPE_ICE='0' | |
./xmlchange NTASKS_OCN=36 | |
./xmlchange NTHRDS_OCN=1 | |
./xmlchange ROOTPE_OCN='0' | |
./xmlchange NTASKS_GLC=36 | |
./xmlchange NTHRDS_GLC=1 | |
./xmlchange ROOTPE_GLC='0' | |
./xmlchange NTASKS_WAV=36 | |
./xmlchange NTHRDS_WAV=1 | |
./xmlchange ROOTPE_WAV='0' | |
./xmlchange NTASKS_CPL=36 | |
./xmlchange NTHRDS_CPL=1 | |
./xmlchange ROOTPE_CPL='0' | |
./xmlchange NTASKS=36 | |
./xmlchange NTHRDS=1 | |
./xmlchange ROOTPE='0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gold2718. I just updated all the shell_commands
files with your suggestions.
@@ -251,7 +251,7 @@ case $hostname in | |||
mach_workspace="/glade/scratch" | |||
|
|||
# Check for CESM baseline directory | |||
if [ -n "{$BL_TESTDIR}" ] && [ ! -d "${BL_TESTDIR}" ]; then | |||
if [ -n "${BL_TESTDIR}" ] && [ ! -d "${BL_TESTDIR}" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! I guess folks were not running this with a blank value for BL_TESTDIR
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That is exactly how I find out this issue and I hope I am not the only one using an empty BL_TESTDIR
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I would like a chance to see your ChangeLog entry (can be before you have final test data).
modified: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg2_default/shell_commands modified: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_default/shell_commands modified: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_nondefault/shell_commands modified: cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_pcols1536/shell_commands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Just one possible typo.
test/system/test_driver.sh
Outdated
@@ -465,7 +520,12 @@ if [ "${cesm_test_suite}" != "none" -a -n "${cesm_test_mach}" ]; then | |||
fi | |||
|
|||
## Setup CESM work directory | |||
cesm_testdir=$mach_workspace/$LOGNAME/$test_id | |||
if [ "${hostname:0:6}" == "casper" ] || [ "${hostname:0:5}" == "crhtc" ]; then | |||
## Would fail to compile on Casper with long foler name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly a typo here (foler
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nusbaume for catching it. Just fix the typo!
this was brought in with #581 |
This PR introduces a new regression test suite for the GPU-enabled PUMAS codes, which works on Casper.
There is no source code change here but just a few xml files. It also requires the updated
CICE5/CICE6/CIME/ccs_config
modules to work properly.This regression test suite includes five different ERP tests:
Previously I made some GPU changes and somehow the CAM run could finish but return wrong results. The latter two new tests are able to detect those unexpected NBFB changes quickly before doing an ECT test and therefore I add them to the test suite.
To generate a baseline for this GPU regression test suite on Casper, use the following commands:
cd /path_to_CAM_main_dir/test/system
module load python/3.7.9
env BL_TESTDIR='' CAM_ACCOUNT=YOUR_PROJECT_ID CAM_FC=nvhpc-gpu CIME_MACHINE=casper ./test_driver.sh --cesm casper_gpu --baseline-dir /path_to_save_baseline -f
To perform a GPU regression test against the baseline generated above, use the following commands:
env BL_TESTDIR=/path_to_save_baseline CAM_ACCOUNT=YOUR_PROJECT_ID CAM_FC=nvhpc-gpu CIME_MACHINE=casper ./test_driver.sh --cesm casper_gpu --no-baseline -f
The status of the regression test could be viewed through the following commands:
cd /glade/scratch/$user/casper_gpu
./cs.status.casper_gpu_nvhpc-gpu_xxxx
, wherexxxx
is the time stamp for this test runIdeally we should observe
PASS
for all the output from the command above. However, I sawFAIL
for theCOMPARE_base_rest
and there were 4 different fields in the cpl restart file. I later ran the ERP test for the PUMAS CPU codes on Casper and got the same error. However, if I switched theERP
test toERS
test and re-ran it with PUMAS GPU codes, there was noFAIL
in the output. Thus I think the error message here is not caused by the PUMAS GPU codes but I do not know how to fix this issue.Fix #512.