Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

part of cam6_3_056: PUMAS GPU regression test suite #577

Conversation

sjsprecious
Copy link
Collaborator

@sjsprecious sjsprecious commented Apr 20, 2022

This PR introduces a new regression test suite for the GPU-enabled PUMAS codes, which works on Casper.

There is no source code change here but just a few xml files. It also requires the updated CICE5/CICE6/CIME/ccs_config modules to work properly.

This regression test suite includes five different ERP tests:

  • CAM run with MG2, default namelist
  • CAM run with MG3, default namelist
  • CAM run with MG3, change the MG logical variables from their default values to the opposite values
  • CAM run with MG3, default namelist, different PCOLS values
  • CAM run with MG3, default namelist, different number of GPUs per node

Previously I made some GPU changes and somehow the CAM run could finish but return wrong results. The latter two new tests are able to detect those unexpected NBFB changes quickly before doing an ECT test and therefore I add them to the test suite.

To generate a baseline for this GPU regression test suite on Casper, use the following commands:

  • cd /path_to_CAM_main_dir/test/system
  • module load python/3.7.9
  • env BL_TESTDIR='' CAM_ACCOUNT=YOUR_PROJECT_ID CAM_FC=nvhpc-gpu CIME_MACHINE=casper ./test_driver.sh --cesm casper_gpu --baseline-dir /path_to_save_baseline -f

To perform a GPU regression test against the baseline generated above, use the following commands:

  • env BL_TESTDIR=/path_to_save_baseline CAM_ACCOUNT=YOUR_PROJECT_ID CAM_FC=nvhpc-gpu CIME_MACHINE=casper ./test_driver.sh --cesm casper_gpu --no-baseline -f

The status of the regression test could be viewed through the following commands:

  • cd /glade/scratch/$user/casper_gpu
  • ./cs.status.casper_gpu_nvhpc-gpu_xxxx, where xxxx is the time stamp for this test run

Ideally we should observe PASS for all the output from the command above. However, I saw FAIL for the COMPARE_base_rest and there were 4 different fields in the cpl restart file. I later ran the ERP test for the PUMAS CPU codes on Casper and got the same error. However, if I switched the ERP test to ERS test and re-ran it with PUMAS GPU codes, there was no FAIL in the output. Thus I think the error message here is not caused by the PUMAS GPU codes but I do not know how to fix this issue.

Fix #512.

Katetc and others added 4 commits April 19, 2022 11:28
Update cam_development branch
	modified:   cime_config/testdefs/testlist_cam.xml
	new file:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg2_default/shell_commands
	renamed:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_casper/user_nl_cam -> cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg2_default/user_nl_cam
	renamed:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_casper/user_nl_clm -> cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg2_default/user_nl_clm
	renamed:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_casper/shell_commands -> cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_default/shell_commands
	new file:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_default/user_nl_cam
	new file:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_default/user_nl_clm
	new file:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_nondefault/shell_commands
	new file:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_nondefault/user_nl_cam
	new file:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_nondefault/user_nl_clm
	new file:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_pcols1536/shell_commands
	new file:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_pcols1536/user_nl_cam
	new file:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_pcols1536/user_nl_clm
	modified:   test/system/test_driver.sh
@sjsprecious sjsprecious added enhancement New feature or request BFB bit for bit tag externals externals updating issue or PR labels Apr 20, 2022
@sjsprecious sjsprecious marked this pull request as draft April 20, 2022 16:51
@sjsprecious sjsprecious removed the BFB bit for bit tag label Apr 22, 2022
@sjsprecious sjsprecious marked this pull request as ready for review April 23, 2022 03:27
Copy link
Collaborator

@gold2718 gold2718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments but otherwise, looks good!

Comment on lines 1 to 24
./xmlchange NTASKS_ATM=36
./xmlchange NTHRDS_ATM=1
./xmlchange ROOTPE_ATM='0'
./xmlchange NTASKS_LND=36
./xmlchange NTHRDS_LND=1
./xmlchange ROOTPE_LND='0'
./xmlchange NTASKS_ROF=36
./xmlchange NTHRDS_ROF=1
./xmlchange ROOTPE_ROF='0'
./xmlchange NTASKS_ICE=36
./xmlchange NTHRDS_ICE=1
./xmlchange ROOTPE_ICE='0'
./xmlchange NTASKS_OCN=36
./xmlchange NTHRDS_OCN=1
./xmlchange ROOTPE_OCN='0'
./xmlchange NTASKS_GLC=36
./xmlchange NTHRDS_GLC=1
./xmlchange ROOTPE_GLC='0'
./xmlchange NTASKS_WAV=36
./xmlchange NTHRDS_WAV=1
./xmlchange ROOTPE_WAV='0'
./xmlchange NTASKS_CPL=36
./xmlchange NTHRDS_CPL=1
./xmlchange ROOTPE_CPL='0'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, xmlchange recognizes these variables as categories so you can replicate this behavior with:

Suggested change
./xmlchange NTASKS_ATM=36
./xmlchange NTHRDS_ATM=1
./xmlchange ROOTPE_ATM='0'
./xmlchange NTASKS_LND=36
./xmlchange NTHRDS_LND=1
./xmlchange ROOTPE_LND='0'
./xmlchange NTASKS_ROF=36
./xmlchange NTHRDS_ROF=1
./xmlchange ROOTPE_ROF='0'
./xmlchange NTASKS_ICE=36
./xmlchange NTHRDS_ICE=1
./xmlchange ROOTPE_ICE='0'
./xmlchange NTASKS_OCN=36
./xmlchange NTHRDS_OCN=1
./xmlchange ROOTPE_OCN='0'
./xmlchange NTASKS_GLC=36
./xmlchange NTHRDS_GLC=1
./xmlchange ROOTPE_GLC='0'
./xmlchange NTASKS_WAV=36
./xmlchange NTHRDS_WAV=1
./xmlchange ROOTPE_WAV='0'
./xmlchange NTASKS_CPL=36
./xmlchange NTHRDS_CPL=1
./xmlchange ROOTPE_CPL='0'
./xmlchange NTASKS=36
./xmlchange NTHRDS=1
./xmlchange ROOTPE='0'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gold2718. I just updated all the shell_commands files with your suggestions.

@@ -251,7 +251,7 @@ case $hostname in
mach_workspace="/glade/scratch"

# Check for CESM baseline directory
if [ -n "{$BL_TESTDIR}" ] && [ ! -d "${BL_TESTDIR}" ]; then
if [ -n "${BL_TESTDIR}" ] && [ ! -d "${BL_TESTDIR}" ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I guess folks were not running this with a blank value for BL_TESTDIR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That is exactly how I find out this issue and I hope I am not the only one using an empty BL_TESTDIR.

Copy link
Collaborator

@gold2718 gold2718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I would like a chance to see your ChangeLog entry (can be before you have final test data).

	modified:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg2_default/shell_commands
	modified:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_default/shell_commands
	modified:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_nondefault/shell_commands
	modified:   cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3_pcols1536/shell_commands
Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just one possible typo.

@@ -465,7 +520,12 @@ if [ "${cesm_test_suite}" != "none" -a -n "${cesm_test_mach}" ]; then
fi

## Setup CESM work directory
cesm_testdir=$mach_workspace/$LOGNAME/$test_id
if [ "${hostname:0:6}" == "casper" ] || [ "${hostname:0:5}" == "crhtc" ]; then
## Would fail to compile on Casper with long foler name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a typo here (foler)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nusbaume for catching it. Just fix the typo!

peverwhee pushed a commit to peverwhee/CAM that referenced this pull request Apr 28, 2022
@cacraigucar cacraigucar changed the title PUMAS GPU regression test suite part of cam6_3_056: PUMAS GPU regression test suite Apr 29, 2022
@peverwhee
Copy link
Collaborator

this was brought in with #581

@peverwhee peverwhee closed this Apr 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request externals externals updating issue or PR
Projects
Status: To Do
Development

Successfully merging this pull request may close these issues.

6 participants