Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regression test suite for GPU-enabled PUMAS run on Casper #31

Closed
sjsprecious opened this issue Sep 29, 2021 · 7 comments
Closed

Add regression test suite for GPU-enabled PUMAS run on Casper #31

sjsprecious opened this issue Sep 29, 2021 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@sjsprecious
Copy link
Collaborator

Suggested by @johnmauff , it is necessary to add a regression test suite for the GPU-enabled PUMAS codes so that we can better maintain its GPU compatibility and recognize changes in the CAM code (less likely) or CIME code (more likely) that may break the GPU run.

Based on the discussion during the AMP SE WG meeting on 09/28, we probably will focus on the GPU test suite for the PUMAS GitHub Repo only at this moment. A test suite configuration will be added to https://github.com/PUMASDevelopment/CAM and the regression test will be done whenever a revision is made for the PUMAS code.

It is also suggested during the meeting that we should run regression test for individual CAM parameterization once it is ported to GPU. These GPU test suites could be later integrated into the standard CAM test suite when Derecho is online, which has both CPU and GPU available on the same machine.

@sjsprecious sjsprecious added the enhancement New feature or request label Sep 29, 2021
@andrewgettelman
Copy link
Collaborator

@sjsprecious, @johnmauff indicated that currently GPU testing is broken for CAM? Is that the case? If so, maybe an issue needs to be raised in the ESCOMP/CAM repository.

I am a bit nervous about where to put the testing for GPUs (or for PUMAS in general). I'm not exactly sure the flow between ESCOMP/PUMAS, PUMASDevelopment/CAM and ESCOMP/CAM. We might want a more general solution for any parameterization?

@sjsprecious
Copy link
Collaborator Author

Thanks @andrewgettelman for your comment. The current GPU testing on Casper works fine for CAM. But we did realize that the testing could fail if we updated the nvhpc compiler and CUDA module version in CIME. This is one of our motivations to push the idea of regression test for the GPU run so that such changes could be caught easily.

Currently I have added a test suite for Casper in the ESCOMP/CAM (https://github.com/ESCOMP/CAM/blob/cam_development/cime_config/testdefs/testlist_cam.xml) but somehow it is not in the PUMASDevelopment/CAM yet. Not sure if this looks like a general solution to you and others.

@sjsprecious
Copy link
Collaborator Author

Sorry @andrewgettelman that I just realized that my previous statement was incorrect. The GPU testing was broken since PUMAS v1.17 when PPE and implicit sedimentation were introduced but they were not GPU-enabled. I am working on those codes for GPU porting now.

@andrewgettelman
Copy link
Collaborator

andrewgettelman commented Oct 4, 2021

Hi @sjsprecious , so does that mean the issue is not CIME but just the new PUMAS code? That would be a relief to know. If it is CIME, then we definitely should figure out a way to do better testing for GPUs. But also good to figure out testing for GPUs to know if we have broken things. It might enable other PUMAS developers like me to try to implement GPU directives in the code (usually copy and paste from a similar part of the code works...)

@sjsprecious
Copy link
Collaborator Author

Hi @sjsprecious , so does that mean the issue is not CIME but just the new PUMAS code? That would be a relief to know. If it is CIME, then we definitely should figure out a way to do better testing for GPUs. But also good to figure out testing for GPUs to know if we have broken things. It might enable other PUMAS developers like me to try to implement GPU directives in the code (usually copy and paste from a similar part of the code works...)

Hi @andrewgettelman , yes, you are right. The current issue only comes from the new PUMAS code that is not GPU-enabled and I am working on it now.

Regarding the CIME code, we have already realized that updating nvhpc compiler to nvhpc/21.7 and using some advanced compiler flags on Casper will break the GPU test, even if we are using a GPU-enabled PUMAS code like v1.16. Currently we are working with NVIDIA colleagues to check whether a coming new nvhpc compiler would resolve this issue or not. But we definitely need to be cautious if someone updates the CIME code related to the Casper's or Derecho's (for GPU) configuration in the future.

It will be extremely helpful to maintain the GPU-enabled PUMAS code if the PUMAS developers could implement GPU directives during the code development. Feel free to reach out to us if you or others have any questions.

@andrewgettelman
Copy link
Collaborator

Hi @sjsprecious. Thanks for clarifying. I think we do need to get this test stood up for CIME updates, and probably elevate it to the CESM level.

Regarding adding code directives during development: we probably need a GPU test for PUMAS to know when we break stuff. I can try to copy directives when I add new loops or variables, but we don't yet have skill to really maintain them and know what we are doing. This will be an ongoing issue.

Not sure how we solve it. But probably need to raise it again at the AMP development meeting.

@sjsprecious
Copy link
Collaborator Author

A similar issue is opened at ESCOMP/CAM#512 and a PR is opened to address this issue ESCOMP/CAM#577.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants