-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add regression test suite for GPU-enabled PUMAS run on Casper #31
Comments
@sjsprecious, @johnmauff indicated that currently GPU testing is broken for CAM? Is that the case? If so, maybe an issue needs to be raised in the ESCOMP/CAM repository. I am a bit nervous about where to put the testing for GPUs (or for PUMAS in general). I'm not exactly sure the flow between ESCOMP/PUMAS, PUMASDevelopment/CAM and ESCOMP/CAM. We might want a more general solution for any parameterization? |
Thanks @andrewgettelman for your comment. The current GPU testing on Casper works fine for CAM. But we did realize that the testing could fail if we updated the nvhpc compiler and CUDA module version in CIME. This is one of our motivations to push the idea of regression test for the GPU run so that such changes could be caught easily. Currently I have added a test suite for Casper in the ESCOMP/CAM (https://github.com/ESCOMP/CAM/blob/cam_development/cime_config/testdefs/testlist_cam.xml) but somehow it is not in the PUMASDevelopment/CAM yet. Not sure if this looks like a general solution to you and others. |
Sorry @andrewgettelman that I just realized that my previous statement was incorrect. The GPU testing was broken since PUMAS |
Hi @sjsprecious , so does that mean the issue is not CIME but just the new PUMAS code? That would be a relief to know. If it is CIME, then we definitely should figure out a way to do better testing for GPUs. But also good to figure out testing for GPUs to know if we have broken things. It might enable other PUMAS developers like me to try to implement GPU directives in the code (usually copy and paste from a similar part of the code works...) |
Hi @andrewgettelman , yes, you are right. The current issue only comes from the new PUMAS code that is not GPU-enabled and I am working on it now. Regarding the CIME code, we have already realized that updating nvhpc compiler to It will be extremely helpful to maintain the GPU-enabled PUMAS code if the PUMAS developers could implement GPU directives during the code development. Feel free to reach out to us if you or others have any questions. |
Hi @sjsprecious. Thanks for clarifying. I think we do need to get this test stood up for CIME updates, and probably elevate it to the CESM level. Regarding adding code directives during development: we probably need a GPU test for PUMAS to know when we break stuff. I can try to copy directives when I add new loops or variables, but we don't yet have skill to really maintain them and know what we are doing. This will be an ongoing issue. Not sure how we solve it. But probably need to raise it again at the AMP development meeting. |
A similar issue is opened at ESCOMP/CAM#512 and a PR is opened to address this issue ESCOMP/CAM#577. |
Suggested by @johnmauff , it is necessary to add a regression test suite for the GPU-enabled PUMAS codes so that we can better maintain its GPU compatibility and recognize changes in the CAM code (less likely) or CIME code (more likely) that may break the GPU run.
Based on the discussion during the AMP SE WG meeting on 09/28, we probably will focus on the GPU test suite for the PUMAS GitHub Repo only at this moment. A test suite configuration will be added to https://github.com/PUMASDevelopment/CAM and the regression test will be done whenever a revision is made for the PUMAS code.
It is also suggested during the meeting that we should run regression test for individual CAM parameterization once it is ported to GPU. These GPU test suites could be later integrated into the standard CAM test suite when Derecho is online, which has both CPU and GPU available on the same machine.
The text was updated successfully, but these errors were encountered: