Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SCREAMv1 test to e3sm_gpucxx suite #5495

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

brhillman
Copy link
Contributor

@brhillman brhillman commented Mar 2, 2023

Add F2010-SCREAMv1 test to e3sm_gpucxx suite to get test coverage on GPU for EAMxx codebase in main E3SM repo.

[BFB]

@brhillman brhillman added BFB PR leaves answers BFB Testing Anything related to unit/system tests EAMxx PRs focused on capabilities for EAMxx labels Mar 2, 2023
@brhillman brhillman requested review from xyuan and rljacob March 2, 2023 15:39
Copy link
Contributor

@xyuan xyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested on Summit using both cpu and gpu, and it works fine, thanks.

@amametjanov
Copy link
Member

./cime/scripts/create_test ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1 --compiler gnugpu

with latest master on Summit runs into bld errors:

components/eamxx/src/dynamics/homme/atmosphere_dynamics.cpp(817): error: class "Homme::SimulationParams" has no member "rearth"
/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/ext/new_allocator.h(147): error: no instance of constructor "Homme::Diagnostics::Diagnostics" matches the argument list
            argument types are: (int, __nv_bool)

On Crusher with --compiler crayclanggpu:

externals/ekat/extern/kokkos/core/src/../../tpls/desul/include/desul/atomics/Generic.hpp:618:9: error: call to 'atomic_fetch_add' is ambiguous
  (void)atomic_fetch_add(dest, val, order, scope);
        ^~~~~~~~~~~~~~~~

Does this look familiar in runs of this case with SCREAM repo?

@brhillman
Copy link
Contributor Author

Thanks @amametjanov this actually passes for me on the SCREAM repo, and confirmed it fails with a local master merge into E3SM on summit. I'll try to figure out what's going on with the master merge.

@brhillman
Copy link
Contributor Author

brhillman commented Mar 7, 2023

Okay, I see what happened. It looks like #5481 changed some things in hommexx that haven't been fixed upstream in E3SM yet. The fixes are on the SCREAM repo, so I will need to open another PR to pull those into E3SM I think, or pull those into this PR (looks like the diffs only affect sources in the eamxx directory, plus shoc).

@brhillman brhillman force-pushed the brhillman/cime_config/add-scream-gpucxx-test branch from 979b436 to e2caf84 Compare April 3, 2023 17:45
Copy link
Member

@rljacob rljacob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should not be 356 commits for such a simple addition.

@brhillman
Copy link
Contributor Author

@rljacob I had to pull in a scream->e3sm merge because there were conflicts between the two repos since this PR was opened that prevented SCREAMv1 from building on the E3SM repo. I can move this to a separate PR, or rename this one to reflect this. Was waiting on my tests to update this PR (which just passed).

@brhillman brhillman changed the title Add F2010-SCREAMv1 test to e3sm_gpucxx suite Fix F2010-SCREAMv1 build and add test to e3sm_gpucxx suite Apr 5, 2023
@rljacob
Copy link
Member

rljacob commented Apr 5, 2023

Please make a separate PR.

@rljacob
Copy link
Member

rljacob commented Apr 5, 2023

If you need to resync SCREAM with E3SM and its 300+ commits, that should be its own PR.

@brhillman brhillman force-pushed the brhillman/cime_config/add-scream-gpucxx-test branch from 9d34608 to 5111a4f Compare April 5, 2023 23:26
@brhillman brhillman changed the title Fix F2010-SCREAMv1 build and add test to e3sm_gpucxx suite Add test to e3sm_gpucxx suite Apr 5, 2023
@brhillman brhillman changed the title Add test to e3sm_gpucxx suite Add SCREAMv1 test to e3sm_gpucxx suite Apr 5, 2023
@brhillman
Copy link
Contributor Author

This requires #5582 to fix SCREAMv1 build errors.

@brhillman
Copy link
Contributor Author

If you need to resync SCREAM with E3SM and its 300+ commits, that should be its own PR.

Done. My reasoning here was to consider this PR the upstream merge, and as a side effect add the test to make sure it doesn’t break again, rather than considering the upstream merge the effect. In any event, #5582 is opened to bring EAMxx up to date, then this PR will just add the test.

amametjanov added a commit that referenced this pull request May 24, 2023
(PR #5495)

Add SCREAMv1 test to e3sm_gpucxx suite

Add F2010-SCREAMv1 test to e3sm_gpucxx suite to get test coverage
on GPU for EAMxx codebase in main E3SM repo.

[BFB]
@amametjanov
Copy link
Member

Do we need to add machine-specific cmake in components/eamxx/cmake/machine-files/?
From a JLSE run:

-- Found scream component
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/LINUX.cmake
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/jlse.cmake
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/oneapi-ifxgpu_LINUX.cmake
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/oneapi-ifxgpu_jlse.cmake
CMake Error at cmake/build_eamxx.cmake:39 (include):
  include could not find requested file:

    /gpfs/jlse-fs0/projects/climate/testing/E3SM/components/eamxx/cmake/machine-files/jlse.cmake
Call Stack (most recent call first):
  CMakeLists.txt:120 (build_eamxx)

Similar error on Crusher: https://my.cdash.org/viewTest.php?buildid=2341352

Works on Ascent: https://my.cdash.org/viewTest.php?buildid=2341401

@rljacob
Copy link
Member

rljacob commented Jul 6, 2023

notes: needs some syncing and restesting. @brhillman will do.

@brhillman brhillman force-pushed the brhillman/cime_config/add-scream-gpucxx-test branch from 5111a4f to 13d1962 Compare July 6, 2023 22:27
@rljacob
Copy link
Member

rljacob commented Jul 12, 2023

@brhillman is this ready now?

@amametjanov
Copy link
Member

If I copy crusher-scream-gpu.cmake to components/eamxx/cmake/machine-files/crusher.cmake, the build succeeds, but the run fails during atm-init with

$ tail /lustre/orion/cli115/proj-shared/azamat/e3sm_scratch/crusher/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher_crayclanggpu.G.20230714-chk-new-scream-gputest4/run/e3sm.log.350998.230714-173930 
0: (seq_comm_printcomms)    31     0     1     1  ALLESPID:
0: (seq_comm_printcomms)    32     0     8     1  CPLALLESPID:
0: (seq_comm_printcomms)    33     0     1     1  ESP:
0: (seq_comm_printcomms)    34     0     8     1  CPLESP:
0: (seq_comm_printcomms)    35     0     1     1  ALLIACID:
0: (seq_comm_printcomms)    36     0     8     1  CPLALLIACID:
0: (seq_comm_printcomms)    37     0     1     1  IAC:
0: (seq_comm_printcomms)    38     0     8     1  CPLIAC:
srun: error: crusher188: tasks 0-7: Bus error
srun: Terminating StepId=350998.0

@jgfouca
Copy link
Member

jgfouca commented Dec 4, 2023

@rljacob , currently waiting on a couple CIME PRs and then I will do a CIME update. Once that is done, I will do an upstream merge to SCREAM, which will force me to sort out the remaining build system issues.

@rljacob
Copy link
Member

rljacob commented Jan 29, 2024

@brhillman I believe this modified suite will now build and run after merging to next. Can you confirm?

@rljacob
Copy link
Member

rljacob commented Feb 1, 2024

@amametjanov can you verify the new version of this suite will build on a system running gpucxx ?

@rljacob
Copy link
Member

rljacob commented Feb 1, 2024

It probably needs to be rebased to get Jim's recent build system changes.

@amametjanov
Copy link
Member

I'm trying on pm-gpu on latest E3SM master with

./cime/scripts/create_test --machine pm-gpu --compiler gnugpu ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1

but getting

CMake Error at eamxx/src/physics/rrtmgp/CMakeLists.txt:39 (target_link_libraries):
  INTERFACE library can only be used with the INTERFACE keyword of
  target_link_libraries


-- Disabling all warnings for target yakl
CMake Error at /global/u2/a/azamat/saul/E3SM/externals/ekat/cmake/EkatUtils.cmake:102 (target_compile_options):
  target_compile_options may only set INTERFACE properties on INTERFACE
  targets
Call Stack (most recent call first):
  eamxx/src/physics/rrtmgp/CMakeLists.txt:40 (EkatDisableAllWarning)


-- Found CUDA: /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7 (found version "11.7")
CMake Error at /global/u2/a/azamat/saul/E3SM/externals/ekat/cmake/EkatSetCompilerFlags.cmake:379 (target_compile_options):
  target_compile_options may only set INTERFACE properties on INTERFACE
  targets
Call Stack (most recent call first):
  eamxx/src/physics/rrtmgp/CMakeLists.txt:53 (SetCudaFlags)

@jgfouca
Copy link
Member

jgfouca commented Mar 7, 2024

@amametjanov , I've fixed that on the EAMXX fork. I will do a downstream merge soon.

@rljacob rljacob assigned jgfouca and unassigned amametjanov Apr 30, 2024
jgfouca added a commit that referenced this pull request May 2, 2024
… (PR #5495)

Add SCREAMv1 test to e3sm_gpucxx suite

Add F2010-SCREAMv1 test to e3sm_gpucxx suite to get test coverage on
GPU for EAMxx codebase in main E3SM repo.

[BFB]
@jgfouca
Copy link
Member

jgfouca commented May 2, 2024

The downstream merge is done, merged to next.

@rljacob
Copy link
Member

rljacob commented May 8, 2024

The only place we run this test is pm-gpu and looks like SCREAM has a runtime error:
https://my.cdash.org/viewTest.php?onlyfailed&buildid=2558361

At least its not build time!

@ndkeen
Copy link
Contributor

ndkeen commented May 8, 2024

Rob that job looks like it ran out of walltime

@rljacob
Copy link
Member

rljacob commented May 15, 2024

@jgfouca how do we increase the walltime for the scream test?

@jgfouca
Copy link
Member

jgfouca commented May 15, 2024

@rljacob , there are a few ways. This new test in the suite e3sm_gpucxx which does not have "time" field. We could add this field and set it to something long enough to give it a chance to finish. As an alternative, if this test is only too slow on one platform, we can go to that machine, run the test by hand ./create_test $test --walltime=4:00:00. If it passes, it will store the runtime under the $baseline/walltimes area and our "smart" walltime system will give that time the highest precedence when choosing a default time.

@rljacob
Copy link
Member

rljacob commented May 15, 2024

Lets do the second option but wait until pm comes back online.

@ndkeen
Copy link
Contributor

ndkeen commented May 15, 2024

Note that when I try this same test on muller-gpu (generally same as pm-gpu), I get an error and a hang. We would not expect any ne4 test to take much time, so maybe don't increase the walltime for the test.

2: (GTL DEBUG: 2) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97

Checking out scream instead (again on muller-gpu) and these tests work fine. So must be something we are doing differently.

@jgfouca
Copy link
Member

jgfouca commented May 17, 2024

Should I merge to master or hold off?

@rljacob
Copy link
Member

rljacob commented May 29, 2024

@ndkeen is the one SCREAM test in this suite still running out time?

@ndkeen
Copy link
Contributor

ndkeen commented May 29, 2024

If I recall, I think we found some issue with the test hanging.

When I tried it just now, I get a build error:

components/eamxx/src/share/atm_process/atmosphere_process_dag.cpp(149): error: class "ekat::units::Units" has no member "get_string"
          s += " [" + fid.get_units().get_string() + "]";
                                      ^

@rljacob
Copy link
Member

rljacob commented May 29, 2024

@bartgol since Jim is on vacation, could you take a look at the above build error on pm-gpu?

@bartgol
Copy link
Contributor

bartgol commented May 30, 2024

It seems like the ekat submodule is not recent enough. However, it is recent enough in e3sm/master. Perhaps the branch is old, and is forcing a roll back on the ekat submodule? I'll try merging into next locally now, to see what happens to the submodule.

@bartgol
Copy link
Contributor

bartgol commented May 30, 2024

Oh, wait, this has the new ekat, but an old EAMxx. EAMxx is still using code from an old ekat. We need to update eamxx first. Perhaps the cleanest thing is to remove the branch from next, hard reset the branch to track an up-to-date eamxx, and then remerge to next.

@rljacob
Copy link
Member

rljacob commented May 30, 2024

I'd rather have an eamxx update in a separate PR. Then when that is on next, this should work.

@bartgol
Copy link
Contributor

bartgol commented May 30, 2024

That's fine with me. And now, let's draw straws! :)

Jokes aside, how soon do we want this to be integrated?

@rljacob
Copy link
Member

rljacob commented Jun 3, 2024

I'm really sick of seeing this open PR so an eamxx downstream merge soon would be good. But I think there is an upstream still pending over in scream. @jgfouca can you push that along?

@bartgol
Copy link
Contributor

bartgol commented Jun 10, 2024

@rljacob the downstream merge in the scream repo should move along today. There was a failing test due to a bad auto-resolution of a conflict. Should be fixed now, and tests should be passing on the next AT round (mappy is a bit backed up due to it being offline for a few days).

@rljacob
Copy link
Member

rljacob commented Jul 13, 2024

After the emaxx downstream merge in #6489 there's no longer a kokkos build error but there is an error from COSP. @jgfouca can you look at it next week?

@mahf708
Copy link
Contributor

mahf708 commented Jul 25, 2024

@rljacob, this scream build of cosp was broken due to: #6407

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB EAMxx PRs focused on capabilities for EAMxx Testing Anything related to unit/system tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants