Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Hera intel modulefile to Rocky 8 #715

Merged
merged 7 commits into from
Mar 18, 2024

Conversation

RussTreadon-NOAA
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA commented Mar 12, 2024

DUE DATE for merger of this PR into develop is 4/23/2024 (six weeks after PR creation).

Description
This PR updates the Hera intel modulefile to build gsi.x and enkf.x on Rocky 8 nodes.

Fixes #710
Fixes #711

Type of change

  • Maintenance

How Has This Been Tested?

  • clone, build, and run ctests on Hera Rocky 8 nodes

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • New and existing tests pass with my changes

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Mar 12, 2024
@RussTreadon-NOAA
Copy link
Contributor Author

@DavidHuber-NOAA , this PR contains changes required to build and run GSI on Hera Rocky 8 nodes. Cross link this PR with g-w #2329

@RussTreadon-NOAA
Copy link
Contributor Author

RussTreadon-NOAA commented Mar 12, 2024

Hera intel Rocky 8 test

Build feature/rocky8 on hfe11 using the intel compiler. Modify regression_var.out so _contrl_ is also feature/rocky8` intel build. Run ctests with following results

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/rocky8/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #3: rrfs_3denvar_glbens ..............   Passed  617.91 sec
2/7 Test #4: netcdf_fv3_regional ..............   Passed  735.17 sec
3/7 Test #7: global_enkf ......................   Passed  1089.40 sec
4/7 Test #2: rtma .............................   Passed  1281.30 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1415.53 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1542.82 sec
7/7 Test #1: global_4denvar ...................***Failed  1794.29 sec

86% tests passed, 1 tests failed out of 7

Total Test time (real) = 1794.52 sec

The following tests FAILED:
          1 - global_4denvar (Failed)

The global_4denvar failure is due to

The runtime for global_4denvar_hiproc_updat is 343.729136 seconds.  This has exceeded maximum allowable threshold time of 330.344874 seconds, resulting in Failure of timethresh2 the regression test.

A check of the gsi.x wall times shows considerable variability between contrl and updpat and loproc and hiproc

global_4denvar_hiproc_contrl/stdout:The total amount of wall time                        = 300.313522
global_4denvar_hiproc_updat/stdout:The total amount of wall time                        = 343.729136
global_4denvar_loproc_contrl/stdout:The total amount of wall time                        = 425.272795
global_4denvar_loproc_updat/stdout:The total amount of wall time                        = 384.462055

This is not a fatal fail.

@RussTreadon-NOAA RussTreadon-NOAA changed the title Modulefile updates Modulefile updates - Hera Rocky 8 & GSI-fix Mar 13, 2024
@RussTreadon-NOAA RussTreadon-NOAA changed the title Modulefile updates - Hera Rocky 8 & GSI-fix Update Hera modulefiles to Rocky 8 Mar 14, 2024
@RussTreadon-NOAA
Copy link
Contributor Author

@DavidHuber-NOAA , while I can build gsi.x and enkf.x on Hera Rocky 8 nodes using the updated modulefiles/gsi_hera.gnu.lua, attempts to run ctests fail. For example, global_4denvar fails with

srun: lua: This job was submitted from a host running Rocky 8. Assigning job to el8 reservation.
[h1c08:287709] mca_base_component_repository_open: unable to open mca_pmix_s1: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
[h1c14:256122] mca_base_component_repository_open: unable to open mca_pmix_s1: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)

This is the first time I have tried to run GSI ctests with gnu executables. Have you run gnu executables in GSI ctests? If "yes", what changes were necessary to make this work?

@DavidHuber-NOAA
Copy link
Collaborator

DavidHuber-NOAA commented Mar 14, 2024

@RussTreadon-NOAA I was able to run the ctests with gnu executables on Hera-CentOS. Most would fail, but not due to library linking issues. The UFS appears to be going through similar issues with GNU compilers on Rocky8 (see ufs-community/ufs-weather-model#2143 (comment)) and the issue appears to be related to slurm.

I also wonder if openmpi needs to be recompiled for Rocky8. The installation the GSI and UFS are using was compiled on CentOS. I asked that question on spack-stack here.

Personally, I am OK with just updating the Intel modulefile for now and returning to GNU when the issues are resolved.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @DavidHuber-NOAA . Good to know the GSI is not alone when it comes to challenges running gnu executables on Hera Rocky 8 nodes. I agree with you. Let me revert the change to modulefiles/gsi_hera.gnu.lua. This can be revisited at a later date.

@RussTreadon-NOAA RussTreadon-NOAA changed the title Update Hera modulefiles to Rocky 8 Update Hera intel modulefile to Rocky 8 Mar 14, 2024
Copy link
Collaborator

@hu5970 hu5970 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works in my test.

@RussTreadon-NOAA RussTreadon-NOAA marked this pull request as ready for review March 14, 2024 14:57
@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @hu5970 for testing the changes in this PR. Good to hear that it works for you.

Question for @ShunLiu-NOAA , @CoryMartin-NOAA , you, and me:

  • When do we want to merge this PR into GSI develop? Do we wait until 4/2 when the Hera Rocky 8 transition is complete or should we merge after 3/19 when 2/3 of Hera is Rocky 8.
  • I will be on leave 4/2 so if we wait until then, one of you will need to merge.

@ShunLiu-NOAA
Copy link
Contributor

Since @hu5970 reviewed this PR and he is actively working on hera, I suggest @hu5970 merge this PR to develop.

@RussTreadon-NOAA
Copy link
Contributor Author

Since @hu5970 reviewed this PR and he is actively working on hera, I suggest @hu5970 merge this PR to develop.

Works for me. I take your reply, @ShunLiu-NOAA, to mean that you prefer merging after the 4/2 transition is complete. Is this correct?

@CoryMartin-NOAA
Copy link
Contributor

I vote to merge after 3/19, that way we have people starting to rebuild beforehand and then on 4/2, people can move ahead with their science.

@RussTreadon-NOAA
Copy link
Contributor Author

RussTreadon-NOAA commented Mar 14, 2024

I vote to merge after 3/19, that way we have people starting to rebuild beforehand and then on 4/2, people can move ahead with their science.

I agree. A 3/19 merger allows g-w issue #2329 to update the gsi_enkf.fd submodule hash sooner rather than later. A similar comment applies GDASApp PR #969.

@hu5970
Copy link
Collaborator

hu5970 commented Mar 14, 2024

Agree merge after 3/19.

@RussTreadon-NOAA
Copy link
Contributor Author

@DavidHuber-NOAA , we plan on merging this PR into GSI develop after the 3/19 Hera outage which transitions 2/3 of Hera to Rocky 8. Would you mind taking a quick look at the changes as a peer reviewer?

@RussTreadon-NOAA
Copy link
Contributor Author

@DavidHuber-NOAA and @HenryWinterbottom-NOAA : We can merge this PR into GSI develop on 3/19 during the Hera outage.

@DavidHuber-NOAA
Copy link
Collaborator

@RussTreadon-NOAA That sounds good to us, thanks!

@RussTreadon-NOAA
Copy link
Contributor Author

RDHPCS admins informed users that after the 3/19 Hera maintenance the default for users logging into Hera will be Rocky 8. You will have to hit ^C and select one of hfe01-hfe04 to access a CentOS 7 login node. All jobs submitted from a Rocky 8 login node will run on Rocky 8 compute nodes.

Given this we should merge this PR into develop no later than 3/19.

@RussTreadon-NOAA
Copy link
Contributor Author

@DavidHuber-NOAA, would you be able to serve as a peer reviewer for this PR?

@DavidHuber-NOAA
Copy link
Collaborator

Yes, I'm happy to do that.

Copy link
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified that the environment path has been updated correctly. Approve.

@RussTreadon-NOAA
Copy link
Contributor Author

@ShunLiu-NOAA , @hu5970 , & @CoryMartin-NOAA , we need to merge this PR into develop and pass a new gsi_enkf.fd hash to the g-w team. When shall we merge this PR into develop - today (3/18) or tomorrow (3/19)?

@CoryMartin-NOAA
Copy link
Contributor

I'm fine with either. No one will be using Hera tomorrow and no one should be building CentOS7 today on Hera anyways.

@ShunLiu-NOAA
Copy link
Contributor

ShunLiu-NOAA commented Mar 18, 2024 via email

@RussTreadon-NOAA RussTreadon-NOAA merged commit dfb958f into NOAA-EMC:develop Mar 18, 2024
4 checks passed
@RussTreadon-NOAA RussTreadon-NOAA deleted the feature/rocky8 branch March 18, 2024 19:31
@RussTreadon-NOAA
Copy link
Contributor Author

@HenryWinterbottom-NOAA and @aerorahul : The GSI Hera intel build has been updated to Rocky 8. Done at dfb958f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update RDHPCS Hera to Rocky-8 compliance compile GSI on Hera with Rocky 8
5 participants