Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install and test unified environment on supported HPCs #478

Closed
7 tasks done
climbfuji opened this issue Feb 18, 2023 · 13 comments
Closed
7 tasks done

Install and test unified environment on supported HPCs #478

climbfuji opened this issue Feb 18, 2023 · 13 comments
Assignees
Labels
INFRA JEDI Infrastructure

Comments

@climbfuji
Copy link
Collaborator

climbfuji commented Feb 18, 2023

Is your feature request related to a problem? Please describe.
We need to install and test the unified environment all supported HPCs. A good starting point is the list of preconfigured and configurable (generic) platforms in https://spack-stack.readthedocs.io/en/latest/Platforms.html.

Describe the solution you'd like

Left over from previous PR #454:

  • Remove ncl from global-workflow-env (also affects macos site config)

See epic #503 for a list of final installations and successful tests. Consider this issue as completed if all the required boxes are ticket in the epic.

Preliminary testing done beforehand:

  • Orion
    • Update site config to contain Intel, GNU, and legacy Intel 18 (for global workflow) configurations
    • Install unified environment for Intel + GNU
    • Install unified environment for legacy Intel
    • Test unified environment for Intel + GNU
      • JEDI-Skylab
      • UFS Weather Model
      • UFS SRW App

Additional context
n/a

@ulmononian
Copy link
Collaborator

@climbfuji i can install this in the role.epic space on orion, jet, and cheyenne to start. hera may have to wait for an epic-owned installation because our nems account is at capacity.

would you mind sharing the install recipe?

@climbfuji
Copy link
Collaborator Author

@climbfuji i can install this in the role.epic space on orion, jet, and cheyenne to start. hera may have to wait for an epic-owned installation because our nems account is at capacity.

would you mind sharing the install recipe?

Thanks for volunteering. I think we need to agree on the directory structure and naming conventions first, then create an install recipe that we can more or less copy and paste or automate with Jenkins. I wonder if this can wait until Thursday when we have our spack-stack meeting.

Also, we need to update all site configs to have the compilers configured correctly. That can be a separate PR that goes in first. For example, we have this for Orion (https://github.com/NOAA-EMC/spack-stack/blob/develop/configs/sites/orion/packages.yaml):

packages:
  all:
    compiler:: [intel@2022.0.2, intel@18.0.5, gcc@10.2.0]
    providers:
      mpi:: [intel-oneapi-mpi@2021.5.1, intel-mpi@2018.5.274, openmpi@4.0.4]

but what we want is

packages:
  all:
    compiler:: [intel@2022.0.2, gcc@10.2.0]
    #compiler:: [intel@18.0.5]
    providers:
      mpi:: [intel-oneapi-mpi@2021.5.1, openmpi@4.0.4]
      #mpi:: [intel-mpi@2018.5.274]

and then our instructions/automation needs to take care of swapping between Intel-latest+GNU and Intel-18 for global workflow. Also, most sites do not have an Intel 18 configuration. We need to add this for sites were users run the global workflow. This is only a small number of sites, all others are ok with just Intel-whateveristherealready+GNU-whateveristherealready.

@ulmononian
Copy link
Collaborator

ulmononian commented Feb 21, 2023

@climbfuji

@climbfuji i can install this in the role.epic space on orion, jet, and cheyenne to start. hera may have to wait for an epic-owned installation because our nems account is at capacity.
would you mind sharing the install recipe?

Thanks for volunteering. I think we need to agree on the directory structure and naming conventions first, then create an install recipe that we can more or less copy and paste or automate with Jenkins. I wonder if this can wait until Thursday when we have our spack-stack meeting.

Also, we need to update all site configs to have the compilers configured correctly. That can be a separate PR that goes in first. For example, we have this for Orion (https://github.com/NOAA-EMC/spack-stack/blob/develop/configs/sites/orion/packages.yaml):

packages:
  all:
    compiler:: [intel@2022.0.2, intel@18.0.5, gcc@10.2.0]
    providers:
      mpi:: [intel-oneapi-mpi@2021.5.1, intel-mpi@2018.5.274, openmpi@4.0.4]

but what we want is

packages:
  all:
    compiler:: [intel@2022.0.2, gcc@10.2.0]
    #compiler:: [intel@18.0.5]
    providers:
      mpi:: [intel-oneapi-mpi@2021.5.1, openmpi@4.0.4]
      #mpi:: [intel-mpi@2018.5.274]

and then our instructions/automation needs to take care of swapping between Intel-latest+GNU and Intel-18 for global workflow. Also, most sites do not have an Intel 18 configuration. We need to add this for sites were users run the global workflow. This is only a small number of sites, all others are ok with just Intel-whateveristherealready+GNU-whateveristherealready.

thanks for this information. totally happy wait until thursday's meeting to discuss things before beginning the installs. since these site configs need updated and some sites need intel 18, there is plenty of prep. to do. are we using spack to install intel@18.0.5 on sites where the GW will be run that do not yet have it?

@climbfuji
Copy link
Collaborator Author

climbfuji commented Feb 21, 2023 via email

@ulmononian
Copy link
Collaborator

ulmononian commented Feb 21, 2023

No, global workflow runs on a few hpcs that all have intel 18.

10-4

@KateFriedman-NOAA
Copy link

No, global workflow runs on a few hpcs that all have intel 18.

Once the GSI is able to move off of intel 18 and to the same intel as the other GFS components we shouldn't need intel 18 anywhere anymore. Hoping this happens soon!

@ulmononian
Copy link
Collaborator

ulmononian commented Feb 22, 2023

@KateFriedman-NOAA @climbfuji speaking of:

"Dear RDHPCS users,

We plan to deprecate the software module intel/18.0.5.274 and impi/2018.0.4 from Hera .

You are receiving this email because you have loaded the module from either your login profile or your batch jobs during the past year. Deprecating a software module means:

This software module will not be supported by the RDHPCS Application Support Group, including related help tickets. 

The module name will be hidden and users will not see it from the “module avail” command.

The module and related software packages are still on the System(s) without any changes, therefore users can still load and use it as they did before. 

The module and related software packages will NOT be removed from the system(s) until they do not function (e.g. future OS or System upgrades) or it is no longer used. 

The deprecated software list can be found in “ /apps/modules/modulefiles/.modulerc”.

If you believe this module should remain supported (un-deprecated) please start a help ticket to request reversing this change within 5 work days. Otherwise, no response needed. https://rdhpcs-common-docs.rdhpcs.noaa.gov/wiki/index.php/Help_Requests.

Thank you very much!

RDHPCS User Support Group
"

Similar email for Intel 18 and wgrib2/2.0.8 on Jet...

@AlexanderRichert-NOAA
Copy link
Collaborator

I'm working on #333 on Hera (testing the unified environment with esmf@8.4.1 and mapl@2.35.2), and I've run into the following:

  • UFS appears to fail when hdf5 has +threadsafe enabled (verifying this right now). I think I've run into this on at least one other machine. The problem is that cdo wants hdf5+threadsafe when using +hdf5 or +netcdf. If that's really the issue, I don't have a good solution, other than to either remove the threadsafe requirement from cdo and hope for the best, or try to resolve the issue in hdf5/UFS, which could, of course, be a rabbit hole. I don't think we've been using hdf5+threadsafe in hpc-stack, so... maybe it would be okay to remove the requirement?
  • nco tar file needs to be deleted from source cache due to checksum change (someone with permissions: rm /scratch1/NCEPDEV/global/spack-stack/source-cache/nco/nco-5.0.6.tar.gz /scratch1/NCEPDEV/global/spack-stack/source-cache/_source-cache/archive/37/37d11ffe582aa0ee89f77a7b9a176b41e41900e9ab709e780ec0caf52ad60c4b.tar.gz)
  • We need network access to ftp://ftp.ssec.wisc.edu/ for crtm-fix
  • We need network access to https://download.osgeo.org/ for GDAL and GEOS

@climbfuji
Copy link
Collaborator Author

@AlexanderRichert-NOAA To get around the network access you should be able to transfer hera's spack.lock to a machine that has access (e.g. orion, cheyenne, or your laptop ...) and then create a mirror that you transfer back to hera to intall from. There are a few packages that download stuff during the build, but I don't think this applies to gdal, geos, crtm-fix.

Regarding nco, do you not have write access? I can try if I can delete the cached nco files.

hdf+threadsafe: Not sure if it's a good idea to remove +threadsafe and hope for the best. Someone must have put it in for a reason. But, if you know for sure that cdo only ever gets used without OpenMP parallelism, then it may be ok. But let's make sure first that hdf5+threadsafe is really the problem.

@climbfuji
Copy link
Collaborator Author

@AlexanderRichert-NOAA I removed the link and the source file behind it:

[ =0 03:03:20 10000 emc.nemspara@hfe01 ]
~> rm /scratch1/NCEPDEV/global/spack-stack/source-cache/nco/nco-5.0.6.tar.gz
[ =0 03:03:26 10001 emc.nemspara@hfe01 ]
~> rm /scratch1/NCEPDEV/global/spack-stack/source-cache/_source-cache/archive/37/37d11ffe582aa0ee89f77a7b9a176b41e41900e9ab709e780ec0caf52ad60c4b.tar.gz
[ =0 03:03:39 10002 emc.nemspara@hfe01 ]

@AlexanderRichert-NOAA
Copy link
Collaborator

Well, rats, cdo does use OpenMP... and yet in hpc-stack, hdf5 is built without thread safety. @KateFriedman-NOAA do you know whether cdo could be run without OpenMP for global workflow? If so, then I could probably ease the thread safety requirement for cdo by adding "+openmp" to the when.

@KateFriedman-NOAA
Copy link

do you know whether cdo could be run without OpenMP for global workflow?

I do not know unfortunately. I don't know much about cdo, I'm just a user of it. :)

@climbfuji
Copy link
Collaborator Author

Done, finally. See #503.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
INFRA JEDI Infrastructure
Projects
None yet
Development

No branches or pull requests

5 participants