Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update hpc-stack and compiler versions #812

Closed
WalterKolczynski-NOAA opened this issue May 25, 2022 · 14 comments
Closed

Update hpc-stack and compiler versions #812

WalterKolczynski-NOAA opened this issue May 25, 2022 · 14 comments
Assignees
Labels
maintenance Regular updates and maintenance work

Comments

@WalterKolczynski-NOAA
Copy link
Contributor

Description
HPCs are moving away from the Intel 2018 compiler, so we should update to one of the newer compiler versions. We should also take the opportunity to move to the hpc-stack/1.2.0

Requirements
Programs should compile and run using Intel 2021 or 2022 compilers

Acceptance Criteria (Definition of Done)

  • All workflow-controlled modulefiles updated to use hpc-stack/1.2.0
  • All workflow-controlled programs build using Intel 2021 or 2022
  • All workflow jobs run loading Intel 2021 or 2022 at runtime
    (These changes may also require other library updates, depending on what is available.)

Dependencies
Workflow-controlled programs have no dependencies, but those controlled by other components will need similar changes to move the entire system to a newer compiler. These may also be needed to prevent mismatches at runtime.

@WalterKolczynski-NOAA WalterKolczynski-NOAA added the maintenance Regular updates and maintenance work label May 25, 2022
@WalterKolczynski-NOAA WalterKolczynski-NOAA changed the title Update hpc and compiler versions Update hpc-stack and compiler versions May 25, 2022
@ulmononian
Copy link

ulmononian commented May 25, 2022

Hi @WalterKolczynski-NOAA. I encountered an issue with the load of met/9.1 on Orion in S2SW (coupled, forecast-only) and ATM (uncoupled, forecast only) tests using the develop branch. The same error is produced in the gfsfcst task of the ATM case and gfswaveinit task of the S2SW case:
Screen Shot 2022-05-23 at 11 42 45 AM Screen Shot 2022-05-23 at 11 55 25 AM

I was able to bypass the issue by removing line 55 from module_base.orion.lua (load(pathJoin("met", "9.1"))). When I check the modules loaded by load_fv3gfs_modules.sh when testing outside of runtime, met/9.1 seems to still load despite removing this line:

Screen Shot 2022-05-25 at 8 48 56 AM

I did not know if this was worth its own issue, and thought perhaps it was related to this issue. Thank you, Cameron.

@WalterKolczynski-NOAA
Copy link
Contributor Author

I did not know if this was worth its own issue, and thought perhaps it was related to this issue. Thank you, Cameron.

This looks like an unrelated issue. Please open a new one.

@arunchawla-NOAA
Copy link
Contributor

Hello, was an issue created for the Metplus problem ?

@WalterKolczynski-NOAA
Copy link
Contributor Author

Hello, was an issue created for the Metplus problem ?

hpc-stack has had a PR open since Sept to add it: NOAA-EMC/hpc-stack/pull/324

Unless you are talking about Cameron's issue, which was just a permissions issue on my hacked-up stopgap solution to a completely separate problem than this issue.

@arunchawla-NOAA
Copy link
Contributor

Thanks Walter ! I believe that is the thing.

@DavidHuber-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA I'm working on porting the global workflow and subcomponents to a google cloud instance (not an RDHPCS instance) using Intel 2021 compilers and hpc-stack 1.2.0. Would it be beneficial to this issue to document hurdles and workarounds?

@WalterKolczynski-NOAA
Copy link
Contributor Author

Yes, although AFAIK the big hurdle is GSI.

@DavidHuber-NOAA
Copy link
Contributor

DavidHuber-NOAA commented Nov 3, 2022

@WalterKolczynski-NOAA Good to know, thanks.

So far, I have only built the ufs from build_ufs.sh, which just required changing intel versions to 2021, hpc-stack version to 1.2.0, and png to libpng in module_base. I am moving on to the GSI now.

@DavidHuber-NOAA
Copy link
Contributor

No hurdles to build the GSI. I won't be able to run RTs on the GCP as they contain restricted data. I'm guessing the issues were at runtime? Will update when I get cycling running.

@DavidHuber-NOAA
Copy link
Contributor

The GSI and ENKF were a little tricky. The GSI in particular does not play well with Intel 2021/2 compilers and -O3 optimization. I tried to get regression tests to run using Intel 2021.3.0, 2022.1.2, and 2022.3.0, but all continue to crash at the same line (see NOAA-EMC/GSI#447). Turning the optimization down to -O0 does allow the regression tests to run to completion, while -O2 allows some regression tests to pass but others still fail, albeit at different lines.

On the GCP, I have been able to cycle with -O2 with only one disruption. The job gdasanalcalc often (but not always) hangs when executing calc_anl.x at gsi_utils/src/netcdf_io/calc_analysis.fd/inc2anl.f90:139 when attempting to write the sfcanl netcdf file via ncio (hanging at write_vardata_code.f90:73). Rebooting the job enough times does clear this hurdle, but it is annoying.

I am in the process of building a global workflow on Hera with Intel 2022.1.2 with the GSI's optimization turned down to -O2 and will see if I can tweak flags for the gsi_utils build as well to get cycling to run more smoothly.

@DavidHuber-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA I will lose access to this GCP instance tomorrow, so I won't be able to continue this work there. I will note one other important issue that cropped up at C384. One of the EnKF recentering applications in gdasecen000-2 produces completely nonsensical values (e.g. temperatures on the order of 10^30). This is either a silent failure or possibly an issue with ncio. Unfortunately, I did not have time to root it out.

Is there any other information you would like me to gather before I lose access?

@WalterKolczynski-NOAA
Copy link
Contributor Author

I don't believe so. The DA group is well aware of the compiler version problems, and we reminded them again last week and asked to have it made a higher priority as it is blocking standardization work.

@DavidHuber-NOAA
Copy link
Contributor

Capturing a discussion by @RussTreadon-NOAA, @WalterKolczynski-NOAA, @KateFriedman-NOAA, @hu5970, and myself from the GSI upgrade to Intel 2022 (NOAA-EMC/GSI#447 NOAA-EMC/GSI#571). The miniconda versions in the GSI are being updated to newer, EPIC-managed locations. These should be updated with an upgrade to Intel 2022 in the global workflow as well. Note that these locations will update again when migrating to spack-stack.

@DavidHuber-NOAA
Copy link
Contributor

I am moving towards using spack-stack (#1868) on all systems and will bypass the newer Intel 2022 hpc-stack builds. Closing.

@DavidHuber-NOAA DavidHuber-NOAA closed this as not planned Won't fix, can't repro, duplicate, stale Oct 10, 2023
kayeekayee pushed a commit to kayeekayee/global-workflow that referenced this issue May 30, 2024
* update submodule pointer for regression testing of ccpp-physics#812
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Regular updates and maintenance work
Projects
None yet
Development

No branches or pull requests

4 participants