Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to spack-stack libraries on non-production machines #624

Merged
merged 45 commits into from
Nov 30, 2023

Conversation

DavidHuber-NOAA
Copy link
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA commented Sep 19, 2023

Description

The latest version of spack-stack, 1.4.1, is installed on all platforms. This PR points all of the modulefiles on tier-1 machines and S4 to their respective spack-stack installations. This fixes #589, #574, and #563.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

The ctests were run on Orion and Hera. As noted in #589, differences were seen when calling CRTM and NCIO functions and subroutines. These were traced to the different compiler options used for their respective builds between hpc-stack and spack-stack. Recompiling these libraries with spack-stack netCDF 4.9.2 and hpc-stack compiler options resulted in a 1-to-1 comparison.

The GSI was also built on Jet and S4 as well as on Hera with GNU compilers. I will also run build tests on Gaea tomorrow when it comes up from maintenance. Before #571 was merged, regression tests had also be performed on Jet, Gaea, and Cheyenne with some differences noted (possibly due to the different build options).

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

DUE DATE for this PR is 6 weeks from when this PR is marked ready for review.

@DavidHuber-NOAA
Copy link
Collaborator Author

DavidHuber-NOAA commented Sep 21, 2023

Merged the develop branch into spack-stack and reran the regression tests:

Test project /scratch1/NCEPDEV/nems/David.Huber/GSI/gsi_spackstack/build/regression
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #5: hwrf_nmm_d3 ......................   Passed  500.12 sec
2/9 Test #7: rrfs_3denvar_glbens ..............   Passed  551.35 sec
3/9 Test #9: global_enkf ......................   Passed  561.74 sec
4/9 Test #8: netcdf_fv3_regional ..............***Failed  606.70 sec
5/9 Test #4: hwrf_nmm_d2 ......................***Failed  610.43 sec
6/9 Test #6: rtma .............................   Passed  1212.57 sec
7/9 Test #3: global_4denvar ...................***Failed  1627.33 sec
8/9 Test #2: global_4dvar .....................***Failed  1744.26 sec
9/9 Test #1: global_3dvar .....................***Failed  1868.29 sec

44% tests passed, 5 tests failed out of 9

The following tests failed due to differing penalties on the first iteration onward: hwrf_nmm_d2, global_4denvar, global_4dvar, global_3dvar, and netcdf_fv3_regional. Differences first arise in radiance penalties consistent with the previously reported CRTM optimization differences between spack-stack and hpc-stack (Release instead of RelWithDebInfo). Not surprisingly, this also resulted in a differences in the siginc file for the global_4dvar test case.

The global_4denvar and global_3dvar tests also failed the maxmem test with used/allowable values of 5,188,800/2,516,582 and 4,236,432/2,516,582 KB, respectively. These differences also disappear when spack-stack CRTM is compiled with hpc-stack flags.

@DavidHuber-NOAA
Copy link
Collaborator Author

The latest develop branch was merged into the spack-stack branch and ctests were run again. This time, I ran them both with the standard spack-stack/1.4.1 packages as well as my custom builds of ncio and crtm (with hpc-stack optimization). The results were as expected for the standard set. For the custom builds, the results were as follows:

Test project /scratch1/NCEPDEV/nems/David.Huber/GSI/gsi_spackstack/build/regression
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #8: netcdf_fv3_regional ..............   Passed  492.93 sec
2/9 Test #5: hwrf_nmm_d3 ......................   Passed  508.25 sec
3/9 Test #9: global_enkf ......................   Passed  536.42 sec
4/9 Test #4: hwrf_nmm_d2 ......................   Passed  552.12 sec
5/9 Test #7: rrfs_3denvar_glbens ..............***Failed  616.91 sec
6/9 Test #6: rtma .............................   Passed  1215.48 sec
7/9 Test #3: global_4denvar ...................***Failed  1627.04 sec
8/9 Test #2: global_4dvar .....................***Failed  1743.71 sec
9/9 Test #1: global_3dvar .....................***Failed  1807.66 sec

The rrfs_3denvar_glbens test failed for scalability (updat with a time of 35.984s/node and contrl with 36.316s/node).
The global_4dvar test failed as the siginc differed between the updat and contrl. However, on closer inspection, the data and netcdf header sections are identical. It is only the HDF5 header that differs. This is the typical result with the upgrade to hdf5/1.14.0. Thus, these failures are not fatal.

Both the global_4denvar and global_3dvar tests experienced maxmem failures, both exceeding the memory thresholds by > 1000GB. This seems problematic. I will look into other libraries to see if I can find which one is causing this memory issue and in the meantime convert this PR to a draft.

@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA , will this PR add a Hercules build option to NOAA-EMC/GSI? Daryl asked about Hercules yesterday. Are all the libraries needed to build NOAA-EMC/GSI are available in a Hercules version of spack-stack?

@DavidHuber-NOAA
Copy link
Collaborator Author

@RussTreadon-NOAA Yes, thank you for the reminder. I will check on the available libraries and add Hercules if all requirements are available (#574).

@RussTreadon-NOAA
Copy link
Contributor

Thank you, @DavidHuber-NOAA !

@DavidHuber-NOAA
Copy link
Collaborator Author

The culprit behind the maxmem exceedance for the global_4denvar and global_3dvar tests is HDF5/netCDF 1.14.0/4.9.2. I first attempted to build these libraries in the hpc-stack fashion, but still received the maxmem failures. However, this also required loading the zstd and a newer zlib. So to further isolate this issue, I reverted to hpc-stack for all libraries and loaded hdf5/1.14.0 and netcdf/4.9.2 from hpc-stack, which also resulted in maxmem failures. So, this seems to be a feature of one/both of these new library versions.

To move forward, I will increase the maxmem and memthresh values for these tests.

@RussTreadon-NOAA RussTreadon-NOAA linked an issue Sep 29, 2023 that may be closed by this pull request
@JingCheng-NOAA
Copy link
Contributor

@BijuThomas-NOAA and @JingCheng-NOAA, could you please help to test if this branch works fine for HAFS on all supported platforms (WCOSS2, Jet, Hera, Orion, and Hercules)? Thanks!

I've compiled this branch on Orion. It passed with GSI RT. However, when I replaced the gsi.x with this newly compiled one and tested on one of my HAFS experiment case, following error showed up

"./hafs_gsi.x: error while loading shared libraries: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory"

The module file that HAFS currently use on Orion is "/work/noaa/hwrf/save/jcheng/HAFS/modulefiles/hafs.orion.lua". Not sure if there are any inconsistency between libraries.

@DavidHuber-NOAA
Copy link
Collaborator Author

@JingCheng-NOAA I would be happy to take a look at the modulefile and see if I can find any issue. Would you mind opening permissions for me with find /work/noaa/hwrf/save/jcheng -type f -exec chmod o+r {} +; and find /work/noaa/hwrf/save/jcheng -type d -exec chmod o+x {} =;? Thanks!

@JingCheng-NOAA
Copy link
Contributor

@JingCheng-NOAA I would be happy to take a look at the modulefile and see if I can find any issue. Would you mind opening permissions for me with find /work/noaa/hwrf/save/jcheng -type f -exec chmod o+r {} +; and find /work/noaa/hwrf/save/jcheng -type d -exec chmod o+x {} =;? Thanks!

Thanks for helping! However the second command came back with an error message "find: missing argument to `-exec'"

@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA , this PR can be scheduled for merger into develop pending approval from peer reviewers. Thank you for working with them to resolve issues.

@DavidHuber-NOAA
Copy link
Collaborator Author

@JingCheng-NOAA Whoops, that should have been a +, not an =. Please try this: find /work/noaa/hwrf/save/jcheng -type d -exec chmod o+x {} +;. Thanks!

@JingCheng-NOAA
Copy link
Contributor

@JingCheng-NOAA Whoops, that should have been a +, not an =. Please try this: find /work/noaa/hwrf/save/jcheng -type d -exec chmod o+x {} +;. Thanks!

Thanks! It's done. Please feel free to check.

@DavidHuber-NOAA
Copy link
Collaborator Author

@JingCheng-NOAA Hmm, that is odd. When I load that module and echo $LD_LIBRARY_PATH, I do see /apps/intel-2022.1.2/intel-2022.1.2/mkl/2022.0.2/lib/intel64 in the list, which is where the libmkl_intel_lp64.so.2 library resides. Can you point me to the script that runs the GSI and/or the log output?

@JingCheng-NOAA
Copy link
Contributor

@JingCheng-NOAA Hmm, that is odd. When I load that module and echo $LD_LIBRARY_PATH, I do see /apps/intel-2022.1.2/intel-2022.1.2/mkl/2022.0.2/lib/intel64 in the list, which is where the libmkl_intel_lp64.so.2 library resides. Can you point me to the script that runs the GSI and/or the log output?

Thanks for checking. Yes after I used the updated gsi_common.lua, the libmkl_intel_lp64.so.2 was no longer the issue. However right now the error message is ./hafs_gsi.x: error while loading shared libraries: libnetcdff.so.7: cannot open shared object file: No such file or directory.
You can find the log file here /work/noaa/hwrf/scrub/jcheng/HAFS_v1p1a_amv/2023091100/13L/hafs_analysis.log

@GeorgeVandenberghe-NOAA
Copy link

@DavidHuber-NOAA
Copy link
Collaborator Author

DavidHuber-NOAA commented Nov 27, 2023

@JingCheng-NOAA Looking through the log, I see that the netcdf-fortran/4.6.0 module is successfully loaded with the load of hafs.orion.lua, but later after the load of gsi_orion.lua, several modulefiles appear to reload in the new location specified by that gsi_orion.lua, but netcdf-fortran is not among them (see lines 1195-1224) and the environmental variables set by lmod do not include anything for netCDF-Fortran. I believe the script that is executing the analysis is /work/noaa/hwrf/save/jcheng/HAFS/jobs/JHAFS_ANALYSIS. In it, it executes module purge before loading gsi_${machine} for Hera and Jet, but not Orion. Perhaps adding this for Orion would help?

@JingCheng-NOAA
Copy link
Contributor

@JingCheng-NOAA Looking through the log, I see that the netcdf-fortran/4.6.0 module is successfully loaded with the load of hafs.orion.lua, but later after the load of gsi_orion.lua, several modulefiles appear to reload in the new location specified by that gsi_orion.lua, but netcdf-fortran is not among them (see lines 1195-1224) and the environmental variables. I believe the script that is executing the analysis is /work/noaa/hwrf/save/jcheng/HAFS/jobs/JHAFS_ANALYSIS. In it, it executes module purge before loading gsi_${machine} for Hera and Jet, but not Orion. Perhaps adding this for Orion would help?

Thank you! I will give it a try!

@aerorahul
Copy link
Contributor

@DavidHuber-NOAA @RussTreadon-NOAA
If the RTs are passing, can we prioritize merging this PR?
If there are issues in applications, please consider opening issues and recommend them working in those issue scopes.

@RussTreadon-NOAA
Copy link
Contributor

@aerorahul , I approved this PR. We are waiting for two peer reviews as per GSI code management policy.

@aerorahul
Copy link
Contributor

@aerorahul , I approved this PR. We are waiting for two peer reviews as per GSI code management policy.

Great. Do we have an estimate?

@RussTreadon-NOAA
Copy link
Contributor

Reviews have been requested. No estimate as to when reviews will be complete. It would be good to ping peer reviewers for an update.

@DavidHuber-NOAA
Copy link
Collaborator Author

@aerorahul: @JingCheng-NOAA and I have worked through the issues on Orion. He is going to run tests on Jet and then said he will approve if no issues remain. However, if I understand the comments above, it seems Jet had preexisting issues before this PR within the HAFS system. If issues are still present, I would argue for a separate issue in the HAFS repo to handle that as well.

I'm not sure if @hu5970 has had a chance to test this yet or not.

@BijuThomas-NOAA
Copy link

@DavidHuber-NOAA It seems there is an uninitialized variable "toff" in the subroutine read_radar_l2rw_novadqc (read_radar.f90). This variable is getting NaN in our failed HAFS regression tests on Jet. I assume that this might be the reason for the HAFS RT failure on Jet.

@DavidHuber-NOAA
Copy link
Collaborator Author

@BijuThomas-NOAA I'm glad you were able to find the issue with Jet. Looking at read_radar, toff is set by reading from the VAD file, which is not read in read_radar_l2rw_novadqc. The same is also true for read_radar_l2rw where toff is not initialized. I do not know how to resolve this issue properly nor how to test it. Since this is a preexisting issue and unrelated to the spack-stack upgrade, should a new issue be opened to track it so it gets the proper attention?

@JingCheng-NOAA
Copy link
Contributor

The GSI regression test on JET passed for all the cases except for 5 - hafs_4denvar_glbens (Failed). The reason of failure is The runtime for hafs_4denvar_glbens_loproc_updat is 432.617227 seconds. This has exceeded maximum allowable threshold time of 399.070562 seconds, resulting in Failure time-thresh of the regression test. I think it's OK for now, but we need to double check the runtime limit for these two cases. Since @BijuThomas-NOAA has already found the cause of HAFS regression test failure, I have no issue to approve this PR. @BinLiu-NOAA

@BijuThomas-NOAA
Copy link

@DavidHuber-NOAA Thanks. Opened a new issue here 661

@BinLiu-NOAA
Copy link
Contributor

@JingCheng-NOAA Thanks for conducting the GSI regression tests from HAFS end. Also, thanks for @BijuThomas-NOAA for testing from the HAFS workflow side. Meanwhile, agree with @DavidHuber-NOAA, the issue @BijuThomas-NOAA encountered on Jet is probably a separate issue and he has created an issue #661 for it. With that, we will approve this PR from HAFS side. Thanks a lot!

@RussTreadon-NOAA
Copy link
Contributor

@ShunLiu-NOAA and @hu5970: if you are OK with this PR, would you please review and approve. Once we get sufficient reviews we can schedule this PR for merger into develop.

@RussTreadon-NOAA
Copy link
Contributor

@ShunLiu-NOAA , @hu5970 , @CoryMartin-NOAA , I would like to merge this PR into develop no later than 12:00 pm EST today (11/30/2023). Please let me know if you have any concerns.

@RussTreadon-NOAA RussTreadon-NOAA merged commit c94bc72 into NOAA-EMC:develop Nov 30, 2023
4 checks passed
@RussTreadon-NOAA RussTreadon-NOAA mentioned this pull request Nov 30, 2023
6 tasks
@aerorahul
Copy link
Contributor

Thank you @DavidHuber-NOAA for your effort in porting the GSI to spack-stack and the prep work for identifying and resolving the older compiler issues.
Thanks also to @RussTreadon-NOAA for providing the support and testing during this process.
Thanks to all involved who contributed to achieving this result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet