Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test spack-stack provided on Orion #1310

Closed
KateFriedman-NOAA opened this issue Feb 13, 2023 · 26 comments
Closed

Test spack-stack provided on Orion #1310

KateFriedman-NOAA opened this issue Feb 13, 2023 · 26 comments
Assignees
Labels
feature New feature or request

Comments

@KateFriedman-NOAA
Copy link
Member

KateFriedman-NOAA commented Feb 13, 2023

Description

This issue documents work to test out a test install of spack-stack on Orion provided by @climbfuji and others. Details provided by Dom:

module purge
module use /work/noaa/da/role-da/spack-stack/modulefiles
module load miniconda/3.9.7
module load ecflow/5.8.4

module use  /work2/noaa/da/dheinzel-new/spack-stack-unified-env-io-updates/envs/unified-dev-test2/install/modulefiles/Core
module load stack-intel/2022.0.2
module load stack-intel-oneapi-mpi/2021.5.1
module load stack-python/3.9.7

module av
module load global-workflow-env/unified-dev

module li

Currently Loaded Modules:
  1) ecflow/5.8.4                     19) sqlite/3.40.0           37) netcdf-cxx4/4.3.1         55) ncio/1.1.2
  2) intel/2022.1.2                   20) proj/8.1.0              38) openblas/0.3.19           56) antlr/2.7.7
  3) stack-intel/2022.0.2             21) udunits/2.2.28          39) py-setuptools/59.4.0      57) nco/5.0.6
  4) impi/2022.1.2                    22) util-linux-uuid/2.38.1  40) py-numpy/1.22.3           58) nemsio/2.5.2
  5) stack-intel-oneapi-mpi/2021.5.1  23) cdo/2.0.5               41) py-cftime/[1.0.3.4](http://1.0.3.4/)         59) nemsiogfs/2.5.3
  6) miniconda/3.9.7                  24) netcdf-fortran/4.6.0    42) py-mpi4py/3.1.4           60) prod-util/1.2.2
  7) stack-python/3.9.7               25) esmf/8.3.0b09           43) py-netcdf4/1.5.3          61) sigio/2.3.2
  8) bacio/2.4.1                      26) libjpeg/2.1.0           44) py-bottleneck/1.3.5       62) py-cython/0.29.32
  9) bufr/11.7.1                      27) jasper/2.0.32           45) py-pyparsing/3.0.9        63) py-f90nml/1.4.3
 10) openjpeg/2.3.1                   28) libpng/1.6.37           46) py-packaging/21.3         64) py-markupsafe/2.1.1
 11) eccodes/2.27.0                   29) g2/3.4.5                47) py-numexpr/2.8.3          65) py-jinja2/3.1.2
 12) fftw/3.3.10                      30) sp/2.3.3                48) py-six/1.16.0             66) py-pyyaml/6.0
 13) pkg-config/0.27.1                31) ip/3.3.3                49) py-python-dateutil/2.8.2  67) ufs-pyenv/1.0.0
 14) zlib/1.2.13                      32) w3nco/2.4.1             50) py-pytz/2022.2.1          68) w3emc/2.9.2
 15) hdf5/1.14.0                      33) grib-util/1.2.3         51) py-pandas/1.4.0           69) wgrib2/2.0.8
 16) curl/7.85.0                      34) landsfcutil/2.4.1       52) py-xarray/2022.3.0        70) global-workflow-env/unified-dev
 17) zstd/1.5.2                       35) g2c/1.6.4               53) met/10.1.1
 18) netcdf-c/4.9.0                   36) gsl/2.7.1               54) metplus/4.1.1

Issues encountered during testing and requests to add modules will be submitted via the spack-stack repo: https://github.com/NOAA-EMC/spack-stack

Primary relevant spack-stack issue: JCSDA/spack-stack#454

@KateFriedman-NOAA KateFriedman-NOAA added the feature New feature or request label Feb 13, 2023
@KateFriedman-NOAA KateFriedman-NOAA self-assigned this Feb 13, 2023
@KateFriedman-NOAA
Copy link
Member Author

Opened JCSDA/spack-stack#471 to get the following modules added to the global-workflow-env/unified-dev module:

ncl/6.6.2
g2tmpl/1.10.2
gsi-ncdiag/1.0.0
crtm/2.4.0

Also need gempak but it's not currently within spack-stack. Will load it separately for now and look at submitting a spack-stack issue to get it added to spack-stack.

@KateFriedman-NOAA KateFriedman-NOAA added this to the GW March 2023 milestone Feb 28, 2023
@KateFriedman-NOAA
Copy link
Member Author

Status summary

I am able to run most of the system using the provided spack-stack install on Orion, however there are a few stumbling blocks to resolve:

  1. hanging python scripts
  2. hanging interp_inc.x

Working with @climbfuji and @ulmononian I was able to get past hanging python scripts by switching to a different spack-stack environment put together by @ulmononian:

In attempt to fix the hanging Python script issue, I installed a version of the unified environment with py-netcdf4
and py-h5py built without mpi, as Dom suggested this might have been the root cause of the problem. - Cameron
module use /work2/noaa/epic-ps/cbook/spack_work/spack-stack-1.2.0/envs/unified-env/install/modulefiles/Core
module load stack-intel
module load stack-intel-oneapi-mpi

This helped with issue 1. Issue 2 now occurs (once able to get to that step).

Last reply to email thread:

The second python script that was having issues got past the prior issues but hit a new one. The second python script
invokes an executable that appears to be hanging. Please see the following log on Orion (jump to the very bottom for
the timeout, python script starts line 1028 and the executable starts line 1074):

/work/noaa/stmp/kfriedma/comrot/spackcyc192/logs/2022010200/gfsanalcalc.log

I tried this analcalc job both with and without setting I_MPI_HYDRA_BOOTSTRAP=ssh --> no difference.

The chgres_inc.x exec is this exec: /work/noaa/global/kfriedma/git/develop_spack/exec/interp_inc.x

Orion-login-3[190] /work/noaa/stmp/kfriedma/comrot/spackcyc192$ ll /work/noaa/global/kfriedma/git/develop_spack/exec/interp_inc.x
lrwxrwxrwx 1 kfriedma global 87 Feb 16 15:45 /work/noaa/global/kfriedma/git/develop_spack/exec/interp_inc.x -> /work/noaa/global/kfriedma/git/develop_spack/sorc/gsi_utils.fd/install/bin/interp_inc.x*

That interp_inc.x exec comes from the GSI utils component, which was built with the intel 2018 spack-stack install.
Perhaps I need an intel 2018 version with the same mpi-less python within? And then rebuild the GSI components with that?

This is where I am currently stuck.

@KateFriedman-NOAA
Copy link
Member Author

Status Summary

I was able to get past the hanging python utilities by using a different spack-stack install made by @ulmononian (without MPI):

/work2/noaa/epic-ps/cbook/spack_work/spack-stack-1.2.0/envs/unified-env/install/modulefiles/Core

The python in the analcalc jobs completed without hanging...however, the chgres_inc.x/interp_inc.x exec now fails with what appears to be an HDF5 error:

 FATAL ERROR: CREATING FILE=inc.fullres.03: Permission denied
 STOP.
999
999
999
999
999
999
999
999
999
srun: launch/slurm: _task_finish: Received task exit notification for 9 tasks of StepId=9900781.0 (status=0xe700).
srun: error: Orion-05-22: tasks 0-8: Exited with exit code 231
srun: launch/slurm: _step_signal: Terminating StepId=9900781.0
slurmstepd: error: *** STEP 9900781.0 ON Orion-05-22 CANCELLED AT 2023-03-31T09:48:00 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
chgres_inc.x       0000000000493D1E  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B82A20825D0  Unknown               Unknown  Unknown
libc-2.17.so       00002B82A2D55603  epoll_wait            Unknown  Unknown
libucs.so.0.0.0    00002B82AC35F3AB  ucs_event_set_wai     Unknown  Unknown
libucs.so.0.0.0    00002B82AC34642C  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B82A207ADD5  Unknown               Unknown  Unknown
libc-2.17.so       00002B82A2D5502D  clone                 Unknown  Unknown
HDF5-DIAG: Error detected in HDF5 (1.14.0) MPI-process 0:
  #000: /work2/noaa/epic-ps/cbook/spack_work/spack-stack-1.2.0/cache/build_stage/spack-stage-hdf5-1.14.0-wcprhcxjrt7qbgfi3sqgdrj37wgd3mdc/spack-src/src/H5T.c line 1911 in H5Tclose(): not a datatype
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.14.0) MPI-process 0:

Log (starting line 1075): /work/noaa/stmp/kfriedma/comrot/spackcyc192/logs/2022010200/gfsanalcalc.log

Word from @GeorgeVandenberghe-NOAA is that the GSI fails with hdf5/1.1.40 (have previously used hdf5/1.10.6 with the GSI). Still using intel 2018 with the GSI execs.

Currently stuck here...need to retest or rebuild the GSI with hdf5/1.10.6 to see if we can get past this error and run the few remaining jobs to complete a full cycle with spack-stack.

Test on Orion:

clone with updated components: /work/noaa/global/kfriedma/git/develop_spack
EXPDIR: /work/noaa/global/kfriedma/expdir/spackcyc192
COMROT: /work/noaa/stmp/kfriedma/comrot/spackcyc192

Note: clone is now several months old and does not have all GFSv17-dev updates since then, particularly the COM reorg updates that went in this week. Likely need to redo spack-stack updates in all components in new clone using global-workflow develop.

@AlexanderRichert-NOAA
Copy link

@KateFriedman-NOAA for what it's worth, looking at your UFS build logs, it looks like it's still using hpc-stack's parallelio somehow. The one in spack-stack is a shared library, so I think your UFS CMakeLists.txt and CMakeModules will need to be updated (see the current UFS develop branch; basically, use a more recent commit of CMakeModules, and remove the STATIC from the PIO line in ufs_model.fd/CMakeLists.txt).

@KateFriedman-NOAA
Copy link
Member Author

@KateFriedman-NOAA for what it's worth, looking at your UFS build logs, it looks like it's still using hpc-stack's parallelio somehow. The one in spack-stack is a shared library, so I think your UFS CMakeLists.txt and CMakeModules will need to be updated (see the current UFS develop branch; basically, use a more recent commit of CMakeModules, and remove the STATIC from the PIO line in ufs_model.fd/CMakeLists.txt).

Ah, yes, let me double check this, I now remember having an issue with the UFS build. Let me get back to you on this...

@GeorgeVandenberghe-NOAA
Copy link

GeorgeVandenberghe-NOAA commented May 1, 2023 via email

@junwang-noaa
Copy link
Contributor

One concern I have with the shared library is that we can't run a executable on different platforms, e.g. run executable compiled on wcoss2 on gaea C5, or hera executable on Orion (we recently confirmed the baselines on hera/orion reproduce).

@AlexanderRichert-NOAA
Copy link

@junwang-noaa have you done the former (build on WCOSS2->run on C5) with executables built with hpc-stack? With WCOSS2, I think there are some inevitable shared dependencies because of Cray, but I haven't tried it (I guess those same libraries exist on C5, just different versions potentially, like cray-mpich).

@KateFriedman-NOAA
Copy link
Member Author

@AlexanderRichert-NOAA

@KateFriedman-NOAA for what it's worth, looking at your UFS build logs, it looks like it's still using hpc-stack's parallelio somehow. The one in spack-stack is a shared library, so I think your UFS CMakeLists.txt and CMakeModules will need to be updated (see the current UFS develop branch; basically, use a more recent commit of CMakeModules, and remove the STATIC from the PIO line in ufs_model.fd/CMakeLists.txt).

Okie dokie...so I had changed modulefiles/ufs_common.lua to use the spack-stack parallelio module:

+parallelio_ver=os.getenv("parallelio_ver") or "2.5.7"
+load(pathJoin("parallelio", parallelio_ver))

...but I remember trying to make that change and having issues so, like you said, using a more recent version would help. I will see if I can find time to try that, I'd probably want to update to newer versions of all components and remake the module/stack changes too. Will see if I can find time to do that before I go on leave this month. I have a bunch of higher priority tasks to complete first though. Thanks!

@AlexanderRichert-NOAA
Copy link

@KateFriedman-NOAA looking at your gfsanalcalc.log, I just noticed you're getting an issue that I was running into: MPI startup(): PMI server not found. Please set I_MPI_PMI_LIBRARY variable if it is not a singleton case. I think this means it's running the tasks as separate jobs without MPI comms, but I could be wrong (so if this is normal/expected, please disregard this). In any case, I got around it in exglobal_forecast.sh by adding export I_MPI_PMI_LIBRARY=/opt/slurm/lib/libpmi.so right before the $APRUN_FV3 line (this isn't the ideal way to do it of course). Again assuming this isn't the expected behavior, can you try making a similar change to this job and see if it changes anything?

It looks like this variable gets set by the impi/2022.1.2 module and the value looks correct, so I don't know why it's not getting propagated...

@AlexanderRichert-NOAA
Copy link

By setting I_MPI_PMI_LIBRARY I'm able to get to the point where chgres_inc.x just hangs indefinitely. Switching to hdf5/1.12.2 at runtime does not fix it, but I have not yet tried building with it. I have also tried building and running the executable using intel 2022 to avoid the mismatch between modules (it was building with old intel but running with new intel), but no luck. It appears that none of the processes are making it through the nf90_get_var call in gsi_utils.fd/src/netcdf_io/interp_inc.fd/driver.f90 (line 346). I'll keep digging and see if I can figure out where exactly it's getting stuck...

@AlexanderRichert-NOAA
Copy link

AlexanderRichert-NOAA commented May 3, 2023

No luck with netcdf-c@4.9.2, nor with hdf5@1.12.2...

@climbfuji
Copy link

climbfuji commented May 3, 2023 via email

@AlexanderRichert-NOAA
Copy link

Yeah, I tried building and running that executable with spack-stack-1.3.1 (hdf5@1.12.2/netcdf-c@4.9.2/netcdf-fortran@4.6.0) and it was still hanging in the same place.

@GeorgeVandenberghe-NOAA
Copy link

GeorgeVandenberghe-NOAA commented May 3, 2023 via email

@AlexanderRichert-NOAA
Copy link

I just tested that-- It does run to completion building with the existing hpc-stack libraries, so it's not a matter of misconfiguration at runtime of MPI settings, etc. I'm building my own stack with intel 2018 and I'll just keep changing things until it works (probably starting with hdf5 version)...

For my own memory-- I've now tried hdf5 builds with and without thread safety enabled, so that doesn't seem to be the issue.

@AlexanderRichert-NOAA
Copy link

I built a spack-stack-based stack with hdf@1.10.6 and, for better or for worse, that fixed it. I'm going to see if I can narrow down the version where it breaks. It seems odd that it would break in the context of a high-level netcdf function (n*_get_var), so I wonder if it has to do with how it's being used in the context of MPI.

@GeorgeVandenberghe-NOAA
Copy link

GeorgeVandenberghe-NOAA commented May 4, 2023 via email

@AlexanderRichert-NOAA
Copy link

Here's a summary of my findings on Orion in terms of hdf5 versions and interp_inc.x:
Working (no hang): 1.10.6, 1.10.7, 1.10.8
Not working (hanging): 1.10.9, 1.12.2, 1.14.0

So whatever the issue is, it appears to have emerged in the 1.10.8->1.10.9 transition. I'll see if I can pin down the culprit.

@KateFriedman-NOAA
Copy link
Member Author

To document this from an email thread with @climbfuji:

I have wgrib2@3.1.1 on Orion in:

/work2/noaa/da/dheinzel-new/spack-stack-unified-env-io-updates/envs/unified-dev-test4-intel-18

and

/work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.3.1/envs/unified-env

(the latter has a default=newer Intel and a GNU environment)

@AlexanderRichert-NOAA
Copy link

Before I go digging deeper in terms of hdf5 code, @KateFriedman-NOAA if we had an intel 18-based stack with hdf5@1.10.6, would that at least take care of things in the short-to-mid term, especially in terms of transitioning to spack-stack?

@KateFriedman-NOAA
Copy link
Member Author

if we had an intel 18-based stack with hdf5@1.10.6, would that at least take care of things in the short-to-mid term, especially in terms of transitioning to spack-stack?

For supporting the GSI moving to spack-stack, yes, probably. We would need to decide if and how we mix intel versions across the GFS or not...meaning, GSI builds with 2018 but all other pieces build with 2022 and the whole system (from the workflow level) runs with 2022 loaded.

Our move to EPIC stack is planning to move everything to 2022 with the GSI being a question right now as well.

@GeorgeVandenberghe-NOAA
Copy link

GeorgeVandenberghe-NOAA commented May 9, 2023 via email

@AlexanderRichert-NOAA
Copy link

AlexanderRichert-NOAA commented May 19, 2023

Okay, I at least have a fix for the gfsanalcalc/interp_inc.x issue. For whatever reason, as of hdf@1.10.9, certain parallel operations require the involvement of the root process (MPI_RANK==0), and if it's not available for work, it hangs. So my workaround is to make [MPI world size]-1 the root process (replace mype == 0 with mype = npes-1 and so on in netcdf_io/interp_inc.fd/driver.f90), such that the parallel work is done on ranks 0 through 8 and rank 9 is the "root" MPI process. See /work/noaa/nems/arichert/develop_spack/sorc/gsi_utils.fd/src/netcdf_io/interp_inc.fd/driver.f90_WORKING

I'll submit a bug report to HDF5 and see what they say, and I'll cross-reference the issue here. @GeorgeVandenberghe-NOAA any chance this could explain other issues you've encountered? And if so, is this kind of workaround viable as far as level of effort required?

@GeorgeVandenberghe-NOAA
Copy link

@WalterKolczynski-NOAA WalterKolczynski-NOAA removed this from the GW March 2023 milestone Jul 5, 2023
@KateFriedman-NOAA
Copy link
Member Author

Being continued in issue #1868. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants