Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global workflow v16.3.7 will not build on WCOSS2 #1812

Closed
ADCollard opened this issue Aug 21, 2023 · 35 comments
Closed

Global workflow v16.3.7 will not build on WCOSS2 #1812

ADCollard opened this issue Aug 21, 2023 · 35 comments
Assignees
Labels
bug Something isn't working

Comments

@ADCollard
Copy link
Contributor

The current operational version of the GFS workflow is not building correctly.

git clone -b EMC-v16.3.7 https://github.com/NOAA-EMC/global-workflow.git  global_workflow_v16.3.7
cd global_workflow_v16.3.7/sorc
./checkout.sh
./build_tropcy_NEMS.sh   (./build_all.sh produces the same error but this is the script that fails)

Results in:

+ module load modulefile.storm_reloc_v6.0.0.wcoss2
++ /usr/share/lmod/lmod/libexec/lmod bash load  intel/19.1.3.304
Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "w3nco/2.4.1"
   Try: "module spider w3nco/2.4.1" to see how to load the module(s).

modulefile.storm_reloc_v6.0.0.wcoss2 will load with a clean login environment, just not in the ./build_tropcy_NEMS.sh script or when ./machine-setup.sh is run first.

@RussTreadon-NOAA
Copy link
Contributor

Changing the order of module loads in modulefile.storm_reloc_v6.0.0.wcoss2.lua to

--- a/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua
+++ b/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua
@@ -12,9 +12,9 @@ load(pathJoin("libpng", os.getenv("libpng_ver")))
 load(pathJoin("zlib", os.getenv("zlib_ver")))
 
 load(pathJoin("bacio", os.getenv("bacio_ver")))
-load(pathJoin("w3nco", os.getenv("w3nco_ver")))
 load(pathJoin("nemsio", os.getenv("nemsio_ver")))
 load(pathJoin("nemsiogfs", os.getenv("nemsiogfs_ver")))
+load(pathJoin("w3nco", os.getenv("w3nco_ver")))
 load(pathJoin("sigio", os.getenv("sigio_ver")))
 load(pathJoin("w3emc", os.getenv("w3emc_ver")))
 load(pathJoin("sp", os.getenv("sp_ver")))

yields a rc=0 execution of build_tropcy_NEMS.sh.

I do not know why moving the w3nco load makes a difference. Is the above behavior known and expected?

@WalterKolczynski-NOAA , who is the tropcy code manager? We should assign this issue to him/her for resolution.

@WalterKolczynski-NOAA
Copy link
Contributor

I think it is @JiayiPeng-NOAA

@WalterKolczynski-NOAA
Copy link
Contributor

Also, w3nco may not even be needed anymore.

@RussTreadon-NOAA
Copy link
Contributor

Thanks @WalterKolczynski-NOAA . @JiayiPeng-NOAA needs to fix the tropcy build and test it before g-w staff cut a gfs.v16.3.8 tag for NCO to pick up.

@JiayiPeng-NOAA
Copy link

JiayiPeng-NOAA commented Aug 21, 2023 via email

@WalterKolczynski-NOAA
Copy link
Contributor

@JiayiPeng-NOAA at the bottom of every email is a link that says "view it on GitHub".

@JiayiPeng-NOAA
Copy link

JiayiPeng-NOAA commented Aug 21, 2023 via email

@RussTreadon-NOAA
Copy link
Contributor

@JiayiPeng-NOAA , @ADCollard found a problem with the operational build for tropcy. This issue (#1812) documents the problem. We had to move the w3nco load after nemsiogfs in order to get tropcy to build. Is this known behavior?

@aerorahul
Copy link
Contributor

@RussTreadon-NOAA @ADCollard
Are we sure we are sourcing versions/build.ver before we build the software?
Calling build_tropcy_NEMS.sh I think by passes that.

@ADCollard
Copy link
Contributor Author

@aerorahul That gets sourced in sorc/machine-setup.sh

@ADCollard
Copy link
Contributor Author

@WalterKolczynski-NOAA @aerorahul I can confirm that Russ's solution allows the build to proceed.

If I put in a PR for this change to be added to https://github.com/NOAA-EMC/global-workflow/tree/release/gfs.v16.3.8, can we get this version tagged? This is getting time-critical and I don't think we are going to find a definitive solution soon.

We can advise NCO not to do a full re-build as none of the changes are related to code (two fix file, one script).

@aerorahul
Copy link
Contributor

aerorahul commented Aug 21, 2023

So, I commented out the w3nco/ line in the modulefile.
When I list the modules after loading this, I see nemsio/2.5.4 being loaded, even though nemsio_ver=2.5.2 is set in build.ver.

sorc [EMC-v16.3.7|✚ 2]
11:15 $ git diff
diff --git i/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua w/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua
index 33cd59f0..7a03b195 100755
--- i/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua
+++ w/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua
@@ -12,7 +12,7 @@ load(pathJoin("libpng", os.getenv("libpng_ver")))
 load(pathJoin("zlib", os.getenv("zlib_ver")))

 load(pathJoin("bacio", os.getenv("bacio_ver")))
index 33cd59f0..7a03b195 100755
--- i/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua
+++ w/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua
@@ -12,7 +12,7 @@ load(pathJoin("libpng", os.getenv("libpng_ver")))
 load(pathJoin("zlib", os.getenv("zlib_ver")))

 load(pathJoin("bacio", os.getenv("bacio_ver")))
-load(pathJoin("w3nco", os.getenv("w3nco_ver")))
+--load(pathJoin("w3nco", os.getenv("w3nco_ver")))
 load(pathJoin("nemsio", os.getenv("nemsio_ver")))
 load(pathJoin("nemsiogfs", os.getenv("nemsiogfs_ver")))
 load(pathJoin("sigio", os.getenv("sigio_ver")))
diff --git i/sorc/build_tropcy_NEMS.sh w/sorc/build_tropcy_NEMS.sh
index 0e96cfcc..08d2cd13 100755
--- i/sorc/build_tropcy_NEMS.sh
+++ w/sorc/build_tropcy_NEMS.sh
@@ -13,7 +13,9 @@
 #
 set -eux

+set +x
 source ./machine-setup.sh > /dev/null 2>&1
+set -x
 cwd=`pwd`

 # Check final exec folder exists
@@ -21,23 +23,30 @@ if [ ! -d "../exec" ]; then
   mkdir ../exec
 fi

+set +x
 module use ${cwd}/../modulefiles
 module load modulefile.storm_reloc_v6.0.0.$target
+module list
+#module show w3nco/2.4.1
+#module show nemsio
+#module avail
+set -x
+exit

And then calling this script:

/sorc [EMC-v16.3.7|✚ 2]
11:15 $ ./build_tropcy_NEMS.sh
+ set +x
++ pwd
+ cwd=/lfs/h2/emc/eib/noscrub/rahul.mahajan/ops/global-workflow/sorc
+ '[' '!' -d ../exec ']'
+ set +x

Currently Loaded Modules:
  1) craype-x86-rome     (H)   5) PrgEnv-intel/8.1.0   9) jasper/2.0.25  13) nemsio/2.5.4     17) sp/2.3.3
  2) libfabric/1.11.0.0. (H)   6) craype/2.7.10       10) libpng/1.6.37  14) nemsiogfs/2.5.3  18) g2/3.4.5
  3) craype-network-ofi  (H)   7) intel/19.1.3.304    11) zlib/1.2.11    15) sigio/2.3.2      19) modulefile.storm_reloc_v6.0.0.wcoss2
  4) envvar/1.0                8) cray-mpich/8.1.9    12) bacio/2.4.1    16) w3emc/2.9.2

  Where:
   H:  Hidden Module



+ exit

@aerorahul
Copy link
Contributor

aerorahul commented Aug 21, 2023

I can confirm that loading of nemsiogfs forces the load of nemsio/2.5.4. There is no version specified for nemsiogfs_ver in build.ver.

I can also confirm that removing the nemsiogfs dependency in building the tropcy module allows the build_tropcy_NEMS.sh to complete successfully. Is nemsiogfs required for the tropcy programs?

@aerorahul
Copy link
Contributor

This is the diff that goes with the above comment:

11:41 $
✔ /lfs/h2/emc/eib/noscrub/rahul.mahajan/ops/global-workflow/sorc [EMC-v16.3.7|✚ 2]
11:41 $ git diff
diff --git i/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua w/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua
index 33cd59f0..e03d8994 100755
--- i/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua
+++ w/modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua
@@ -14,7 +14,7 @@ load(pathJoin("zlib", os.getenv("zlib_ver")))
 load(pathJoin("bacio", os.getenv("bacio_ver")))
 load(pathJoin("w3nco", os.getenv("w3nco_ver")))
 load(pathJoin("nemsio", os.getenv("nemsio_ver")))
-load(pathJoin("nemsiogfs", os.getenv("nemsiogfs_ver")))
+--load(pathJoin("nemsiogfs", os.getenv("nemsiogfs_ver")))
 load(pathJoin("sigio", os.getenv("sigio_ver")))
 load(pathJoin("w3emc", os.getenv("w3emc_ver")))
 load(pathJoin("sp", os.getenv("sp_ver")))
diff --git i/sorc/build_tropcy_NEMS.sh w/sorc/build_tropcy_NEMS.sh
index 0e96cfcc..0091cc1f 100755
--- i/sorc/build_tropcy_NEMS.sh
+++ w/sorc/build_tropcy_NEMS.sh
@@ -13,7 +13,9 @@
 #
 set -eux

+set +x
 source ./machine-setup.sh > /dev/null 2>&1
+set -x
 cwd=`pwd`

 # Check final exec folder exists
@@ -21,8 +23,11 @@ if [ ! -d "../exec" ]; then
   mkdir ../exec
 fi

+set +x
 module use ${cwd}/../modulefiles
 module load modulefile.storm_reloc_v6.0.0.$target
+module list
+set -x

 export FC=$myFC
 export JASPER_LIB=${JASPER_LIB:-$JASPER_LIBRARIES/libjasper.a}
@@ -33,7 +38,8 @@ export LIBS_SUP="${W3EMC_LIBd} ${W3NCO_LIBd}"
 echo lset
 echo lset
  export LIBS_REL="${W3NCO_LIB4}"
-export LIBS_REL="${NEMSIOGFS_LIB} ${NEMSIO_LIB} ${LIBS_REL} ${SIGIO_LIB} ${BACIO_LIB4} ${SP_LIBd}"
+#export LIBS_REL="${NEMSIOGFS_LIB} ${NEMSIO_LIB} ${LIBS_REL} ${SIGIO_LIB} ${BACIO_LIB4} ${SP_LIBd}"
+export LIBS_REL="${NEMSIO_LIB} ${LIBS_REL} ${SIGIO_LIB} ${BACIO_LIB4} ${SP_LIBd}"
 export LIBS_SIG="${SIGIO_INC}"
 export LIBS_SYN_GET="${W3NCO_LIB4}"
 export LIBS_SYN_MAK="${W3NCO_LIB4} ${BACIO_LIB4}"

@JiayiPeng-NOAA
Copy link

JiayiPeng-NOAA commented Aug 21, 2023 via email

@aerorahul
Copy link
Contributor

aerorahul commented Aug 21, 2023

@Qingfu-Liu Can you please take a look at the comments from @RussTreadon-NOAA and @aerorahul to address build issues with the tropcy programs

Specifically
#1812 (comment) from @RussTreadon-NOAA
and
#1812 (comment) from @aerorahul

Please let us know which solution would you prefer. We need to provide a tag to NCO asap.

@WalterKolczynski-NOAA
Copy link
Contributor

The nemsio version thing isn't something new for this release, so let's not pull on that thread here. Let's just fix the w3nco blocker for this release.

@ADCollard One the release branch is finalized and tested to work, I will create a tag.

@yangfanglin
Copy link
Contributor

gfsnemsio is a wrapper to call nemsio, tailored for the I/O of GFS NEMSIO version. It is probably the right time to clean up the storm_reloc code, and any other code, to remove dependences on nemsio and gfsnemsio. (Sorry, Walter, for pulling on this).

@WalterKolczynski-NOAA
Copy link
Contributor

@yangfanglin This needs to get to NCO to be implemented in 8 days. It can be fixed in develop, or even a future v16 release. Unless something is actually broken, we should leave it.

@ADCollard
Copy link
Contributor Author

@WalterKolczynski-NOAA OK. Once @Qingfu-Liu has replied to @aerorahul with regards to which solution we should follow I will issue the new PR. Testing obviously will have to wait until this evening due to Cactus being unavailable unless @Qingfu-Liu can suggest a standalone test we can run on Acorn, for example.

@Qingfu-Liu
Copy link

Qingfu-Liu commented Aug 21, 2023 via email

@ADCollard
Copy link
Contributor Author

@Qingfu-Liu there is probably no reason to test this on Hera as the build does not fail there. This is a WCOSS2 issue.

@Qingfu-Liu
Copy link

Qingfu-Liu commented Aug 21, 2023 via email

@Qingfu-Liu
Copy link

Qingfu-Liu commented Aug 21, 2023

I am not able to login WCOSS2 to test those now. My guess is that the fail is related to the new software or libraries on WCOSS2, and I do not have any ideals about the change, and need helps from some experts having the knowledge on the software changes on WCOSS2.

@RussTreadon-NOAA
Copy link
Contributor

Qingfu, you have devonprod access.

russ.treadon@dlogin02:/lfs/h1/ops/prod/output/20230821> groups qingfu.liu
qingfu.liu : emc primarysys rstprod physics pioneers qingfu.liu

Are devonprod users permitted to log onto the production machine to compile code?

@RussTreadon-NOAA
Copy link
Contributor

I logged onto Dogwood, cloned g-w branch gfs.v16.3.8, and executed build_tropcy_NEMS.sh twice. One execution placed the w3nco load in modulefile.storm_reloc_v6.0.0.wcoss2.lua after nemsiogfs. The other execution left w3nco alone and removed nemsiogfs references from the module and build_tropcy_NEMS.sh. Both approaches generate six executables. Nothing new here. I am simply repeating what @aerorahul, @ADCollard , and I did earlier.

@Qingfu-Liu, would you please identify the approach to pass to NCO and confirm that the modified build does not alter gfs/gdas tropcy results.

@Qingfu-Liu
Copy link

Qingfu-Liu commented Aug 21, 2023

@RussTreadon-NOAA I am looking the changes, but not sure what the differences between those two. I feels the nemsiogfs should be loaded before the w3nco. I am doing more research on this

@Qingfu-Liu
Copy link

I might have data on Cactus, but I can't login now. I will try late today to see if I have files on Cactus to compare

@Qingfu-Liu
Copy link

@RussTreadon-NOAA I looked the script change and the executables produced, the generated executables are the same for both cases. The library NEMSIOGFS is used for relocation program, so it is no longer necessary to be there. So both changes are good.

@aerorahul
Copy link
Contributor

The nemsio/2.5.2 library depends on bacio and w3nco. As a result, they should appear before the loading of nemsio.
nemsiogfs depends on nemsio, so it is loaded after nemsio.

If w3nco/2.4.1 is placed after nemsiogfs/2.5.3, the load of nemsiogfs/2.5.4 will change the version of nemsio/2.5.2 to nemsio/2.5.4.

GFSv16.3.x depends on nemsio/2.5.2 as is defined in versions/build.ver.

@Qingfu-Liu
Copy link

@aerorahul the easiest way to fix this is to change the library order as @RussTreadon-NOAA suggested

@aerorahul
Copy link
Contributor

It may be easy, but it alters the library versions of nemsio. I will take the suggestion from the implementation team and expert on the program.

@aerorahul
Copy link
Contributor

So the root of this q. is why does nemsiogfs load nemsio/2.5.4 now while it didn't do this earlier.
The answer lies in the nemsiogfs modulefile at /apps/ops/prod/libs/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.4/nemsiogfs/2.5.3.lua

load("nemsio")
prereq("nemsio")

These lines forces nemsiogfs to load the latest version of nemsio available, which is now 2.5.4 as seen under:

15:26 $ ls -lrt /apps/ops/prod/libs/intel/19.1.3.304/cray-mpich/8.1.4/nemsio
total 8
drwxr-sr-x 5 ops.prod prod 2048 Oct 14  2021 2.5.2
lrwxrwxrwx 1 ops.prod prod   24 Aug 15 11:44 2.5.4 -> ../../8.1.9/nemsio/2.5.4

On Aug 15, 2023 NCO (or someone) installed a new version of nemsio/2.5.4 and lmod is picking that one up.

If we want to keep the package same, NCO can remove the two lines from nemsiogfs module or explicitly add nemsio/2.5.2 to it.

@aerorahul
Copy link
Contributor

To make sure that there is no nemsio_gfs use in any of the tropcy programs, I did the following:

✔ /lfs/h2/emc/eib/noscrub/rahul.mahajan/ops/global-workflow/sorc [EMC-v16.3.7|✚ 2]
17:45 $ grep -ir nemsio_gfs *.fd/*
✘-1 /lfs/h2/emc/eib/noscrub/rahul.mahajan/ops/global-workflow/sorc [EMC-v16.3.7|✚ 2]

The public interfaces for nemsiogfs library are all named nemsio_gfs* as seen in: https://github.com/NOAA-EMC/NCEPLIBS-nemsiogfs/blob/develop/src/nemsio_gfs.f90

There was no hit confirming that nothing in the tropcy programs depend on nemsiogfs library.

@Qingfu-Liu
Copy link

Here is my suggestion:

  1. in script sorc/build_tropcy_NEMS.sh, remove the following lines:
    line 35: export LIBS_REL="${W3NCO_LIB4}"
    line36: export LIBS_REL="${NEMSIOGFS_LIB} ${NEMSIO_LIB} ${LIBS_REL} ${SIGIO_LIB} ${BACIO_LIB4} ${SP_LIBd}"
    line 41: echo $LIBS_REL
  2. in modulefiles/modulefile.storm_reloc_v6.0.0.wcoss2.lua
    remove line: load(pathJoin("nemsiogfs", os.getenv("nemsiogfs_ver")))
    so we remove the NEMSIOGFS_LIB

WalterKolczynski-NOAA pushed a commit that referenced this issue Aug 22, 2023
For some reason the sorc/build_tropcy_NEMS.sh no longer works on WCOSS2 machines.   @RussTreadon-NOAA and @aerorahul found fixes that allow the build to complete.  This PR implements one of these fixes - removing the references to the NEMSIOGFS module, which is being loaded but does not have a version number attached.

Six executables result from this build as expected.   They should still be tested for functionality.

This references #1812.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants