Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changes to allow USE_MGBF be able to be off #772

Draft
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

TingLei-daprediction
Copy link
Contributor

Fixes #765
Added changes in GSI codes to allow the mgbf lib be able to set off.
The changes are mainly using pre-processing directives to wrapping mgbf codes in hybrid_ensemble_isotropic.F90.
As discussed with @ManuelPondeca-NOAA and @MiodragRancic-NOAA, the current treatment is easy and straightforward, but the added directives in hybrid_ensemble_isotropic.F90 do not look as clean as expected.
This draft PR is for GSI developers to review and discuss if this is the efficient way to resolve the issue 765.

@RussTreadon-NOAA
Copy link
Contributor

Install TingLei-daprediction:feature/gsi_mgbf_turned_off on Hera. Found it necessary to make the following change to the top-level CMakeLists.txt in order to not build MGBF

-option(BUILD_MGBF "Build MGBF Library" ON)
+option(BUILD_MGBF "Build MGBF Library" OFF)

Without this change, MGBF source code was still complied and libmgbf.a created. With BUILD_MGBF set to OFF, ush/build.sh does not compile MGBF source code and libmgbf.a is not created. Only gsi.x and enkf.x` are built. Ran ctests with following results

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr772/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  737.44 sec
2/6 Test #6: global_enkf ......................   Passed  886.38 sec
3/6 Test #2: rtma .............................   Passed  1090.47 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1112.24 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1345.83 sec
6/6 Test #1: global_4denvar ...................   Passed  2285.50 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 2285.51 sec

The changes in this draft PR along with defaulting BUILD_MGBF to OFF in CMakeLists.txt achieve the desired purpose of this PR.

We do not have a MGBF ctest so I can not confirm that setting BUILD_MGBF to ON yields a MGBF library which behaves as intended.

@ShunLiu-NOAA ShunLiu-NOAA self-assigned this Jul 12, 2024
@RussTreadon-NOAA
Copy link
Contributor

@TingLei-daprediction , based on my tests I think this PR is ready for review. May we mark this PR as ready for review"?

@TingLei-daprediction
Copy link
Contributor Author

@RussTreadon-NOAA Thanks a lot! And sorry I hadn't reported in time my test results for GSI with the mgbf option turned off.
We are in middle of several related thing together and I will give a report as soon as possible on a off-line test/verification for ensemble mgbf localization (for GSI with MGBF option turned on).

@RussTreadon-NOAA
Copy link
Contributor

OK. @TingLei-daprediction. I will mark this PR as ready for review and add my review.

@RussTreadon-NOAA RussTreadon-NOAA marked this pull request as ready for review July 12, 2024 13:53
@RussTreadon-NOAA RussTreadon-NOAA self-requested a review July 12, 2024 13:53
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top-level CMakeLists.txt needs to change the BUILD_MGBF default from ON to OFF in order to achieve the stated purpose of this PR.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-daprediction , we need two peer reviewers for this PR. Who do you recommend?

@TingLei-daprediction
Copy link
Contributor Author

@RussTreadon-NOAA Thanks a lot (sorry for not replying sooner).
I would recommend @GangZhao-NOAA and @XuLu-NOAA as the reviewers.

Copy link
Contributor

@XuLu-NOAA XuLu-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look reasonable and clean to me.

@RussTreadon-NOAA
Copy link
Contributor

WCOSS ctests
Install TingLei-daprediction:feature/gsi_mgbf_turned_off at cf3a3a3 on Dogwood. Run ctests with the following results

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr772/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  738.65 sec
2/6 Test #6: global_enkf ......................   Passed  858.96 sec
3/6 Test #2: rtma .............................   Passed  971.08 sec
4/6 Test #5: hafs_3denvar_hybens ..............***Failed  1216.29 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed  1278.19 sec
6/6 Test #1: global_4denvar ...................***Failed  1684.25 sec

33% tests passed, 4 tests failed out of 6

Total Test time (real) = 1684.37 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
          3 - rrfs_3denvar_rdasens (Failed)
          4 - hafs_4denvar_glbens (Failed)
          5 - hafs_3denvar_hybens (Failed)
Errors while running CTest

Examine the failures and find that each is due to non-reproducibile analysis results between the updat (TingLei-daprediction:feature/gsi_mgbf_turned_off) and contrl (develop) analyses.

For rrfs_3denvar_rdasens differences show up in the last few digits of the initial stepsize written to stdout
updat

cost,grad,step,b,step? =   1   0  1.608182401781457011E+05  1.278388086096595998E+03  2.164417019337178338E+00  0.000000000000000000E+00  good

contrl

cost,grad,step,b,step? =   1   0  1.608182401781457011E+05  1.278388086096595998E+03  2.164417019337169901E+00  0.000000000000000000E+00  good

For hafs_3denvar_hybens differences show up in the last few digits of the initial step size
updat

cost,grad,step,b,step? =   1   0  1.522570550456362253E+05  5.089882891394057879E+03  3.159083965720937970E-01  0.000000000000000000E+00  good

contrl

cost,grad,step,b,step? =   1   0  1.522570550456362253E+05  5.089882891394057879E+03  3.159083965720939080E-01  0.000000000000000000E+00  good

For hafs_4denvar_glbens differences first appear in the last few digits of the initial step size
updat

cost,grad,step,b,step? =   1   0  1.640321269846514042E+05  3.663464403379373380E+03  1.078684720834985011E+00  0.000000000000000000E+00  good

contrl

cost,grad,step,b,step? =   1   0  1.640321269846514042E+05  3.663464403379373380E+03  1.078684720834984789E+00  0.000000000000000000E+00  good

For global_4denvar the difference shows up in the step size for the second iteration on the first outer loop
updat

cost,grad,step,b,step? =   1   0  7.456350531189966714E+05  1.771749011413640574E+03  1.159295676216339777E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  7.413499265815886902E+05  2.375440109105811644E+03  1.539623514636089263E+00  1.536107770547574303E+00  good

contrl

cost,grad,step,b,step? =   1   0  7.456350531189966714E+05  1.771749011413640574E+03  1.159295676216339777E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  7.413499265815886902E+05  2.375440109105811644E+03  1.539623514636094370E+00  1.536107770547574303E+00  good

I see that TingLei-daprediction:feature/gsi_mgbf_turned_off is two commits behind the head of develop. @TingLei-daprediction , you should update to the current head of develop and rerun ctests on Dogwood.

I reran ctests on Hera. All ctests pass using TingLei-daprediction:feature/gsi_mgbf_turned_off at cf3a3a3 and develop at a5e2a43. The Dogwood test uses the same snapshots of TingLei-daprediction:feature/gsi_mgbf_turned_off and develop.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Thanks. I will take a look to see what the problem is for failed ctest on dogwood while I will update this PR to the current develop branch. It is intersting, since the recent two PRs for (which are missing in this PR) don't cause any differences in ctests , why this PR lacking those PR would cause different results in those failed ctests. I will see if I missed anything.

@RussTreadon-NOAA
Copy link
Contributor

Agreed. The two PRs not in TingLei-daprediction:feature/gsi_mgbf_turned_off should not alter results but this is why we run ctests on multiple platforms. The unexpected can and does occur. Tests should be run on Hercules, Orion, and Jet to see if ctests on these machines behave like Hera (all tests pass) or like Dogwood (some tests fail).

@RussTreadon-NOAA
Copy link
Contributor

Unfortunately updating TingLei-daprediction:feature/gsi_mgbf_turned_off to the current head of develop does not alter ctest behavior on Dogwood. The same tests fail

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr772/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  792.77 sec
2/6 Test #6: global_enkf ......................   Passed  917.62 sec
3/6 Test #5: hafs_3denvar_hybens ..............***Failed  1274.96 sec
4/6 Test #4: hafs_4denvar_glbens ..............***Failed  1337.52 sec
5/6 Test #2: rtma .............................   Passed  1692.73 sec
6/6 Test #1: global_4denvar ...................***Failed  1986.11 sec

33% tests passed, 4 tests failed out of 6

Total Test time (real) = 1986.26 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
          3 - rrfs_3denvar_rdasens (Failed)
          4 - hafs_4denvar_glbens (Failed)
          5 - hafs_3denvar_hybens (Failed)
Errors while running CTest

I'll stop here and let the MGBF team investigate.

@RussTreadon-NOAA
Copy link
Contributor

Hercules and Orion ctests
Install TingLei-daprediction:feature/gsi_mgbf_turned_off at cf3a3a3 on MSU machines. Run ctests with the following results

Hercules

Test project /work/noaa/da/rtreadon/git/gsi/pr772/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  485.36 sec
2/6 Test #6: global_enkf ......................   Passed  726.20 sec
3/6 Test #2: rtma .............................   Passed  964.97 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1091.30 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1158.67 sec
6/6 Test #1: global_4denvar ...................   Passed  1741.25 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1741.26 sec

Orion

Test project /work2/noaa/da/rtreadon/git/gsi/pr772/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #6: global_enkf ......................   Passed  968.96 sec
2/6 Test #2: rtma .............................   Passed  1687.83 sec
3/6 Test #5: hafs_3denvar_hybens ..............   Passed  4221.41 sec
4/6 Test #4: hafs_4denvar_glbens ..............   Passed  4282.75 sec
5/6 Test #1: global_4denvar ...................   Passed  4862.98 sec

rrfs_3denvar_rdasens hung on Orion as documented in GSI issue #766. The significant increase in Orion wall times has been reported in GSI issue #771.

Hercules and Orion Passed ctests results are the same as Hera. It's only on WCOSS2 (Dogwood) that several tests fail due to non-reproducible results.

Several modules on WCOSS2 are built using hpc-stack, not spack-stack. Also the WCOSS2 build uses intel 19.1.3.304 whereas the Hera, Hercules, and Orion builds use intel 2021.x.0.

@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA , do you know if we can build gsi.x on WCOSS2 using an intel/2020+ compiler? I recall an effort to do so on acorn a while ago but I can't find details regarding this attempt.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA The ctests for this PR is running on cactus and I will keep posting of my findings .

@DavidHuber-NOAA
Copy link
Collaborator

@RussTreadon-NOAA The Spack-Stack configuration file for Acorn shows installations using Intel 2022 and Intel 19. However, Acorn is down right now, so I cannot verify that such installations are available ATTM.

For Cactus and Dogwood, there are two installations of Intel 2022, one for 'classic' and one for 'One API' compilers. However, these are not approved for operations. Additionally, the libraries that are available upon loading the intel-classic/2022.2.0.262 module (e.g. HDF5) appear to have been compiled with Intel 19. Thus, I would advise against using Intel 2022 on Cactus or Dogwood.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @DavidHuber-NOAA for the update. It's unfortunate that building with intel/2020+ is a no go on Cactus and Dogwood. I hope more recent intel compilers are soon approved for these machines. I didn't realize Acorn was down. This explains why I can't log into Acorn this morning.

@RussTreadon-NOAA
Copy link
Contributor

Thank you, @TingLei-daprediction . I ran ctests on Cactus following the Saturday production swtich. Results are the same as Dogwood.

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr772/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed   60.27 sec
2/6 Test #6: global_enkf ......................   Passed  850.28 sec
3/6 Test #2: rtma .............................   Passed  968.73 sec
4/6 Test #5: hafs_3denvar_hybens ..............***Failed  1091.75 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed  1211.76 sec
6/6 Test #1: global_4denvar ...................***Failed  1682.63 sec

33% tests passed, 4 tests failed out of 6

Total Test time (real) = 1682.64 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
          3 - rrfs_3denvar_rdasens (Failed)
          4 - hafs_4denvar_glbens (Failed)
          5 - hafs_3denvar_hybens (Failed)
Errors while running CTest

The global_4denvar, hafs_4denvar_glbens, and hafs_3denvar_hybens failures are due to non-reproducible results between the updat and contrl gsi.x. The rrfs_3denvar_rdasens failure is due not pointing at the correct files for the new rrfs ctest. After the path was updated the rrfs_3denvar_rdas test ran all four jobs but failed due to non-reproducible results between updat and contrl gsi.x.

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr772/build
    Start 3: rrfs_3denvar_rdasens
1/1 Test #3: rrfs_3denvar_rdasens .............***Failed  726.65 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) = 726.67 sec

The following tests FAILED:
          3 - rrfs_3denvar_rdasens (Failed)
Errors while running CTest

Copy link
Contributor

@GangZhao-NOAA GangZhao-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These modifications seem good.
I noticed that there is an issue with the ctest run on wcoss2. Hope it could be fixed soon.

@RussTreadon-NOAA
Copy link
Contributor

@GangZhao-NOAA , this PR can not be approved until the reason(s) for non-reproducible results on WCOSS2 is/are explained and resolved. This is a serious problem since operational realizations of the GSI run on WCOSS2.

@TingLei-daprediction
Copy link
Contributor Author

@RussTreadon-NOAA Thanks a lot. Would you please point me to the develop branch with your above changes? I will use them to continue the investigation.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-daprediction , @wx20jjung is investigating the CADS issue reported in #775 and will develop the correct solution. What I have simply enables global_4denvar to run to completion when gsi.x is built in debug mode. The global_4denvar test is the only test which which runs with CADS on. None of the regional tests turn on CADS.

I recommend that you focus on the failed regional tests

          3 - rrfs_3denvar_rdasens (Failed)
          4 - hafs_4denvar_glbens (Failed)
          5 - hafs_3denvar_hybens (Failed)

since you are very familiar with regional DA.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Ok, I will use regonal tests to see what the problem is.

@TingLei-daprediction
Copy link
Contributor Author

Using GSI built with debug mode,
The rrfs_3denvar_rdasens passed. I will run the rest of regional test and then see if the optimization levels would cause any differences.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-daprediction , good suggestion about lowering the optimization levels. I recompiled develop and TingLei-daprediction:feature/gsi_mgbf_turned_off using -O1 on Cactus. All ctests generate reproducible results with this level of optimization.

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr772/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  728.55 sec
2/6 Test #6: global_enkf ......................   Passed  859.76 sec
3/6 Test #2: rtma .............................***Failed  1029.43 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1223.65 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1279.04 sec
6/6 Test #1: global_4denvar ...................   Passed  1683.33 sec

83% tests passed, 1 tests failed out of 6

Total Test time (real) = 1683.401029.50 sec

The following tests FAILED:
          2 - rtma (Failed)
                    Errors while running CTest

The rtma falure is due to time threshold check

The runtime for rtma_hiproc_updat is 212.882223 seconds.  This has exceeded maximum allowable threshold time of 210.621013 seconds, resulting in Failure of timethresh2 the regression test.

This is not a fatal fail. The updat and contrl wall times are comparable

rtma_hiproc_contrl/stdout:The total amount of wall time                        = 191.473649
rtma_hiproc_updat/stdout:The total amount of wall time                        = 212.882223
rtma_loproc_contrl/stdout:The total amount of wall time                        = 201.818444
rtma_loproc_updat/stdout:The total amount of wall time                        = 207.358389

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Thanks for this update. It appeears that some "uncertain behavior " in the code, according to the fortran standard caused the differences observed when higher optimization levels are used. I will focus on this possibility and provide any updates as soon as possible.

@RussTreadon-NOAA
Copy link
Contributor

Build gsi.x and enkf.x on Cactus using intel-classic/2022.2.0.262 instead of intel/19.1.3.304. Do this for develop and TingLei-daprediction:feature/gsi_mgbf_turned_off. Run ctests with the following results

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr772/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  974.30 sec
2/6 Test #6: global_enkf ......................   Passed  1276.24 sec
3/6 Test #5: hafs_3denvar_hybens ..............   Passed  1395.37 sec
4/6 Test #4: hafs_4denvar_glbens ..............   Passed  1573.62 sec
5/6 Test #2: rtma .............................   Passed  1809.34 sec
6/6 Test #1: global_4denvar ...................   Passed  2403.33 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 2403.39 sec

All tests pass as is the case on Hera, Hercules, and Orion. This isn't too surprising since these machines compile GSI with intel/2020+ compilers.

It should be noted that the intel-classic/2022 test did not recompile all the supporting modules. These remain intel/19 builds. The only change was to the intel compiler in modulefiles/gsi_wcoss2.intel.lua

@@ -3,6 +3,7 @@ help([[
 
 local PrgEnv_intel_ver=os.getenv("PrgEnv_intel_ver") or "8.1.0"
 local intel_ver=os.getenv("intel_ver") or "19.1.3.304"
+local intel_classic_ver=os.getenv("intel_classic_ver") or "2022.2.0.262"
 local craype_ver=os.getenv("craype_ver") or "2.7.8"
 local cray_mpich_ver=os.getenv("cray_mpich_ver") or "8.1.7"
 local cmake_ver= os.getenv("cmake_ver") or "3.20.2"
@@ -24,7 +25,8 @@ local crtm_ver=os.getenv("crtm_ver") or "2.4.0.1"
 local ncdiag_ver=os.getenv("ncdiag_ver") or "1.1.1"
 
 load(pathJoin("PrgEnv-intel", PrgEnv_intel_ver))
-load(pathJoin("intel", intel_ver))
+--load(pathJoin("intel", intel_ver))
+load(pathJoin("intel-classic", intel_classic_ver))
 load(pathJoin("craype", craype_ver))
 load(pathJoin("cray-mpich", cray_mpich_ver))
 load(pathJoin("cmake", cmake_ver))

@GangZhao-NOAA
Copy link
Contributor

@RussTreadon-NOAA Thank you for keeping us updated.
So the non-reproducible issue and satellite-CADS seg fault issue are both fixed with using intel compiler 2020+? Or CADS seg fault is not solved yet?
-Gang

@RussTreadon-NOAA
Copy link
Contributor

Additional tests with intel/19 and intel/2022 compilers on Cactus find that it is not necessary to recompile develop using intel/2022 in order to get the Passed result.

Leave develop gsi_wcoss.intel.lua unchanged so that the develop build uses intel/19. Change gsi_wcoss2.intel.lua for TingLei-daprediction:feature/gsi_mgbf_turned_off to use intel-classic/2022.2.0.262.

With this set up the ctests yield

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr772/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  728.13 sec
2/6 Test #6: global_enkf ......................   Passed  853.82 sec
3/6 Test #2: rtma .............................   Passed  970.10 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1155.08 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1214.54 sec
6/6 Test #1: global_4denvar ...................   Passed  1683.13 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1683.14 sec

I do not know if NCO accepts packages compiled with intel-classic/2022.2.0.262.

@TingLei-daprediction
Copy link
Contributor Author

@RussTreadon-NOAA Thanks a lot! Your findings indicate there are different treatments with the higher optimization to the "active codes after those "inactive" codes excluded by those preprocessing directives introduced in this PR. I will dig more on this as soon as possible.

@RussTreadon-NOAA
Copy link
Contributor

@GangZhao-NOAA : The issue with CADS is not complier related. Source code changes are needed to resolve the CADS failure in debug mode. GSI issue #775 is tracking this problem.

Compiling TingLei-daprediction:feature/gsi_mgbf_turned_off with intel-classic/2022.2.0.262 on Cactus yields a gsi.x which reproduces analysis results from develop.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA an update confusing to me:(. My recent GSI (also for the control) compiled with the release mode and using intel/19.1.3.304 still gave the results of ctests passing:

 The following tests passed:
        rrfs_3denvar_rdasens
        hafs_4denvar_glbens
        hafs_3denvar_hybens

I didn't find any changes for GSI itself and wondering if there are any changes in last couple of days in the system modules .

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , where are your GSI clone and ctest run directories on Cactus?

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA The tmpdir for ctests is (cactus ) /lfs/h2/emc/ptmp/ting.lei/GSI_optVsopt.
The GSI is /lfs/h2/emc/da/noscrub/Ting.Lei/dr-daprediction-gsi_optimized/GSI_optVsopt/ (the output of build.sh is build.out ).
The control GSI is /lfs/h2/emc/da/noscrub/Ting.Lei/dr-emc-gsi_optmized/GSI/.
Thanks!

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA I just realized that there was a difference on the above GSI building process. For GSI_optVsopt/, I copied it from a previous GSI built with debug mode and , after deleting the build dir , I did a building with optimization. This is supposed to give a clean re-building. But, there are possibilities that is not the case. Hence, I am doing a GSI building from cloning and will keep you posted.

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Jul 24, 2024

Thank you @TingLei-NOAA for sharing that there is a mistake in your test.

Unfortunately, this information came too late. I spent a lot of time trying to reproduce your result. Despite various tests I could not reproduce your Passed results. Here is a sample of the tests I ran

  • copy /lfs/h2/emc/da/noscrub/Ting.Lei/dr-daprediction-gsi_optimized/GSI_optVsopt/ to /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/GSI_optVsopt
  • execute ush/build.sh The build log file is ush/build.log_rt
  • execute ctest -j 6 in build directory. The ctests use the gsi.x I compiled as updat. The contrl gsi.x is /lfs/h2/emc/da/noscrub/Ting.Lei/dr-emc-gsi_optmized/GSI/install/bin/gsi.x.
    ctest output saved in stdout_ctest_rt_exec_updat_ting_exec_contrl.txt
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  732.01 sec
2/6 Test #6: global_enkf ......................   Passed  861.41 sec
3/6 Test #2: rtma .............................   Passed  1034.74 sec
4/6 Test #5: hafs_3denvar_hybens ..............***Failed  1218.28 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed  1334.95 sec
6/6 Test #1: global_4denvar ...................***Failed  1744.59 sec

33% tests passed, 4 tests failed out of 6

Total Test time (real) = 1744.73 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
          3 - rrfs_3denvar_rdasens (Failed)
          4 - hafs_4denvar_glbens (Failed)
          5 - hafs_3denvar_hybens (Failed)

I repeated the above above test and used /lfs/h2/emc/da/noscrub/Ting.Lei/dr-daprediction-gsi_optimized/GSI_optVsopt/build/src/gsi/gsi.x as updat. With this set up, the ctests reproduce your Passed result.

I can not see how your and my compilation of the same code in GSI_optVsopt/src can yield different results.

Your most recent comment points to potential problems in your build.

FYI, your branch TingLei-daprediction:feature/gsi_mgbf_turned_off is three commits behind the head of develop. Please update your branch before you run any more tests.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA So sorry for not realizing the above factor earlier and really appreciate your efforts as always !
I will see if the new question would give some hints on this issue.

@@ -182,7 +184,9 @@ module hybrid_ensemble_isotropic
integer(r_kind) :: nval_loc_en

! For MGBF
#ifdef USE_MGBF_def
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

USE_MGBF_def is the wrong variable. CMakeLists.txt sets USE_MGBF. USE_MGBF_def needs to be changed to USE_MGBF

The same comment applies to all occurrences of USE_MGBF_def in this source code file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use different names in cmake files and the fortran source codes
in src/gsi/CMakeLists.txt, there is :

if(USE_MGBF)
  list(APPEND GSI_Fortran_defs USE_MGBF_def)
endif()

That makes USE_MGBF_def to be used as the preprocessing option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set USE_MGBF to ON in CMakeLists.txt. This broke the build

[ 80%] Building Fortran object src/gsi/CMakeFiles/gsi_fortran_obj.dir/hybrid_ensemble_isotropic.F90.o
/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/test/src/gsi/hybrid_ensemble_isotropic.F90(188): error #6457: This derived type name has not been declared.   [MG_INTSTATE_TYPE]
  type (mg_intstate_type), allocatable, dimension(:) :: obj_mgbf
--------^
/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/test/src/gsi/hybrid_ensemble_isotropic.F90(4184): error #6158: The structure-name is invalid or is missing.   [OBJ_MGBF]
  real(r_kind)   ,intent(inout) :: g(grd_loc%nsig,obj_mgbf(ig)%nm,obj_mgbf(ig)%mm)
--------------------------------------------------^
/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/test/src/gsi/hybrid_ensemble_isotropic.F90(4184): error #6158: The structure-name is invalid or is missing.   [OBJ_MGBF]
  real(r_kind)   ,intent(inout) :: g(grd_loc%nsig,obj_mgbf(ig)%nm,obj_mgbf(ig)%mm)
------------------------------------------------------------------^
/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/test/src/gsi/hybrid_ensemble_isotropic.F90(1769): error #6404: This name does not have a type, and must have an explicit type.   [PRINT_CPU]
       if(l_mgbf_loc) call print_mg_timers("mgbf_timing_cpu.csv", print_cpu, mype)
------------------------------------------------------------------^
/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/test/src/gsi/hybrid_ensemble_isotropic.F90(3749): error #6404: This name does not have a type, and must have an explicit type.   [OBJ_MGBF]
     allocate(work_mgbf(obj_mgbf(1)%km_a_all,obj_mgbf(1)%nm,obj_mgbf(1)%mm))
------------------------^

Looks like the error is that the first MGBF ifdef in hybrid_ensemble_isotropic.F90 references USE_MGBF

 ! For MGBF
#ifdef USE_MGBF
   use mg_intstate
   use mg_timers
 #endif /* End of USE_MGBF block */

while all the other ifdefs reference USE_MGBF_def. We need to pick one.

My preference is to use USE_MGBF consistently through CMakeLists.txt and the source code. That said, I am not a cmake expert. Maybe the the two varaible approach of USE_MGBF & USE_MGBF_def is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Russ, Thanks for exploring on them. I think use of one variable or two variable is just a stylish problem. In this case, up to now, there are no actual differences. Using two variables ( as for USE_GSDCLOUD vs RR_CLOUDANALYSIS) just gives a reminder on how those directives in the source codes are being recognized by the compiler.
If it is ok with you, I still prefer to following the use of such a "two variable" approach in the previous GSI setup as for the above GSDCLOUD management.
If you think the "one variable" approach is better enough that we don't need use the same "two variable" approach in GSI for similar situations , I can change it later .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference remains to use one variable, USE_MGBF, all the way through. This is more readable than USE_MGBF_def and less confusing. You are the developer. It's your PR. Do as you wish.

@RussTreadon-NOAA
Copy link
Contributor

As a test make the following changes in a working copy of TingLei-daprediction:feature/gsi_mgbf_turned_off

top level CMakeLists.txt

 option(BUILD_GSDCLOUD "Build GSD Cloud Analysis Library" OFF)
-option(BUILD_MGBF "Build MGBF Library" OFF)
+option(BUILD_MGBF "Build MGBF Library" ON)
 option(BUILD_GSI "Build GSI" ON)

src/gsi/hybrid_ensemble_isotropic.F90

 ! For MGBF
-#ifdef USE_MGBF
+#ifdef USE_MGBF_def
   use mg_intstate
   use mg_timers
 #endif /* End of USE_MGBF_def block */

Build gsi.x and enkf.x. Run ctests. All tests Passed. This is an expected result because with the above changes we do not actually alter the build with respect to develop.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , based on discussions with EIB we can not implement operational code using an intel-classic build. All supporting libraries and modules must be built with the same compiler.

We should look ahead to intel OneAPI. A future implementation of spack-stack on WCOSS might use intel OneAPI. The GSI build could then be updated to intel OneAPI. Doing so might resolve the non-reproducible results currently found with this PR when compiled on WCOSS using intel/19.

One path forward for this PR is to default BUILD_MGBF to ON in the top level CMakeLists.txt. This reverts CMakeLists.txt to what is currently in develop. However, all the other changes in this PR remain. Then at a later date when the WCOSS compiler is upgraded we can retest with BUILD_MGBF set to OFF.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Thanks a lot for giving a path forward . That is great!
Please give me a couple of day and see if I could get some progress on this issue.
Thanks again!

@dkokron
Copy link

dkokron commented Jul 25, 2024

The issue with "floating divide by zero" in CRTM_Planck_Functions.f90 using a debug build of GSI has been seen and reported before. I don't recall who was working on the issue, the following modification to src/gsi/read_iasi.f90 resolves the failure.

433 ! Allocate arrays to hold data
434 ! The number of channels in obtained from the satinfo file being used.
435 nele=nreal+satinfo_nchan
436 allocate(data_all(nele,itxmax),nrec(itxmax))
437 allocate(temperature(1)) ! dependent on # of channels in the bufr file
438 allocate(allchan(2,1)) ! actual values set after ireadsb
439 allocate(bufr_chan_test(1))! actual values set after ireadsb
440 allocate(scalef(1))
441
442 data_all(:,:)=zero <---------------

@RussTreadon-NOAA RussTreadon-NOAA mentioned this pull request Jul 30, 2024
7 tasks
@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , what is the status of this PR?

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA , On cactus, I had encountered the reproducibility issue for "optimized" GSI even with the USE_MBGF on. Also considered the previously reported success of the "optimized" GSI with USE_MGBF off (both of them were not started from a freshly cloned branch while a "clean" building was done), I 'd prefer to put this PR on "draft" status and will have a more thorough investigation into it in the future.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , your branch TingLei-daprediction:feature/gsi_mgbf_turned_off is two commits behind the current head of develop. GSI PR #743 alters global_4denvar and global_enkf results.

Please update keep your branch in sync with the head of the authoritative develop.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Thanks for the reminder, my local branch on cacutus had been updated as below :

commit 4e85ebd906d8c2761873fa41cf6b6f4a0d47876b (HEAD -> feature/gsi_mgbf_turned_off, origin/feature/gsi_mgbf_turned_off)
Merge: f3ef08c7 3e27bb8a
Author: Tinglei-daprediction <leiting2002@gmail.com>
Date:   Fri Jul 26 12:45:53 2024 +0000

    Merge branch 'develop' into feature/gsi_mgbf_turned_off

commit 3e27bb8ada1c7202ea1eb99957f0b143464410c7 (origin/develop, origin/HEAD, develop)
Author: RussTreadon-NOAA <26926959+RussTreadon-NOAA@users.noreply.github.com>
Date:   Mon Jul 22 10:53:55 2024 -0400

    Update global_4denvar and global_enkf ctests to reflect GFS v17 (#774)

I would sync my branch on github soon.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set BUILD_MGBF default to OFF
8 participants