Remove HIRS from GSI regression tests #672

ADCollard · 2023-12-21T20:14:10Z

Description

With the update to CRTM 2.4.0.1, an issue with versioning of CRTM HIRS files has been exposed. This is documented in JCSDA/spack-stack#922 and JCSDA/crtm#42.

As HIRS radiances are no longer used in the GSI, these data should be removed from processing until a suitable update to CRTM is available.

Resolves #673

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Build and run ctests on Hera

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
New and existing tests pass with my changes
Any dependent changes have been merged and published

DUE DATE for merger of this PR into develop is 2/1/2024 (six weeks after PR creation).

RussTreadon-NOAA

Changes look good. Approve

RussTreadon-NOAA · 2023-12-21T20:21:58Z

@ADCollard , we need to peer reviews. Would you assign two peer reviewers. This is a straight forward PR. The review should not take long.

I'm running regression tests on Orion and will post results in the PR later. Can you run ctests on WCOSS2 and Hera to ensure everything works as expected?

ADCollard · 2023-12-21T20:29:51Z

@ADCollard , we need to peer reviews. Would you assign two peer reviewers. This is a straight forward PR. The review should not take long.

I'm running regression tests on Orion and will post results in the PR later. Can you run ctests on WCOSS2 and Hera to ensure everything works as expected?

@RussTreadon-NOAA ctests are running on Hera. I will start the tests on WCOSS2.

RussTreadon-NOAA · 2023-12-21T21:15:20Z

Orion ctests

Test project /work2/noaa/da/rtreadon/git/gsi/regression/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  1387.23 sec
2/7 Test #2: rtma .............................   Passed  2234.87 sec
3/7 Test #5: hafs_4denvar_glbens ..............   Passed  2964.23 sec
4/7 Test #7: global_enkf ......................   Passed  2980.83 sec
5/7 Test #3: rrfs_3denvar_glbens ..............   Passed  3054.81 sec
6/7 Test #6: hafs_3denvar_hybens ..............   Passed  3106.43 sec
7/7 Test #1: global_4denvar ...................   Passed  3258.46 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 3258.71 sec

The global_4denvar, hafs_3denvar_hybens, hafs_4denvar_glbens, netcdf_fv3_regional tests process HIRS4 data.

The first three tests monitor the data. Removing monitored data does not alter analysis results.

The netcdf_fv3_regional test assimilates NOAA-19 and Metop-A HIRS4 data. Removing the HIRS4 data alters netcdf_fv3_regional analysis results, but the removal is done in both the update and control. Hence a Passed result for the netcdf_fv3_regional test.

ADCollard · 2023-12-21T21:42:31Z

I am getting a very small difference in the final penalty values for the global_4denvar test on WCOSS. I do not know why this is. I am going to try to re-run with a freshly built version of control and develop.

This straightforward PR might take a little longer than expected.

RussTreadon-NOAA · 2023-12-21T22:38:16Z

Hmm, we faced a similar issue with PR #616. What's up with WCOSS2? Orion results are identical between develop and ADCollard:develop I'll try Hera.

RussTreadon-NOAA · 2023-12-22T00:39:05Z

Hera ctests

Hera(hfe04):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr672/build$ ctest -j 7
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr672/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  975.26 sec
2/7 Test #7: global_enkf ......................   Passed  1449.94 sec
3/7 Test #3: rrfs_3denvar_glbens ..............   Passed  3316.17 sec
4/7 Test #1: global_4denvar ...................   Passed  3482.03 sec
5/7 Test #5: hafs_4denvar_glbens ..............   Passed  3516.99 sec
6/7 Test #6: hafs_3denvar_hybens ..............   Passed  4114.00 sec
7/7 Test #2: rtma .............................   Passed  5902.02 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 5902.11 sec

Orion and Hera ctest results are consistent. All tests pass on both machines.

ADCollard · 2023-12-22T04:28:30Z

I re-ran the WCOSS2 ctests with clean builds. I get two failures

andrew.collard@dlogin06:/lfs/h2/emc/da/noscrub/andrew.collard/git/regtest/GSI_ADC/build> cat ctest.out 
Test project /lfs/h2/emc/da/noscrub/andrew.collard/git/regtest/GSI_ADC/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............***Failed  485.05 sec
2/7 Test #7: global_enkf ......................   Passed  620.58 sec
3/7 Test #3: rrfs_3denvar_glbens ..............   Passed  670.34 sec
4/7 Test #2: rtma .............................***Failed  1157.13 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1213.30 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1394.29 sec
7/7 Test #1: global_4denvar ...................   Passed  1505.60 sec

71% tests passed, 2 tests failed out of 7

Total Test time (real) = 1505.61 sec

The following tests FAILED:
          2 - rtma (Failed)
          4 - netcdf_fv3_regional (Failed)

rtma is for wall time:

The runtime for rtma_hiproc_updat is 214.941543 seconds.  This has exceeded maximum allowable threshold time of 201.688033 seconds,
resulting in Failure of timethresh2 the regression test.

netcdf_fv3_regional is for memory:

The memory for netcdf_fv3_regional_loproc_updat is 283992 KBs.  This has exceeded maximum allowable memory of 203645 KBs,
resulting in Failure memthresh of the regression test.

Not sure why this would be, so re-running these two tests.

ADCollard · 2023-12-22T04:45:23Z

The second run of netcdf_fv3_regional again fails because of memory, but is even worse. And I am not sure why the maximum allowable has changed:

The memory for netcdf_fv3_regional_loproc_updat is 305184 KBs.  This has exceeded maximum allowable memory of 197722 KBs,
resulting in Failure memthresh of the regression test.

RussTreadon-NOAA · 2023-12-22T11:05:13Z

Thank you @ADCollard for rerunning the WCOSS2 tests.

The rtma failure is not a fatal failure. The timing check often generates false positives. A failure can reflect more an issue with the system load than an actual issue with the code.

The netcdf_fv3_regional result is odd. Why would the task 0 maximum resident set size vary between repeated submissions of the test? Same script, same executables, same data ... this should yield the same memory usage statistics. Maybe the memory usage returned by w3tagb and w3tage isn't what we think it is on WCOSS2. This is a question for the library team and/or system administrators.

While the netcdf_fv3_regional result is intriguing, I do not view this as a fatal fail.

RussTreadon-NOAA · 2023-12-22T12:02:56Z

Installed ADCollard:develop on Dogwood along with current head of develop. Run rtma and netcdf_fv3_regional ctests with the following results.

russ.treadon@dlogin06:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build> ctest -R rtma
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build
    Start 2: rtma
    1/1 Test #2: rtma .............................   Passed  972.43 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 972.44 sec
russ.treadon@dlogin06:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build> ctest -R netcdf_fv3_regional 
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build
    Start 4: netcdf_fv3_regional
1/1 Test #4: netcdf_fv3_regional ..............   Passed  484.50 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 484.51 sec

Below are the rtma gsi.x wall times for each job

tmpreg_rtma/rtma_hiproc_contrl/stdout:The total amount of wall time                        = 180.486884
tmpreg_rtma/rtma_hiproc_updat/stdout:The total amount of wall time                        = 175.442958
tmpreg_rtma/rtma_loproc_contrl/stdout:The total amount of wall time                        = 196.570069
tmpreg_rtma/rtma_loproc_updat/stdout:The total amount of wall time                        = 194.194533

Below are the task 0 maximum resident set sizes for the netcdf_fv3_regional jobs

tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_contrl/stdout:The maximum resident set size (KB)                   = 361028
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_updat/stdout:The maximum resident set size (KB)                   = 361784
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_contrl/stdout:The maximum resident set size (KB)                   = 302376
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_updat/stdout:The maximum resident set size (KB)                   = 305108

RussTreadon-NOAA · 2023-12-22T12:17:41Z

Rerun netcdf_fv3_regional test on Dogwood and observe behavior similar to what @ADCollard found. While the rerun passed

russ.treadon@dlogin06:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build> ctest -R netcdf_fv3_regional 
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build
    Start 4: netcdf_fv3_regional
1/1 Test #4: netcdf_fv3_regional ..............   Passed  484.44 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 484.54 sec

the task 0 maximum resident set size differs from the first run

russ.treadon@dlogin09:/lfs/h2/emc/ptmp/russ.treadon/pr672> grep resident  tmpreg_netcdf_fv3_regional/*/stdout
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_contrl/stdout:The maximum resident set size (KB)                   = 362736
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_updat/stdout:The maximum resident set size (KB)                   = 363396
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_contrl/stdout:The maximum resident set size (KB)                   = 280036
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_updat/stdout:The maximum resident set size (KB)                   = 212156

I can not explain the run to run difference in the task 0 maximum resident set size.

ADCollard · 2023-12-22T13:39:18Z

@RussTreadon-NOAA Thanks for running the WCOSS2 regression tests! I am sorry you had to do that.

Where do we go from here? I cannot see how minor changes such as these can cause thiis behavior, so I assume there is something more fundamental going on?

RussTreadon-NOAA · 2023-12-22T14:45:52Z

Rerun ctests on Orion. Note similar behavior with regards to different task 0 maximum resident set sizes. For example,

netcdf_fv3_regional run 1

tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_contrl/stdout:The maximum resident set size (KB)                   = 297004
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_updat/stdout:The maximum resident set size (KB)                   = 298948
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_contrl/stdout:The maximum resident set size (KB)                   = 302504
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_updat/stdout:The maximum resident set size (KB)                   = 305772

netcdf_fv3_regional run 2

tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_contrl/stdout:The maximum resident set size (KB)                   = 304284
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_updat/stdout:The maximum resident set size (KB)                   = 250416
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_contrl/stdout:The maximum resident set size (KB)                   = 305904
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_updat/stdout:The maximum resident set size (KB)                   = 252720

I will reach out to George V. to see if he has observed similar behavior and, more importantly, has an explanation.

RussTreadon-NOAA · 2023-12-22T15:45:42Z

Based on George's informative reply we should reconsider using the task 0 maximum resident set size in the GSI regression test check script.

There are several possibilities . The most likely is a race condition in allocation and deallocation such that deallocated address space does not become available until shortly
after an allocation so the allocate gets a larger address creating a temporary hole in address space. This hole likely fills
in later with other allocations. MPI buffers are a common source of this. We can make memory use much more predictable by making MPI synchronous and getting rid of asynchronous calls as much as possible. The performance impact from this is often much less than one would think because communication is itself expensive and the picture of cpus sitting around waiting for communication is often inaccurate on modern high speed interconnects.

W3tage accurately measures the maximum resident set size. I don't think its misleading us.

Another possibility is very sparse data structures which land differently on page boundaries from run to run so sometimes the same structure uses a lot more real pages. I do not know the current page sizes of our systems but this is more of an issue with large pages if we support them. Large pages are otherwise a huge advantage because managing the page table is a significant cost on large memory machines.

RussTreadon-NOAA · 2023-12-22T15:47:01Z

@ADCollard , I think this PR can be merged pending @emilyhcliu 's review. Thank you for making these changes.

ADCollard · 2023-12-22T19:11:13Z

@RussTreadon-NOAA @emilyhcliu Thank you! Sorry it took so long - nothing is ever simple, it seems!

DavidHuber-NOAA · 2023-12-22T21:03:05Z

@RussTreadon-NOAA @ADCollard Just thought I would add in that w3tagb and w3tage work by reading the memory usage in the /proc/<pid> files, which are updated by the OS. These are not updated constantly, so it may not be surprising that you are getting different results, especially if WCOSS2 updates these at a slower rate than other systems.

RussTreadon-NOAA · 2023-12-22T21:06:19Z

Thank you @DavidHuber-NOAA for the additional information. I didn't know how w3tagb and w3tage get memory stats.

DavidHuber-NOAA · 2024-01-17T17:53:49Z

I just wanted to check on the status of this PR. I'm starting to do an official upgrade to spack-stack version 1.6.0, including CRTM v2.4.0.1, which will need HIRS removed.

RussTreadon-NOAA · 2024-01-17T17:55:14Z

@emilyhcliu , would you please review this PR?

emilyhcliu · 2024-01-18T16:00:58Z

I am catching up my e-mails.
@RussTreadon-NOAA I am reviewing the PR now.

emilyhcliu · 2024-01-18T16:04:22Z

It looks like the issue with the regression tests has been resolved.
Approved! Thanks for your patience.

ADCollard and others added 3 commits November 9, 2023 22:47

Fix cool layer output

57e45c4

Merge branch 'NOAA-EMC:develop' into develop

12a67af

Remove HIRS from regression tests

336d6fb

ADCollard requested a review from RussTreadon-NOAA December 21, 2023 20:14

RussTreadon-NOAA approved these changes Dec 21, 2023

View reviewed changes

ADCollard requested review from emilyhcliu and azadeh-gh December 21, 2023 20:30

azadeh-gh approved these changes Dec 21, 2023

View reviewed changes

DavidHuber-NOAA mentioned this pull request Jan 18, 2024

Upgrade the GSI to Spack-Stack version 1.6.0 #684

Merged

6 tasks

emilyhcliu approved these changes Jan 18, 2024

View reviewed changes

RussTreadon-NOAA merged commit 7f37de3 into NOAA-EMC:develop Jan 18, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove HIRS from GSI regression tests #672

Remove HIRS from GSI regression tests #672

ADCollard commented Dec 21, 2023 •

edited by RussTreadon-NOAA

Loading

RussTreadon-NOAA left a comment

RussTreadon-NOAA commented Dec 21, 2023

ADCollard commented Dec 21, 2023

RussTreadon-NOAA commented Dec 21, 2023

ADCollard commented Dec 21, 2023

RussTreadon-NOAA commented Dec 21, 2023

RussTreadon-NOAA commented Dec 22, 2023

ADCollard commented Dec 22, 2023

ADCollard commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

ADCollard commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

ADCollard commented Dec 22, 2023

DavidHuber-NOAA commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

DavidHuber-NOAA commented Jan 17, 2024

RussTreadon-NOAA commented Jan 17, 2024

emilyhcliu commented Jan 18, 2024

emilyhcliu commented Jan 18, 2024

Remove HIRS from GSI regression tests #672

Remove HIRS from GSI regression tests #672

Conversation

ADCollard commented Dec 21, 2023 • edited by RussTreadon-NOAA Loading

RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

RussTreadon-NOAA commented Dec 21, 2023

ADCollard commented Dec 21, 2023

RussTreadon-NOAA commented Dec 21, 2023

ADCollard commented Dec 21, 2023

RussTreadon-NOAA commented Dec 21, 2023

RussTreadon-NOAA commented Dec 22, 2023

ADCollard commented Dec 22, 2023

ADCollard commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

ADCollard commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

ADCollard commented Dec 22, 2023

DavidHuber-NOAA commented Dec 22, 2023

RussTreadon-NOAA commented Dec 22, 2023

DavidHuber-NOAA commented Jan 17, 2024

RussTreadon-NOAA commented Jan 17, 2024

emilyhcliu commented Jan 18, 2024

emilyhcliu commented Jan 18, 2024

ADCollard commented Dec 21, 2023 •

edited by RussTreadon-NOAA

Loading