Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove HIRS from GSI regression tests #672

Merged
merged 3 commits into from
Jan 18, 2024

Conversation

ADCollard
Copy link
Contributor

@ADCollard ADCollard commented Dec 21, 2023

Description

With the update to CRTM 2.4.0.1, an issue with versioning of CRTM HIRS files has been exposed. This is documented in JCSDA/spack-stack#922 and JCSDA/crtm#42.

As HIRS radiances are no longer used in the GSI, these data should be removed from processing until a suitable update to CRTM is available.

Resolves #673

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Build and run ctests on Hera

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

DUE DATE for merger of this PR into develop is 2/1/2024 (six weeks after PR creation).

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. Approve

@RussTreadon-NOAA
Copy link
Contributor

@ADCollard , we need to peer reviews. Would you assign two peer reviewers. This is a straight forward PR. The review should not take long.

I'm running regression tests on Orion and will post results in the PR later. Can you run ctests on WCOSS2 and Hera to ensure everything works as expected?

@ADCollard
Copy link
Contributor Author

@ADCollard , we need to peer reviews. Would you assign two peer reviewers. This is a straight forward PR. The review should not take long.

I'm running regression tests on Orion and will post results in the PR later. Can you run ctests on WCOSS2 and Hera to ensure everything works as expected?

@RussTreadon-NOAA ctests are running on Hera. I will start the tests on WCOSS2.

@RussTreadon-NOAA
Copy link
Contributor

Orion ctests

Test project /work2/noaa/da/rtreadon/git/gsi/regression/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  1387.23 sec
2/7 Test #2: rtma .............................   Passed  2234.87 sec
3/7 Test #5: hafs_4denvar_glbens ..............   Passed  2964.23 sec
4/7 Test #7: global_enkf ......................   Passed  2980.83 sec
5/7 Test #3: rrfs_3denvar_glbens ..............   Passed  3054.81 sec
6/7 Test #6: hafs_3denvar_hybens ..............   Passed  3106.43 sec
7/7 Test #1: global_4denvar ...................   Passed  3258.46 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 3258.71 sec

The global_4denvar, hafs_3denvar_hybens, hafs_4denvar_glbens, netcdf_fv3_regional tests process HIRS4 data.

The first three tests monitor the data. Removing monitored data does not alter analysis results.

The netcdf_fv3_regional test assimilates NOAA-19 and Metop-A HIRS4 data. Removing the HIRS4 data alters netcdf_fv3_regional analysis results, but the removal is done in both the update and control. Hence a Passed result for the netcdf_fv3_regional test.

@ADCollard
Copy link
Contributor Author

I am getting a very small difference in the final penalty values for the global_4denvar test on WCOSS. I do not know why this is. I am going to try to re-run with a freshly built version of control and develop.

This straightforward PR might take a little longer than expected.

@RussTreadon-NOAA
Copy link
Contributor

Hmm, we faced a similar issue with PR #616. What's up with WCOSS2? Orion results are identical between develop and ADCollard:develop I'll try Hera.

@RussTreadon-NOAA
Copy link
Contributor

Hera ctests

Hera(hfe04):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr672/build$ ctest -j 7
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr672/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  975.26 sec
2/7 Test #7: global_enkf ......................   Passed  1449.94 sec
3/7 Test #3: rrfs_3denvar_glbens ..............   Passed  3316.17 sec
4/7 Test #1: global_4denvar ...................   Passed  3482.03 sec
5/7 Test #5: hafs_4denvar_glbens ..............   Passed  3516.99 sec
6/7 Test #6: hafs_3denvar_hybens ..............   Passed  4114.00 sec
7/7 Test #2: rtma .............................   Passed  5902.02 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 5902.11 sec

Orion and Hera ctest results are consistent. All tests pass on both machines.

@ADCollard
Copy link
Contributor Author

I re-ran the WCOSS2 ctests with clean builds. I get two failures

andrew.collard@dlogin06:/lfs/h2/emc/da/noscrub/andrew.collard/git/regtest/GSI_ADC/build> cat ctest.out 
Test project /lfs/h2/emc/da/noscrub/andrew.collard/git/regtest/GSI_ADC/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............***Failed  485.05 sec
2/7 Test #7: global_enkf ......................   Passed  620.58 sec
3/7 Test #3: rrfs_3denvar_glbens ..............   Passed  670.34 sec
4/7 Test #2: rtma .............................***Failed  1157.13 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1213.30 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1394.29 sec
7/7 Test #1: global_4denvar ...................   Passed  1505.60 sec

71% tests passed, 2 tests failed out of 7

Total Test time (real) = 1505.61 sec

The following tests FAILED:
          2 - rtma (Failed)
          4 - netcdf_fv3_regional (Failed)

rtma is for wall time:

The runtime for rtma_hiproc_updat is 214.941543 seconds.  This has exceeded maximum allowable threshold time of 201.688033 seconds,
resulting in Failure of timethresh2 the regression test.

netcdf_fv3_regional is for memory:

The memory for netcdf_fv3_regional_loproc_updat is 283992 KBs.  This has exceeded maximum allowable memory of 203645 KBs,
resulting in Failure memthresh of the regression test.

Not sure why this would be, so re-running these two tests.

@ADCollard
Copy link
Contributor Author

The second run of netcdf_fv3_regional again fails because of memory, but is even worse. And I am not sure why the maximum allowable has changed:

The memory for netcdf_fv3_regional_loproc_updat is 305184 KBs.  This has exceeded maximum allowable memory of 197722 KBs,
resulting in Failure memthresh of the regression test.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @ADCollard for rerunning the WCOSS2 tests.

The rtma failure is not a fatal failure. The timing check often generates false positives. A failure can reflect more an issue with the system load than an actual issue with the code.

The netcdf_fv3_regional result is odd. Why would the task 0 maximum resident set size vary between repeated submissions of the test? Same script, same executables, same data ... this should yield the same memory usage statistics. Maybe the memory usage returned by w3tagb and w3tage isn't what we think it is on WCOSS2. This is a question for the library team and/or system administrators.

While the netcdf_fv3_regional result is intriguing, I do not view this as a fatal fail.

@RussTreadon-NOAA
Copy link
Contributor

Installed ADCollard:develop on Dogwood along with current head of develop. Run rtma and netcdf_fv3_regional ctests with the following results.

russ.treadon@dlogin06:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build> ctest -R rtma
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build
    Start 2: rtma
    1/1 Test #2: rtma .............................   Passed  972.43 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 972.44 sec
russ.treadon@dlogin06:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build> ctest -R netcdf_fv3_regional 
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build
    Start 4: netcdf_fv3_regional
1/1 Test #4: netcdf_fv3_regional ..............   Passed  484.50 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 484.51 sec

Below are the rtma gsi.x wall times for each job

tmpreg_rtma/rtma_hiproc_contrl/stdout:The total amount of wall time                        = 180.486884
tmpreg_rtma/rtma_hiproc_updat/stdout:The total amount of wall time                        = 175.442958
tmpreg_rtma/rtma_loproc_contrl/stdout:The total amount of wall time                        = 196.570069
tmpreg_rtma/rtma_loproc_updat/stdout:The total amount of wall time                        = 194.194533

Below are the task 0 maximum resident set sizes for the netcdf_fv3_regional jobs

tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_contrl/stdout:The maximum resident set size (KB)                   = 361028
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_updat/stdout:The maximum resident set size (KB)                   = 361784
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_contrl/stdout:The maximum resident set size (KB)                   = 302376
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_updat/stdout:The maximum resident set size (KB)                   = 305108

@RussTreadon-NOAA
Copy link
Contributor

Rerun netcdf_fv3_regional test on Dogwood and observe behavior similar to what @ADCollard found. While the rerun passed

russ.treadon@dlogin06:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build> ctest -R netcdf_fv3_regional 
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr672/build
    Start 4: netcdf_fv3_regional
1/1 Test #4: netcdf_fv3_regional ..............   Passed  484.44 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 484.54 sec

the task 0 maximum resident set size differs from the first run

russ.treadon@dlogin09:/lfs/h2/emc/ptmp/russ.treadon/pr672> grep resident  tmpreg_netcdf_fv3_regional/*/stdout
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_contrl/stdout:The maximum resident set size (KB)                   = 362736
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_updat/stdout:The maximum resident set size (KB)                   = 363396
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_contrl/stdout:The maximum resident set size (KB)                   = 280036
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_updat/stdout:The maximum resident set size (KB)                   = 212156

I can not explain the run to run difference in the task 0 maximum resident set size.

@ADCollard
Copy link
Contributor Author

@RussTreadon-NOAA Thanks for running the WCOSS2 regression tests! I am sorry you had to do that.

Where do we go from here? I cannot see how minor changes such as these can cause thiis behavior, so I assume there is something more fundamental going on?

@RussTreadon-NOAA
Copy link
Contributor

Rerun ctests on Orion. Note similar behavior with regards to different task 0 maximum resident set sizes. For example,

netcdf_fv3_regional run 1

tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_contrl/stdout:The maximum resident set size (KB)                   = 297004
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_updat/stdout:The maximum resident set size (KB)                   = 298948
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_contrl/stdout:The maximum resident set size (KB)                   = 302504
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_updat/stdout:The maximum resident set size (KB)                   = 305772

netcdf_fv3_regional run 2

tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_contrl/stdout:The maximum resident set size (KB)                   = 304284
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_hiproc_updat/stdout:The maximum resident set size (KB)                   = 250416
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_contrl/stdout:The maximum resident set size (KB)                   = 305904
tmpreg_netcdf_fv3_regional/netcdf_fv3_regional_loproc_updat/stdout:The maximum resident set size (KB)                   = 252720

I will reach out to George V. to see if he has observed similar behavior and, more importantly, has an explanation.

@RussTreadon-NOAA
Copy link
Contributor

Based on George's informative reply we should reconsider using the task 0 maximum resident set size in the GSI regression test check script.

There are several possibilities . The most likely is a race condition in allocation and deallocation such that deallocated address space does not become available until shortly
after an allocation so the allocate gets a larger address creating a temporary hole in address space. This hole likely fills
in later with other allocations. MPI buffers are a common source of this. We can make memory use much more predictable by making MPI synchronous and getting rid of asynchronous calls as much as possible. The performance impact from this is often much less than one would think because communication is itself expensive and the picture of cpus sitting around waiting for communication is often inaccurate on modern high speed interconnects.

W3tage accurately measures the maximum resident set size. I don't think its misleading us.

Another possibility is very sparse data structures which land differently on page boundaries from run to run so sometimes the same structure uses a lot more real pages. I do not know the current page sizes of our systems but this is more of an issue with large pages if we support them. Large pages are otherwise a huge advantage because managing the page table is a significant cost on large memory machines.

@RussTreadon-NOAA
Copy link
Contributor

@ADCollard , I think this PR can be merged pending @emilyhcliu 's review. Thank you for making these changes.

@ADCollard
Copy link
Contributor Author

@RussTreadon-NOAA @emilyhcliu Thank you! Sorry it took so long - nothing is ever simple, it seems!

@DavidHuber-NOAA
Copy link
Collaborator

@RussTreadon-NOAA @ADCollard Just thought I would add in that w3tagb and w3tage work by reading the memory usage in the /proc/<pid> files, which are updated by the OS. These are not updated constantly, so it may not be surprising that you are getting different results, especially if WCOSS2 updates these at a slower rate than other systems.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @DavidHuber-NOAA for the additional information. I didn't know how w3tagb and w3tage get memory stats.

@DavidHuber-NOAA
Copy link
Collaborator

I just wanted to check on the status of this PR. I'm starting to do an official upgrade to spack-stack version 1.6.0, including CRTM v2.4.0.1, which will need HIRS removed.

@RussTreadon-NOAA
Copy link
Contributor

@emilyhcliu , would you please review this PR?

@emilyhcliu
Copy link
Contributor

I am catching up my e-mails.
@RussTreadon-NOAA I am reviewing the PR now.

@emilyhcliu
Copy link
Contributor

It looks like the issue with the regression tests has been resolved.
Approved! Thanks for your patience.

@RussTreadon-NOAA RussTreadon-NOAA merged commit 7f37de3 into NOAA-EMC:develop Jan 18, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove HIRS from GSI regression tests
5 participants