Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase tolerance for lgetkf.x reference check #1148

Closed
RussTreadon-NOAA opened this issue Jun 6, 2024 · 6 comments
Closed

increase tolerance for lgetkf.x reference check #1148

RussTreadon-NOAA opened this issue Jun 6, 2024 · 6 comments
Assignees

Comments

@RussTreadon-NOAA
Copy link
Contributor

test_gdasapp_atm_jjob_ens_run using GDASApp develop at 825f19c (update JEDI hashes) fails on Hercules.  This test passes on Hera and Orion.

The Hercules failure is due to the reference test after lgetkf runs.

0: OOPS_STATS Run end                                  - Runtime:    456.38 sec,  Memory: total:    22.59 Gb, per task: min =     3.36 Gb, max =     4.06 Gb
0: Run: Finishing oops::LocalEnsembleDA<FV3JEDI, UFO and IODA observations> with status = 0
0: terminate called after throwing an instance of 'oops::TestReferenceFloatMismatchError'
0:   what():  Test reference Float mismatch @ Line:149
0: Test Val : 3.9113397703941590e-04
0: Ref  Val : 3.9113337546012496e-04
0: Delta    : 6.0157929093924978e-10
0: Relative tolerance: 3.9113367624977039e-10
0: Absolute tolerance: 0.0000000000000000e+00
0: Test Line: 'cloud_liquid_ice                             | Min:+0.0000000000000000e+00 Max:+3.9113397703941590e-04 RMS:+1.0484802023479406e-05'
0: Ref Line : 'cloud_liquid_ice                             | Min:+0.0000000000000000e+00 Max:+3.9113337546012496e-04 RMS:+1.0484801913924773e-05'
srun: error: hercules-07-15: task 0: Aborted (core dumped)

The input yaml ends with

test:
  reference filename: /work2/noaa/da/rtreadon/git/global-workflow/pr2641_hercules/sorc/gdas.cd/test/atm/global-workflow/lgetkf.ref
  test output filename: ./lgetkf.out
  float relative tolerance: 1e-06
  float absolute tolerance: 0.0
  integer tolerance: 0

Increasing float relative tolerance to 1e-05 allows the reference check to pass.

1e-06 works on Orion and Hera. Test test_gdasapp_atm_jjob_ens_run does not yet run on WCOSS2. It is possible that a larger float relative tolerance is needed on WCOSS2.

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Jun 6, 2024
@RussTreadon-NOAA
Copy link
Contributor Author

Repeat this test on Cactus. test_gdasapp_atm_jjob_ens_run passes the reference check on Cactus with float relative tolerance=1e-06

OOPS_STATS Run end                                  - Runtime:    415.42 sec,  Memory: total:    10.94 Gb, per task: min =     1.41 Gb, max =     2.11 Gb
Run: Finishing oops::LocalEnsemblnid002305.cactus.wcoss2.ncep.noaa.gov 0: eDA<FV3JEDI, UFO and IODA observations> with status = 0
nid002305.cactus.wcoss2.ncep.noaa.gov 0: [TestReference] Comparison is done
OOPS Ending   2024-06-06 17:26:21 (UTC+0000)
Application 9072bd70-3674-4906-baf9-4a6f7343b9f6 resources: utime=2394s stime=47s maxrss=2064092KB inblock=1975532 oublock=2299120 minflt=22567728 majflt=268 nvcsw=44485 nivcsw=1010
2024-06-06 17:26:21,374 - INFO     - atmens_analysis:   END: pygfs.task.atmens_analysis.letkf
2024-06-06 17:26:21,375 - DEBUG    - atmens_analysis:  returning: None
+ 134467411.cbqs01.SC[21]: status=0
+

@RussTreadon-NOAA
Copy link
Contributor Author

@DavidNew-NOAA , what do you think? Should we increase float relative tolerance to 1e-05 in order to get test_gdasapp_atm_jjob_ens_run to pass on all supported machines?

One thing which bothers me is why we need to increase the tolerance by an order of magnitude on Hercules. The var test passes on Hercules with 1e-06. 1e-06 works as the tolerance for the ens test on other supported machines. Hercules is the outlier for the ens test. Why?

@DavidNew-NOAA
Copy link
Collaborator

@RussTreadon-NOAA I have float relative tolerance as 1e-03 and float absolute tolerance at '1e-05' for test_gdasapp_atm_jjob_ens_run and test_gdasapp_atm_jjob_var_run Could you clarify?

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @DavidNew-NOAA for your question. This prompted me to look more closely at our jcb files.

parm/jcb-algorithms/local_ensemble_da.yaml.j2 contains

  float relative tolerance: {{test_float_relative_tolerance | default(1.0e-6, true)}}
  float absolute tolerance: {{test_float_absolute_tolerance | default(0.0, true) }}
  integer tolerance: {{test_integer_tolerance | default(0, true) }}

test/atm/global-workflow/jcb-prototype_lgetkf.yaml.j2 contains

# Testing things
# --------------
test_reference_filename: {{ HOMEgfs }}/sorc/gdas.cd/test/atm/global-workflow/lgetkf.ref
test_output_filename: ./lgetkf.out
float_relative_tolerance: 1.0e-3
float_absolute_tolerance: 1.0e-5

Note that the float keywords above do not include the test_ prefix. Thus the ens_init job winds up using the default values of 1e-o6 and 0.0 when creating the input yaml for the ens_run job.

I added the prefix test_ to the float_ keywords in jcb-prototype_lgetkf.yaml.j2 and reran test_gdasapp_atm_jjob_ens_init. Now I see the desired values in enkfgdas.t18z.atmens.yaml

test:
  reference filename: /work2/noaa/da/rtreadon/git/global-workflow/pr2641_hercules/sorc/gdas.cd/test/atm/global-workflow/lgetkf.ref
  test output filename: ./lgetkf.out
  float relative tolerance: 0.001
  float absolute tolerance: 1e-05
  integer tolerance: 0

Which way was your intention? Do we want to users to override default tolerances via keywords starting with test_ or drop test_ and set the float_ keywords?

@DavidNew-NOAA
Copy link
Collaborator

@RussTreadon-NOAA Ah, yes, nice catch. They should match, so be can change the jcb prototypes for the jjob test to be test_float_relative_tolerance and test_float_absolute_tolerance

CoryMartin-NOAA pushed a commit that referenced this issue Jun 7, 2024
@RussTreadon-NOAA caught a bug in the JCB prototype files for the the
jjob tests. They are missing "test_" in the keywords for the float
tolerances, so that the jjob tests are just using the defaults.

#1148 (comment)
@RussTreadon-NOAA
Copy link
Contributor Author

Resolved by #1154

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants