Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFS_phys_time_vary_init does not report errmsg/errflg correctly due to thread race condition #1031

Open
SamuelTrahanNOAA opened this issue Sep 21, 2023 · 0 comments
Labels

Comments

@SamuelTrahanNOAA
Copy link
Collaborator

Description

Normally I don't cross-post bugs between forks, but this is a pretty big one. I want to make sure everyone is aware.

I reported it in the UFS fork already: ufs-community#105

The GFS_phys_time_vary_init is parallelized using mpi sections, but it does not correctly handle errmsg or errflg. All threads update the same errmsg and errflg. That means a failure message can be overwritten by a success message in a later step.

To visualize this, suppose there are two threads running at once. For simplicity's sake, lets say there are only two initialization calls: init_that_fails() and init_that_succeeds()

Failure happens first

Events happened in this order:

Thread 1: Completes init_that_fails() and sets errmsg=1
Thread 2: Completes init_that_succeeds() and sets errmsg=0

The errmsg is 0 and the model will run even though one of the initialization steps failed.

Failure happens second

Events happened in this order:

Thread 2: Completes init_that_succeeds() and sets errmsg=0
Thread 1: Completes init_that_fails() and sets errmsg=1

The errmsg is 1 so the model will abort as expected.

Steps to Reproduce

Please provide detailed steps for reproducing the issue.

  1. Delete noahmptable.tbl
  2. Use a scheme that does not require that file.
  3. Run the model a few times with at least two threads.
  4. Notice that it fails sporadically instead of 100% of the time.

Additional Context

This was discovered in an RRFS parallel. The machine, compiler, etc. doesn't matter. However, the easiest way to see it is to run a non-NOAHMP suite without noahmptable.tbl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant