Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remap broken in going from MOM5 ICA ➡️ MOM6 NL3 on mil with develop version #55

Closed
sanAkel opened this issue Feb 2, 2024 · 14 comments

Comments

@sanAkel
Copy link
Contributor

sanAkel commented Feb 2, 2024

Few things with the remap script:

  1. The yaml file changes format, which implies old ones – that worked in past simply can’t be used.
    Not asking for backward compatibility or a translator, but please note down what has changed here Appreciate that!

  2. Output from my test case is at: /discover/nobackup/sakella/restarts/emc_golden_period/output/*_rst; I ran on Milan/SCU17.

A run from those restarts left me with a super interesting crash! :-o /discover/nobackup/sakella/free_run_gold_per/slurm-28387344.out

A run using restarts that I have from past, all fine!

mepo status is following; all is GEOSgcm v11.5.1, except develop branch of GEOS_Util (as of 02/01/2024).

GEOSgcm                | (t) v11.5.1 (DH)
env                    | (t) v4.25.1 (DH)
cmake                  | (t) v3.38.0 (DH)
ecbuild                | (t) geos/v1.3.0 (DH)
NCEP_Shared            | (t) v1.3.0 (DH)
GMAO_Shared            | (t) v1.9.7 (DH)
GEOS_Util              | (b) main
MAPL                   | (t) v2.43.1 (DH)
FMS                    | (t) geos/2019.01.02+noaff.8 (DH)
GEOSgcm_GridComp       | (t) v2.5.1 (DH)
FVdycoreCubed_GridComp | (t) v2.10.0 (DH)
fvdycore               | (t) geos/v2.8.1 (DH)
GEOSchem_GridComp      | (t) v1.13.1 (DH)
HEMCO                  | (t) geos/v2.2.3 (DH)
geos-chem              | (t) geos/v13.0.0-rc1 (DH)
GOCART                 | (t) sdr_v2.2.1.1 (DH)
QuickChem              | (t) v1.0.0 (DH)
TR                     | (t) v1.1.0 (DH)
GMI                    | (t) v1.1.0 (DH)
StratChem              | (t) v1.0.0 (DH)
GEOS_OceanGridComp     | (t) v2.1.5 (DH)
mom                    | (t) geos/5.1.0+1.2.0 (DH)
mom6                   | (t) geos/v2.2.3 (DH)
mit                    | (t) checkpoint68o (DH)
cice6                  | (t) geos/v0.1.3 (DH)
icepack                | (t) geos/v0.1.1 (DH)
sis2                   | (t) geos/v0.0.1 (DH)
GEOSradiation_GridComp | (t) v1.6.0 (DH)
RRTMGP                 | (t) geos/v1.6+1.1.0 (DH)
ww3                    | (t) v6.07.1-geos-r2 (DH)
umwm                   | (t) v2.0.0-geos-r1 (DH)
GEOSgcm_App            | (t) v2.3.1 (DH)
UMD_Etc                | (t) v1.3.0 (DH)
CPLFCST_Etc            | (t) v1.0.1 (DH)

cc: @GEOS-ESM/gmao-si-team

@sanAkel
Copy link
Contributor Author

sanAkel commented Feb 2, 2024

More info:

  1. I cloned a version v10.26.0 of GEOSgcm (the version I wish to use is: v11.5.1; see ⬆️), built and ran (on sky/cas; as opposed to my wish of using mil) the remap_restarts app from there in. Output is at: /discover/nobackup/sakella/restarts/emc_golden_period/output-GEOSgcm-v10.26.0 ➡️ seems okay!

  2. A log of GEOSgcm run with these restarts is at /discover/nobackup/sakella/free_run_gold_per/slurm-28448900.out and has no errors as in /discover/nobackup/sakella/free_run_gold_per/slurm-28387344.out

This establishes beyond doubt there is a problem somewhere:

  • Either the boundary conditions are bad?
  • Code and/or script(s) has some bug somewhere?

@gmao-rreichle
Copy link
Contributor

@sanAkel : Sorry for these issues. I can't say what the problem might be. Did you use "MOM[x] tripoloar grid" bcs from:
/discover/nobackup/projects/gmao/bcs_shared/fvInput/ExtData/esm/tiles/NL3/
I thought those were "manually" created by Yury and Sarith. It's anyone's guess if those would work sensibly with remap restarts, or a recent version of the GCM.

@sanAkel
Copy link
Contributor Author

sanAkel commented Feb 2, 2024

@sanAkel : Sorry for these issues. I can't say what the problem might be. Did you use "MOM[x] tripoloar grid" bcs from:

/discover/nobackup/projects/gmao/bcs_shared/fvInput/ExtData/esm/tiles/NL3/

I thought those were "manually" created by Yury and Sarith. It's anyone's guess if those would work sensibly with remap restarts, or a recent version of the GCM.

Yes @gmao-rreichle that's the path to the bcs that the script fills in.

If there is a good reason to believe that it's garbage, I would insist on the script echoing a warning that it won't work.

But having someone go through a futile exercise is meaningless!

To be specific,

  1. Either the app gives up- raises a warning and exit.
  2. Let the user fill in a bcs and DIY.

@gmao-rreichle
Copy link
Contributor

If there is a good reason to believe that it's garbage, I would insist on the script echoing a warning that it won't work.
But having someone go through a futile exercise is meaningless!

Agreed, and sorry again for the troubles. The problem is that we rely on the others in GMAO to find these kinds of issues before we can document and (hopefully) fix them. The cleanup and reorganization of the bcs and remap_restarts is a massive undertaking, and the ocean bcs have been particularly gnarly. We do what we can, but we can't possibly test everything on our end. Someone has to be the guinea pig. Thanks for being patient.

@sanAkel
Copy link
Contributor Author

sanAkel commented Feb 3, 2024

@gmao-rreichle no worries at all! Thank you!

It is a massive undertaking to maintain and develop an earth system model like ours!

Whenever @GEOS-ESM/gmao-si-team gets to it. I have a restart set that works.

@gmao-rreichle
Copy link
Contributor

@sanAkel : We are a bit lost as to what the exact problem is. Per @biljanaorescanin on the separate Teams chat, there was an issue in some of the archived bcs files, but I understand that has been corrected now, and those were not used by you anyway. Besides documentation fixes, the only remaining problem (i.e., something not working) that I can see in the comments above is your crash report on Milan:

A run from those restarts left me with a super interesting crash! :-o /discover/nobackup/sakella/free_run_gold_per/slurm-28387344.out

Can this issue be summarized as "The coupled model does not work on Milan" ? If so, maybe start a new issue, or change the title of this issue and confirm in a new comment what exactly the problem is. Thanks!

cc: @weiyuan-jiang

@sanAkel sanAkel changed the title Remap broken in going from MOM5 ICA ➡️ MOM6 NL3 Remap broken in going from MOM5 ICA ➡️ MOM6 NL3 on mil with develop version Feb 5, 2024
@sanAkel
Copy link
Contributor Author

sanAkel commented Feb 5, 2024

@sanAkel : We are a bit lost as to what the exact problem is. Per @biljanaorescanin on the separate Teams chat, there was an issue in some of the archived bcs files, but I understand that has been corrected now, and those were not used by you anyway. Besides documentation fixes, the only remaining problem (i.e., something not working) that I can see in the comments above is your crash report on Milan:

A run from those restarts left me with a super interesting crash! :-o /discover/nobackup/sakella/free_run_gold_per/slurm-28387344.out

Can this issue be summarized as "The coupled model does not work on Milan" ? If so, maybe start a new issue, or change the title of this issue and confirm in a new comment what exactly the problem is. Thanks!

cc: @weiyuan-jiang

@gmao-rreichle I edited this issue title.

@gmao-rreichle
Copy link
Contributor

Thanks, @sanAkel. If I can rephrase the title change of the present Issue in my own words, you're finding that remap_restarts.py produces bad restarts on Milan with the "develop" version? But @biljanaorescanin told me (somewhere in the depths of Teams) that she can reproduce binary identical restarts across SLES12 and SLES15. Since remap_restarts.py is in the GEOS_Util external repo, which only has a "main" branch (not "develop"), is it possible that the GEOS_Util release in components.yaml of the coupled model is out of date?

@biljanaorescanin
Copy link
Contributor

I used this PR to make restarts: #53
SLES12 vs SLES15 are zero diff. So if SLES12 created restarts worked SLES15 has to work as well.

Here is full path SLES15:
/discover/nobackup/borescan/brisi/util_pr_43/feb_2/testing/c180_ica_72_mom5_to_c180_nl3_72_mom6_sles15
Here is full path SLES12:
/discover/nobackup/borescan/brisi/util_pr_43/feb_2/testing/c180_ica_72_mom5_to_c180_nl3_72_mom6_72

Maybe problem is not restarts but coupled model setup on SLES15?

I can double check main, but we only changed upper restarts on PR link vs remap that shouldn't affect coupled run to fail but just give diff starting point.

@sanAkel
Copy link
Contributor Author

sanAkel commented Feb 5, 2024

@biljanaorescanin, @gmao-rreichle, as time permits, I'll re- clone and test all again- as long as discover file system is stable.

Please let's pause until then.

Meanwhile if @weiyuan-jiang wants to troubleshoot, that's fine with me!

@gmao-rreichle
Copy link
Contributor

I'll re- clone and test all again- as long as discover file system is stable.

@sanAkel: Before you get to build and run, I suggest looking at the release/branch info for GEOS_Util and GEOSgcm_GridComp. If the "wrong" (outdated) versions are associated with the coupled model, remap_restarts.py wouldn't be expected to work.

@sanAkel
Copy link
Contributor Author

sanAkel commented Feb 5, 2024

@gmao-rreichle point noted.

For the case/testing I have right now, indeed I have GEOS_Util at develop and rest v11.5.1, see above.

@gmao-rreichle
Copy link
Contributor

For the case/testing I have right now, indeed I have GEOS_Util at develop and rest v11.5.1, see #55 (comment).

Sorry, @sanAkel, I missed that you already provided all these details. Given this info and @biljanaorescanin's success in reproducing identical restarts on SLES12 and SLES15, I'm having a hard time believing that there's anything wrong with remap_restarts.py per se that would have caused your coupled model failure. It's still possible that something was wrong in the configuration of your remapping of the restarts, but if remap_restarts.py produced the expected files, that's also unlikely. I agree, running the test again and seeing if the failure is reproducible is the logical next step.

@sanAkel
Copy link
Contributor Author

sanAkel commented Feb 5, 2024

Will reopen this issue, if I can reproduce the error!

@sanAkel sanAkel closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants