Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with topo linking, pass in stretched grid options #14

Merged
merged 12 commits into from
May 17, 2023

Conversation

mathomp4
Copy link
Member

@mathomp4 mathomp4 commented Apr 24, 2023

As found by @wmputman and others, the python remap script did not work for stretched grid. Working with @weiyuan-jiang, the issue was found that the BCs dirs for the stretched grid had both nc4 and binary topographies, but the Fortran code doing the remapping in FV3 can only read binary.

But, the python was linking in the NC4 and boom. Bad things happen (aka NaNs). This is a fix to make sure only the binary .data files are found.

Also, working with @bena-nasa it was found the correct way to do the stretched grid is to pass in options to interp_restarts.x and not rely solely on the namelist.

@mathomp4 mathomp4 added the 0 diff The changes in this pull request have verified to be zero-diff with the target branch. label Apr 24, 2023
@mathomp4 mathomp4 requested review from a team as code owners April 24, 2023 14:09
sanAkel
sanAkel previously approved these changes Apr 24, 2023
@biljanaorescanin biljanaorescanin marked this pull request as draft April 24, 2023 19:09
@biljanaorescanin
Copy link
Contributor

All, I moved it to draft until we test/confirm few more things.

@mathomp4 mathomp4 changed the title Fix issue with topo linking Fix issue with topo linking, pass in stretched grid options Apr 25, 2023
@biljanaorescanin
Copy link
Contributor

biljanaorescanin commented May 3, 2023

I definitely have these warning and error messages with regrid.pl. @mathomp4 ... I run regrid.pl with on screen questions not with command line options.
sbatch: error: Batch job submission failed: Invalid qos specification
Use of uninitialized value $cnlist[0] in concatenation (.) or string at ./regrid.pl line 1657, line 24.
Use of uninitialized value $cnlist[0] in concatenation (.) or string at ./regrid.pl line 2914, line 26.
Use of uninitialized value $cnlist[0] in concatenation (.) or string at ./regrid.pl line 2981, line 26.

I think it is same case you've run C1440 Ostia 137 lev to C2880 Ostia 137 lev.

@mathomp4
Copy link
Member Author

mathomp4 commented May 3, 2023

@biljanaorescanin Yeah, those "Use of uninitialized value" seem to be legion with regrid.pl. I usually ignore them if the script works. Does it work for you?

@biljanaorescanin
Copy link
Contributor

Upper restarts have first error:
sbatch: error: Batch job submission failed: Invalid qos specification

There is no "high" qos on discover
@gmao-rreichle
Copy link
Contributor

@mathomp4: Many thanks for fixing the issue with remap_restarts. I'm surprised, however, that regrid.pl is edited as part of this PR. The new remap_restarts utilities should supersede regrid.pl, which should be obsolete. After talking with @biljanaorescanin, my understanding is that regrid.pl cannot work for stretched-grid restarts because the associated stretched-grid bcs were created in the revised (and much more sensible) directory tree and naming convention. It's of course possible to provide the stretched-grid bcs in the old directory tree and naming convention, but we're trying hard to get away from this. Is there a reason why regrid.pl is strung along here? And do we really want to have a set of stretched-grid bcs in the old directory tree and naming conventions?
In short, my strong preference would be to not touch regrid.pl.

@mathomp4
Copy link
Member Author

mathomp4 commented May 4, 2023

@gmao-rreichle I edited regrid.pl mainly because:

  1. @wmputman was still using it(probably because of the the issues with stretched grid in remap_restarts), and I guessed if I fixed one, I fix the other.
  2. @biljanaorescanin found a bug that's been there for...years? I dunno. But don't want the bug there.

But we can fix it in this one and announce its deprecation. At least then we can say regrid.pl is obsolete and will not be updated. Then we can move to remove it.

Though that is up to @sdrabenh and others. I just wanted to fix bugs I saw.

@gmao-rreichle
Copy link
Contributor

@gmao-rreichle I edited regrid.pl mainly because:

  1. @wmputman was still using it(probably because of the the issues with stretched grid in remap_restarts), and I guessed if I fixed one, I fix the other.
  2. @biljanaorescanin found a bug that's been there for...years? I dunno. But don't want the bug there.

But we can fix it in this one and announce its deprecation. At least then we can say regrid.pl is obsolete and will not be updated. Then we can move to remove it.

Though that is up to @sdrabenh and others. I just wanted to fix bugs I saw.

@mathomp4: Thanks for the insights. It makes sense to fix bugs since regrid.pl is still used. But I don't think it makes sense to edit regrid.pl for the stretched-grid options unless we also make it work with bcs provided in the updated directory tree and file name convention. Without these additional modifications regrid.pl will still not work for the stretched-grid restarts. What's worse, the current incomplete edits of regrid.pl towards stretched-grid functionality may make the user think that regrid.pl should work for stretched-grid restarts when in fact does not. The only other alternative is to maintain new bcs in the old dir tree and file name convention. I don't think we want to go there.
@wmputman, @sdrabenh, please advise.
cc: @biljanaorescanin @weiyuan-jiang

@biljanaorescanin
Copy link
Contributor

biljanaorescanin commented May 10, 2023

I find this error in both python and regrid.pl
I will share path to files with @weiyuan-jiang . @mathomp4 let me know did you encounter same issue when running from Bill's restart files: /discover/nobackup/projects/gmao/osse2/Ops12KM_R/forecasts_EC/06KM/Feature-c1440_L137/restarts/
BCS trying to be used that are mismatched: /discover/nobackup/projects/gmao/share/gmao_ops/fvInput/g5gcm/bcs/Icarus-NLv3/Icarus-NLv3_Ostia/

Used extreme measures 0 times

Writing CH
FAIL at line=00037 NetCDF4_get_var.H <Unable to get variable: CH from file: InData/Feature-c1440_L137.openwater_internal_rst.e20230417_21z.GEOSgcm-v11.0.0.Icarus-NLv3_Ostia_CF1440x6C_CF1440x6C.nc4>
MAPL_VarReadNCpar_R4_1dError reading variable 1
Operation not permitted

FAIL at line=02085 NCIO.F90 <status=1>
FAIL at line=00145 mk_LakeLandiceSaltRestarts.F90 <status=1>
Attempting to use an MPI routine before initializing MPICH

I don't see this error if I run from M2 restarts.

@biljanaorescanin
Copy link
Contributor

Turns out this error I had is due to input restarts we try to remap didn't match our input boundary conditions. Once that was changed there is no error. That is why M2 option worked from start since that is already predefined (all except date from which to start). I wonder could Bill's run crash also related to this mismatch ... since he did see at some point error pointing to LAI and that is coming from bcs files? What do you think @mathomp4 ? I will do more testing but so far all looks ok.

@gmao-rreichle
Copy link
Contributor

Did the linked bcs have a different number of land tiles than the restart file? If so, it should be trivial to implement a check and clean exit with a helpful error message.
If the linked bcs and the restart file have the same number of tiles, but the linked bcs are still not consistent with the restart file, we can check the parameter values and exit cleanly with a helpful error message. Some (but not all) of the bcs parameters are in the restart file. It should be possible to verify these parameters against the values in the linked bcs files.
If faintly remember that in an older version we were actually reading some parameters twice, once from the restart and again from the linked bcs. (Maybe I'm wrong on this.) In any case, by reading and comparing some key parameters from the linked bcs and the restart file we can check for inconsistency. Such a check won't catch all cases of inconsistency, but it should help.

@biljanaorescanin
Copy link
Contributor

@gmao-rreichle yes adding check would solve this.

One other problem I see with this PR is with regid.pl and boundary conditions.

For example if we run c1440 to c540 it will looks for bcs files on designated location which doesn't host them:

Error; Cannot find bcs directory: /discover/nobackup/projects/gmao/share/gmao_ops/fvInput/g5gcm/bcs/Icarus-NLv3/Icarus-NLv3_Ostia/CF1440x6C_CF1440x6C; at ./regrid.pl line 2185, line 24.

we don't have that problem with remap_restarts.py since we can give our own bcs path.

But it really is a game of chance what I am running, I could have picked some other resolution and find it at that location. In this case right path would be this one: /gpfsm/dnb03/projects/p38/stage/BCS_FILES/Icarus-NLv3/Icarus-NLv3_Ostia/

@sdrabenh what do you think? Should we just go with what we have now and focus development to remap_restats.py or try to figure out what happens with missing resolutions and regrid.pl?

@gmao-rreichle
Copy link
Contributor

gmao-rreichle commented May 11, 2023

@biljanaorescanin:
I'm still unclear if in your earlier error message the linked bcs were inconsistent in the number of land tiles or in the parameter values.
Also, re. this error:

Error; Cannot find bcs directory: /discover/nobackup/projects/gmao/share/gmao_ops/fvInput/g5gcm/bcs/Icarus-NLv3/Icarus-NLv3_Ostia/CF1440x6C_CF1440x6C; at ./regrid.pl line 2185, line 24.

This particular error isn't as problematic because it generates a helpful error message. My understanding is that the error you encountered in testing this PR did not have a helpful error message, and that's what I'm most concerned about.

@biljanaorescanin
Copy link
Contributor

@gmao-rreichle yes that first error was not helpful, and yes number of tiles was different mismatched.
@weiyuan-jiang confirmed that the tile number with type = 0 in the tile file does not match the tile number in the openwater restart file.  Basically, all restart files don't match the input bcs. I was trying to run with restart files I taught were using one set of bcs files but actually they were created using other. Once I picked correct set that was resolved. Still any user could make a mistake...
So as you said error message should resolve this in case of mismatch but maybe @weiyuan-jiang can say more

@sdrabenh
Copy link
Contributor

@sdrabenh what do you think? Should we just go with what we have now and focus development to remap_restats.py or try to figure out what happens with missing resolutions and regrid.pl?

My initial thoughts are to just focus development on remap_restarts.py. It works with the new structure and is cleaner and more people know python. I don't view maintaining 2 different scripts that do the same thing as sustainable or practical in the long-term. As soon as @weiyuan-jiang has the inline command line options working like regrid.pl, I am fine with removing regrid.pl from GEOS_Util. That will be the only way to get users to stop using that pearl script.

@wmputman
Copy link
Contributor

I have a regrid.pl that 'works' for most of these resolutions.

I just committed this to feature/wmputman/hwt_modifications @GEOS_Util

Many of the 'missing' BCS locs point to my project space

@wmputman
Copy link
Contributor

But, ideally, we just need to move the BCS files I have staged in project space to the proper locations

…l only be implemented into remap_restarts.py package
@gmao-rreichle
Copy link
Contributor

gmao-rreichle commented May 12, 2023

After discussion with Scott and Bill, we decided to implement the stretched-grid option only in the new remap_restart.py package, and not in the old regrid.pl script (which in any case would have needed additional mods to accommodate the new bcs output dir structure and naming conventions). I reverted regrid.pl back to the main branch.

DONE: check merge compatibility of this PR with PR #17.

Updated 12 May 2023, 5pm: This PR is now compatible with PR #17, see #17 (comment)

@biljanaorescanin biljanaorescanin marked this pull request as ready for review May 12, 2023 19:20
@weiyuan-jiang
Copy link
Contributor

The error message is already in the log. But maybe we can do more and make it exit without moving forward.

@sdrabenh sdrabenh merged commit 675827d into main May 17, 2023
5 checks passed
@sdrabenh sdrabenh deleted the hotfix/mathomp4/fix-topo-link branch May 17, 2023 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 diff The changes in this pull request have verified to be zero-diff with the target branch.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants