Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zero-diff bug fixes in make_bcs package #601

Merged
merged 21 commits into from
Sep 8, 2022

Conversation

weiyuan-jiang
Copy link
Contributor

@weiyuan-jiang weiyuan-jiang commented Jun 24, 2022

This PR addresses issue #596 .

Without passing the range into LRRasterize(), the "completed successfully" message was never printed out for EASE grid.

if(.not.failed) print *, 'Rasterization completed successfully'

PR contains additional cleanup and bug fixes. Without the fixes, array indices were out of bounds and array elements were assigned garbage numbers, hence the non-zero-diff results for some intermediate files generated in M09 and M36 bcs. These intermediate files do not seem to be used and are no longer created in the present branch, so effectively the PR is 0-diff for make_bcs.

The PR is trivially 0-diff for running the GCM or GEOSldas.

Adds functionality to point out non-0-diff results between:

  • bcs generated with the current code and
  • archived bcs used in ops that were generated with an older compiler.
    In this case, the default string for the output dir is appended to clarify expected non-0-diff results from make_bcs.

@weiyuan-jiang weiyuan-jiang requested a review from a team as a code owner June 24, 2022 16:25
@weiyuan-jiang weiyuan-jiang added the bugfix This fixes a bug label Jun 24, 2022
@gmao-rreichle
Copy link
Contributor

@weiyuan-jiang, I have one more question: I can see how passing the min/max lat/lon into LRRasterize() is important for the EASE and EASEv2 grids because these grid do not fully cover the globe. However, in #596 @biljanaorescanin reports that she hasn't seen the "Rasterization completed successfully" message for the CF grids either: #596 (comment)
Since the CF grids should provide full global coverage and (I assume) their min/max lat/lon values match the defaults [-90,90], [-180,180] that are hardcoded in LRRasterize(), the PR should not change anything for the CF grids. Consequently, we still would not see the "successfully completed" message for the CF grids. Then the question becomes if that's ok or if we're missing a related issue for the CF grids.
In any case, the key is for @biljanaorescanin to 0-diff test the PR for at least a few representative EASEv2 and CF tile spaces.

@weiyuan-jiang weiyuan-jiang added the Non 0-diff The changes in this pull request are non-zero-diff label Jun 25, 2022
@weiyuan-jiang
Copy link
Contributor Author

I think this PR is non-zero diff for EASEv2 grid. For CF grid, it should be zero-diff. I have printed out the x values, it is exact 180.000000 and -180.000000. Of course it is highly possible it could be 180.000001 since the last digit is not reliable. But this PR it is already non-zero diff anyway.

@weiyuan-jiang
Copy link
Contributor Author

We could not see the message "Rasterization completed successfully" because this message is redirected to >/dev/null . I tested NL3 C24 case. If I remove ">/dev/null" in the job script, the successful message would be printed out. @gmao-rreichle @biljanaorescanin

@weiyuan-jiang
Copy link
Contributor Author

This line is obviously a bug. tmpstring is not assigned at this point. We probably can just remove it.

@gmao-rreichle
Copy link
Contributor

@weiyuan-jiang, thanks for the explanation and adding the non-0-diff label. If the PR is indeed not 0-diff for EASEv2, figuring out how to proceed will need more thought. We would lose the 0-diff backward compatibility for all existing EASEv2 bcs datasets, including the brandnew NLv5 datasets.
We also still need to fix the issue in item 6. of #539, for which the solution may or may not be 0-diff.
Re. the following comment:

This line is obviously a bug. tmpstring is not assigned at this point. We probably can just remove it.

I agree. We should be able to just remove this line.

@mathomp4
Copy link
Member

Note from the Gallery, but if you are updating this file, I might recommend making the following changes:

  • iargc()command_argument_count()
  • getarg()get_command_argument()
  • getenv()get_environment_variable()
  • system()execute_command_line()

The commands on the left are all Fortran extensions that some compilers (e.g., NAG) do not support because, well, Fortran can do it now.

I have a WIP PR (#500) trying to change this everywhere in GCM GC, but that's something I only go back to every so often (low priority task).

But I figure this is one file where the changes can be tested to make sure they work! 😄

@biljanaorescanin
Copy link
Contributor

Nightly tests say this is zero diff as it is now. It is not zero diff for boundary conditions is that what you wanted to say @weiyuan-jiang ?

@weiyuan-jiang
Copy link
Contributor Author

Yes. These routines are used to create BCs.

@weiyuan-jiang
Copy link
Contributor Author

I think the PR is incomplete. There are many hard-coded range , LL ( low left) UR (upper right) for EASE grid. For example

It is not 180 any more.

@gmao-rreichle
Copy link
Contributor

Thanks for the update, @biljanaorescanin. Yes, the "non-0-diff" aspect should only apply to creating bcs, and it should only apply to EASEv2 bcs (but not sure about the latter). The important test is to generate a bcs datasets and see to what extent the output differs from what we generated with a recent tag (or "develop"). I suggest starting with NLv3 for c180 and EASEv2 M36.

@gmao-rreichle
Copy link
Contributor

[...] I might recommend making the following changes:

  • iargc()command_argument_count()
    [...]

Fine with me, although at this point it's hard to predict when we can merge this PR, and it's unlikely to be 0-diff for bcs, so throwing in an unrelated change that should be 0-diff may not be ideal.

@biljanaorescanin
Copy link
Contributor

I suggest starting with NLv3 for c180 and EASEv2 M36.

@gmao-rreichle I will tests this as soon as bcs package runs again we are still debugging at this moment.

@weiyuan-jiang
Copy link
Contributor Author

I pushed more changes to avoid crash in debug mode. It is slow for debug mode. I think we don't have to call "create_mapping" every time. We should be able to improve it.

@biljanaorescanin
Copy link
Contributor

biljanaorescanin commented Jun 30, 2022

@weiyuan-jiang there are still errors, this is from C24 NLv3 CS run:
Screen Shot 2022-06-30 at 1 43 16 PM

For EASE grid M09 NL3
Screen Shot 2022-06-30 at 1 45 48 PM

@weiyuan-jiang
Copy link
Contributor Author

@weiyuan-jiang there are still errors, this is from C24 NLv3 CS run: Screen Shot 2022-06-30 at 1 43 16 PM

For EASE grid M09 NL3 Screen Shot 2022-06-30 at 1 45 48 PM

I saw it and fixing it...

@weiyuan-jiang
Copy link
Contributor Author

@gmao-rreichle @biljanaorescanin Now I can finish make_bcs with NL3 and m9 choices. It is time to check the difference.

@biljanaorescanin
Copy link
Contributor

@weiyuan-jiang @gmao-rreichle with last commit we are able now to run both ease grid case and cf case with no errors. As we suspected it is not zero diff.

My control/develop EASE m36 NLv3 case from May: /discover/nobackup/borescan/brisi/develop_3/SMAP_EASEv2_M36
This branch:
/discover/nobackup/borescan/brisi/test_pr_601/try_4/bcs_ease/SMAP_EASEv2_M36

Files that differ are:
ar.new, bf.dat, catchment.def, cti_stats.dat, soil_param.dat, ts.dat , SMAP_EASEv2_M36_964x406.til , Pfafstetter.til, SMAP_EASEv2_M36_964x406.til, SMAP-EASEv2-M36-Pfafstetter.til and SMAP-EASEv2-M36.til file.

I can take a closer look at all the differences next week. But sole fact that .til and catchment.def are different make all past bcs files in comparison non zero diff. Some differences are not related to this branch, but to peat update like soil_param.dat has 5 tiles that are different. Maybe I should just rerun control to isolate just this PR impact, I didn't realize I didn't run these cases after last peat change.

For C90 NLv3 CS case similar differences:
Control/develop from May:
/discover/nobackup/borescan/brisi/develop_3/NL3/CF0090x6C_CF0090x6C
This branch:
/discover/nobackup/borescan/brisi/test_pr_601/try_4/bcs_cf/CF0090x6C_CF0090x6C

Same files are affected as in ease case summary.

#setenv NCPUS `/usr/bin/lscpu | grep '^CPU(s)' | cut -d ':' -f2 | head -1 `
#@ NCPUS = $NCPUS / 4
#@ NCPUS = $NCPUS * 3
setenv NCPUS 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@biljanaorescanin:
I'm not sure why NCPUS was hardcoded here, overwriting the earlier setting(s).
I'm also not sure if NCPUS is needed as a global environment variable. Maybe we could just use
set NCPUS = 20 ?
For now, I'm leaving it as an env variable, but if NCPUS is just a local variable, we should use "set", not "setenv".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can use:
set NCPUS = 20
I am still trying to figure out why we use 20 and not more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to Matt and he thinks it was either memory issue or it stopped scaling around then.

@biljanaorescanin
Copy link
Contributor

If we remove from make_bcs script this section:

if(${MGRID} == M09 | ${MGRID} == M36) then
bin/mkLandRaster.x -x ${NX} -y ${NY} -v -t ${NT}
bin/mkSMAPTilesPara_v2.x -smap_grid ${MGRID} -pfaf_til T
bin/CombineRasters.x -f 0 -t 232000000 ${THISGRID} Pfafstetter > /dev/null
bin/CombineRasters.x -t 232000000 ${THISGRID} ${THISGRID}-Pfafstetter
/bin/mv til/${THISGRID}_${THISGRID}-Pfafstetter.til til/${THISGRID}_${THISGRID}-Pfafstetter.ind
endif

We get zero diff for EASE grid cases m9 and m36. So without "if loop" section on this branch vs when compared to develop.

Two things:

  1. we never run EASE grid with AGCM
  2. we always run bcs script for EASE grid options with O1 ocean

So question is why did we ever have this section? Maybe there was preparation to run some cases with different oceans?

I am able to run GEOSldas without these files and experiments are zero diff if I use this branch produced m36 vs develop branch produced m36 bcs files. ( those /til files are never used in experiment setup)

@gmao-rreichle gmao-rreichle added 0 diff The changes in this pull request have verified to be zero-diff with the target branch. and removed 0 diff trivial The changes in this pull request are trivially zero-diff (documentation, build failure, &c.) Non 0-diff The changes in this pull request are non-zero-diff labels Aug 26, 2022
@gmao-rreichle
Copy link
Contributor

If we remove from make_bcs script this section:

if(${MGRID} == M09 | ${MGRID} == M36) then
bin/mkLandRaster.x -x ${NX} -y ${NY} -v -t ${NT}
bin/mkSMAPTilesPara_v2.x -smap_grid ${MGRID} -pfaf_til T
bin/CombineRasters.x -f 0 -t 232000000 ${THISGRID} Pfafstetter > /dev/null
bin/CombineRasters.x -t 232000000 ${THISGRID} ${THISGRID}-Pfafstetter
/bin/mv til/${THISGRID}_${THISGRID}-Pfafstetter.til til/${THISGRID}_${THISGRID}-Pfafstetter.ind
endif

We get zero diff for EASE grid cases m9 and m36. So without "if loop" section on this branch vs when compared to develop.

Two things:

  1. we never run EASE grid with AGCM
  2. we always run bcs script for EASE grid options with O1 ocean

So question is why did we ever have this section? Maybe there was preparation to run some cases with different oceans?

I am able to run GEOSldas without these files and experiments are zero diff if I use this branch produced m36 vs develop branch produced m36 bcs files. ( those /til files are never used in experiment setup)

@biljanaorescanin : Are you saying we could and should remove the lines in question? If that's the case, we should remove this block.

@biljanaorescanin
Copy link
Contributor

With this clean up we are zero diff to develop created ease grid m36 and m09.

This branch will not create files in 2 subdirectories /til and files in subdirectory /rst.

These files are in /rst :
image
And in /til :
image

This result is now same as we have for M03 resolution.

@biljanaorescanin biljanaorescanin changed the title non-zero-diff bug fixes in make_bcs package bug fixes in make_bcs package Aug 31, 2022
@gmao-rreichle gmao-rreichle changed the title bug fixes in make_bcs package zero-diff bug fixes in make_bcs package Aug 31, 2022
@gmao-rreichle
Copy link
Contributor

[...] I might recommend making the following changes:

  • iargc()command_argument_count()
    [...]

Fine with me, although at this point it's hard to predict when we can merge this PR, and it's unlikely to be 0-diff for bcs, so throwing in an unrelated change that should be 0-diff may not be ideal.

@biljanaorescanin, @mathomp4: Since the PR turned out to be 0-diff for make_bcs after all, it would be ok to now add the changes suggested by @mathomp4. If the final tests unexpectedly show non-0-diff, we'd have to undo the changes, but that's unlikely. So if you're still interested in cleaning this up, please go ahead.

@biljanaorescanin
Copy link
Contributor

@mathomp4 I've changed all in "Raster" directory.

image
I can change on some other branch all in "mk_restarts" (since this PR is just raster related).

@biljanaorescanin biljanaorescanin marked this pull request as ready for review September 1, 2022 15:39
@biljanaorescanin biljanaorescanin removed the Contingent - DNA These changes are contingent on other PRs (DNA=do not approve) label Sep 1, 2022
@gmao-rreichle
Copy link
Contributor

@sdrabenh, I just approved this PR for the land group. It's trivially 0-diff for running the GCM because the changes are confined to the ./Raster (bcs generation) directory. We successfully tested the bcs generation. Please let me know if you have any questions.

@sdrabenh sdrabenh merged commit 3dea2e4 into develop Sep 8, 2022
@sdrabenh sdrabenh deleted the bugfix/wjiang/pass_range_to_easev2 branch September 8, 2022 16:58
@gmao-rreichle gmao-rreichle linked an issue Nov 23, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 diff The changes in this pull request have verified to be zero-diff with the target branch. bugfix This fixes a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

debug mode exposed an error that exists in optimized mode
5 participants