Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates the UFS hash and global-workflow for CICE fixes #2505

Conversation

HenryWinterbottom-NOAA
Copy link
Contributor

@HenryWinterbottom-NOAA HenryWinterbottom-NOAA commented Apr 18, 2024

Description

This PR address issue #2490. The following is accomplished:

Type of change

  • Maintenance (submodule hash update)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO

How has this been tested?

The new ufs_model.fd hash has been tested using the global-workflow C48 ATM only and S2SW CI configurations. The relevant rocotostat information is as follows:

ATM

       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
202103231200             gfsstage_ic                    58293800           SUCCEEDED                   0         1          19.0
202103231200                 gfsfcst                    58293836           SUCCEEDED                   0         1        2572.0

S2SW

       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
202103231200             gfsstage_ic                    58293800           SUCCEEDED                   0         1          19.0
202103231200             gfswaveinit                           -                   -                   -         -             -
202103231200                 gfsfcst                    58293836           SUCCEEDED                   0         1        2572.0

The CICE post-processing also completed accordingly:

202103231200  gfsice_prod_f006-f018          58413985      SUCCEEDED          0     1     51.0
202103231200  gfsice_prod_f024-f036          58413986      SUCCEEDED          0     1     52.0
202103231200  gfsice_prod_f042-f054          58413987      SUCCEEDED          0     1     51.0
202103231200  gfsice_prod_f060-f072          58413988      SUCCEEDED          0     1     51.0

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • I have made corresponding changes to the documentation if necessary

@aerorahul
Copy link
Contributor

@DeniseWorthen
Would you be able to confirm the updates to parsing_namelist_CICE.sh to just make sure the defaults the global-workflow is using for some of the newly added CICE ice_in namelist variables are consistent with the UFSWM RT.
Thanks!

@JessicaMeixner-NOAA
Copy link
Contributor

@aerorahul would you like me to run the C768 test for this?

@HenryWinterbottom-NOAA HenryWinterbottom-NOAA changed the title Feature/gfsv17 issue 2490 UFS hash updates relevant for CICE Apr 18, 2024
@HenryWinterbottom-NOAA HenryWinterbottom-NOAA changed the title UFS hash updates relevant for CICE Updates the UFS and global-workflow for CICE fixes Apr 18, 2024
@HenryWinterbottom-NOAA HenryWinterbottom-NOAA changed the title Updates the UFS and global-workflow for CICE fixes Updates the UFS hash and global-workflow for CICE fixes Apr 18, 2024
@aerorahul
Copy link
Contributor

@aerorahul would you like me to run the C768 test for this?

I think we will run the usual CI tests to ensure the model hash update does not break the baseline functionality.
If the C768 run fails due to resources or something else unexpected, we could open a more targeted issue for troubleshooting the problem.

Does that work?

@JessicaMeixner-NOAA
Copy link
Contributor

@aerorahul that sounds like a great plan. As long as the existing CI tests pass, I see no reason to not merge this PR and address any additional issues separately. @DeniseWorthen is the best person to look at the CICE updates, but if she is unavailable, I'd be happy to double check what I can in reguards to that.

Thank you @HenryWinterbottom-NOAA for making this update!

@DeniseWorthen
Copy link

@aerorahul The changes to the cice_in parsing looks correct. There is an issue for running on WCOSS2, since there seems to be some problem w/ using pio/pnetcdf so you currently can't use pnetcdf2 on that platform. We had to use hdf5 on WCOSS2

+  if [[ ${MACHINE_ID} == wcoss2 ]]; then
+    export CICE_RESTART_FORMAT='hdf5'
+    export CICE_HISTORY_FORMAT='hdf5'
+  fi

I did a set of tests w/ the PIO feature and found pnetcdf2 was the best performance, and at high task counts, switching the rearrangers to subset had a big impact. I don't know where or if that optional sort of setting would be included in the G-W.

Copy link
Contributor

@JessicaMeixner-NOAA JessicaMeixner-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HenryWinterbottom-NOAA is there a reason you chose ufs-community/ufs-weather-model@8a5f711 ? The bug fix that issue #2490 needs is the next commit. My test C768 failed with the error that we are trying to fix and that the next commit should fix.

@HenryWinterbottom-NOAA
Copy link
Contributor Author

@HenryWinterbottom-NOAA is there a reason you chose ufs-community/ufs-weather-model@8a5f711 ? The bug fix that issue #2490 needs is the next commit. My test C768 failed with the error that we are trying to fix and that the next commit should fix.

@JessicaMeixner-NOAA I chose that hash following the discussion here.

I will try the next commit now and follow-up.

Actually, @JessicaMeixner-NOAA, could you replace sorc/ufs_model.fd with the hash that you are referencing, rebuild, and test the C768? I was repeatedly having issues getting C768 working regardless of the hashes I tested. If you can identify explicitly the hash that needs to be pushed, I can make the changes.

@JessicaMeixner-NOAA
Copy link
Contributor

@HenryWinterbottom-NOAA I see no reason not to move to the latest develop branch of ufs-waether-model. We should at minimum move to ufs-community/ufs-weather-model@281b32f which includes the bug-fxies that are required for fixing the C768 issues.

I'll update to the latest, rebuild and test C768, but as @aerorahul said, if the low-resolution passes tests we should not delay this PR for the C768 issues. We could even mrege this PR as is, but it would not close issue 2490.

@KateFriedman-NOAA KateFriedman-NOAA removed their request for review April 19, 2024 14:40
@JessicaMeixner-NOAA
Copy link
Contributor

@HenryWinterbottom-NOAA I used ufs-weather-model commit hash: 47c00995706380f9827a7e98bd242ffc6e4a7f01
and it's been running for an hour: /scratch1/NCEPDEV/climate/Jessica.Meixner/testgw2505/test02/COMROOT/test02/logs/2019120300/gfsfcst.log

I think it's safe to update to this commit hash and move forward with this PR.

@HenryWinterbottom-NOAA
Copy link
Contributor Author

@JessicaMeixner-NOAA Thank you for following-up.

I am testing against develop at the moment and my test should be closing out soon. If successful, do you suggest pushing to the UFS develop hash or the commit hash that you are referencing above?

@JessicaMeixner-NOAA
Copy link
Contributor

The commit hash I pointed you to is the latest commit to the develop branch hash (although a commit might be coming here in the next hour or so). It is from Wed Apr 17 ufs-community/ufs-weather-model@47c0099

@HenryWinterbottom-NOAA
Copy link
Contributor Author

OK, once the ocean/ice post apps complete, I will make the hash update and hopefully we can get this completed today. Thank you for helping out with this.

@JessicaMeixner-NOAA
Copy link
Contributor

Is this just a WCOSS2 issue? I successfully ran the C768 test case on hera. Unfortunately I just cleaned that up thinking it was no longer needed, so I can't double check version numbers or anything.

@aerorahul
Copy link
Contributor

Is this just a WCOSS2 issue? I successfully ran the C768 test case on hera. Unfortunately I just cleaned that up thinking it was no longer needed, so I can't double check version numbers or anything.

Yes. It appears to be a WCOSS2 thing. If someone else can verify it, that would be great.

@aerorahul
Copy link
Contributor

@HenryWinterbottom-NOAA @JessicaMeixner-NOAA has offered to help test on Hera w/ hdf5 to reproduce the issue seen on WCOSS2.
If you can share your paths to logs/build/etc with her, she might be able to get a head start.
Thanks

@JessicaMeixner-NOAA
Copy link
Contributor

@HenryWinterbottom-NOAA I have a C48 S2SW forecast test submitted on hera. I'll be interested to see what issues you're running into and will keep you up to date on my progress too.

@HenryWinterbottom-NOAA
Copy link
Contributor Author

@HenryWinterbottom-NOAA I have a C48 S2SW forecast test submitted on hera. I'll be interested to see what issues you're running into and will keep you up to date on my progress too.

Thank you, @JessicaMeixner-NOAA . I am trying one more test to make sure I did botch something earlier working with @aerorahul. I will also keep you updated. Thank you for your help.

@JessicaMeixner-NOAA
Copy link
Contributor

@HenryWinterbottom-NOAA I'll be interested to see what tests you were running and failing with. So far my C48_S2SW test is running. It's about half way through the forecast. COM/EXP dirs are here: /scratch1/NCEPDEV/climate/Jessica.Meixner/updatemodelgw/test01/s2s01

@HenryWinterbottom-NOAA
Copy link
Contributor Author

@JessicaMeixner-NOAA I am cross checking that I have the correct UFSWM hash. The one I tested against earlier today was different. I see there were more recent pushes to develop so that is likely where things got messed up for -- or at least a reason.

I will update this thread as I learn more.

@JessicaMeixner-NOAA
Copy link
Contributor

Here's my clone if that's of any help: /scratch1/NCEPDEV/climate/Jessica.Meixner/updatemodelgw/test01/global-workflow I just checked out your latest branch from like 2 hours ago, it was up to date with develop.

@HenryWinterbottom-NOAA
Copy link
Contributor Author

And this is working for you, correct? If so, that suggests that I botched something.

@JessicaMeixner-NOAA
Copy link
Contributor

@HenryWinterbottom-NOAA sounds like maybe the merge unintentionally changed the UFS hash on your end. Maybe a fresh clone will help?

My test has finished the forecast job and some of the post is finishing up now, so it looks to be successful.

@HenryWinterbottom-NOAA
Copy link
Contributor Author

@JessicaMeixner-NOAA OK, the confirms the branch isn't broken (a good thing). I am not sure what happened with earlier, but I am already doing as you are suggesting and so far, so good.

Thank you for taking the time to help us (me) to debug. It is appreciated.

@JessicaMeixner-NOAA
Copy link
Contributor

@HenryWinterbottom-NOAA @aerorahul @DeniseWorthen I went ahead and ran another run with HDF5 as the option on hera to see if we could narrow this down to machine or HDF5, and got the following error /scratch1/NCEPDEV/climate/Jessica.Meixner/updatemodelgw/test01/s2s02hdf5/COMROOT/s2s02hdf5/logs/2021032312/gfsfcst.log.0

 0:  fcst_realize, cpl_grid_id=           1
 0:  zeroing coupling accumulated fields at kdt=            1
 0:  zeroing coupling accumulated fields at kdt=            1
20:
20:  (abort_ice)ABORTED:
20:  (abort_ice) called from ice_pio.F90
20:  (abort_ice) line number          223
20:  (abort_ice) error =
20:  (ice_pio_check)Permission denied, (ice_pio_init) ERROR: Failed to create file .
20:  /CICE_OUTPUT/iceh_ic.2021-03-23-43200.nc
21:
21:  (abort_ice)ABORTED:
21:  (abort_ice) called from ice_pio.F90
21:  (abort_ice) line number          223
21:  (abort_ice) error =
21:  (ice_pio_check)Permission denied, (ice_pio_init) ERROR: Failed to create file .
21:  /CICE_OUTPUT/iceh_ic.2021-03-23-43200.nc
22:
22:  (abort_ice)ABORTED:
22:  (abort_ice) called from ice_pio.F90
22:  (abort_ice) line number          223
22:  (abort_ice) error =
22:  (ice_pio_check)Permission denied, (ice_pio_init) ERROR: Failed to create file .
22:  /CICE_OUTPUT/iceh_ic.2021-03-23-43200.nc
20: Abort(128) on node 20 (rank 20 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 20

I believe this means that hdf5 has issues with linking in CICE and this is not necessarily a machine specific issue. What I do not know is if this is the way CICE uses HDF5, or just an HDF5 issue. If it's just the way CICE uses HDF5, is there anyway to change that so linking files is okay? If not, I think our solutions to being able to update the model are:

  • Get new libraries on WCOSS2 (which will take time)
  • Update workflow to not use linked files, which is in progress but will take time as well.
    @aerorahul is that a correct summary? I know there's another option to just let WCOSS2 fail, but there are a lot of people who are using WCOSS2, particularly for C1152 runs.

@HenryWinterbottom-NOAA
Copy link
Contributor Author

@JessicaMeixner-NOAA I can make the changes in my branch to that it can be tested WRT the symlink versus copy.

Would this be worth testing?

@JessicaMeixner-NOAA
Copy link
Contributor

@HenryWinterbottom-NOAA it's my understanding that @aerorahul might be somewhere in the progress of that update, so perhaps coordinating that with him would be a good next step.

I'd be curious to hear @DeniseWorthen thoughts on if we should create an issue for the CICE+HDF5+linking to see if there's anything that can be done on the CICE side?

@aerorahul
Copy link
Contributor

@JessicaMeixner-NOAA
When this issue came up, we started working on a branch that creates output into a DATAoutput/ equivalent to the DATArestart/.
We are able to write the model output for MOM6 and CICE in DATAoutput/MOM6_OUTPUT/ and DATAoutput/CICE_OUTPUT/ respectively. The files here are in their model native formats provided by diag_table and cice.
The branch then copies this data to COM/. This is done after the model has finished integrating. There in lies the problem for post-processing and downstream applications. They can't wait for the GFS 384 hour (or GEFS 48 day) forecast to finish before data is seen in COM.

The post-processing jobs can look for model output in DATAoutput/, however since the output files are named differently between components e.g. GFSPRS.FHH or GFSPRS.FHHH for atmos, ocn_%Y_%m_%d_%H.nc for ocean, iceh_06h.%Y-%m-%d-%seconds.nc for CICE, WW3 and GOCART have other formats, coding this into the dependencies for Rocoto is going to require a lot more effort. Even so, this will be prone to errors, as there is no standard in the model output for the components. This will be a maintenance burden.

An alternative is to employ a forecast manager job that runs in parallel with the forecast job, that copies the output from DATAoutput/ to COM/ as the model integrates. This seems to be a viable solution, but it also has a few downsides. It needs to know the forecast output frequency, and special naming conventions and they need to be replicated from the forecast_postdet.sh, COMPONENT_out functions. It also will likely need to depend on some forecast_predet.sh functions to know the explicit forecast output hours.
The bigger complication, is the condition of RERUN=YES. When that happens, where does this forecast manager job kick off from? When does it run? How long does it run for?

Addressing these will take some thinking, tinkering, trial and error. It will also need a discussion with NCO on what will be acceptable to ops, given that we already have open tickets to replace links with copies.

This is just a status update to keep everyone in the loop. Please let me know if there are any questions or anything needs clarification.

@aerorahul
Copy link
Contributor

  • Get new libraries on WCOSS2 (which will take time)
  • Update workflow to not use linked files, which is in progress but will take time as well.
    @aerorahul is that a correct summary? I know there's another option to just let WCOSS2 fail, but there are a lot of people who are using WCOSS2, particularly for C1152 runs.

Thanks for summarizing. This is correct.

New libraries on WCOSS2 will allow us to continue linking, but it does not resolve NCO bugzilla requiring replacing links with copies. The second bullet does address it, but raises some other concerns (noted in this comment.
As you surmised, letting WCOSS2 fail is not really an acceptable solution.

@HenryWinterbottom-NOAA
Copy link
Contributor Author

@aerorahul Should we icebox this PR UFN?

@JessicaMeixner-NOAA
Copy link
Contributor

I've asked on ufs-community/ufs-weather-model#2232 when new libraries might be ready.

@DeniseWorthen
Copy link

DeniseWorthen commented Apr 29, 2024

I've tested a sandbox using linked output directories on Hera and it worked for both HDF5 and pnetcdf2. This is the HDF5 case:

[Hera: hfe03] /scratch1/NCEPDEV/stmp2/Denise.Worthen/oitest/test: ls -l CICE_*
lrwxrwxrwx 1 Denise.Worthen stmp 51 Apr 29 19:44 CICE_OUTPUT -> /scratch1/NCEPDEV/stmp4/Denise.Worthen/CICE_OUTPUT/
lrwxrwxrwx 1 Denise.Worthen stmp 52 Apr 29 19:44 CICE_RESTART -> /scratch1/NCEPDEV/stmp4/Denise.Worthen/CICE_RESTART/

I'm not clear on why this is failing on WCOSS2. It doesn't appear to me to have anything to do w/ the CICE PIO implementation.

@aerorahul
Copy link
Contributor

I've tested a sandbox using linked output directories on Hera and it worked for both HDF5 and pnetcdf2. This is the HDF5 case:

[Hera: hfe03] /scratch1/NCEPDEV/stmp2/Denise.Worthen/oitest/test: ls -l CICE_*
lrwxrwxrwx 1 Denise.Worthen stmp 51 Apr 29 19:44 CICE_OUTPUT -> /scratch1/NCEPDEV/stmp4/Denise.Worthen/CICE_OUTPUT/
lrwxrwxrwx 1 Denise.Worthen stmp 52 Apr 29 19:44 CICE_RESTART -> /scratch1/NCEPDEV/stmp4/Denise.Worthen/CICE_RESTART/

I'm not clear on why this is failing on WCOSS2.

@DeniseWorthen
In your test, it appears you are linking output directory, and not the output file.
What does your test reveal, when you do something like:

CICE_OUTPUT/iceh_06h.2021-03-23-64800.nc -> /scratch1/NCEPDEV/stmp4/Denise.Worthen/CICE_OUTPUT/f018.nc

@HenryWinterbottom-NOAA
Copy link
Contributor Author

Closing. A new issue will be drafted and subsequent PR will be created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update ufs-weather-model hash in g-w
6 participants