Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print NetCDF error message if pFIO_NetCDF4_FileFormatterMod::open() fails #962

Merged
merged 3 commits into from
Aug 17, 2021

Conversation

LiamBindle
Copy link
Contributor

@LiamBindle LiamBindle commented Aug 13, 2021

Description

Hi all, this is a trivial update to print nf90_strerror(status) if nf90_open() in pFIO_NetCDF4_FileFormatterMod::open() fails.

Motivation and Context

Currently there is not error handling if nf90_open() fails in pFIO_NetCDF4_FileFormatterMod::open(). As a result, a file missing read permissions will cause an error message that looks like this:

pe=00213 FAIL at line=00245    NetCDF4_FileFormatter.F90                <status=13>
pe=00213 FAIL at line=00090    MAPL_ExtDataCollection.F90               <status=13>
pe=00213 FAIL at line=00232    FileMetadata.F90                         <can not find time>
pe=00213 FAIL at line=00083    FileMetadataUtilities.F90                <status=1>
pe=00213 FAIL at line=02777    MAPL_ExtDataGridCompMod.F90              <status=1>
pe=00213 FAIL at line=01512    MAPL_ExtDataGridCompMod.F90              <status=1>

and the offending file isn't printed.

Now an error message like this is written to stderr

nf90_open: returned error code (13) opening ./MetDir/2016/12/MERRA2.20161214.A3dyn.05x0625.nc4 [Permission denied]

follow by the normal stack traces following the failed _VERIFY(status).

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Trivial change (affects only documentation or cleanup)

Checklist:

  • I have tested this change with a run of GEOSgcm (if non-trivial)
  • I have added one of the required labels (0 diff, 0 diff trivial, 0 diff structural, non 0-diff)
  • I have updated the CHANGELOG.md accordingly following the style of Keep a Changelog

@LiamBindle LiamBindle added the 0 Diff Trivial The changes in this pull request are trivially zero-diff (documentation, build failure, &c.) label Aug 13, 2021
@LiamBindle LiamBindle requested a review from a team as a code owner August 13, 2021 18:47
end if
!$omp end critical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought from @tclune as we looked at this: Should the !$omp end critical come after the write?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. That make sense. I fixed it.

@tclune
Copy link
Collaborator

tclune commented Aug 16, 2021

The reason being that write is itself not thread-safe. (Though STDERR might be for most compilers.)

@mathomp4 mathomp4 self-requested a review August 17, 2021 12:23
@mathomp4
Copy link
Member

Doing a test now with GEOS.

@mathomp4
Copy link
Member

Well, I tested this and as far as I can see, it's zero-diff and doesn't do anything in a GEOS run.

@mathomp4 mathomp4 requested a review from tclune August 17, 2021 13:28
@mathomp4
Copy link
Member

I've added @tclune as a reviewer as well. Just so he sees it.

@gopikrishnangs44
Copy link

I am also facing this kind of issue in GCHP 13.4.0. Any solution?
slurm-954598.out.txt

@mathomp4
Copy link
Member

@gopikrishnangs44 What version of MAPL is in GCHP 13.4.0. It's possible it has an older MAPL without the fix from @LiamBindle ?

@gopikrishnangs44
Copy link

@mathomp4 MAPL 2.6.3

@mathomp4
Copy link
Member

@gopikrishnangs44 Ooh. Yeah, that's before this was put into 2.8.3, and I'm not sure it was ever backported to a 2.6 release. Your best bet might be to try and put the changes from @LiamBindle into your code and rebuild. Then at least you can figure out what file is failing.

@gopikrishnangs44
Copy link

I have installed the model using spack. Could you please guide me through the things necessary for a solution

@mathomp4
Copy link
Member

Ah. Spack. Now that is harder as you don't really have the ability to hand-edit the code and recompile. Hmm. Can you provide your logging.yaml? We might need to add something to it. I know there is a way to get ExtData to spit out a lot of extra information. We might have to use that instead to track down the file it's having issues with.

@gopikrishnangs44
Copy link

PLease find the attached @mathomp4
logging.yml.txt

@mathomp4
Copy link
Member

@gopikrishnangs44 Okay. I know how to turn on the ExtData debug prints with GEOS, I suppose you can try it with yours. In my logging.yaml I can change both the console and CAP handlers to level: DEBUG:

   console:
      class: streamhandler
      formatter: basic
      unit: OUTPUT_UNIT
      level: DEBUG
...
   CAP:
       level: WARNING
       root_level: DEBUG

and get things like:

        EXTDATA: DEBUG: ExtData Run_: INTERP_LOOP: interpolating between bracket times, variable: AEF_ISOPRENE file: ExtData/chemistry/HEMCO/v0.0.0/sfc/MEGAN2.1_EF.geos.025x03125.esmf.nc
        EXTDATA: DEBUG:    MAPL_ExtDataInterpField: Uninterpolated field MEGAN_AEF_ISOP set to sample L 19850101 000000
        EXTDATA: DEBUG: ExtData Run_: INTERP_LOOP: interpolating between bracket times, variable: AEF_MBO file: ExtData/chemistry/HEMCO/v0.0.0/sfc/MEGAN2.1_EF.geos.025x03125.esmf.nc
        EXTDATA: DEBUG:    MAPL_ExtDataInterpField: Uninterpolated field MEGAN_AEF_MBOX set to sample L 19850101 000000

in my run log.

Now it looks like GCHP has extra stuff in their yaml, so maybe this won't do it? First thing to try I suppose.

NOTE: This gets very verbose!

@gopikrishnangs44
Copy link

gopikrishnangs44 commented Oct 27, 2022

@mathomp4

I have replaced my logging with
console:
class: streamhandler
formatter: basic
unit: OUTPUT_UNIT
level: DEBUG
...
CAP:
level: WARNING
root_level: DEBUG

and tried to run the model, but got the same error

@mathomp4
Copy link
Member

@gopikrishnangs44 And you didn't get a zillion prints from extdata in your log?

@gopikrishnangs44
Copy link

gopikrishnangs44 commented Oct 31, 2022

Dear @mathomp4,
sorry for the late reply

I tried installing GCHP 14.1 with updated MAPL. Still I am having the same issue.

This is my jobscript for slurm:

#!/bin/bash
#SBATCH -J gc # name of the job
#SBATCH -p standard-low # name of the partition: available options "standard,standard-low,gpu,gpu-low,hm"
#SBATCH -n 24 # no of processes or tasks
#SBATCH -N 1 #nodes
#SBATCH --cpus-per-task=1 # no of threads per process or task
#SBATCH -t 72:00:00 # walltime in HH:MM:SS, Max value 72:00:00
#SBATCH --mem-per-cpu=8G

ulimit -c 0                  # coredumpsize
ulimit -l unlimited          # memorylocked
ulimit -u 50000              # maxproc
ulimit -v unlimited          # vmemoryuse
ulimit -s unlimited          # stacksize

source /home/21cl91p01/spack/share/spack/setup-env.sh
spack load gcc@9.3.0
spack load git@2.17.0
spack load netcdf-fortran@4.5.4
spack load cmake@3.24.2
spack load esmf@8.3.1
spack load /gmukwis          #openmpi@4.1.4
export MPI_ROOT=$(spack location -i /gmukwis)
sh /home/21cl91p01/test_14.1/setCommonRunSettings.sh
mpirun --oversubscribe -np 24 ./gchp

I am facing the same issue at the same time step for the new version as well.

log file:
slurm-958758.out.txt

@mathomp4
Copy link
Member

@gopikrishnangs44 Well the answer is one of your input files is missing a date or stopped being provided or something. That's what we need to find out.

Maybe try running with this part of your logging.yaml:

   CAP.EXTDATA:
       handlers: [mpi_shared]
       level: WARNING
       root_level: DEBUG
       propagate: false

as well as the other debugs for CAP and CONSOLE. Perhaps this is hijacking the debug messages?

@gopikrishnangs44
Copy link

gopikrishnangs44 commented Nov 1, 2022

Dear @mathomp4

I changed logging.yml to

   CAP:
       level: WARNING
       root_level: DEBUG

   CAP.EXTDATA:
       handlers: [mpi_shared]
       level: WARNING
       root_level: DEBUG
       propagate: false
   console:
      class: streamhandler
#      formatter: legacy
      formatter: basic
      unit: OUTPUT_UNIT
      level: DEBUG

But the error is still there.

@mathomp4
Copy link
Member

mathomp4 commented Nov 1, 2022

@gopikrishnangs44 At this point I'm not sure what to do as I don't have access to your run or model or machine to try things out. Some file is missing or missing data.

You might try asking the GCHP folks on how to turn on the debug logger prints for ExtData in their model. I know how to do it in GEOS, but obviously it's different in GCHP.

Beyond that, it's adding prints to the MAPL you are building with and re-building.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 Diff Trivial The changes in this pull request are trivially zero-diff (documentation, build failure, &c.)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants