Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATES PIO issue for f19_g16 resolution ERP tests #6316

Open
glemieux opened this issue Apr 1, 2024 · 15 comments
Open

FATES PIO issue for f19_g16 resolution ERP tests #6316

glemieux opened this issue Apr 1, 2024 · 15 comments
Assignees
Labels
FATES pm-cpu Perlmutter at NERSC (CPU-only nodes) SCORPIO The E3SM I/O library (derived from PIO)

Comments

@glemieux
Copy link
Contributor

glemieux commented Apr 1, 2024

In the fates test list we have two debug mode ERP tests using the f19_g16 resolution for the default set of fates run modes. The difference between the tests is that one runs with the gnu compiler and one runs intel. Both of these tests are failing with a PIO error while accessing the restart file:

 64: PIO: FATAL ERROR: Aborting... An error occured, Waiting on pending requests on file (./ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3.elm.r.0001-01-03-00000.nc, ncid=56) failed (Number of pending requests on file = 129, Number of variables with pending requests = 129, Number of request blocks = 2, Current block being waited on = 0, Number of requests in current block = 92).. Size of I/O request exceeds INT_MAX (err=-237). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/u1/g/glemieux/E3SM-project/e3sm/externals/scorpio/src/clib/pio_darray_int.c: 2087)
 64: Obtained 10 stack frames.
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a40be8]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a3fc95]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a823bf]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a73b7c]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a827d8]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a74cc3]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a303f4]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x4616652]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x467faf3]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0xb81e4e]
 64: MPICH ERROR [Rank 64] [job id 23646208.0] [Fri Mar 29 12:37:52 2024] [nid006735] - Abort(-1) (rank 64 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 64
 64:
 64: aborting job:
 64: application called MPI_Abort(MPI_COMM_WORLD, -1) - process 64
srun: error: nid006735: task 64: Exited with exit code 255

We have a other similar fates ERP tests that run on ne4pg2_ne4pg2 and f09_g16 that don't seem to hit this issue, although those are not being run in debug mode.

@glemieux glemieux changed the title FATES PIO issue for f19_g16 resolution ERP tests FATES PIO issue for f19_g16 resolution ERP_D tests Apr 1, 2024
@jayeshkrishna
Copy link
Contributor

  • Is this issue occuring with the latest E3SM master?
  • Do you see this issue on other machines (apart from pm)?
  • Is the test using PnetCDF or NetCDF for writes (xmlquery for PIO_TYPENAME)?
  • How many MPI processes is the test using on PM?

@glemieux glemieux changed the title FATES PIO issue for f19_g16 resolution ERP_D tests FATES PIO issue for f19_g16 resolution ERP tests Apr 1, 2024
@glemieux
Copy link
Contributor Author

glemieux commented Apr 1, 2024

  • Is this issue occuring with the latest E3SM master?

Yes, nearly the latest master. This was discovered when generating new fates test list baselines using E3SM v3.0.0-104-g7792c63c19 (commit from 4 days ago) and fates tag sci.1.70.0_api.32.0.0_tools.1.1.0.

  • Do you see this issue on other machines (apart from pm)?

To be determined.

  • Is the test using PnetCDF or NetCDF for writes (xmlquery for PIO_TYPENAME)?

Looks like land is using PnetCDF:
PIO_TYPENAME: ['CPL:pnetcdf', 'ATM:netcdf', 'LND:pnetcdf', 'ICE:pnetcdf', 'OCN:pnetcdf', 'ROF:pnetcdf', 'GLC:pnetcdf', 'WAV:pnetcdf', 'IAC:pnetcdf', 'ESP:pnetcdf']

  • How many MPI processes is the test using on PM?

128 tasks. Here's the preview_run output:

CASE INFO:
  nodes: 1
  total tasks: 128
  tasks per node: 128
  thread count: 1
  ngpus per node: 0

BATCH INFO:
  FOR JOB: case.test
    ENV:
      Setting Environment ADIOS2_ROOT=/global/cfs/cdirs/e3sm/3rdparty/adios2/2.9.1/cray-mpich-8.1.25/gcc-11.2.0
      Setting Environment Albany_ROOT=/global/common/software/e3sm/mali_tpls/albany-e3sm-serial-release-gcc
      Setting Environment BLA_VENDOR=Generic
      Setting Environment FI_CXI_RX_MATCH_MODE=software
      Setting Environment GATOR_INITIAL_MB=4000MB
      Setting Environment HDF5_USE_FILE_LOCKING=FALSE
      Setting Environment MPICH_COLL_SYNC=MPI_Bcast
      Setting Environment MPICH_ENV_DISPLAY=1
      Setting Environment MPICH_VERSION_DISPLAY=1
      Setting Environment NETCDF_PATH=/opt/cray/pe/netcdf-hdf5parallel/4.9.0.3/gnu/9.1
      Setting Environment OMP_NUM_THREADS=1
      Setting Environment OMP_PLACES=threads
      Setting Environment OMP_PROC_BIND=spread
      Setting Environment OMP_STACKSIZE=128M
      Setting Environment PERL5LIB=/global/cfs/cdirs/e3sm/perl/lib/perl5-only-switch
      Setting Environment PNETCDF_PATH=/opt/cray/pe/parallel-netcdf/1.12.3.3/gnu/9.1
      Setting Environment Trilinos_ROOT=/global/common/software/e3sm/mali_tpls/trilinos-e3sm-serial-release-gcc

    SUBMIT CMD:
      sbatch --time 00:31:40 -q regular --account m2420 .case.test 

    MPIRUN (job=case.test):
      srun  --label  -n 128 -N 1 -c 2  --cpu_bind=cores   -m plane=128 /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_gnu.elm-fates_cold.G.20240329_093430_nc2qos/bld/e3sm.exe   >> e3sm.log.$LID 2>&1 

I've confirmed this fails in non-debug mode as well.

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Apr 1, 2024

Thanks, can you also print out the PIO_BUFFER_SIZE_LIMIT (./xmlquery PIO_BUFFER_SIZE_LIMIT) for the test?

We might be able to overcome this limit by increasing the number of I/O tasks too (setting PIO_NUMTASKS to say 8)

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Apr 1, 2024

Try adding a testmod (like SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.anlgce_gnu.elm-force_netcdf_pio uses -- ./components/elm/cime_config/testdefs/testmods_dirs/elm/force_netcdf_pio) to set the number of I/O tasks for the test to 8 (./xmlchange PIO_NUMTASKS=8; ./xmlchange PIO_STRIDE=-99) and see if it works.

@jayeshkrishna jayeshkrishna added FATES SCORPIO The E3SM I/O library (derived from PIO) pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Apr 1, 2024
@glemieux
Copy link
Contributor Author

glemieux commented Apr 1, 2024

Thanks, can you also print out the PIO_BUFFER_SIZE_LIMIT (./xmlquery PIO_BUFFER_SIZE_LIMIT) for the test?

We might be able to overcome this limit by increasing the number of I/O tasks too (setting PIO_NUMTASKS to say 8)

PIO_BUFFER_SIZE_LIMIT: -1

@glemieux
Copy link
Contributor Author

glemieux commented Apr 1, 2024

force_netcdf_pio

I'm sorry, I don't quite understand what you're suggesting here. Do you want me to modify the failing f19_g16 test to use the force_netcdf_pio tesmod shell script, adding the ./xmlchange commands you noted to it as well?

@jayeshkrishna
Copy link
Contributor

No, just add a testmod for the failing ERP test so that you can set the PIO_NUMTASKS to 8 and PIO_STRIDE to -99 (I mentioned the *elm-force_netcdf_pio test to use as a reference on how to add/set testmods for CIME tests).

@glemieux
Copy link
Contributor Author

glemieux commented Apr 2, 2024

🎉 That did the trick. The test passes using the above PIO_NUMTASKS and PIO_STRIDE values you suggested @jayeshkrishna. What's the next steps for addressing this?

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Apr 2, 2024

Can you also check if PIO_NUMTASKS=4 works?
The solution for this issue would be to set the number of I/O tasks (8 or 4) permanently in a testmod for this test (add the above xmlchange commands to the testmod associated with this test). The value should get reset by E3SM (share utils) if the test is run with less than 8/4 procs.

@glemieux
Copy link
Contributor Author

glemieux commented Apr 3, 2024

PIO_NUMTASKS=4 works as well.

This particular testmod, fates_cold, is used pretty widely across a number of resolutions and also the basis for other testmods. I can create a resolution-specific testmod for this one test for this resolution, but I'm wondering if there are other options for updating the PIO settings without having to tie a testmod to a given resolution.

@jayeshkrishna
Copy link
Contributor

ok, I will try to recreate the issue and find a fix for it in SCORPIO. Meanwhile, you can add the testmod to get the test working on PM.

@glemieux
Copy link
Contributor Author

glemieux commented Apr 3, 2024

Thanks for all your help @jayeshkrishna

@jayeshkrishna jayeshkrishna self-assigned this Apr 3, 2024
@rljacob
Copy link
Member

rljacob commented Apr 3, 2024

You can put "if" statements in the shell_commands file and only take action if its a certain resolution. See this example for the "noio" testmod in eam:

(base) jacob@Roberts-MacAirM2 noio % more shell_commands
#!/bin/bash
./xmlchange --append CAM_CONFIG_OPTS='-cosp'

# save benchmark timing info for provenance
./xmlchange SAVE_TIMING=TRUE

# on KNLs, run hyper-threaded with 64x2
if [ `./xmlquery --value MACH` == theta ]||[ `./xmlquery --value MACH` == cori-knl ]; then
  ./xmlchange MAX_MPITASKS_PER_NODE=64
  ./xmlchange MAX_TASKS_PER_NODE=128
  ./xmlchange NTHRDS=2
  # avoid over-decomposing LND beyond 7688 clumps (grid cells)
  if [ `./xmlquery --value NTASKS_LND` -gt 3844 ]; then ./xmlchange NTHRDS_LND=1; fi
else
  ./xmlchange NTHRDS=1
fi

@glemieux
Copy link
Contributor Author

glemieux commented Apr 3, 2024

Thanks for the suggestion @rljacob. I forgot I could xmlquery LND_GRID.

@glemieux
Copy link
Contributor Author

glemieux commented Apr 3, 2024

I should note for reference, that this test was working as of 67abd00. It stopped working sometime between then and 069c226.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FATES pm-cpu Perlmutter at NERSC (CPU-only nodes) SCORPIO The E3SM I/O library (derived from PIO)
Projects
None yet
Development

No branches or pull requests

3 participants