Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EarthWorks porting to Perlmutter #33

Closed
6 of 8 tasks
gdicker1 opened this issue Mar 1, 2024 · 7 comments
Closed
6 of 8 tasks

EarthWorks porting to Perlmutter #33

gdicker1 opened this issue Mar 1, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@gdicker1
Copy link
Contributor

gdicker1 commented Mar 1, 2024

This issue is intended to capture the work needed and issues being experienced when running EarthWorks on Perlmutter. This issue can be closed once there is a reliable initial state on Perlmutter. This includes:

  • Appropriate machine configuration (ccs_configs)
  • Modules (nvhpc software stack)
  • A generally accessible input data space (at least by all EW developers)
  • Any necessary code changes in externals to run on Perlmutter

To check this a example test (FHS94 on mpasa120_mpasa120 grid) should be able to:

  • run create_newcase script
  • run case.setup
  • run case.build
  • run case.submit

(It's fine if these steps require some small edits, as long as they are in case-specific files like user_nl_cam.)

@gdicker1
Copy link
Contributor Author

gdicker1 commented Mar 1, 2024

  • Appropriate machine configuration (ccs_configs)

Currently I'm having an issue with this. After following the instructions provided (here) I get errors during my case.setup step.:

./case.setup
ERROR: module command /usr/share/lmod/lmod/libexec/lmod python purge  failed with message:
Unloading the cpe module is insufficient to restore the system defaults.
Please run 'source /opt/cray/pe/cpe/23.12/restore_lmod_system_defaults.[csh|sh]'.
ERROR: case.setup failed
--- End loop for EWv21_PmtrDbg_FHS94.mpasa120.perlmutter_ew_debug.nvhpc.64 ---

I think I can solve this by removing the "purge" and "rm" commands from the perlmutter_ew_debug entry

@gdicker1 gdicker1 added the enhancement New feature or request label Mar 1, 2024
@gdicker1 gdicker1 self-assigned this Mar 1, 2024
@gdicker1
Copy link
Contributor Author

gdicker1 commented Mar 1, 2024

  • Modules (nvhpc software stack)

I think we do have this, just something to keep track of and update.

@gdicker1
Copy link
Contributor Author

gdicker1 commented Mar 1, 2024

To check this a example test (FHS94 on mpasa120_mpasa120 grid) should be able to:

  • run create_newcase script
  • run case.setup
  • run case.build
  • run case.submit

Right now, case.submit failed due to missing MPAS-A partition files in the check_input_data step (i.e. that I didn't copy over files I knew weren't in a CESM input data source).

@gdicker1
Copy link
Contributor Author

gdicker1 commented Mar 5, 2024

Related PR: EarthWorksOrg/ccs_config_cesm#17

@gdicker1
Copy link
Contributor Author

  • A generally accessible input data space (at least by all EW developers)

See this comment in #34

@gdicker1
Copy link
Contributor Author

gdicker1 commented Mar 11, 2024

  • run case.submit

@supreethms1809 I think I need some help here. I created a case on Perlmutter (using --machine perlmutter_ew_debug and -i /global/cfs/cdirs/m4180/inputdata) but things fail with a MPI abort error when running the compset. The run fails early (during init) with no ouput to drv.log (the only other file in the run directory)

From file: "/pscratch/sd/g/gdicker/2024Mar08-164113_EWv21_PmtrDbg_FHS94.mpasa120.perlmutter_ew_debug.nvhpc.64/run/cesm.log.22715130.240308-170402" on Perlmutter

... # repeated t_initf) output per thread
26:  (t_initf)       profile_ovhd_measurement=  F
26:  (t_initf)       profile_add_detail=        F
26:  (t_initf)       profile_papi_enable=       F
 5:  (t_initf) Read in prof_inparm namelist from: drv_in
 5:  (t_initf) Using profile_disable=           F
 5:  (t_initf)       profile_timer=                       4
 5:  (t_initf)       profile_depth_limit=                 4
 5:  (t_initf)       profile_detail_limit=                2
 5:  (t_initf)       profile_barrier=           F
 5:  (t_initf)       profile_outpe_num=                   1
 5:  (t_initf)       profile_outpe_stride=                0
 5:  (t_initf)       profile_single_file=       F
 5:  (t_initf)       profile_global_stats=      T
 5:  (t_initf)       profile_ovhd_measurement=  F
 5:  (t_initf)       profile_add_detail=        F
 5:  (t_initf)       profile_papi_enable=       F
 ... # repeated MPI_ABORT output per thread
 5: --------------------------------------------------------------------------
 5: MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
 5:   Proc: [[38668,0],0]
 5:   Errorcode: 1
 5:
 5: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
 5: You may or may not see output from other processes, depending on
 5: exactly when Open MPI kills them.
 5: --------------------------------------------------------------------------
srun: error: nid002180: tasks 5,14-16,18-19,23,26,39,42,47,50,59: Exited with exit code 1
srun: Terminating StepId=22715130.0
 0: slurmstepd: error: *** STEP 22715130.0 ON nid002180 CANCELLED AT 2024-03-09T01:04:16 ***
srun: error: nid002180: tasks 0-4,6-13,17,20-22,24-25,27-38,40-41,43-46,48-49,51-58,60-63: Terminated
srun: Force Terminated StepId=22715130.0

I haven't tried a run with DEBUG=true because I think NVHPC dies in general when we turn that on for EW/CESM.

@gdicker1
Copy link
Contributor Author

gdicker1 commented Aug 2, 2024

Closing due to lack of progress/interest. This can be re-opened later

@gdicker1 gdicker1 closed this as completed Aug 2, 2024
gdicker1 added a commit to gdicker1/EarthWorks that referenced this issue Sep 20, 2024
Use tags for EarthWorksOrg/CAM EarthWorksOrg#33 and EarthWorksOrg/CTSM EarthWorksOrg#12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant