Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model hangs #19

Closed
jpolton opened this issue Jul 9, 2021 · 18 comments
Closed

Model hangs #19

jpolton opened this issue Jul 9, 2021 · 18 comments

Comments

@jpolton
Copy link

jpolton commented Jul 9, 2021

Model seems to run but hangs without terminating properly

@jpolton
Copy link
Author

jpolton commented Jul 9, 2021

Chris passed on insights from Adam. To replace the modules in make_xios.sh make_nemo.sh and the submit.slurm scripts
from

module -s restore /work/n01/shared/acc/n01_modules/ucx_env

to

module load cpe/21.03
module load cray-hdf5-parallel
module load cray-netcdf-hdf5parallel
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

In the slurm script this goes above the OMP_NUM_THREADS=1

@jpolton
Copy link
Author

jpolton commented Jul 9, 2021

Rebuilding nemo.exe and xios_server.exe as above on branch feature/new_module made no difference to the hanging

@jpolton
Copy link
Author

jpolton commented Jul 9, 2021

Try investigating the allocation of nodes and all that. E.g. from Chris:

/work/n01/n01/cwi/mkslurm_hetjob -S 8 -s 16 -m 2 -C 831 -g 16 -N 128 -t 00:10:00 -a n01-CLASS -j SE-NEMO > runscript_831.slurm

The number of cores and gaps are the main things to vary here. 831 and 16 are a sweet spot for eORCA025 but other options may be better for a different NEMO configuration.
Scroll to the bottom of https://docs.archer2.ac.uk/research-software/nemo/nemo.html for info.

@jpolton jpolton mentioned this issue Jul 9, 2021
@jpolton
Copy link
Author

jpolton commented Jul 9, 2021

@mpayopayo @micdom IT IS RUNNING!!
On a new branch https://github.com/JMMP-Group/SEVERN-SWOT/tree/feature/new_modules I tried a couple of things:

  1. Try out the alternative modules as Chris suggested. This did not work (on its own)
  2. Rebuild the slurm script that does all the fancy MPI allocation stuff.

Together these got NEMO running and outputting again (though "1." above may not be necessary).

It didn't quite complete - possible an issue with the domain but this is progress.

@jpolton
Copy link
Author

jpolton commented Jul 9, 2021

As a minimum effort to test if it was only the new slurm script that was needed copy (https://github.com/JMMP-Group/SEVERN-SWOT/blob/feature/new_modules/RUN_DIRECTORIES/EXP_unforced/submit.slurm) and swap the jelt line 28. And swap the modules back (line 16 instead of 17-20). And n01-ACCORD in line 6.

Unforrtunately I have rebuilt my NEMO and XIOS executables using the new modules so haven't tested whether they were important or not.

@micdom
Copy link
Collaborator

micdom commented Jul 9, 2021

@jpolton @mpayopayo I'm a bit lost. I'm trying to run the unforced run without the boundary file (which I didn't manage to build).
I got an output, not sure what I got though... but I didn't change the submit.slurm script...

@jpolton
Copy link
Author

jpolton commented Jul 9, 2021

@jpolton @mpayopayo I'm a bit lost. I'm trying to run the unforced run without the boundary file (which I didn't manage to build).
I got an output, not sure what I got though... but I didn't change the submit.slurm script...

@micdom Can you do a chmod a+rx -R /work/n01/n01/micdom

@mpayopayo
Copy link
Collaborator

@jpolton @micdom OK I'll try now just with the new submit slurm

@micdom
Copy link
Collaborator

micdom commented Jul 9, 2021

@jpolton @mpayopayo done chmod a+rx -R /work/n01/n01/micdom

@mpayopayo
Copy link
Collaborator

@jpolton maybe silly, but I'm not at ease yet with git, do I have to do the test in a new branch?

@jpolton
Copy link
Author

jpolton commented Jul 9, 2021

@jpolton @mpayopayo I'm a bit lost. I'm trying to run the unforced run without the boundary file (which I didn't manage to build).
I got an output, not sure what I got though... but I didn't change the submit.slurm script...

Looks like @micdom is the winner so far. Even got RESTART files written!! The run log is ocean.output . The XIOS output (defined in field_def_nemo-oce.xml) is SEVERN_unforced_1d_t.nc
Well done

@jpolton
Copy link
Author

jpolton commented Jul 9, 2021

@jpolton maybe silly, but I'm not at ease yet with git, do I have to do the test in a new branch?

You could copy off the web page and paste it into your file.

@jpolton
Copy link
Author

jpolton commented Jul 9, 2021

@micdom
Screen_Capture_-_9_Jul__5_37_pm

Elevations ~1e-12 m after 288 steps without forcing. Good job.

Riding high on this success, I'm calling it quits for the week before something goes wrong!

@mpayopayo
Copy link
Collaborator

mpayopayo commented Jul 9, 2021

@jpolton I'm getting segmentation fault, maybe because different modules compiling and running?
I'm running with the bathy that misses the SW bit. If @micdom is running with the "full" bathy, and I did not have problems with your bathy Could it all come from the bathy and the domain?

I'll try next week generating the bathy again.

@micdom
Copy link
Collaborator

micdom commented Jul 9, 2021

@jpolton @mpayopayo I'm using a different bathy with the SW bit!
dout.variables['elevation'][0:99,:] = 0
dout.variables['elevation'][0:200,650::] = 0
for the rest I've followed the instructions, made a last pull this afternoon, and just changed in the namelist_cfg ln_bdy=.false. and nn_itend= 288.

have a nice weekend!

@micdom
Copy link
Collaborator

micdom commented Jul 20, 2021

@jpolton @mpayopayo I have not updated the wiki for the unforced run, but maybe I should.

The unforced run can be done without creating the boundary file first.
It is sufficient to change ln_bdy=.false in the namelist_cfg.

The section of the wiki Run Unforced can go before Make tidal boundary conditions.

@mpayopayo
Copy link
Collaborator

@jpolton, @micdom I'm redoing again the bathy and the run unforced, I happy to modify the wiki afterwards

@mpayopayo
Copy link
Collaborator

@jpolton it hangs/gives segmentation fault with the crop bathymetry but not with the full bathymetry. So I think that is were the problem is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants