Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to COMPASS to support conda MPI #480

Merged
merged 8 commits into from
Apr 10, 2020

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Mar 20, 2020

This merge converts several script calls to function calls, which seems to work more reliably with conda MPI:

  • The paraview extractor can now be called as a function rather than a script, and this is done during base-mesh generation and culling
  • SCRIP files are now created with a function call

With these changes, calls to python scripts that use NetCDF in the parallel conda enviornment will now work as long as they are called with mpirun -np 1

This merge also adds support for a conda_mpi attribute to step tags in COMPASS XML files.
If this is set to true or false, the step will have mpirun -np 1 prepended to the executable in conda environments with MPI support. If no conda_mpi attribute is specified, mpirun -np 1 is prepended only to python scripts (calls starting with python or ending with .py).

This is needed to support compass conda environments with MPI. Python scripts and modules that use the netcdf4 package with mpich support will work property if they are called with mpirun.

Changes are made to setup_testcase.py in b1a2b3b, so this commit should be merged to develop in a separate PR.

@xylar
Copy link
Collaborator Author

xylar commented Mar 20, 2020

This PR is based off #468 so it should be merged with (or after) that PR.

@xylar
Copy link
Collaborator Author

xylar commented Mar 20, 2020

Testing

I successfully ran all steps of the QU240wISC test case on Anvil with these changes and with the compass and mpas_tools packages from MPAS-Dev/MPAS-Tools#303 (not yet released).

@xylar
Copy link
Collaborator Author

xylar commented Apr 6, 2020

I got this working on my laptop and on Grizzly today. This required modifications to the approach, but these changes mean COMPASS users will not have to do anything other than load a compass conda environment, and setup_testcase.py will figure out whether MPI is available in the conda environment or not and what to do from there.

So far, it is not necessary to specify the conda_mpi attribute on the step tag anywhere I've tested. setup_testcase.py automatically determines that it should not be used to call run.py scripts from the driver script but that it should be used elsewhere. However, the attribute is there if we discover we need it.

@xylar
Copy link
Collaborator Author

xylar commented Apr 6, 2020

I will try to run all the test cases I can once #506, #507, #508, #510 and #511 are merged, and I can run cleaner tests.

@xylar xylar force-pushed the add_compass_mpi_support branch 3 times, most recently from 5e93015 to cd7ae92 Compare April 7, 2020 10:35
This merge moves soma/4km/32to4km and soma/8km/32to8km test cases
into a new subdirectory called "broken" since these test cases are
not working and won't be fixed anytime soon.  With this change,
`./list_testcases.py` and `./setup_testcases.py` won't pick up
these tests because their driver config files aren't at the
expected directory level.
The Maine, QU60 and SOQU60to15 test cases now have the links to
the python script for defining their vertical grids that they need
to be set up successfully.
We no longer define a path to metis in the config file, so the
version from the conda environment needs to be used instead.
Add support for a "conda_mpi" attribute to "step" tags. If this
attribute is set to "true" and MPI is present in the conda
environment, that command will be called with `mpirun` from
the conda envrionment.

This is needed to support compass conda enviornments with mpich.
Python scripts and modules that use the netcdf4 package with
mpich support don't work properly on many compute nodes (e.g.
Grizzly at LANL and Anvil at ANL) unless they are prefixed with
`mpirun -np 1`
The paraview extractor can now be called as a function rather
than a script, and this is done during base-mesh generation
and culling.

SCRIP files can now also be created with a function, so a script
call is replaced with a function here as well.

With these changes, calls to python scripts that use NetCDF in
the parallel conda enviornment will now work as long as they
are called with `mpirun -np 1`
Also rename the load script for convenience.
This will make sure the compatible version of MPI gets used.
Since we can't detect automatically that this is a python script,
(and that it needs to support compass mpi) we need to say so
explicitly
@xylar
Copy link
Collaborator Author

xylar commented Apr 9, 2020

Testing of all ocean test cases on grizzly:

Successful tests

Tests checked here were successful, those unchecked have not run yet:

  • 42: -o ocean -c baroclinic_channel -r 10km -t decomp_test
  • 43: -o ocean -c baroclinic_channel -r 10km -t default
  • 44: -o ocean -c baroclinic_channel -r 10km -t restart_test
  • 45: -o ocean -c baroclinic_channel -r 10km -t rpe_test
  • 46: -o ocean -c baroclinic_channel -r 10km -t threads_test
  • 48: -o ocean -c baroclinic_channel -r 4km -t rpe_test
  • 51: -o ocean -c dam_break -r default -t 004m
  • 52: -o ocean -c dam_break -r default -t 012m
  • 55: -o ocean -c drying_slope -r meshes -t 1km
  • 56: -o ocean -c drying_slope -r meshes -t 250m
  • 61: -o ocean -c global_ocean -r ARM60to10 -t init
  • 62: -o ocean -c global_ocean -r ARM60to6 -t init
  • 63: -o ocean -c global_ocean -r CUSP12 -t init
  • 65: -o ocean -c global_ocean -r CUSP8 -t init
  • 67: -o ocean -c global_ocean -r EC60to30 -t init
  • 68: -o ocean -c global_ocean -r EC60to30 -t spin_up
  • 69: -o ocean -c global_ocean -r EC60to30wISC -t init
  • 70: -o ocean -c global_ocean -r EC60to30wISC -t spin_up
  • 71: -o ocean -c global_ocean -r QU240 -t analysis_test
  • 72: -o ocean -c global_ocean -r QU240 -t init
  • 73: -o ocean -c global_ocean -r QU240 -t performance_test
  • 74: -o ocean -c global_ocean -r QU240 -t restart_test
  • 75: -o ocean -c global_ocean -r QU240 -t rk4_blocks_test
  • 76: -o ocean -c global_ocean -r QU240 -t se_blocks_test
  • 77: -o ocean -c global_ocean -r QU240 -t test
  • 78: -o ocean -c global_ocean -r QU240wISC -t init
  • 79: -o ocean -c global_ocean -r QU60 -t init
  • 80: -o ocean -c global_ocean -r QU60 -t spin_up
  • 81: -o ocean -c global_ocean -r SO60to10wISC -t init
  • 82: -o ocean -c global_ocean -r SO60to10wISC -t spin_up
  • 83: -o ocean -c global_ocean -r SOQU60to15 -t init
  • 84: -o ocean -c global_ocean -r SOQU60to15 -t spin_up
  • 89: -o ocean -c internal_waves -r 5km -t default
  • 90: -o ocean -c internal_waves -r 5km -t rpe_test
  • 91: -o ocean -c internal_waves -r 5km -t ten-day
  • 92: -o ocean -c isomip -r 10km -t expt1.01
  • 93: -o ocean -c isomip -r 10km -t expt2.01
  • 94: -o ocean -c isomip_plus -r 2km -t Ocean0
  • 95: -o ocean -c isomip_plus -r 2km -t Ocean1
  • 96: -o ocean -c isomip_plus -r 2km -t Ocean2
  • 97: -o ocean -c isomip_plus -r 2km -t time_varying_Ocean0
  • 98: -o ocean -c isomip_plus -r 5km -t Ocean0
  • 99: -o ocean -c isomip_plus -r 5km -t Ocean1
  • 100: -o ocean -c isomip_plus -r 5km -t Ocean2
  • 101: -o ocean -c lock_exchange -r 0.5km -t default
  • 102: -o ocean -c lock_exchange -r 0.5km -t rpe_test
  • 103: -o ocean -c lock_exchange -r 16km -t default
  • 104: -o ocean -c overflow -r 10km -t default
  • 105: -o ocean -c overflow -r 1km -t rpe_test
  • 106: -o ocean -c periodic_planar -r 20km -t default_light
  • 107: -o ocean -c periodic_planar -r 20km -t region_reset_light_test
  • 108: -o ocean -c periodic_planar -r 20km -t time_reset_light_test
  • 109: -o ocean -c sea_mount -r 6.7km -t default
  • 110: -o ocean -c single_column_model -r planar -t cvmix_test
  • 111: -o ocean -c single_column_model -r sphere -t cvmix_test
  • 112: -o ocean -c soma -r 16km -t 3layer
  • 113: -o ocean -c soma -r 16km -t default
  • 114: -o ocean -c soma -r 16km -t surface_restoring
  • 115: -o ocean -c soma -r 32km -t 3layer
  • 117: -o ocean -c soma -r 32km -t surface_restoring
  • 118: -o ocean -c soma -r 32km -t time_varying_wind
  • 119: -o ocean -c soma -r 4km -t 3layer
  • 120: -o ocean -c soma -r 4km -t default
  • 121: -o ocean -c soma -r 4km -t surface_restoring
  • 122: -o ocean -c soma -r 8km -t 3layer
  • 123: -o ocean -c soma -r 8km -t default
  • 124: -o ocean -c soma -r 8km -t surface_restoring
  • 125: -o ocean -c sub_ice_shelf_2D -r 5km -t Haney_number_init
  • 126: -o ocean -c sub_ice_shelf_2D -r 5km -t Haney_number_iterative_init
  • 127: -o ocean -c sub_ice_shelf_2D -r 5km -t default
  • 128: -o ocean -c sub_ice_shelf_2D -r 5km -t iterative_init
  • 129: -o ocean -c sub_ice_shelf_2D -r 5km -t restart_test
  • 130: -o ocean -c sub_ice_shelf_2D -r 5km -t with_frazil
  • 133: -o ocean -c ziso -r 10km -t default
  • 135: -o ocean -c ziso -r 20km -t default
  • 136: -o ocean -c ziso -r 20km -t with_frazil
  • 137: -o ocean -c ziso -r 5km -t default

Tests that fail

All these tests failed for reasons unrelated to the PR

These tests need to copy define_base_mesh.py, not link to . (the "old" way) because of a change to build_mesh.py in #495:

  • 40: -o ocean -c Gaussian_hump -r USDEQU120cr10rr2 -t build_mesh
  • 49: -o ocean -c coastal -r Maine -t init
  • 50: -o ocean -c coastal -r USDEQU120cr10rr2 -t build_mesh
  • 85: -o ocean -c hurricane -r USDEQU120at30cr10rr2 -t build_mesh
  • 87: -o ocean -c hurricane -r USDEQU60at15cr5rr1 -t build_mesh

These tests are missing local links to a python script called ./comparison.py:

  • 53: -o ocean -c drying_slope -r hybrid -t 1km
  • 54: -o ocean -c drying_slope -r hybrid -t 250m
  • 57: -o ocean -c drying_slope -r sigma -t 1km
  • 58: -o ocean -c drying_slope -r sigma -t 250m
  • 59: -o ocean -c drying_slope -r zstar -t 1km
  • 60: -o ocean -c drying_slope -r zstar -t 250m
  • 131: -o ocean -c surface_waves -r direct -t 1km
  • 132: -o ocean -c surface_waves -r thickness_source -t 1km

All the above test cases should be fixed in #514

This test case is missing a local link to a file called ./check_particle_sampling.py :

  • `116: -o ocean -c soma -r 32km -t default

This test case crashes during forward run on 4 nodes (144 cores) with insufficient memory and
should be configured to indicate how many cores are actually needed:

  • 134: -o ocean -c ziso -r 2.5km -t default

Tests that were skipped

Some have prerequisites that are broken, others are too big to test:

  • 41: -o ocean -c Gaussian_hump -r USDEQU120cr10rr2 -t delaware: needs 40 (broken)
  • 64: -o ocean -c global_ocean -r CUSP12 -t spin_up: too involved, too many nodes
  • 66: -o ocean -c global_ocean -r CUSP8 -t spin_up: same as above
  • 86: -o ocean -c hurricane -r USDEQU120at30cr10rr2 -t sandy: needs 85 (broken)
  • 88: -o ocean -c hurricane -r USDEQU60at15cr5rr1 -t sandy: needs 87 (broken)

Ran only partly (because it takes too long):

  • 47: -o ocean -c baroclinic_channel -r 1km -t rpe_test

@xylar
Copy link
Collaborator Author

xylar commented Apr 10, 2020

@mark-petersen, this has been thoroughly tested on Grizzly and is now ready to test and merge. To test, please make sure you use the environment compass_0.1.3_mpich:

source /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/etc/profile.d/conda.sh
conda activate compass_0.1.3_mpich

Important: you need to use this environment both to set up the test cases and to run them. If you don't use this conda environment during setup, links to the wrong mpirun will be hard-coded into run.py scripts in your work directory (or mpirun from the conda environment won't be detected, and ./setup_testcase.py won't do its magic in determining which scripts need to be called with it). If you run with a different conda environment, you risk not having libraries (such as libnetcdf) that are compatible with the version of mpirun that is being called.

@xylar xylar removed the in progress label Apr 10, 2020
mark-petersen added a commit that referenced this pull request Apr 10, 2020
…evelop

Optionally add links to load_compass_env.sh in test cases #492

This is specified either in the config file or at the command line.

Like #480, this involves changes to common COMPASS infrastructure and we
should consider making a separate PR to develop instead of merging those
changes to ocean/develop.

closes #490
@mark-petersen mark-petersen merged commit 190338d into MPAS-Dev:ocean/develop Apr 10, 2020
@xylar xylar deleted the add_compass_mpi_support branch April 10, 2020 18:40
@xylar
Copy link
Collaborator Author

xylar commented Apr 11, 2020

@mark-petersen, I'm not sure what is different in compass_0.1.3_mpich vs compass_test_mpich (or perhaps something else in my testing) but I am now seeing errors when I run analysis_mapping, no matter how many MPI tasks I use (I tried 1 or 2). I'm seeing errors when I run with 1 task on my laptop as well, but it works there when I run with 2.

So it seems like there's still something to sort out here and maybe the MPICH environment still isn't ready for general use.

@pwolfram
Copy link
Contributor

@xylar, check_particle_sampling.py is at https://github.com/MPAS-Dev/MPAS-Model/blob/ocean/coastal/testing_and_setup/compass/ocean/soma/analysis/check_particle_sampling.py

@pwolfram
Copy link
Contributor

comparison.py file is at https://github.com/MPAS-Dev/MPAS-Model/blob/ocean/coastal/testing_and_setup/compass/ocean/drying_slope/analysis/comparison.py

@pwolfram
Copy link
Contributor

Can you please be more specific about what "broken link" means? Also, what is best way you propose to remediate the issue now that the file location is identified?

@xylar
Copy link
Collaborator Author

xylar commented May 14, 2020

@pwolfram, I suggest you try setting up the test case with --work_dir and see that it fails. And debug that. I don't feel like I have the time to maintain every test case, which is why I had asked if the number of test cases could be reduced. You said, no, so I will keep reporting problems.

@mark-petersen
Copy link
Contributor

So it seems like there's still something to sort out here and maybe the MPICH environment still isn't ready for general use.

@xylar I confirmed on LANL IC that mapping_analysis fails using compass_0.1.5, trying to run either with python or mpirun commands.

error message

(compass_0.1.5) gr1361:e3sm_coupling$ /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/bin/mpirun -np 1 python create_E3SM_coupling_files.py
****** Creating E3SM coupling files ******
- ice_shelf_cavities set to False
- author's name autodetected from git config: Mark Petersen
- author's email autodetected from git config: mpetersen@lanl.gov
- date string autodetected from today's date: 200518
- creation date autodetected from today's date: 05/18/2020 15:08:50
- maximum ocean depth autodetected mesh file: 3000.0
- number of vertical levels in the ocean autodetected mesh file: 16
- mesh long name specified in config file: QU240kmL16E3SMv2r01
- mesh short name specified in config file: QU240E2r01

****** initial_condition_ocean ******
Disabled in .ini file

****** graph_partition_ocean ******
Disabled in .ini file

****** initial_condition_seaice ******
Disabled in .ini file

****** scrip ******
Disabled in .ini file

****** transects_and_regions ******
Disabled in .ini file

****** mapping_analysis ******
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_0.5x0.5degree_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --ignore_unmapped
[cli_0]: write_line error; fd=5 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=5 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(586):
MPID_Init(175).......: channel initialization failed
MPID_Init(463).......: PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=5 buf=:cmd=abort exitcode=1093647
:
system msg for write_line failure : Bad file descriptor
!!! FAILURE !!!
Traceback (most recent call last):
  File "create_E3SM_coupling_files.py", line 127, in main
    function(config)
  File "create_E3SM_coupling_files.py", line 321, in mapping_analysis
    make_analysis_lat_lon_map(config, mesh_name)
  File "create_E3SM_coupling_files.py", line 1022, in make_analysis_lat_lon_map
    tempdir='.')
  File "/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/lib/python3.7/site-packages/pyremap/remapper.py", line 214, in build_mapping_file
    subprocess.check_call(args, stdout=DEVNULL)
  File "/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/bin/ESMF_RegridWeightGen', '--source', './src_mesh.nc', '--destination', './dst_mesh.nc', '--weight', 'map_QU240kmL16E3SMv2r01_to_0.5x0.5degree_bilinear.nc', '--method', 'bilinear', '--netcdf4', '--no_log', '--src_regional', '--ignore_unmapped']' returned non-zero exit status 15.

python create_E3SM_coupling_files.py
--> same error message

I can get mapping_analysis to work with conda activate compass_0.1.4:

details

python create_E3SM_coupling_files.py
WARNING:root:Setting cartopy.config["pre_existing_data_dir"] to /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/share/cartopy. Don't worry, this is probably intended behaviour to avoid failing downloads of geological data behind a firewall.
****** Creating E3SM coupling files ******
- ice_shelf_cavities set to False
- author's name autodetected from git config: Mark Petersen
- author's email autodetected from git config: mpetersen@lanl.gov
- date string autodetected from today's date: 200518
- creation date autodetected from today's date: 05/18/2020 15:58:36
- maximum ocean depth autodetected mesh file: 3000.0
- number of vertical levels in the ocean autodetected mesh file: 16
- mesh long name specified in config file: QU240kmL16E3SMv2r01
- mesh short name specified in config file: QU240E2r01

****** initial_condition_ocean ******
Disabled in .ini file

****** graph_partition_ocean ******
Disabled in .ini file

****** initial_condition_seaice ******
Disabled in .ini file

****** scrip ******
Disabled in .ini file

****** transects_and_regions ******
Disabled in .ini file

****** mapping_analysis ******
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_0.5x0.5degree_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --ignore_unmapped
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_6000.0x6000.0km_10.0km_Antarctic_stereo_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --dst_regional --ignore_unmapped
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_6000.0x6000.0km_10.0km_Arctic_stereo_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --dst_regional --ignore_unmapped
SUCCESS

This may be a clue: the mpirun command from the compass_0.1.5 directory is successful after conda activate compass_0.1.4.

details

(compass_0.1.4) gr0235:e3sm_coupling$/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/bin/mpirun -np 1 python create_E3SM_coupling_files.py
WARNING:root:Setting cartopy.config["pre_existing_data_dir"] to /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/share/cartopy. Don't worry, this is probably intended behaviour to avoid failing downloads of geological data behind a firewall.
****** Creating E3SM coupling files ******
- ice_shelf_cavities set to False
- author's name autodetected from git config: Mark Petersen
- author's email autodetected from git config: mpetersen@lanl.gov
- date string autodetected from today's date: 200518
- creation date autodetected from today's date: 05/18/2020 16:03:11
- maximum ocean depth autodetected mesh file: 3000.0
- number of vertical levels in the ocean autodetected mesh file: 16
- mesh long name specified in config file: QU240kmL16E3SMv2r01
- mesh short name specified in config file: QU240E2r01

****** initial_condition_ocean ******
Disabled in .ini file

****** graph_partition_ocean ******
Disabled in .ini file

****** initial_condition_seaice ******
Disabled in .ini file

****** scrip ******
Disabled in .ini file

****** transects_and_regions ******
Disabled in .ini file

****** mapping_analysis ******
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_0.5x0.5degree_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --ignore_unmapped
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_6000.0x6000.0km_10.0km_Antarctic_stereo_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --dst_regional --ignore_unmapped
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_6000.0x6000.0km_10.0km_Arctic_stereo_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --dst_regional --ignore_unmapped
SUCCESS

So we don't need to revert this commit, we could temporarily point load_latest_compass.sh to 1.4.0 and everything still works.

One side note, it seems to be hanging on something for a long time IC. With all steps disabled, the create_E3SM_coupling_files.py script takes 41 seconds. Based on the output, it looks like the header section.

caozd999 pushed a commit to caozd999/MPAS-Model that referenced this pull request Jan 14, 2021
… ocean/develop

Optionally add links to load_compass_env.sh in test cases MPAS-Dev#492

This is specified either in the config file or at the command line.

Like MPAS-Dev#480, this involves changes to common COMPASS infrastructure and we
should consider making a separate PR to develop instead of merging those
changes to ocean/develop.

closes MPAS-Dev#490
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants