Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e3sm_coupling failing with load_latest_compass.sh on LANL IC #564

Closed
xylar opened this issue May 19, 2020 · 11 comments
Closed

e3sm_coupling failing with load_latest_compass.sh on LANL IC #564

xylar opened this issue May 19, 2020 · 11 comments
Assignees

Comments

@xylar
Copy link
Collaborator

xylar commented May 19, 2020

This is a copy of the following comment form @mark-petersen:
#480 (comment)

So it seems like there's still something to sort out here and maybe the MPICH environment still isn't ready for general use.

@xylar I confirmed on LANL IC that mapping_analysis fails using compass_0.1.5, trying to run either with python or mpirun commands.

error message

(compass_0.1.5) gr1361:e3sm_coupling$ /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/bin/mpirun -np 1 python create_E3SM_coupling_files.py
****** Creating E3SM coupling files ******
- ice_shelf_cavities set to False
- author's name autodetected from git config: Mark Petersen
- author's email autodetected from git config: mpetersen@lanl.gov
- date string autodetected from today's date: 200518
- creation date autodetected from today's date: 05/18/2020 15:08:50
- maximum ocean depth autodetected mesh file: 3000.0
- number of vertical levels in the ocean autodetected mesh file: 16
- mesh long name specified in config file: QU240kmL16E3SMv2r01
- mesh short name specified in config file: QU240E2r01

****** initial_condition_ocean ******
Disabled in .ini file

****** graph_partition_ocean ******
Disabled in .ini file

****** initial_condition_seaice ******
Disabled in .ini file

****** scrip ******
Disabled in .ini file

****** transects_and_regions ******
Disabled in .ini file

****** mapping_analysis ******
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_0.5x0.5degree_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --ignore_unmapped
[cli_0]: write_line error; fd=5 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=5 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(586):
MPID_Init(175).......: channel initialization failed
MPID_Init(463).......: PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=5 buf=:cmd=abort exitcode=1093647
:
system msg for write_line failure : Bad file descriptor
!!! FAILURE !!!
Traceback (most recent call last):
  File "create_E3SM_coupling_files.py", line 127, in main
    function(config)
  File "create_E3SM_coupling_files.py", line 321, in mapping_analysis
    make_analysis_lat_lon_map(config, mesh_name)
  File "create_E3SM_coupling_files.py", line 1022, in make_analysis_lat_lon_map
    tempdir='.')
  File "/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/lib/python3.7/site-packages/pyremap/remapper.py", line 214, in build_mapping_file
    subprocess.check_call(args, stdout=DEVNULL)
  File "/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/bin/ESMF_RegridWeightGen', '--source', './src_mesh.nc', '--destination', './dst_mesh.nc', '--weight', 'map_QU240kmL16E3SMv2r01_to_0.5x0.5degree_bilinear.nc', '--method', 'bilinear', '--netcdf4', '--no_log', '--src_regional', '--ignore_unmapped']' returned non-zero exit status 15.

python create_E3SM_coupling_files.py
--> same error message

I can get mapping_analysis to work with conda activate compass_0.1.4:

details

python create_E3SM_coupling_files.py
WARNING:root:Setting cartopy.config["pre_existing_data_dir"] to /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/share/cartopy. Don't worry, this is probably intended behaviour to avoid failing downloads of geological data behind a firewall.
****** Creating E3SM coupling files ******
- ice_shelf_cavities set to False
- author's name autodetected from git config: Mark Petersen
- author's email autodetected from git config: mpetersen@lanl.gov
- date string autodetected from today's date: 200518
- creation date autodetected from today's date: 05/18/2020 15:58:36
- maximum ocean depth autodetected mesh file: 3000.0
- number of vertical levels in the ocean autodetected mesh file: 16
- mesh long name specified in config file: QU240kmL16E3SMv2r01
- mesh short name specified in config file: QU240E2r01

****** initial_condition_ocean ******
Disabled in .ini file

****** graph_partition_ocean ******
Disabled in .ini file

****** initial_condition_seaice ******
Disabled in .ini file

****** scrip ******
Disabled in .ini file

****** transects_and_regions ******
Disabled in .ini file

****** mapping_analysis ******
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_0.5x0.5degree_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --ignore_unmapped
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_6000.0x6000.0km_10.0km_Antarctic_stereo_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --dst_regional --ignore_unmapped
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_6000.0x6000.0km_10.0km_Arctic_stereo_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --dst_regional --ignore_unmapped
SUCCESS

This may be a clue: the mpirun command from the compass_0.1.5 directory is successful after conda activate compass_0.1.4.

details

(compass_0.1.4) gr0235:e3sm_coupling$/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.5/bin/mpirun -np 1 python create_E3SM_coupling_files.py
WARNING:root:Setting cartopy.config["pre_existing_data_dir"] to /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/share/cartopy. Don't worry, this is probably intended behaviour to avoid failing downloads of geological data behind a firewall.
****** Creating E3SM coupling files ******
- ice_shelf_cavities set to False
- author's name autodetected from git config: Mark Petersen
- author's email autodetected from git config: mpetersen@lanl.gov
- date string autodetected from today's date: 200518
- creation date autodetected from today's date: 05/18/2020 16:03:11
- maximum ocean depth autodetected mesh file: 3000.0
- number of vertical levels in the ocean autodetected mesh file: 16
- mesh long name specified in config file: QU240kmL16E3SMv2r01
- mesh short name specified in config file: QU240E2r01

****** initial_condition_ocean ******
Disabled in .ini file

****** graph_partition_ocean ******
Disabled in .ini file

****** initial_condition_seaice ******
Disabled in .ini file

****** scrip ******
Disabled in .ini file

****** transects_and_regions ******
Disabled in .ini file

****** mapping_analysis ******
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_0.5x0.5degree_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --ignore_unmapped
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_6000.0x6000.0km_10.0km_Antarctic_stereo_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --dst_regional --ignore_unmapped
running: /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.4/bin/ESMF_RegridWeightGen --source ./src_mesh.nc --destination ./dst_mesh.nc --weight map_QU240kmL16E3SMv2r01_to_6000.0x6000.0km_10.0km_Arctic_stereo_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --dst_regional --ignore_unmapped
SUCCESS

So we don't need to revert this commit, we could temporarily point load_latest_compass.sh to 1.4.0 and everything still works.

One side note, it seems to be hanging on something for a long time IC. With all steps disabled, the create_E3SM_coupling_files.py script takes 41 seconds. Based on the output, it looks like the header section.

@xylar xylar self-assigned this May 19, 2020
@xylar
Copy link
Collaborator Author

xylar commented May 19, 2020

@mark-petersen, I'm moving this discussion to a new issue because it's not related to the PR you're commenting on. That PR and the comment you're referring to was for an older version of COMPASS and the compass metapackage. I don't think the errors you're seeing are related to that work.

The pull request to use compass_0.1.5 is here: #545

You approved it suggesting that the nightly regression suite passed. I had also tested e3sm_coupling from that branch and it worked fine for me. So I don't think 0.1.5 is broken, I just think it doesn't work because #545 hasn't been merged yet.

The safest way to know what compass environment you should be using from a given COMPASS branch is with load_compass_env.sh. You can get a local link to that script with ./setup_testcase.py --link_load_compass. You can also look at README_ocean.md to see what the compatible version is.

I think it's reasonable to assume that load_latest_compass.sh should also always work on ocean/develop. @pwolfram, will it cause you trouble if I edit this to point to compass_0.1.4 for now?

@xylar
Copy link
Collaborator Author

xylar commented May 19, 2020

One side note, it seems to be hanging on something for a long time IC. With all steps disabled, the create_E3SM_coupling_files.py script takes 41 seconds. Based on the output, it looks like the header section.

This has to be some sort of an IC file system issue. I see similar things on Cori. There's nothing I can do about that.

Update: Sorry, my previous comment isn't correct. This seems to be related to calling the python script with mpirun and should go away in #545

@xylar
Copy link
Collaborator Author

xylar commented May 19, 2020

The error @mark-petersen is seeing is still present in #545. It results from running ESMF_RegirdWeightGen without mpirun for 1 MPI task. This will need to be fixed in pyreamp (see MPAS-Dev/pyremap#15).

@vanroekel
Copy link
Contributor

@xylar not sure if this is related to this issue, but when I try to run a test case (Jamil's horizontal advection test) with using the ./setup_testcase.py --link_load_compass I get the following errors

[mpiexec@gr1210.localdomain] fn_kvs_get (pm/pmiserv/pmiserv_pmi_v2.c:299): assert (idx != -1) failed
[mpiexec@gr1210.localdomain] handle_pmi_cmd (pm/pmiserv/pmiserv_cb.c:49): PMI handler returned error
[mpiexec@gr1210.localdomain] control_cb (pm/pmiserv/pmiserv_cb.c:286): unable to process PMI command
[mpiexec@gr1210.localdomain] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@gr1210.localdomain] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[mpiexec@gr1210.localdomain] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

but if I activate compass_0.1.4 the test case runs fine.

sorry if this is in the wrong place.

@xylar
Copy link
Collaborator Author

xylar commented May 19, 2020

@vanroekel, I think that may be a different issue but it can stay here. But I would need more details on how to reproduce this. compass_0.1.4 is serial (no MPI) whereas compass_0.1.5 is MPI. So it's related to that for sure.

@vanroekel
Copy link
Contributor

Let me open a new issue with the summary and reproduction steps. I think that will be clearer. I'll also first try a case off ocean/develop to see if I can reproduce

@xylar
Copy link
Collaborator Author

xylar commented May 19, 2020

@vanroekel, could you try first loading the python environment and then setting the MPI modules? I think that might make a differnece here -- which mpirun/mpiexec, etc. is being used.

@vanroekel
Copy link
Contributor

sure I will try that as well.

@vanroekel
Copy link
Contributor

@xylar yes, it was order. If I load Mpi second, it works fine. I'm perfectly happy doing that and not worrying about any new issues. Thanks!

@xylar
Copy link
Collaborator Author

xylar commented May 19, 2020

That is always going to be a requirement once MPI is part of the conda environment, and we need it to be so we can run ESMF_RegridWeightGen in MPI. I'll make sure this is part of the documentation once there is some...

@xylar
Copy link
Collaborator Author

xylar commented Nov 6, 2020

Much of this has been addressed. The issues about the order of loading will be documented in the new COMPASS repo's documentation and have been moved to MPAS-Dev/compass#10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants