Calling `cutn.distributed_reset_configuration()` with MPICH might fail with `CUTENSORNET_STATUS_DISTRIBUTED_FAILURE` #31

leofang · 2023-01-23T18:30:21Z

MPICH users running this sample might see the following error:

$ mpiexec -n 2 python example22_mpi_auto.py
Traceback (most recent call last):
  File "/home/leof/dev/cuquantum/python/samples/cutensornet/coarse/example22_mpi_auto.py", line 60, in <module>
    cutn.distributed_reset_configuration(
  File "cuquantum/cutensornet/cutensornet.pyx", line 2306, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 2328, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 229, in cuquantum.cutensornet.cutensornet.check_status
cuquantum.cutensornet.cutensornet.cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE
Traceback (most recent call last):
  File "/home/leof/dev/cuquantum/python/samples/cutensornet/coarse/example22_mpi_auto.py", line 60, in <module>
    cutn.distributed_reset_configuration(
  File "cuquantum/cutensornet/cutensornet.pyx", line 2306, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 2328, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 229, in cuquantum.cutensornet.cutensornet.check_status
cuquantum.cutensornet.cutensornet.cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE

This is a known issue for the automatic MPI support using cuQuantum Python 22.11 / cuTensorNet 2.0.0 + mpi4py + MPICH.

The reason is that Python by default dynamically loads shared libraries in the private mode (see, e.g., the documentation for ctypes.DEFAULT_MODE)), which breaks the assumption of libcutensornet_distributed_interface_mpi.so (whose path is set via $CUTENSORNET_COMM_LIB) that MPI symbols would be loaded to the public scope.

Open MPI is immune to this problem because mpi4py had to "break" this assumption due to a few old Open MPI issues.

There are multiple workarounds that users can choose:

Load the MPI symbols via LD_PRELOAD, e.g., mpiexec -n 2 -env LD_PRELOAD=$MPI_HOME/lib/libmpi.so python example22_mpi_auto.py
Change Python's default loading mode to public (global) before any other imports

import os, sys
sys.setdlopenflags(os.RTLD_LAZY | os.RTLD_GLOBAL)
import ...

If compiling libcutensornet_distributed_interface_mpi.so manually, link the MPI library to it via -lmpi

In a future release, we will add a fix to work around this limitation. See also #30 for discussion.

The text was updated successfully, but these errors were encountered:

leofang · 2023-01-23T18:33:53Z

Internal ticket: CUQNT-1594.

leofang · 2023-04-06T22:32:49Z

This is fixed in cuQuantum 23.03. We now ask users to link libcutensornet_distributed_interface_mpi.so to MPI by passing -lmpi to the compiler/linker, if users are building it themselves.

The cuTensorNet-MPI wrapper library (libcutensornet_distributed_interface_mpi.so) needs to be linked to the MPI library libmpi.so. If you use our conda-forge packages or cuQuantum Appliance container, or compile your own using the provided activate_mpi.sh script, this is taken care for you.

https://docs.nvidia.com/cuda/cuquantum/cutensornet/release_notes.html#cutensornet-v2-1-0

yapolyak · 2023-11-22T14:39:31Z

Hi @leofang sorry to re-open this, but after a while I tried automatic contraction with cuQuantum 23.06 (basically this script https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/cutensornet/coarse/example22_mpi_auto.py) on Perlmutter, using its MPICH, and I again got the CUTENSORNET_STATUS_DISTRIBUTED_FAILURE error.

I do build cuQuantum from conda-forge, so according to you the linking to MPI should be sorted... however I fetch only an "external" placeholder mpich from conda-forge and then build mpi4py locally - could it be that due to that the libcutensornet_distributed_interface_mpi.so is not linked properly?

yapolyak · 2023-11-22T14:50:06Z

Ah there we go:

ldd ~/.conda/envs/py-cuquantum-23.06.0-mypich-py3.9/lib/libcutensornet_distributed_interface_mpi.so
	linux-vdso.so.1 (0x00007fffc8de5000)
	libmpi.so.12 => not found
	libc.so.6 => /lib64/libc.so.6 (0x00007ff8c1f36000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ff8c2157000)

Let me try to link it manually if I can...

yapolyak · 2023-11-22T15:01:42Z

Done - I added the MPICH's /lib-abi-mpich path to $LD_LIBRARY_PATH, relinked and it works now! Sorry for the noise :)

leofang self-assigned this Jan 23, 2023

leofang pinned this issue Jan 23, 2023

yangcal unpinned this issue Mar 17, 2023

leofang closed this as completed Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling `cutn.distributed_reset_configuration()` with MPICH might fail with `CUTENSORNET_STATUS_DISTRIBUTED_FAILURE` #31

Calling `cutn.distributed_reset_configuration()` with MPICH might fail with `CUTENSORNET_STATUS_DISTRIBUTED_FAILURE` #31

leofang commented Jan 23, 2023

leofang commented Jan 23, 2023

leofang commented Apr 6, 2023

yapolyak commented Nov 22, 2023 •

edited

Loading

yapolyak commented Nov 22, 2023 •

edited

Loading

yapolyak commented Nov 22, 2023

Calling cutn.distributed_reset_configuration() with MPICH might fail with CUTENSORNET_STATUS_DISTRIBUTED_FAILURE #31

Calling cutn.distributed_reset_configuration() with MPICH might fail with CUTENSORNET_STATUS_DISTRIBUTED_FAILURE #31

Comments

leofang commented Jan 23, 2023

leofang commented Jan 23, 2023

leofang commented Apr 6, 2023

yapolyak commented Nov 22, 2023 • edited Loading

yapolyak commented Nov 22, 2023 • edited Loading

yapolyak commented Nov 22, 2023

Calling `cutn.distributed_reset_configuration()` with MPICH might fail with `CUTENSORNET_STATUS_DISTRIBUTED_FAILURE` #31

Calling `cutn.distributed_reset_configuration()` with MPICH might fail with `CUTENSORNET_STATUS_DISTRIBUTED_FAILURE` #31

yapolyak commented Nov 22, 2023 •

edited

Loading

yapolyak commented Nov 22, 2023 •

edited

Loading