Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling cutn.distributed_reset_configuration() with MPICH might fail with CUTENSORNET_STATUS_DISTRIBUTED_FAILURE #31

Closed
leofang opened this issue Jan 23, 2023 · 5 comments
Assignees

Comments

@leofang
Copy link
Member

leofang commented Jan 23, 2023

MPICH users running this sample might see the following error:

$ mpiexec -n 2 python example22_mpi_auto.py
Traceback (most recent call last):
  File "/home/leof/dev/cuquantum/python/samples/cutensornet/coarse/example22_mpi_auto.py", line 60, in <module>
    cutn.distributed_reset_configuration(
  File "cuquantum/cutensornet/cutensornet.pyx", line 2306, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 2328, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 229, in cuquantum.cutensornet.cutensornet.check_status
cuquantum.cutensornet.cutensornet.cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE
Traceback (most recent call last):
  File "/home/leof/dev/cuquantum/python/samples/cutensornet/coarse/example22_mpi_auto.py", line 60, in <module>
    cutn.distributed_reset_configuration(
  File "cuquantum/cutensornet/cutensornet.pyx", line 2306, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 2328, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 229, in cuquantum.cutensornet.cutensornet.check_status
cuquantum.cutensornet.cutensornet.cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE

This is a known issue for the automatic MPI support using cuQuantum Python 22.11 / cuTensorNet 2.0.0 + mpi4py + MPICH.

The reason is that Python by default dynamically loads shared libraries in the private mode (see, e.g., the documentation for ctypes.DEFAULT_MODE)), which breaks the assumption of libcutensornet_distributed_interface_mpi.so (whose path is set via $CUTENSORNET_COMM_LIB) that MPI symbols would be loaded to the public scope.

Open MPI is immune to this problem because mpi4py had to "break" this assumption due to a few old Open MPI issues.

There are multiple workarounds that users can choose:

  1. Load the MPI symbols via LD_PRELOAD, e.g., mpiexec -n 2 -env LD_PRELOAD=$MPI_HOME/lib/libmpi.so python example22_mpi_auto.py
  2. Change Python's default loading mode to public (global) before any other imports
import os, sys
sys.setdlopenflags(os.RTLD_LAZY | os.RTLD_GLOBAL)
import ...
  1. If compiling libcutensornet_distributed_interface_mpi.so manually, link the MPI library to it via -lmpi

In a future release, we will add a fix to work around this limitation. See also #30 for discussion.

@leofang leofang self-assigned this Jan 23, 2023
@leofang leofang pinned this issue Jan 23, 2023
@leofang
Copy link
Member Author

leofang commented Jan 23, 2023

Internal ticket: CUQNT-1594.

@yangcal yangcal unpinned this issue Mar 17, 2023
@leofang
Copy link
Member Author

leofang commented Apr 6, 2023

This is fixed in cuQuantum 23.03. We now ask users to link libcutensornet_distributed_interface_mpi.so to MPI by passing -lmpi to the compiler/linker, if users are building it themselves.

The cuTensorNet-MPI wrapper library (libcutensornet_distributed_interface_mpi.so) needs to be linked to the MPI library libmpi.so. If you use our conda-forge packages or cuQuantum Appliance container, or compile your own using the provided activate_mpi.sh script, this is taken care for you.

https://docs.nvidia.com/cuda/cuquantum/cutensornet/release_notes.html#cutensornet-v2-1-0

@leofang leofang closed this as completed Apr 6, 2023
@yapolyak
Copy link

yapolyak commented Nov 22, 2023

Hi @leofang sorry to re-open this, but after a while I tried automatic contraction with cuQuantum 23.06 (basically this script https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/cutensornet/coarse/example22_mpi_auto.py) on Perlmutter, using its MPICH, and I again got the CUTENSORNET_STATUS_DISTRIBUTED_FAILURE error.

I do build cuQuantum from conda-forge, so according to you the linking to MPI should be sorted... however I fetch only an "external" placeholder mpich from conda-forge and then build mpi4py locally - could it be that due to that the libcutensornet_distributed_interface_mpi.so is not linked properly?

@yapolyak
Copy link

yapolyak commented Nov 22, 2023

Ah there we go:

ldd ~/.conda/envs/py-cuquantum-23.06.0-mypich-py3.9/lib/libcutensornet_distributed_interface_mpi.so
	linux-vdso.so.1 (0x00007fffc8de5000)
	libmpi.so.12 => not found
	libc.so.6 => /lib64/libc.so.6 (0x00007ff8c1f36000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ff8c2157000)

Let me try to link it manually if I can...

@yapolyak
Copy link

Done - I added the MPICH's /lib-abi-mpich path to $LD_LIBRARY_PATH, relinked and it works now! Sorry for the noise :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants