-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calling cutn.distributed_reset_configuration()
with MPICH might fail with CUTENSORNET_STATUS_DISTRIBUTED_FAILURE
#31
Comments
Internal ticket: CUQNT-1594. |
This is fixed in cuQuantum 23.03. We now ask users to link
https://docs.nvidia.com/cuda/cuquantum/cutensornet/release_notes.html#cutensornet-v2-1-0 |
Hi @leofang sorry to re-open this, but after a while I tried automatic contraction with cuQuantum 23.06 (basically this script https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/cutensornet/coarse/example22_mpi_auto.py) on Perlmutter, using its MPICH, and I again got the CUTENSORNET_STATUS_DISTRIBUTED_FAILURE error. I do build cuQuantum from conda-forge, so according to you the linking to MPI should be sorted... however I fetch only an "external" placeholder mpich from conda-forge and then build mpi4py locally - could it be that due to that the libcutensornet_distributed_interface_mpi.so is not linked properly? |
Ah there we go:
Let me try to link it manually if I can... |
Done - I added the MPICH's |
MPICH users running this sample might see the following error:
This is a known issue for the automatic MPI support using cuQuantum Python 22.11 / cuTensorNet 2.0.0 + mpi4py + MPICH.
The reason is that Python by default dynamically loads shared libraries in the private mode (see, e.g., the documentation for
ctypes.DEFAULT_MODE
)), which breaks the assumption oflibcutensornet_distributed_interface_mpi.so
(whose path is set via$CUTENSORNET_COMM_LIB
) that MPI symbols would be loaded to the public scope.Open MPI is immune to this problem because mpi4py had to "break" this assumption due to a few old Open MPI issues.
There are multiple workarounds that users can choose:
LD_PRELOAD
, e.g.,mpiexec -n 2 -env LD_PRELOAD=$MPI_HOME/lib/libmpi.so python example22_mpi_auto.py
libcutensornet_distributed_interface_mpi.so
manually, link the MPI library to it via-lmpi
In a future release, we will add a fix to work around this limitation. See also #30 for discussion.
The text was updated successfully, but these errors were encountered: