Skip to content

cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE on Perlmutter #30

Answered by yapolyak
yapolyak asked this question in Q&A
Discussion options

You must be logged in to vote

Two more updates:

  • I tried using the same setup (cuQuantum-python and mpi4py installed in a local Conda env, but using Perlmutter's default MPICH) and running a CUDA-aware MPI example from here https://github.com/mpi4py/mpi4py/blob/master/demo/cuda-aware-mpi/use_cupy.py - and it works fine. So my issues must be coming from cuQuantum (cuTensor) not talking correctly to MPICH I suppose...
  • I tried installing OpenMPI from within Conda together with cuQuantum-python, but every time it seems to install an "external package", specifically openmpi-4.1.4-external_2, and as a result there are neither MPI libraries no executables such as mpirun in my environment. Would you have any suggestion of wha…

Replies: 3 comments 23 replies

Comment options

You must be logged in to vote
5 replies
@DmitryLyakh
Comment options

@DmitryLyakh
Comment options

@DmitryLyakh
Comment options

@yapolyak
Comment options

@yapolyak
Comment options

Comment options

You must be logged in to vote
18 replies
@DmitryLyakh
Comment options

@DmitryLyakh
Comment options

@leofang
Comment options

@yapolyak
Comment options

@yapolyak
Comment options

Answer selected by yapolyak
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants