Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test NCCL failure common.cu:1285 : internal error #50

Open
Eliasj42 opened this issue Oct 2, 2023 · 0 comments
Open

Test NCCL failure common.cu:1285 : internal error #50

Eliasj42 opened this issue Oct 2, 2023 · 0 comments

Comments

@Eliasj42
Copy link

Eliasj42 commented Oct 2, 2023

Hi, when I try to run the example in the README ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4, I'm getting this error

# nThreads: 1 nGpus: 4 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid 117738 on sharbox-ultra device  0 [0000:63:00.0] AMD Instinct MI210
#   Rank  1 Pid 117738 on sharbox-ultra device  1 [0000:43:00.0] AMD Instinct MI210
#   Rank  2 Pid 117738 on sharbox-ultra device  2 [0000:30:00.0] AMD Instinct MI210
#   Rank  3 Pid 117738 on sharbox-ultra device  3 [0000:03:00.0] AMD Instinct MI210
sharbox-ultra: Test NCCL failure common.cu:1285 'internal error - please report this issue to the NCCL developers'
 .. sharbox-ultra pid 117738: Test failure common.cu:1161

When I tried to run the tests for building rccl, I got this error output

[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from AllReduce
[ RUN      ] AllReduce.OutOfPlace
[ INFO     ] Calling PIPE_READ to Child 0

sharbox-ultra:204802:204802 [0] /home/elias/rccl/build/release/hipify/src/init.cc:125 NCCL WARN Missing "amd_iommu=on" from kernel command line which can lead to system instablity or hang!

sharbox-ultra:204802:204802 [0] /home/elias/rccl/build/release/hipify/src/init.cc:127 NCCL WARN Missing "iommu=pt" from kernel command line which can lead to system instablity or hang!

sharbox-ultra:204802:204802 [0] /home/elias/rccl/build/release/hipify/src/init.cc:132 NCCL WARN Missing "HSA_FORCE_FINE_GRAIN_PCIE=1" from environment which can lead to low RCCL performance, system instablity or hang!
[ INFO     ] Got PIPE_READ 128 from Child 0
[ INFO     ] Calling PIPE_READ to Child 0
[ INFO     ] Got PIPE_READ 4 from Child 0
[ INFO     ] Calling PIPE_READ to Child 0
RCCL version 2.18.3+hip5.5 develop:6ecf771+

sharbox-ultra:204802:204836 [1] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:220 NCCL WARN hipIpcGetMemHandle failed : invalid argument

sharbox-ultra:204802:204836 [1] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:222 NCCL WARN Cuda failure 'invalid argument'

sharbox-ultra:204802:204836 [1] /home/elias/rccl/build/release/hipify/src/proxy.cc:1524 NCCL WARN [Proxy Service 1] Failed to execute operation Setup from rank 1, retcode 1

sharbox-ultra:204802:204837 [0] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:220 NCCL WARN hipIpcGetMemHandle failed : invalid argument

sharbox-ultra:204802:204837 [0] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:222 NCCL WARN Cuda failure 'invalid argument'

sharbox-ultra:204802:204837 [0] /home/elias/rccl/build/release/hipify/src/proxy.cc:1524 NCCL WARN [Proxy Service 0] Failed to execute operation Setup from rank 0, retcode 1

sharbox-ultra:204802:204835 [1] /home/elias/rccl/build/release/hipify/src/misc/socket.cc:57 NCCL WARN socketProgress: Connection closed by remote peer sharbox-ultra<48295>

sharbox-ultra:204802:204835 [1] /home/elias/rccl/build/release/hipify/src/proxy.cc:1148 NCCL WARN Socket recv failed while polling for opId=0x7fe95066b760

sharbox-ultra:204802:204834 [0] /home/elias/rccl/build/release/hipify/src/misc/socket.cc:57 NCCL WARN socketProgress: Connection closed by remote peer sharbox-ultra<35867>

sharbox-ultra:204802:204834 [0] /home/elias/rccl/build/release/hipify/src/proxy.cc:1148 NCCL WARN Socket recv failed while polling for opId=0x7fe958288240
[ ERROR    ] Child process 0 fails NCCL call ncclGroupEnd with code 3
[ ERROR    ] Child 0 failed on command [INIT_COMMS]:
[ INFO     ] Got PIPE_READ 4 from Child 0
[ ERROR    ] Child 0 reports failure
/home/elias/rccl/test/common/TestBed.cpp:178: Failure
Expected equality of these values:
  response
    Which is: 1
  TEST_SUCCESS
    Which is: 0
[  FAILED  ] AllReduce.OutOfPlace (665 ms)

Do you have any idea of what could be causing this crash?
It seems like the invalid arguments are the root of this issue, but I'm unsure what to do about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant