NCCL WARN Failed to open libibverbs.so[.1] #12219
-
Just received qty 2 of A6000 and these are not compatible
So upgraded my docker to
I also made changed to my code the for the lightning braking change from
to
When I try to train it just stops. So set env NCCL_DEBUG=WARN
Same happens when I try
My old setup was 2xRTX Titan with nvlink while the new setup is 2xA6000 without a nvlink. nvidia doc says that PCI is used but unclear if I need to do something to use this. Distributed communication docs say "NCCL backends are built and included in PyTorch distributed (NCCL only when building with CUDA)" . I suspect I am missing something about the breaking changes from pl 1.0 to 1.5. Would appreciate hints as to what to look for. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
I reduced the delta to only
staying on |
Beta Was this translation helpful? Give feedback.
-
Duplicate of #12235. |
Beta Was this translation helpful? Give feedback.
Duplicate of #12235.