-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Description
Hi, I don't know much about nccl.
I want to train deep learning model with multiple GPU devices within a single node by pytorch.
I do not know the exact reason, but the model "freeze"(stuck) when using 4 or more GPUs. So, while trying various things, I confirmed that the model works by setting the variable NCCL_P2P_DISABLE =1 .
As far as I know, if NCCL_P2P_DISABLE is set to 1, communication between GPUs is performed using shared memory instead of P2P/ICP.
I would like to know what potential problems can arise when NCCL_P2P_DISABLE is set to 1 like this. I'm guessing there won't be any problems, right?
yifan-bao, crypdick, leksikov, yassineAlouini, mt-cly and 5 more
Metadata
Metadata
Assignees
Labels
No labels