-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
featureIs an improvement or enhancementIs an improvement or enhancementhelp wantedOpen to be worked onOpen to be worked onwon't fixThis will not be worked onThis will not be worked on
Description
Is your feature request related to a problem? Please describe.
I am working on a cluster that forbids opening sockets on a compute node that anyone can connect to, unauthenticated. This means I can't use the 'env://' initialization method for torch.distributed. Furthermore, I may want to train on systems that do not have CUDA, and therefore would like the option of using an mpi backend for torch.distributed.
Describe the solution you'd like
- Options to change the distributed backend for torch.dist
- Options to change the default bootstrapping to include things like shared filesystem support.
Describe alternatives you've considered
- I've debated coding up everything myself using horovod, but I don't like how it locks me down into MPI and won't let me use nccl afterwards. Using horovod also breaks pytorch-lightning's abstraction of handling such parallelism for you, and conflicts with the pytorch default of using the 'dp' backend when possible.
calclavia, wangg12 and zippeurfou
Metadata
Metadata
Assignees
Labels
featureIs an improvement or enhancementIs an improvement or enhancementhelp wantedOpen to be worked onOpen to be worked onwon't fixThis will not be worked onThis will not be worked on