-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-node, multi-GPU data parallel training. #105
Conversation
@samwaltonnorwood Amazing thank you! I will test that and merge as soon as possible. |
mace/tools/slurm_distributed.py
Outdated
def _setup_distr_env(self): | ||
hostnames = hostlist.expand_hostlist(os.environ['SLURM_JOB_NODELIST']) | ||
os.environ['MASTER_ADDR'] = hostnames[0] | ||
os.environ['MASTER_PORT'] = '33333' # arbitrary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unlikely that port 33333 will be occupied, but it might be better to use the following to guarantee that a open port is used:
from socket import socket
s = socket()
s.bind((hostnames[0], 0)
os.environ['MASTER_PORT'] = s.getsockname()[1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much cleaner on paper, I agree -- but it's not quite applicable here, because run_train.py is launched multiple times, once for each training process, so on each process, socket() will find a different port.
Now, we could use socket() to choose the port on only one of the processes, but then we have to find some means to communicate the port to the other processes before the distributed communication has been initialized, which gets messy. It might be preferable to simply allow the port to be set in the shell with an environment variable:
os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '33333')
Hey @santi921, We were wondering if you could run both mutli-GPU training and validation with this PR. If not, could you describe here the troubles you ran into. |
So I tested this on two clusters we use....one ran with distributed, the other ran fine without --distributed but wouldn't initialize training if I turned the flag on. I can update with my environments, logs in a day or two. I have also yet to play with getting the distributed to work on non slurm systems but @samwaltonnorwood mentioned it wouldn't be crazy to implement |
Currently suitable for Slurm clusters. Compatible with HDF5. I didn't encounter any issues with HDF5 parallel data loading using 8 CPU workers per GPU, but this isn't thoroughly tested. Training is parallelized, but validation is done on a single GPU (to be amended in a future pull request). Example script - 2 nodes/20 GPUs, loading from HDF5 - provided at scripts/distributed_example.sbatch.