Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node, multi-GPU data parallel training. #105

Merged
merged 2 commits into from
Jul 24, 2023

Conversation

samwaltonnorwood
Copy link
Contributor

@samwaltonnorwood samwaltonnorwood commented May 22, 2023

Currently suitable for Slurm clusters. Compatible with HDF5. I didn't encounter any issues with HDF5 parallel data loading using 8 CPU workers per GPU, but this isn't thoroughly tested. Training is parallelized, but validation is done on a single GPU (to be amended in a future pull request). Example script - 2 nodes/20 GPUs, loading from HDF5 - provided at scripts/distributed_example.sbatch.

@ilyes319
Copy link
Contributor

@samwaltonnorwood Amazing thank you! I will test that and merge as soon as possible.

def _setup_distr_env(self):
hostnames = hostlist.expand_hostlist(os.environ['SLURM_JOB_NODELIST'])
os.environ['MASTER_ADDR'] = hostnames[0]
os.environ['MASTER_PORT'] = '33333' # arbitrary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unlikely that port 33333 will be occupied, but it might be better to use the following to guarantee that a open port is used:

from socket import socket

s = socket()
s.bind((hostnames[0], 0)
os.environ['MASTER_PORT'] = s.getsockname()[1]

Copy link
Contributor Author

@samwaltonnorwood samwaltonnorwood Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much cleaner on paper, I agree -- but it's not quite applicable here, because run_train.py is launched multiple times, once for each training process, so on each process, socket() will find a different port.

Now, we could use socket() to choose the port on only one of the processes, but then we have to find some means to communicate the port to the other processes before the distributed communication has been initialized, which gets messy. It might be preferable to simply allow the port to be set in the shell with an environment variable:

os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '33333')

@ilyes319
Copy link
Contributor

ilyes319 commented Jul 6, 2023

Hey @santi921,

We were wondering if you could run both mutli-GPU training and validation with this PR. If not, could you describe here the troubles you ran into.

@santi921
Copy link

santi921 commented Jul 6, 2023

So I tested this on two clusters we use....one ran with distributed, the other ran fine without --distributed but wouldn't initialize training if I turned the flag on. I can update with my environments, logs in a day or two.

I have also yet to play with getting the distributed to work on non slurm systems but @samwaltonnorwood mentioned it wouldn't be crazy to implement

@ilyes319 ilyes319 merged commit 7befa8f into ACEsuit:multi-GPU Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants