Multi-node, multi-GPU data parallel training. #105

samwaltonnorwood · 2023-05-22T19:05:16Z

Currently suitable for Slurm clusters. Compatible with HDF5. I didn't encounter any issues with HDF5 parallel data loading using 8 CPU workers per GPU, but this isn't thoroughly tested. Training is parallelized, but validation is done on a single GPU (to be amended in a future pull request). Example script - 2 nodes/20 GPUs, loading from HDF5 - provided at scripts/distributed_example.sbatch.

ilyes319 · 2023-05-24T11:24:10Z

@samwaltonnorwood Amazing thank you! I will test that and merge as soon as possible.

sivonxay · 2023-06-01T17:49:02Z

mace/tools/slurm_distributed.py

+    def _setup_distr_env(self):
+        hostnames = hostlist.expand_hostlist(os.environ['SLURM_JOB_NODELIST'])
+        os.environ['MASTER_ADDR'] = hostnames[0]
+        os.environ['MASTER_PORT'] = '33333' # arbitrary


It's unlikely that port 33333 will be occupied, but it might be better to use the following to guarantee that a open port is used:

from socket import socket s = socket() s.bind((hostnames[0], 0) os.environ['MASTER_PORT'] = s.getsockname()[1]

This is much cleaner on paper, I agree -- but it's not quite applicable here, because run_train.py is launched multiple times, once for each training process, so on each process, socket() will find a different port.

Now, we could use socket() to choose the port on only one of the processes, but then we have to find some means to communicate the port to the other processes before the distributed communication has been initialized, which gets messy. It might be preferable to simply allow the port to be set in the shell with an environment variable:

os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '33333')

ilyes319 · 2023-07-06T17:21:41Z

Hey @santi921,

We were wondering if you could run both mutli-GPU training and validation with this PR. If not, could you describe here the troubles you ran into.

santi921 · 2023-07-06T17:56:44Z

So I tested this on two clusters we use....one ran with distributed, the other ran fine without --distributed but wouldn't initialize training if I turned the flag on. I can update with my environments, logs in a day or two.

I have also yet to play with getting the distributed to work on non slurm systems but @samwaltonnorwood mentioned it wouldn't be crazy to implement

Multi-node, multi-GPU data parallel training.

bb4d51e

sivonxay reviewed Jun 1, 2023

View reviewed changes

davkovacs mentioned this pull request Jun 16, 2023

out of memory even with multi-card training #118

Closed

Distributed evaluation.

d07469a

ilyes319 merged commit 7befa8f into ACEsuit:multi-GPU Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node, multi-GPU data parallel training. #105

Multi-node, multi-GPU data parallel training. #105

samwaltonnorwood commented May 22, 2023 •

edited

Loading

ilyes319 commented May 24, 2023

sivonxay Jun 1, 2023

samwaltonnorwood Jun 16, 2023 •

edited

Loading

ilyes319 commented Jul 6, 2023

santi921 commented Jul 6, 2023 •

edited

Loading

Multi-node, multi-GPU data parallel training. #105

Multi-node, multi-GPU data parallel training. #105

Conversation

samwaltonnorwood commented May 22, 2023 • edited Loading

ilyes319 commented May 24, 2023

sivonxay Jun 1, 2023

Choose a reason for hiding this comment

samwaltonnorwood Jun 16, 2023 • edited Loading

Choose a reason for hiding this comment

ilyes319 commented Jul 6, 2023

santi921 commented Jul 6, 2023 • edited Loading

samwaltonnorwood commented May 22, 2023 •

edited

Loading

samwaltonnorwood Jun 16, 2023 •

edited

Loading

santi921 commented Jul 6, 2023 •

edited

Loading