-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
Hey everyone,
I am trying to train a model on the GPU workstation of our lab (that has 10 GPUs, of which 1 only is usually in use) using Lightning ad DDP. I have tried with several models (including the BoringModel) without success. In particular, I get a CUDA OOM error when DDP initializes. I tried BoringModel with the following Trainer configuration:
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
max_epochs=1,
weights_summary=None,
gpus=2,
accelerator="ddp",
auto_select_gpus=True
)And the output I get is the following:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
File "boring_model.py", line 138, in <module>
run_test()
File "boring_model.py", line 133, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
self.init_ddp_connection(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
torch_distrib.init_process_group(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory
Traceback (most recent call last):
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 138, in <module>
run_test()
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 133, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
self.init_ddp_connection(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
torch_distrib.init_process_group(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe
The script with the BoringModel I run on our workstation is in this gist.
However, this doesn't happen on Colab using your BoringModel notebook (my version can be found here).
I also tried to run locally the same notebook as Colab, and the result at the first attempt is the following:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-11-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)
<ipython-input-10-d400f0366266> in test_x(tmpdir)
16
17 # Train the model ⚡
---> 18 trainer.fit(model, train, val)
19
20 trainer.test(test_dataloaders=test)
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
442 self.call_hook('on_fit_start')
443
--> 444 results = self.accelerator_backend.train()
445 self.accelerator_backend.teardown()
446
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in train(self)
146 model = self.trainer.model
147
--> 148 results = self.ddp_train(process_idx=self.task_idx, model=model)
149 if 'WORLD_SIZE' in os.environ:
150 del os.environ['WORLD_SIZE']
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in ddp_train(self, process_idx, model)
236 # where to store ip_table
237 model.trainer = self.trainer
--> 238 self.init_ddp_connection(
239 self.trainer.global_rank,
240 self.trainer.world_size,
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in init_ddp_connection(self, global_rank, world_size, is_slurm_managing_tasks)
213 f"initializing ddp: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank + 1}/{world_size}"
214 )
--> 215 torch_distrib.init_process_group(
216 torch_backend, rank=global_rank, world_size=world_size
217 )
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in init_process_group(backend, init_method, timeout, world_size, rank, store, group_name)
440 # process groups including global variables are updated correctly on all
441 # ranks.
--> 442 barrier()
443
444 def _new_process_group_helper(world_size,
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in barrier(group, async_op)
1945 if group == GroupMember.WORLD:
1946 _check_default_pg()
-> 1947 work = _default_pg.barrier()
1948 else:
1949 work = group.barrier()
RuntimeError: CUDA error: out of memory
At the second attempt, though, it works, as expected (i.e. the model trains with no errors, even with multiple GPUs)! So in the script, I tried to do the following to attempt the fit twice as in the notebook:
try:
trainer.fit(model, train_data, val_data)
except:
trainer.fit(model, train_data, val_data)As a result, I get this stack trace:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
File "boring_model.py", line 135, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
self.init_ddp_connection(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
torch_distrib.init_process_group(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "boring_model.py", line 143, in <module>
run_test()
File "boring_model.py", line 137, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
model = self.configure_ddp(model, device_ids)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
model = self.ddp_plugin.configure_ddp(model, device_ids)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
model = LightningDistributedDataParallel(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
self._sync_params_and_buffers(authoritative_rank=0)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 135, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
self.init_ddp_connection(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
torch_distrib.init_process_group(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 143, in <module>
run_test()
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 137, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
model = self.configure_ddp(model, device_ids)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
model = self.ddp_plugin.configure_ddp(model, device_ids)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
model = LightningDistributedDataParallel(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
self._sync_params_and_buffers(authoritative_rank=0)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: Broken pipe
Expected behavior
The models should train without issues.
Environment
- CUDA:
- GPU:
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.19.2
- pyTorch_debug: True
- pyTorch_version: 1.7.0
- pytorch-lightning: 1.0.6
- tqdm: 4.52.0
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: Proposal for help #1 SMP Fri Oct 18 17:15:30 UTC 2019
Additional context
I tried installing torch, torchvision and pl with both Conda and PIP with fresh environments, and still no solution to this problem.
This happens also if I select (free) GPUs manually by specifying them in the gpus flag as a List[int]. Also interestingly, if I run this tutorial notebook by PyTorch that uses vanilla PyTorch DDP, I have no issues whatsoever. Final interesting fact, setting accelerator="dp"I have no issues.
Thanks in advance!