Skip to content

CUDA OOM when initializing DDP #4705

@dedeswim

Description

@dedeswim

🐛 Bug

Hey everyone,

I am trying to train a model on the GPU workstation of our lab (that has 10 GPUs, of which 1 only is usually in use) using Lightning ad DDP. I have tried with several models (including the BoringModel) without success. In particular, I get a CUDA OOM error when DDP initializes. I tried BoringModel with the following Trainer configuration:

trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        max_epochs=1,
        weights_summary=None,
        gpus=2,
        accelerator="ddp",
        auto_select_gpus=True
)

And the output I get is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "boring_model.py", line 138, in <module>
    run_test()
  File "boring_model.py", line 133, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory
Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 138, in <module>
    run_test()
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 133, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: Broken pipe

The script with the BoringModel I run on our workstation is in this gist.

However, this doesn't happen on Colab using your BoringModel notebook (my version can be found here).

I also tried to run locally the same notebook as Colab, and the result at the first attempt is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)

<ipython-input-10-d400f0366266> in test_x(tmpdir)
     16 
     17     # Train the model ⚡
---> 18     trainer.fit(model, train, val)
     19 
     20     trainer.test(test_dataloaders=test)

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    442         self.call_hook('on_fit_start')
    443 
--> 444         results = self.accelerator_backend.train()
    445         self.accelerator_backend.teardown()
    446 

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in train(self)
    146         model = self.trainer.model
    147 
--> 148         results = self.ddp_train(process_idx=self.task_idx, model=model)
    149         if 'WORLD_SIZE' in os.environ:
    150             del os.environ['WORLD_SIZE']

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in ddp_train(self, process_idx, model)
    236         # where to store ip_table
    237         model.trainer = self.trainer
--> 238         self.init_ddp_connection(
    239             self.trainer.global_rank,
    240             self.trainer.world_size,

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in init_ddp_connection(self, global_rank, world_size, is_slurm_managing_tasks)
    213                 f"initializing ddp: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank + 1}/{world_size}"
    214             )
--> 215             torch_distrib.init_process_group(
    216                 torch_backend, rank=global_rank, world_size=world_size
    217             )

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in init_process_group(backend, init_method, timeout, world_size, rank, store, group_name)
    440     # process groups including global variables are updated correctly on all
    441     # ranks.
--> 442     barrier()
    443 
    444 def _new_process_group_helper(world_size,

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in barrier(group, async_op)
   1945     if group == GroupMember.WORLD:
   1946         _check_default_pg()
-> 1947         work = _default_pg.barrier()
   1948     else:
   1949         work = group.barrier()

RuntimeError: CUDA error: out of memory

At the second attempt, though, it works, as expected (i.e. the model trains with no errors, even with multiple GPUs)! So in the script, I tried to do the following to attempt the fit twice as in the notebook:

try:
	trainer.fit(model, train_data, val_data)
except:
	trainer.fit(model, train_data, val_data)

As a result, I get this stack trace:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "boring_model.py", line 135, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "boring_model.py", line 143, in <module>
    run_test()
  File "boring_model.py", line 137, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
    model = self.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
    model = self.ddp_plugin.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
    model = LightningDistributedDataParallel(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 135, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 143, in <module>
    run_test()
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 137, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
    model = self.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
    model = self.ddp_plugin.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
    model = LightningDistributedDataParallel(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Broken pipe

Expected behavior

The models should train without issues.

Environment

  • CUDA:
    • GPU:
      • TITAN V
      • TITAN V
      • TITAN V
      • TITAN V
      • TITAN V
      • TITAN V
      • TITAN V
      • TITAN V
      • TITAN V
      • TITAN V
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.19.2
    • pyTorch_debug: True
    • pyTorch_version: 1.7.0
    • pytorch-lightning: 1.0.6
    • tqdm: 4.52.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.5
    • version: Proposal for help #1 SMP Fri Oct 18 17:15:30 UTC 2019

Additional context

I tried installing torch, torchvision and pl with both Conda and PIP with fresh environments, and still no solution to this problem.

This happens also if I select (free) GPUs manually by specifying them in the gpus flag as a List[int]. Also interestingly, if I run this tutorial notebook by PyTorch that uses vanilla PyTorch DDP, I have no issues whatsoever. Final interesting fact, setting accelerator="dp"I have no issues.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions