DDP failed with big model under `deepspeed_stage_3` strategy #11760

hepengfe · 2022-02-05T17:58:49Z

🐛 Bug

I am running T5-11B(~40GB) model with deepspeed_stage_3 strategy and it failed to _init_deepspeed_distributed()
Trainer works for smaller model such as T5-large (~3GB)

Here is the traceback.

initializing deepspeed distributed: GLOBAL_RANK: 1, MEMBER: 2/6
Traceback (most recent call last):
  File "/home/usr/2022/cc-GKM/scripts/kg_ftt5.py", line 259, in <module>
    main()
  File "/home/usr/2022/cc-GKM/scripts/kg_ftt5.py", line 246, in main
    trainer.fit(model, train_dl, valid_dl)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run
    self.accelerator.setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/accelerators/gpu.py", line 39, in setup_environment
    super().setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in setup_environment
    self.training_type_plugin.setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 185, in setup_environment
    self.setup_distributed()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 358, in setup_distributed
    self._init_deepspeed_distributed()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 374, in _init_deepspeed_distributed
    self.torch_distributed_backend, distributed_port=self.cluster_environment.master_port()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/deepspeed/utils/distributed.py", line 51, in init_distributed
    init_method=init_method)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: connect() timed out. Original timeout was 1800000 ms.
Traceback (most recent call last):
  File "/home/usr/2022/cc-GKM/scripts/kg_ftt5.py", line 259, in <module>
    main()
  File "/home/usr/2022/cc-GKM/scripts/kg_ftt5.py", line 246, in main
    trainer.fit(model, train_dl, valid_dl)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run
    self.accelerator.setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/accelerators/gpu.py", line 39, in setup_environment
    super().setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in setup_environment
    self.training_type_plugin.setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 185, in setup_environment
    self.setup_distributed()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 358, in setup_distributed
    self._init_deepspeed_distributed()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 374, in _init_deepspeed_distributed
    self.torch_distributed_backend, distributed_port=self.cluster_environment.master_port()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/deepspeed/utils/distributed.py", line 51, in init_distributed
    init_method=init_method)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: connect() timed out. Original timeout was 1800000 ms.

Another runs gives different traceback.

  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/deepspeed/utils/distributed.py", line 51, in init_distributed
    init_method=init_method)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 247, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=6, worker_count=3, timeout=0:30:00)

To Reproduce

I think a pytorch lightning trainer with a t5-11b model is enough for reproducing the bug.

Expected behavior

It is expected to start training.

Environment

CUDA:
- GPU:
  - NVIDIA GeForce RTX 3090
  - NVIDIA GeForce RTX 3090
  - NVIDIA GeForce RTX 3090
  - NVIDIA GeForce RTX 3090
  - NVIDIA GeForce RTX 3090
  - NVIDIA GeForce RTX 3090
- available: True
- version: 11.3
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.10.1+cu113
- pytorch-lightning: 1.5.9
- tqdm: 4.62.3
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.6.13
- version: Allow optimizers to alternate at arbitrary intervals #29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022

Additional context

cc @SeanNaren @awaelchli @rohitgr7 @akihironitta

The text was updated successfully, but these errors were encountered:

hepengfe · 2022-02-05T18:17:45Z

It's pretty similar to #4420 .
Currently I am trying out the methods suggested there.

justusschock · 2022-02-06T08:11:29Z

@feipenghe can you report back after trying #4420 ? Did it resolve your issue? :)

hepengfe · 2022-02-06T18:16:47Z

@justusschock After several trials, I finally determined that it's the issue of not enough RAM. My machine has RAM size of 256 GB. I tried T5-3B which used peak RAM size 126GB so I think T5-11B needs RAM size around 500GB.

When I ran the code, RAM usage accumulates to a peak memory usage and then went down after messages like initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/6. It went up for initializing deepspeed distributed for the next device. It keeps this iteration of RAM usage ups and downs until all devices are initialized. For T5-3B, as I mentioned it only has 126GB peak memory usage, and the process proceeds the RAM ups and downs. However, for T5-11B which has 500GB peak memory usage according to my estimation. The process first accumulates to ~250GB (almost reaches my RAM capacity) without any warning/stopping then went through RAM usage ups and downs. After initializing deepspeed distributed for all GPUs, the process hangs for 30 minutes. Then the first traceback I reported comes in.

I read some deepspeed documentation https://deepspeed.readthedocs.io/en/latest/memory.html, under the general RAM section, it says there is a certain amount of memory needed at the beginning to initialize the model on CPU memory. It kind of makes sense in my case according to the formula. However, I am not using "deepspeed_stage_3_offload" but only "deepspeed_stage_3". I wonder if 500GB RAM is still needed to initialize deepspeed distributed?

SeanNaren · 2022-02-08T09:45:05Z

hey @feipenghe I can confirm that I can reproduce this simply by instantiating t5-11b in the BoringModel somewhere.

Further investigation leads me to believe that this memory issue is caused by the pre-trained weights being loaded on each individual device internally in from_pretrained. this adds up and I think a few processes silently die as they cannot allocate.

I'm unsure of what a fix could be in this case, I'll brainstorm a bit here as this would be a good use case to tackle!

HaokunLiu · 2022-03-08T20:52:15Z

I met the same issue when using T5-11B. I had some discussions with a developer of DeepSpeed. microsoft/DeepSpeed#1814

HaokunLiu · 2022-03-08T20:57:48Z

@SeanNaren, have you come up with some way to fix this?
I need it soon. If you already have an idea but don't have the bandwidth to carry it out. I would be happy to contribute.

stale · 2022-04-16T08:09:58Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ZeyiLiao · 2022-11-13T02:50:09Z

Hi any updates here? I meet the same issue with PL

hepengfe added the bug Something isn't working label Feb 5, 2022

justusschock added the strategy: deepspeed label Feb 6, 2022

hepengfe mentioned this issue Feb 13, 2022

Deepspeed strategy is slow for large model #11901

Closed

stale bot added the won't fix This will not be worked on label Apr 16, 2022

stale bot closed this as completed Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP failed with big model under `deepspeed_stage_3` strategy #11760

DDP failed with big model under `deepspeed_stage_3` strategy #11760

hepengfe commented Feb 5, 2022 •

edited by github-actions bot

hepengfe commented Feb 5, 2022

justusschock commented Feb 6, 2022 •

edited by akihironitta

hepengfe commented Feb 6, 2022 •

edited

SeanNaren commented Feb 8, 2022

HaokunLiu commented Mar 8, 2022

HaokunLiu commented Mar 8, 2022

stale bot commented Apr 16, 2022

ZeyiLiao commented Nov 13, 2022

DDP failed with big model under deepspeed_stage_3 strategy #11760

DDP failed with big model under deepspeed_stage_3 strategy #11760

Comments

hepengfe commented Feb 5, 2022 • edited by github-actions bot

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

hepengfe commented Feb 5, 2022

justusschock commented Feb 6, 2022 • edited by akihironitta

hepengfe commented Feb 6, 2022 • edited

SeanNaren commented Feb 8, 2022

HaokunLiu commented Mar 8, 2022

HaokunLiu commented Mar 8, 2022

stale bot commented Apr 16, 2022

ZeyiLiao commented Nov 13, 2022

DDP failed with big model under `deepspeed_stage_3` strategy #11760

DDP failed with big model under `deepspeed_stage_3` strategy #11760

hepengfe commented Feb 5, 2022 •

edited by github-actions bot

justusschock commented Feb 6, 2022 •

edited by akihironitta

hepengfe commented Feb 6, 2022 •

edited