Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP failed with big model under deepspeed_stage_3 strategy #11760

Closed
hepengfe opened this issue Feb 5, 2022 · 8 comments
Closed

DDP failed with big model under deepspeed_stage_3 strategy #11760

hepengfe opened this issue Feb 5, 2022 · 8 comments
Labels
bug Something isn't working strategy: deepspeed won't fix This will not be worked on

Comments

@hepengfe
Copy link

hepengfe commented Feb 5, 2022

馃悰 Bug

I am running T5-11B(~40GB) model with deepspeed_stage_3 strategy and it failed to _init_deepspeed_distributed()
Trainer works for smaller model such as T5-large (~3GB)

Here is the traceback.

initializing deepspeed distributed: GLOBAL_RANK: 1, MEMBER: 2/6
Traceback (most recent call last):
  File "/home/usr/2022/cc-GKM/scripts/kg_ftt5.py", line 259, in <module>
    main()
  File "/home/usr/2022/cc-GKM/scripts/kg_ftt5.py", line 246, in main
    trainer.fit(model, train_dl, valid_dl)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run
    self.accelerator.setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/accelerators/gpu.py", line 39, in setup_environment
    super().setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in setup_environment
    self.training_type_plugin.setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 185, in setup_environment
    self.setup_distributed()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 358, in setup_distributed
    self._init_deepspeed_distributed()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 374, in _init_deepspeed_distributed
    self.torch_distributed_backend, distributed_port=self.cluster_environment.master_port()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/deepspeed/utils/distributed.py", line 51, in init_distributed
    init_method=init_method)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: connect() timed out. Original timeout was 1800000 ms.
Traceback (most recent call last):
  File "/home/usr/2022/cc-GKM/scripts/kg_ftt5.py", line 259, in <module>
    main()
  File "/home/usr/2022/cc-GKM/scripts/kg_ftt5.py", line 246, in main
    trainer.fit(model, train_dl, valid_dl)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run
    self.accelerator.setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/accelerators/gpu.py", line 39, in setup_environment
    super().setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in setup_environment
    self.training_type_plugin.setup_environment()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 185, in setup_environment
    self.setup_distributed()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 358, in setup_distributed
    self._init_deepspeed_distributed()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 374, in _init_deepspeed_distributed
    self.torch_distributed_backend, distributed_port=self.cluster_environment.master_port()
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/deepspeed/utils/distributed.py", line 51, in init_distributed
    init_method=init_method)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: connect() timed out. Original timeout was 1800000 ms.

Another runs gives different traceback.

  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/deepspeed/utils/distributed.py", line 51, in init_distributed
    init_method=init_method)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/usr/miniconda3/envs/ccgkm/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 247, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=6, worker_count=3, timeout=0:30:00)

To Reproduce

I think a pytorch lightning trainer with a t5-11b model is enough for reproducing the bug.

Expected behavior

It is expected to start training.

Environment

  • CUDA:
    • GPU:
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
    • available: True
    • version: 11.3
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.10.1+cu113
    • pytorch-lightning: 1.5.9
    • tqdm: 4.62.3
  • System:

Additional context

cc @SeanNaren @awaelchli @rohitgr7 @akihironitta

@hepengfe hepengfe added the bug Something isn't working label Feb 5, 2022
@hepengfe
Copy link
Author

hepengfe commented Feb 5, 2022

It's pretty similar to #4420 .
Currently I am trying out the methods suggested there.

@justusschock
Copy link
Member

justusschock commented Feb 6, 2022

@feipenghe can you report back after trying #4420 ? Did it resolve your issue? :)

@hepengfe
Copy link
Author

hepengfe commented Feb 6, 2022

@justusschock After several trials, I finally determined that it's the issue of not enough RAM. My machine has RAM size of 256 GB. I tried T5-3B which used peak RAM size 126GB so I think T5-11B needs RAM size around 500GB.

When I ran the code, RAM usage accumulates to a peak memory usage and then went down after messages like initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/6. It went up for initializing deepspeed distributed for the next device. It keeps this iteration of RAM usage ups and downs until all devices are initialized. For T5-3B, as I mentioned it only has 126GB peak memory usage, and the process proceeds the RAM ups and downs. However, for T5-11B which has 500GB peak memory usage according to my estimation. The process first accumulates to ~250GB (almost reaches my RAM capacity) without any warning/stopping then went through RAM usage ups and downs. After initializing deepspeed distributed for all GPUs, the process hangs for 30 minutes. Then the first traceback I reported comes in.

I read some deepspeed documentation https://deepspeed.readthedocs.io/en/latest/memory.html, under the general RAM section, it says there is a certain amount of memory needed at the beginning to initialize the model on CPU memory. It kind of makes sense in my case according to the formula. However, I am not using "deepspeed_stage_3_offload" but only "deepspeed_stage_3". I wonder if 500GB RAM is still needed to initialize deepspeed distributed?

@SeanNaren
Copy link
Contributor

hey @feipenghe I can confirm that I can reproduce this simply by instantiating t5-11b in the BoringModel somewhere.

Further investigation leads me to believe that this memory issue is caused by the pre-trained weights being loaded on each individual device internally in from_pretrained. this adds up and I think a few processes silently die as they cannot allocate.

I'm unsure of what a fix could be in this case, I'll brainstorm a bit here as this would be a good use case to tackle!

@HaokunLiu
Copy link

I met the same issue when using T5-11B. I had some discussions with a developer of DeepSpeed. microsoft/DeepSpeed#1814

@HaokunLiu
Copy link

@SeanNaren, have you come up with some way to fix this?
I need it soon. If you already have an idea but don't have the bandwidth to carry it out. I would be happy to contribute.

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 16, 2022
@stale stale bot closed this as completed Apr 24, 2022
@ZeyiLiao
Copy link

Hi any updates here? I meet the same issue with PL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working strategy: deepspeed won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

5 participants