Split one model's different parts on different gpus #7162

dalek-who · 2021-04-22T11:25:37Z

dalek-who
Apr 22, 2021

🚀 Feature

Motivation

Related to this torch Model Parallel tutorial:
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

In my case, I have a simplified large model like this:

class MyLargeModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        # a large backbone like bert
        self.bert = Bert().to("cuda:0")

        # a very very large classifier layer with 6 million classes
        self.classifier = nn.Linear(768, 6_000_000).to("cuda:1")
    
    def forward(x):
        emb = self.bert(x.to(self.bert.device))
        score = self.classifier(emb.to(self.classifier.device))
        return score

self.classifier is so large that it must be on another gpu.

However, if I simply set pl.Trainer:

trainer = pl.Trainer(
    ...
    gpus=2,
    ...
)

It will copy the model on two gpus (and both will raise CUDA out of memory), rather than split it on two gpus.

Pitch

A easy way to manually split one model on different device like the tutorial above.

Alternatives

Additional context

vballoli · 2021-04-22T11:47:48Z

vballoli
Apr 22, 2021

PyTorch Lightning has support for FairScale, see here. You can add define the model as

self.model = nn.Sequential(Bert(), Linear(10, 20)) # __init__()
...
...
self.model(x) # forward()
...
plugin = RPCSequentialPlugin(balance=[1, 1])
trainer = Trainer()

0 replies

tchaton · 2021-04-22T12:11:50Z

tchaton
Apr 22, 2021
Maintainer

Hey @dalek-who,

I won't recommend to use RPCSequentialPlugin as it will be soon depreciated.

Instead, you can DeepSpeed Integration: https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html?highlight=deepspeed#deepspeed.

We managed to scale crazy large model. It can also be used on only 1 gpu with activation_checkpointing and cpu_offload.

Give it a try and give us feedback.

Best,
T.C

0 replies

dalek-who · 2021-04-22T12:27:20Z

dalek-who
Apr 22, 2021
Author

@tchaton Can you provide a simple DeepSpeedPlugin example for my case like RPCSequentialPlugin ?

0 replies

vballoli · 2021-04-22T12:53:28Z

vballoli
Apr 22, 2021

Oh, I wasn't aware of the deprecation. Sorry about that.

0 replies

SeanNaren · 2021-04-22T13:07:10Z

SeanNaren
Apr 22, 2021

Hey guys :)

Regarding the deprecation of the RPCSequentialPlugin this is being done within #6152 with more information coming in the following weeks on how you can leverage FSDP instead for simpler balancing of the model across GPUs! No need to be sorry @vballoli we'll make this clearer in the follow week!

DeepSpeed Stage 3 offers the same practice which we already have within Lightning. A minimal example of how all this can work can be found here: https://github.com/SeanNaren/minGPT/tree/stage3

Regarding a layer (in this case self.classifier) being too large to fit onto one GPU, we offer a hook called configure_sharded_model within the LightningModule, which is built just for this.

https://pytorch-lightning.readthedocs.io/en/latest/advanced/multi_gpu.html#shard-model-instantly-to-reduce-initialization-time-memory

We are planning on a refresh in the documentation to make it easier to find these tidbits, as things have become a bit complex in the ecosystem.

For a small example:

class MyLargeModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        # a large backbone like bert
        self.bert = Bert()

       
    def configure_sharded_model(self):
        # a very very large classifier layer with 6 million classes, is now sharded instantly onto all GPUs
        # Using DeepSpeed Stage 3
        self.classifier = nn.Linear(768, 6_000_000)

    def forward(x):
        emb = self.bert(x)
        score = self.classifier(emb)
        return score

trainer = pl.Trainer(
    gpus=4,
    plugins='deepspeed_stage_3'
)
trainer.fit(model)

DeepSpeed Stage 3 shards the model across all GPUs, but configure_sharded_model allows this to happen instantly, saving memory and allowing you to instantiate really large layers. More information about Stage 3 here: https://pytorch-lightning.readthedocs.io/en/latest/advanced/multi_gpu.html#shard-model-instantly-to-reduce-initialization-time-memory

1 reply

dalek-who Apr 23, 2021
Author

Layers in configure_sharded_model will shard on all gpus and layers in __init__ will still on "main" gpu? I can't find configure_sharded_model in latest document.

dalek-who · 2021-04-22T13:19:38Z

dalek-who
Apr 22, 2021
Author

@SeanNaren which torch and pytorch-lightning version should I use?

1 reply

SeanNaren Apr 22, 2021

I'd suggest using master for Lightning or the RC candidate which contains the newest Stage 3:

pip install pytorch-lightning==1.3.0rc1

If you can install the latest PyTorch (1.8.1) but this isn't necessary!

tchaton · 2021-04-22T13:20:24Z

tchaton
Apr 22, 2021
Maintainer

Dear @dalek-who,

You should you PyTorch 1.3.0rc1 and latest PyTorch.

Best,
T.C

0 replies

dalek-who · 2021-04-23T03:39:07Z

dalek-who
Apr 23, 2021
Author

@tchaton I use pl-1.3.0rc1 and torch-1.8.1. Some problems of this solution:

When using trainer.test, trainer will call fit at first (inside trainer.test), but during testing I have no training dataloader so it raise this exception.

I think this is the same bug, which is fixed but not released
#6376 (comment)

File "/home/projects/long_tail_link/link_main.py", line 479, in main
   trainer.test(model=pl_module, verbose=False)
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 956, in test
   results = self.fit(model)
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in fit
   self.pre_dispatch()
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 512, in pre_dispatch
   self.accelerator.pre_dispatch(self)
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 105, in pre_dispatch
   self.training_type_plugin.pre_dispatch()
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 234, in pre_dispatch
   self.init_deepspeed()
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 239, in init_deepspeed
   self._format_config()
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 395, in _format_config
   self._format_batch_size_and_grad_accum_config()
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 407, in _format_batch_size_and_grad_accum_config
   batch_size = self.lightning_module.train_dataloader().batch_sampler.batch_size
AttributeError: 'NoneType' object has no attribute 'batch_sampler'

plugins='deepspeed_stage_3' is not an legal string here, does it equals to plugins=DeepSpeedPlugin(stage=3)?
I set Trainer = (..., gpus=3, plugins=DeepSpeedPlugin(stage=3), ...) and write my model in your format using configure_sharded_model, and it starts ddp mode and prepare data on each process. However, what I want is model parallel only (loading data to a "main" gpu and sequentially computes on each gpu), not ddp+model parallel.

1 reply

SeanNaren Apr 25, 2021

Thanks @dalek-who

Apologies I just realised the ability to call deepspeed_stage_3 has not been made into the RC candidate yet. If you'd like to use the string alias deepspeed_stage_3, you can install master for now:

pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git

Regarding testing without training, and passing only a test_dataloader, that is definitely a bug we need to address. Can you make an issue for this to track?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split one model's different parts on different gpus #7162

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Split one model's different parts on different gpus #7162

dalek-who Apr 22, 2021

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Replies: 8 comments · 3 replies

vballoli Apr 22, 2021

tchaton Apr 22, 2021 Maintainer

dalek-who Apr 22, 2021 Author

vballoli Apr 22, 2021

SeanNaren Apr 22, 2021

dalek-who Apr 23, 2021 Author

dalek-who Apr 22, 2021 Author

SeanNaren Apr 22, 2021

tchaton Apr 22, 2021 Maintainer

dalek-who Apr 23, 2021 Author

SeanNaren Apr 25, 2021

dalek-who
Apr 22, 2021

Replies: 8 comments 3 replies

vballoli
Apr 22, 2021

tchaton
Apr 22, 2021
Maintainer

dalek-who
Apr 22, 2021
Author

vballoli
Apr 22, 2021

SeanNaren
Apr 22, 2021

dalek-who Apr 23, 2021
Author

dalek-who
Apr 22, 2021
Author

tchaton
Apr 22, 2021
Maintainer

dalek-who
Apr 23, 2021
Author