orphan: |
---|
When using distributed training make sure to modify your learning rate according to your effective batch size.
Let's say you have a batch size of 7 in your dataloader.
.. testcode:: class LitModel(LightningModule): def train_dataloader(self): return Dataset(..., batch_size=7)
In DDP, DDP_SPAWN, Deepspeed, DDP_SHARDED, or Horovod your effective batch size will be 7 * devices * num_nodes.
# effective batch size = 7 * 8
Trainer(accelerator="gpu", devices=8, strategy="ddp")
Trainer(accelerator="gpu", devices=8, strategy="ddp_spawn")
Trainer(accelerator="gpu", devices=8, strategy="ddp_sharded")
Trainer(accelerator="gpu", devices=8, strategy="horovod")
# effective batch size = 7 * 8 * 10
Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="ddp")
Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="ddp_spawn")
Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="ddp_sharded")
Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="horovod")
Note
Huge batch sizes are actually really bad for convergence. Check out: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
In DP, which does not support multi-node, the effective batch size will be just 7, regardless of how many devices are being used. The reason is that the full batch gets split evenly between all devices.
# effective batch size = 7, each GPU sees a batch size of 1 except the last GPU
Trainer(accelerator="gpu", devices=8, strategy="dp")
# effective batch size = 7, first GPU sees a batch size of 4, the other sees batch size 3
Trainer(accelerator="gpu", devices=2, num_nodes=10, strategy="dp")
To use multiple GPUs on notebooks, use the DP mode.
Trainer(accelerator="gpu", devices=4, strategy="dp")
If you want to use other models, please launch your training via the command-shell.
Note
Learn how to :ref:`access a cloud machine with multiple GPUs <grid_cloud_session_basic>` in this guide.
Pickle is Python's mechanism for serializing and unserializing data. A majority of distributed modes require that your code is fully pickle compliant. If you run into an issue with pickling try the following to figure out the issue
import pickle
model = YourModel()
pickle.dumps(model)
If you ddp your code doesn't need to be pickled.
Trainer(accelerator="gpu", devices=4, strategy="ddp")
If you use ddp_spawn the pickling requirement remains. This is a limitation of Python.
Trainer(accelerator="gpu", devices=4, strategy="ddp_spawn")