Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update strategy flag in docs #10000

Merged
merged 14 commits into from
Oct 20, 2021
56 changes: 28 additions & 28 deletions docs/source/advanced/multi_gpu.rst
Expand Up @@ -253,11 +253,11 @@ Distributed modes
-----------------
Lightning allows multiple ways of training

- Data Parallel (``accelerator='dp'``) (multiple-gpus, 1 machine)
- DistributedDataParallel (``accelerator='ddp'``) (multiple-gpus across many machines (python script based)).
- DistributedDataParallel (``accelerator='ddp_spawn'``) (multiple-gpus across many machines (spawn based)).
- DistributedDataParallel 2 (``accelerator='ddp2'``) (DP in a machine, DDP across machines).
- Horovod (``accelerator='horovod'``) (multi-machine, multi-gpu, configured at runtime)
- Data Parallel (``strategy='dp'``) (multiple-gpus, 1 machine)
- DistributedDataParallel (``strategy='ddp'``) (multiple-gpus across many machines (python script based)).
- DistributedDataParallel (``strategy='ddp_spawn'``) (multiple-gpus across many machines (spawn based)).
- DistributedDataParallel 2 (``strategy='ddp2'``) (DP in a machine, DDP across machines).
- Horovod (``strategy='horovod'``) (multi-machine, multi-gpu, configured at runtime)
rohitgr7 marked this conversation as resolved.
Show resolved Hide resolved
- TPUs (``tpu_cores=8|x``) (tpu or TPU pod)

.. note::
Expand Down Expand Up @@ -287,7 +287,7 @@ after which the root node will aggregate the results.
:skipif: torch.cuda.device_count() < 2

# train on 2 GPUs (using DP mode)
trainer = Trainer(gpus=2, accelerator="dp")
trainer = Trainer(gpus=2, strategy="dp")

Distributed Data Parallel
^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -308,10 +308,10 @@ Distributed Data Parallel
.. code-block:: python

# train on 8 GPUs (same machine (ie: node))
trainer = Trainer(gpus=8, accelerator="ddp")
trainer = Trainer(gpus=8, strategy="ddp")

# train on 32 GPUs (4 nodes)
trainer = Trainer(gpus=8, accelerator="ddp", num_nodes=4)
trainer = Trainer(gpus=8, strategy="ddp", num_nodes=4)

This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment
variables:
Expand Down Expand Up @@ -356,7 +356,7 @@ In this case, we can use DDP2 which behaves like DP in a machine and DDP across
.. code-block:: python

# train on 32 GPUs (4 nodes)
trainer = Trainer(gpus=8, accelerator="ddp2", num_nodes=4)
trainer = Trainer(gpus=8, strategy="ddp2", num_nodes=4)

Distributed Data Parallel Spawn
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -374,7 +374,7 @@ project module) you can use the following method:
.. code-block:: python

# train on 8 GPUs (same machine (ie: node))
trainer = Trainer(gpus=8, accelerator="ddp_spawn")
trainer = Trainer(gpus=8, strategy="ddp_spawn")

We STRONGLY discourage this use because it has limitations (due to Python and PyTorch):

Expand Down Expand Up @@ -446,10 +446,10 @@ Horovod can be configured in the training script to run with any number of GPUs
.. code-block:: python

# train Horovod on GPU (number of GPUs / machines provided on command-line)
trainer = Trainer(accelerator="horovod", gpus=1)
trainer = Trainer(strategy="horovod", gpus=1)

# train Horovod on CPU (number of processes / machines provided on command-line)
trainer = Trainer(accelerator="horovod")
trainer = Trainer(strategy="horovod")

When starting the training job, the driver application will then be used to specify the total
number of worker processes:
Expand Down Expand Up @@ -583,11 +583,11 @@ Below are the possible configurations we support.
+-------+---------+----+-----+--------+------------------------------------------------------------+
| Y | | | | Y | `Trainer(gpus=1, precision=16)` |
+-------+---------+----+-----+--------+------------------------------------------------------------+
| | Y | Y | | | `Trainer(gpus=k, accelerator='dp')` |
| | Y | Y | | | `Trainer(gpus=k, strategy='dp')` |
+-------+---------+----+-----+--------+------------------------------------------------------------+
| | Y | | Y | | `Trainer(gpus=k, accelerator='ddp')` |
| | Y | | Y | | `Trainer(gpus=k, strategy='ddp')` |
+-------+---------+----+-----+--------+------------------------------------------------------------+
| | Y | | Y | Y | `Trainer(gpus=k, accelerator='ddp', precision=16)` |
| | Y | | Y | Y | `Trainer(gpus=k, strategy='ddp', precision=16)` |
kaushikb11 marked this conversation as resolved.
Show resolved Hide resolved
+-------+---------+----+-----+--------+------------------------------------------------------------+


Expand Down Expand Up @@ -616,29 +616,29 @@ In DDP, DDP_SPAWN, Deepspeed, DDP_SHARDED, or Horovod your effective batch size
.. code-block:: python

# effective batch size = 7 * 8
Trainer(gpus=8, accelerator="ddp")
Trainer(gpus=8, accelerator="ddp_spawn")
Trainer(gpus=8, accelerator="ddp_sharded")
Trainer(gpus=8, accelerator="horovod")
Trainer(gpus=8, strategy="ddp")
Trainer(gpus=8, strategy="ddp_spawn")
Trainer(gpus=8, strategy="ddp_sharded")
Trainer(gpus=8, strategy="horovod")

# effective batch size = 7 * 8 * 10
Trainer(gpus=8, num_nodes=10, accelerator="ddp")
Trainer(gpus=8, num_nodes=10, accelerator="ddp_spawn")
Trainer(gpus=8, num_nodes=10, accelerator="ddp_sharded")
Trainer(gpus=8, num_nodes=10, accelerator="horovod")
Trainer(gpus=8, num_nodes=10, strategy="ddp")
Trainer(gpus=8, num_nodes=10, strategy="ddp_spawn")
Trainer(gpus=8, num_nodes=10, strategy="ddp_sharded")
Trainer(gpus=8, num_nodes=10, strategy="horovod")

In DDP2 or DP, your effective batch size will be 7 * num_nodes.
The reason is that the full batch is visible to all GPUs on the node when using DDP2.

.. code-block:: python

# effective batch size = 7
Trainer(gpus=8, accelerator="ddp2")
Trainer(gpus=8, accelerator="dp")
Trainer(gpus=8, strategy="ddp2")
Trainer(gpus=8, strategy="dp")

# effective batch size = 7 * 10
Trainer(gpus=8, num_nodes=10, accelerator="ddp2")
Trainer(gpus=8, accelerator="dp")
Trainer(gpus=8, num_nodes=10, strategy="ddp2")
Trainer(gpus=8, strategy="dp")


.. note:: Huge batch sizes are actually really bad for convergence. Check out:
Expand All @@ -652,7 +652,7 @@ Lightning supports the use of Torch Distributed Elastic to enable fault-tolerant

.. code-block:: python

Trainer(gpus=8, accelerator="ddp")
Trainer(gpus=8, strategy="ddp")

To launch a fault-tolerant job, run the following on all nodes.

Expand Down
6 changes: 3 additions & 3 deletions docs/source/clouds/cluster.rst
Expand Up @@ -55,7 +55,7 @@ To train a model using multiple nodes, do the following:
.. code-block:: python

# train on 32 GPUs across 4 nodes
trainer = Trainer(gpus=8, num_nodes=4, accelerator="ddp")
trainer = Trainer(gpus=8, num_nodes=4, strategy="ddp")


Submit a job to the cluster
Expand Down Expand Up @@ -127,7 +127,7 @@ To train a model using multiple nodes, do the following:
.. code-block:: python

# train on 32 GPUs across 4 nodes
trainer = Trainer(gpus=8, num_nodes=4, accelerator="ddp")
trainer = Trainer(gpus=8, num_nodes=4, strategy="ddp")

3. It's a good idea to structure your training script like this:

Expand All @@ -137,7 +137,7 @@ To train a model using multiple nodes, do the following:
def main(hparams):
model = LightningTemplateModel(hparams)

trainer = Trainer(gpus=8, num_nodes=4, accelerator="ddp")
trainer = Trainer(gpus=8, num_nodes=4, strategy="ddp")

trainer.fit(model)

Expand Down
8 changes: 4 additions & 4 deletions docs/source/common/weights_loading.rst
Expand Up @@ -143,15 +143,15 @@ You can manually save checkpoints and restore your model from the checkpointed s
trainer.save_checkpoint("example.ckpt")
new_model = MyModel.load_from_checkpoint(checkpoint_path="example.ckpt")

Manual saving with accelerators
===============================
Manual saving with strategies
=============================

Lightning also handles accelerators where multiple processes are running, such as DDP. For example, when using the DDP accelerator our training script is running across multiple devices at the same time.
Lightning also handles strategies where multiple processes are running, such as DDP. For example, when using the DDP strategy our training script is running across multiple devices at the same time.
Lightning automatically ensures that the model is saved only on the main process, whilst other processes do not interfere with saving checkpoints. This requires no code changes as seen below.

.. code-block:: python

trainer = Trainer(accelerator="ddp")
trainer = Trainer(strategy="ddp")
model = MyLightningModule(hparams)
trainer.fit(model)
# Saves only on the main process
Expand Down
1 change: 1 addition & 0 deletions docs/source/extensions/accelerators.rst
Expand Up @@ -58,3 +58,4 @@ Accelerator API
CPUAccelerator
GPUAccelerator
TPUAccelerator
IPUAccelerator
2 changes: 1 addition & 1 deletion docs/source/extensions/plugins.rst
Expand Up @@ -77,7 +77,7 @@ can then be passed into the Trainer directly or via a (custom) accelerator:
.. code-block:: python

# custom plugins
trainer = Trainer(plugins=[CustomDDPPlugin(), CustomPrecisionPlugin()])
trainer = Trainer(strategy=CustomDDPPlugin(), plugins=[CustomPrecisionPlugin()])

# fully custom accelerator and plugins
accelerator = MyAccelerator(
Expand Down