Lightning-AI · kaushikb11 · Oct 20, 2021 · Oct 18, 2021 · Oct 19, 2021 · Oct 19, 2021
@@ -253,11 +253,11 @@ Distributed modes
 -----------------
 Lightning allows multiple ways of training
 
-- Data Parallel (``accelerator='dp'``) (multiple-gpus, 1 machine)
-- DistributedDataParallel (``accelerator='ddp'``) (multiple-gpus across many machines (python script based)).
-- DistributedDataParallel (``accelerator='ddp_spawn'``) (multiple-gpus across many machines (spawn based)).
-- DistributedDataParallel 2 (``accelerator='ddp2'``) (DP in a machine, DDP across machines).
-- Horovod (``accelerator='horovod'``) (multi-machine, multi-gpu, configured at runtime)
+- Data Parallel (``strategy='dp'``) (multiple-gpus, 1 machine)
+- DistributedDataParallel (``strategy='ddp'``) (multiple-gpus across many machines (python script based)).
+- DistributedDataParallel (``strategy='ddp_spawn'``) (multiple-gpus across many machines (spawn based)).
+- DistributedDataParallel 2 (``strategy='ddp2'``) (DP in a machine, DDP across machines).
+- Horovod (``strategy='horovod'``) (multi-machine, multi-gpu, configured at runtime)
 - TPUs (``tpu_cores=8|x``) (tpu or TPU pod)
 
 .. note::
@@ -287,7 +287,7 @@ after which the root node will aggregate the results.
     :skipif: torch.cuda.device_count() < 2
 
     # train on 2 GPUs (using DP mode)
-    trainer = Trainer(gpus=2, accelerator="dp")
+    trainer = Trainer(gpus=2, strategy="dp")
 
 Distributed Data Parallel
 ^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -308,10 +308,10 @@ Distributed Data Parallel
 .. code-block:: python
 
     # train on 8 GPUs (same machine (ie: node))
-    trainer = Trainer(gpus=8, accelerator="ddp")
+    trainer = Trainer(gpus=8, strategy="ddp")
 
     # train on 32 GPUs (4 nodes)
-    trainer = Trainer(gpus=8, accelerator="ddp", num_nodes=4)
+    trainer = Trainer(gpus=8, strategy="ddp", num_nodes=4)
 
 This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment
 variables:
@@ -356,7 +356,7 @@ In  this case, we can use DDP2 which behaves like DP in a machine and DDP across
 .. code-block:: python
 
     # train on 32 GPUs (4 nodes)
-    trainer = Trainer(gpus=8, accelerator="ddp2", num_nodes=4)
+    trainer = Trainer(gpus=8, strategy="ddp2", num_nodes=4)
 
 Distributed Data Parallel Spawn
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -374,7 +374,7 @@ project module) you can use the following method:
 .. code-block:: python
 
     # train on 8 GPUs (same machine (ie: node))
-    trainer = Trainer(gpus=8, accelerator="ddp_spawn")
+    trainer = Trainer(gpus=8, strategy="ddp_spawn")
 
 We STRONGLY discourage this use because it has limitations (due to Python and PyTorch):
 
@@ -446,10 +446,10 @@ Horovod can be configured in the training script to run with any number of GPUs
 .. code-block:: python
 
     # train Horovod on GPU (number of GPUs / machines provided on command-line)
-    trainer = Trainer(accelerator="horovod", gpus=1)
+    trainer = Trainer(strategy="horovod", gpus=1)
 
     # train Horovod on CPU (number of processes / machines provided on command-line)
-    trainer = Trainer(accelerator="horovod")
+    trainer = Trainer(strategy="horovod")
 
 When starting the training job, the driver application will then be used to specify the total
 number of worker processes:
@@ -583,11 +583,11 @@ Below are the possible configurations we support.
 +-------+---------+----+-----+--------+------------------------------------------------------------+
 | Y     |         |    |     | Y      | `Trainer(gpus=1, precision=16)`                            |
 +-------+---------+----+-----+--------+------------------------------------------------------------+
-|       | Y       | Y  |     |        | `Trainer(gpus=k, accelerator='dp')`                        |
+|       | Y       | Y  |     |        | `Trainer(gpus=k, strategy='dp')`                        |
 +-------+---------+----+-----+--------+------------------------------------------------------------+
-|       | Y       |    | Y   |        | `Trainer(gpus=k, accelerator='ddp')`                       |
+|       | Y       |    | Y   |        | `Trainer(gpus=k, strategy='ddp')`                       |
 +-------+---------+----+-----+--------+------------------------------------------------------------+
-|       | Y       |    | Y   | Y      | `Trainer(gpus=k, accelerator='ddp', precision=16)`         |
+|       | Y       |    | Y   | Y      | `Trainer(gpus=k, strategy='ddp', precision=16)`         |
 +-------+---------+----+-----+--------+------------------------------------------------------------+
 
 
@@ -616,29 +616,29 @@ In DDP, DDP_SPAWN, Deepspeed, DDP_SHARDED, or Horovod your effective batch size
 .. code-block:: python
 
     # effective batch size = 7 * 8
-    Trainer(gpus=8, accelerator="ddp")
-    Trainer(gpus=8, accelerator="ddp_spawn")
-    Trainer(gpus=8, accelerator="ddp_sharded")
-    Trainer(gpus=8, accelerator="horovod")
+    Trainer(gpus=8, strategy="ddp")
+    Trainer(gpus=8, strategy="ddp_spawn")
+    Trainer(gpus=8, strategy="ddp_sharded")
+    Trainer(gpus=8, strategy="horovod")
 
     # effective batch size = 7 * 8 * 10
-    Trainer(gpus=8, num_nodes=10, accelerator="ddp")
-    Trainer(gpus=8, num_nodes=10, accelerator="ddp_spawn")
-    Trainer(gpus=8, num_nodes=10, accelerator="ddp_sharded")
-    Trainer(gpus=8, num_nodes=10, accelerator="horovod")
+    Trainer(gpus=8, num_nodes=10, strategy="ddp")
+    Trainer(gpus=8, num_nodes=10, strategy="ddp_spawn")
+    Trainer(gpus=8, num_nodes=10, strategy="ddp_sharded")
+    Trainer(gpus=8, num_nodes=10, strategy="horovod")
 
 In DDP2 or DP, your effective batch size will be 7 * num_nodes.
 The reason is that the full batch is visible to all GPUs on the node when using DDP2.
 
 .. code-block:: python
 
     # effective batch size = 7
-    Trainer(gpus=8, accelerator="ddp2")
-    Trainer(gpus=8, accelerator="dp")
+    Trainer(gpus=8, strategy="ddp2")
+    Trainer(gpus=8, strategy="dp")
 
     # effective batch size = 7 * 10
-    Trainer(gpus=8, num_nodes=10, accelerator="ddp2")
-    Trainer(gpus=8, accelerator="dp")
+    Trainer(gpus=8, num_nodes=10, strategy="ddp2")
+    Trainer(gpus=8, strategy="dp")
 
 
 .. note:: Huge batch sizes are actually really bad for convergence. Check out:
@@ -652,7 +652,7 @@ Lightning supports the use of Torch Distributed Elastic to enable fault-tolerant
 
 .. code-block:: python
 
-    Trainer(gpus=8, accelerator="ddp")
+    Trainer(gpus=8, strategy="ddp")
 
 To launch a fault-tolerant job, run the following on all nodes.
 

@@ -55,7 +55,7 @@ To train a model using multiple nodes, do the following:
     .. code-block:: python
 
        # train on 32 GPUs across 4 nodes
-       trainer = Trainer(gpus=8, num_nodes=4, accelerator="ddp")
+       trainer = Trainer(gpus=8, num_nodes=4, strategy="ddp")
 
 
 Submit a job to the cluster
@@ -127,7 +127,7 @@ To train a model using multiple nodes, do the following:
     .. code-block:: python
 
        # train on 32 GPUs across 4 nodes
-       trainer = Trainer(gpus=8, num_nodes=4, accelerator="ddp")
+       trainer = Trainer(gpus=8, num_nodes=4, strategy="ddp")
 
 3.  It's a good idea to structure your training script like this:
 
@@ -137,7 +137,7 @@ To train a model using multiple nodes, do the following:
         def main(hparams):
             model = LightningTemplateModel(hparams)
 
-            trainer = Trainer(gpus=8, num_nodes=4, accelerator="ddp")
+            trainer = Trainer(gpus=8, num_nodes=4, strategy="ddp")
 
             trainer.fit(model)
 

@@ -143,15 +143,15 @@ You can manually save checkpoints and restore your model from the checkpointed s
     trainer.save_checkpoint("example.ckpt")
     new_model = MyModel.load_from_checkpoint(checkpoint_path="example.ckpt")
 
-Manual saving with accelerators
-===============================
+Manual saving with strategies
+=============================
 
-Lightning also handles accelerators where multiple processes are running, such as DDP. For example, when using the DDP accelerator our training script is running across multiple devices at the same time.
+Lightning also handles strategies where multiple processes are running, such as DDP. For example, when using the DDP strategy our training script is running across multiple devices at the same time.
 Lightning automatically ensures that the model is saved only on the main process, whilst other processes do not interfere with saving checkpoints. This requires no code changes as seen below.
 
 .. code-block:: python
 
-    trainer = Trainer(accelerator="ddp")
+    trainer = Trainer(strategy="ddp")
     model = MyLightningModule(hparams)
     trainer.fit(model)
     # Saves only on the main process

@@ -58,3 +58,4 @@ Accelerator API
     CPUAccelerator
     GPUAccelerator
     TPUAccelerator
+    IPUAccelerator
@@ -77,7 +77,7 @@ can then be passed into the Trainer directly or via a (custom) accelerator:
 .. code-block:: python
 
     # custom plugins
-    trainer = Trainer(plugins=[CustomDDPPlugin(), CustomPrecisionPlugin()])
+    trainer = Trainer(strategy=CustomDDPPlugin(), plugins=[CustomPrecisionPlugin()])
 
     # fully custom accelerator and plugins
     accelerator = MyAccelerator(