(10/10) Support 2D Parallelism - Port Fabric docs to PL (#19899)

Lightning-AI · May 23, 2024 · c09356d · c09356d
1 parent 7874cd0
commit c09356d
Show file tree

Hide file tree

Showing 14 changed files with 795 additions and 12 deletions.
diff --git a/docs/source-fabric/advanced/model_parallel/tp_fsdp.rst b/docs/source-fabric/advanced/model_parallel/tp_fsdp.rst
@@ -21,7 +21,6 @@ We will start off with the same feed forward example model as in the :doc:`Tenso
 
 .. code-block:: python
 
-    import torch
     import torch.nn as nn
     import torch.nn.functional as F
 
@@ -164,7 +163,7 @@ Finally, the tensor parallelism will apply to each group, splitting the sharded
         model = fabric.setup(model)
 
         # Define the optimizer
-        optimizer = torch.optim.AdamW(model.parameters(), lr=3e-3, foreach=True)
+        optimizer = torch.optim.AdamW(model.parameters(), lr=3e-3)
         optimizer = fabric.setup_optimizers(optimizer)
 
         # Define dataset/dataloader

diff --git a/docs/source-fabric/glossary/index.rst b/docs/source-fabric/glossary/index.rst
@@ -19,6 +19,11 @@ Glossary
     <div class="display-card-container">
         <div class="row">
 
+.. displayitem::
+    :header: 2D Parallelism
+    :button_link: ../advanced/model_parallel/tp_fsdp.html
+    :col_css: col-md-4
+
 .. displayitem::
     :header: Accelerator
     :button_link: ../fundamentals/accelerators.html

diff --git a/docs/source-pytorch/_static/main.css b/docs/source-pytorch/_static/main.css
@@ -1,3 +1,13 @@
 col {
   width: 50% !important;
 }
+
+ul.no-bullets {
+    list-style-type: none; /* Remove default bullets */
+    padding-left: 0;       /* Remove default padding */
+}
+
+ul.no-bullets li {
+    padding-left: 0.5em;
+    text-indent: -2em;
+}
diff --git a/docs/source-pytorch/accelerators/gpu_advanced.rst b/docs/source-pytorch/accelerators/gpu_advanced.rst
@@ -22,7 +22,7 @@ For experts pushing the state-of-the-art in model development, Lightning offers
    :header: Train models with billions of parameters
    :description:
    :col_css: col-md-4
-   :button_link: ../advanced/model_parallel.html
+   :button_link: ../advanced/model_parallel/index.html
    :height: 150
    :tag: advanced
 

diff --git a/docs/source-pytorch/advanced/model_parallel/fsdp.rst b/docs/source-pytorch/advanced/model_parallel/fsdp.rst
@@ -20,7 +20,7 @@ The memory consumption for training is generally made up of
 |
 
 When the sum of these memory components exceed the VRAM of a single GPU, regular data-parallel training (DDP) can no longer be employed.
-One of the methods that can alleviate this limitation is called **model-parallel** training, and known as **FSDP** in PyTorch, and in this guide, you will learn how to effectively scale large models with it.
+One of the methods that can alleviate this limitation is called **Fully Sharded Data Parallel (FSDP)**, and in this guide, you will learn how to effectively scale large models with it.
 
 
 ----

diff --git a/docs/source-pytorch/advanced/model_parallel/index.rst b/docs/source-pytorch/advanced/model_parallel/index.rst
@@ -0,0 +1,162 @@
+###########################################
+Training models with billions of parameters
+###########################################
+
+Today, large models with billions of parameters are trained with many GPUs across several machines in parallel.
+Even a single H100 GPU with 80 GB of VRAM (one of the biggest today) is not enough to train just a 30B parameter model (even with batch size 1 and 16-bit precision).
+The memory consumption for training is generally made up of
+
+1. the model parameters,
+2. the layer activations (forward),
+3. the gradients (backward),
+4. the optimizer states (e.g., Adam has two additional exponential averages per parameter) and
+5. model outputs and loss.
+
+|
+
+When the sum of these memory components exceed the VRAM of a single GPU, regular data-parallel training (DDP) can no longer be employed.
+To alleviate this limitation, we need to introduce **Model Parallelism**.
+
+
+----
+
+
+**************************
+What is Model Parallelism?
+**************************
+
+There are different types of model parallelism, each with its own trade-offs.
+
+**Fully Sharded Data Parallelism (FSDP)** shards both model parameters and optimizer states across multiple GPUs, significantly reducing memory usage per GPU.
+This method, while highly memory-efficient, involves frequent synchronization between GPUs, introducing communication overhead and complexity in implementation.
+FSDP is advantageous when memory constraints are the primary issue, provided there are high-bandwidth interconnects to minimize latency.
+
+**Tensor Parallelism (TP)** splits individual tensors across GPUs, enabling fine-grained distribution of computation and memory.
+It scales well to a large number of GPUs but requires synchronization of tensor slices after each operation, which adds communication overhead.
+TP is most effective with models that have many linear layers (LLMs), offering a balance between memory distribution and computational efficiency.
+
+**Pipeline Parallelism (PP)** divides model layers into segments, each processed by different GPUs, reducing memory load per GPU and minimizing inter-GPU communication to pipeline stage boundaries.
+While this reduces communication overhead, it can introduce pipeline bubbles where some GPUs idle, leading to potential inefficiencies.
+PP is ideal for deep models with sequential architectures (LLMs), though it requires careful management to minimize idle times.
+
+Choosing a model parallelism style involves considering model architecture, hardware interconnects, and training efficiency.
+In practice, hybrid approaches combining FSDP, TP, and PP are often used to leverage the strengths of each method while mitigating their weaknesses.
+
+
+----
+
+
+***********
+Get started
+***********
+
+.. raw:: html
+
+    <div class="display-card-container">
+        <div class="row">
+
+.. displayitem::
+    :header: Fully-Sharded Data Parallel (FSDP)
+    :description: Get started training large multi-billion parameter models with minimal code changes
+    :col_css: col-md-4
+    :button_link: fsdp.html
+    :height: 180
+    :tag: advanced
+
+.. displayitem::
+    :header: Tensor Parallel (TP)
+    :description: Learn the principles behind tensor parallelism and how to apply it to your model
+    :col_css: col-md-4
+    :button_link: tp.html
+    :height: 180
+    :tag: advanced
+
+.. displayitem::
+    :header: 2D Parallel (FSDP + TP)
+    :description: Combine Tensor Parallelism with FSDP (2D Parallel) to train efficiently on 100s of GPUs
+    :button_link: tp_fsdp.html
+    :col_css: col-md-4
+    :height: 180
+    :tag: advanced
+
+.. displayitem::
+    :header: Pipeline Parallelism
+    :description: Coming soon
+    :col_css: col-md-4
+    :height: 180
+    :tag: advanced
+
+.. raw:: html
+
+        </div>
+    </div>
+
+
+----
+
+
+*********************
+Parallelisms compared
+*********************
+
+
+**Distributed Data Parallel (DDP)**
+
+.. raw:: html
+
+    <ul class="no-bullets">
+        <li>✅ &nbsp; No model code changes required</li>
+        <li>✅ &nbsp; Training with very large batch sizes (batch size scales with number of GPUs)</li>
+        <li>❗ &nbsp; Model (weights, optimizer state, activations / gradients) must fit into a GPU</li>
+    </ul>
+
+|
+
+**Fully-Sharded Data Parallel (FSDP)**
+
+.. raw:: html
+
+    <ul class="no-bullets">
+        <li>✅ &nbsp; No model code changes required </li>
+        <li>✅ &nbsp; Training with very large batch sizes (batch size scales with number of GPUs) </li>
+        <li>✅ &nbsp; Model (weights, optimizer state, gradients) gets distributed across all GPUs </li>
+        <li>❗ &nbsp; A single FSDP layer when gathered during forward/backward must fit into the GPU </li>
+        <li>❗ &nbsp; Requires some knowledge about model architecture to set configuration options correctly </li>
+        <li>❗ &nbsp; Requires very fast networking (multi-node), data transfers between GPUs often become a bottleneck </li>
+    </ul>
+
+|
+
+**Tensor Parallel (TP)**
+
+.. raw:: html
+
+    <ul class="no-bullets">
+        <li>❗ &nbsp; Model code changes required </li>
+        <li>🤔 &nbsp; Fixed global batch size (does not scale with number of GPUs) </li>
+        <li>✅ &nbsp; Model (weights, optimizer state, activations) gets distributed across all GPUs</li>
+        <li>✅ &nbsp; Parallelizes the computation of layers that are too large to fit onto a single GPU </li>
+        <li>❗ &nbsp; Requires lots of knowledge about model architecture to set configuration options correctly </li>
+        <li>🤔 &nbsp; Less GPU data transfers required, but data transfers don't overlap with computation like in FSDP </li>
+    </ul>
+
+|
+
+**2D Parallel (FSDP + TP)**
+
+.. raw:: html
+
+    <ul class="no-bullets">
+        <li>❗ &nbsp; Model code changes required</li>
+        <li>✅ &nbsp; Training with very large batch sizes (batch size scales across data-parallel dimension)</li>
+        <li>✅ &nbsp; Model (weights, optimizer state, activations) gets distributed across all GPUs</li>
+        <li>✅ &nbsp; Parallelizes the computation of layers that are too large to fit onto a single GPU</li>
+        <li>❗ &nbsp; Requires lots of knowledge about model architecture to set configuration options correctly</li>
+        <li>✅ &nbsp; Tensor-parallel within machines and FSDP across machines reduces data transfer bottlenecks</li>
+    </ul>
+
+|
+
+PyTorch Lightning supports all the parallelisms mentioned above natively through PyTorch, with the exception of pipeline parallelism (PP) which is not yet supported.
+
+|