Skip to content

Commit

Permalink
(10/10) Support 2D Parallelism - Port Fabric docs to PL (#19899)
Browse files Browse the repository at this point in the history
  • Loading branch information
awaelchli committed May 23, 2024
1 parent 7874cd0 commit c09356d
Show file tree
Hide file tree
Showing 14 changed files with 795 additions and 12 deletions.
3 changes: 1 addition & 2 deletions docs/source-fabric/advanced/model_parallel/tp_fsdp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ We will start off with the same feed forward example model as in the :doc:`Tenso

.. code-block:: python
import torch
import torch.nn as nn
import torch.nn.functional as F
Expand Down Expand Up @@ -164,7 +163,7 @@ Finally, the tensor parallelism will apply to each group, splitting the sharded
model = fabric.setup(model)
# Define the optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-3, foreach=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-3)
optimizer = fabric.setup_optimizers(optimizer)
# Define dataset/dataloader
Expand Down
5 changes: 5 additions & 0 deletions docs/source-fabric/glossary/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,11 @@ Glossary
<div class="display-card-container">
<div class="row">

.. displayitem::
:header: 2D Parallelism
:button_link: ../advanced/model_parallel/tp_fsdp.html
:col_css: col-md-4

.. displayitem::
:header: Accelerator
:button_link: ../fundamentals/accelerators.html
Expand Down
10 changes: 10 additions & 0 deletions docs/source-pytorch/_static/main.css
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
col {
width: 50% !important;
}

ul.no-bullets {
list-style-type: none; /* Remove default bullets */
padding-left: 0; /* Remove default padding */
}

ul.no-bullets li {
padding-left: 0.5em;
text-indent: -2em;
}
2 changes: 1 addition & 1 deletion docs/source-pytorch/accelerators/gpu_advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ For experts pushing the state-of-the-art in model development, Lightning offers
:header: Train models with billions of parameters
:description:
:col_css: col-md-4
:button_link: ../advanced/model_parallel.html
:button_link: ../advanced/model_parallel/index.html
:height: 150
:tag: advanced

Expand Down
2 changes: 1 addition & 1 deletion docs/source-pytorch/advanced/model_parallel/fsdp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ The memory consumption for training is generally made up of
|
When the sum of these memory components exceed the VRAM of a single GPU, regular data-parallel training (DDP) can no longer be employed.
One of the methods that can alleviate this limitation is called **model-parallel** training, and known as **FSDP** in PyTorch, and in this guide, you will learn how to effectively scale large models with it.
One of the methods that can alleviate this limitation is called **Fully Sharded Data Parallel (FSDP)**, and in this guide, you will learn how to effectively scale large models with it.


----
Expand Down
162 changes: 162 additions & 0 deletions docs/source-pytorch/advanced/model_parallel/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
###########################################
Training models with billions of parameters
###########################################

Today, large models with billions of parameters are trained with many GPUs across several machines in parallel.
Even a single H100 GPU with 80 GB of VRAM (one of the biggest today) is not enough to train just a 30B parameter model (even with batch size 1 and 16-bit precision).
The memory consumption for training is generally made up of

1. the model parameters,
2. the layer activations (forward),
3. the gradients (backward),
4. the optimizer states (e.g., Adam has two additional exponential averages per parameter) and
5. model outputs and loss.

|
When the sum of these memory components exceed the VRAM of a single GPU, regular data-parallel training (DDP) can no longer be employed.
To alleviate this limitation, we need to introduce **Model Parallelism**.


----


**************************
What is Model Parallelism?
**************************

There are different types of model parallelism, each with its own trade-offs.

**Fully Sharded Data Parallelism (FSDP)** shards both model parameters and optimizer states across multiple GPUs, significantly reducing memory usage per GPU.
This method, while highly memory-efficient, involves frequent synchronization between GPUs, introducing communication overhead and complexity in implementation.
FSDP is advantageous when memory constraints are the primary issue, provided there are high-bandwidth interconnects to minimize latency.

**Tensor Parallelism (TP)** splits individual tensors across GPUs, enabling fine-grained distribution of computation and memory.
It scales well to a large number of GPUs but requires synchronization of tensor slices after each operation, which adds communication overhead.
TP is most effective with models that have many linear layers (LLMs), offering a balance between memory distribution and computational efficiency.

**Pipeline Parallelism (PP)** divides model layers into segments, each processed by different GPUs, reducing memory load per GPU and minimizing inter-GPU communication to pipeline stage boundaries.
While this reduces communication overhead, it can introduce pipeline bubbles where some GPUs idle, leading to potential inefficiencies.
PP is ideal for deep models with sequential architectures (LLMs), though it requires careful management to minimize idle times.

Choosing a model parallelism style involves considering model architecture, hardware interconnects, and training efficiency.
In practice, hybrid approaches combining FSDP, TP, and PP are often used to leverage the strengths of each method while mitigating their weaknesses.


----


***********
Get started
***********

.. raw:: html

<div class="display-card-container">
<div class="row">

.. displayitem::
:header: Fully-Sharded Data Parallel (FSDP)
:description: Get started training large multi-billion parameter models with minimal code changes
:col_css: col-md-4
:button_link: fsdp.html
:height: 180
:tag: advanced

.. displayitem::
:header: Tensor Parallel (TP)
:description: Learn the principles behind tensor parallelism and how to apply it to your model
:col_css: col-md-4
:button_link: tp.html
:height: 180
:tag: advanced

.. displayitem::
:header: 2D Parallel (FSDP + TP)
:description: Combine Tensor Parallelism with FSDP (2D Parallel) to train efficiently on 100s of GPUs
:button_link: tp_fsdp.html
:col_css: col-md-4
:height: 180
:tag: advanced

.. displayitem::
:header: Pipeline Parallelism
:description: Coming soon
:col_css: col-md-4
:height: 180
:tag: advanced

.. raw:: html

</div>
</div>


----


*********************
Parallelisms compared
*********************


**Distributed Data Parallel (DDP)**

.. raw:: html

<ul class="no-bullets">
<li>✅ &nbsp; No model code changes required</li>
<li>✅ &nbsp; Training with very large batch sizes (batch size scales with number of GPUs)</li>
<li>❗ &nbsp; Model (weights, optimizer state, activations / gradients) must fit into a GPU</li>
</ul>

|
**Fully-Sharded Data Parallel (FSDP)**

.. raw:: html

<ul class="no-bullets">
<li>✅ &nbsp; No model code changes required </li>
<li>✅ &nbsp; Training with very large batch sizes (batch size scales with number of GPUs) </li>
<li>✅ &nbsp; Model (weights, optimizer state, gradients) gets distributed across all GPUs </li>
<li>❗ &nbsp; A single FSDP layer when gathered during forward/backward must fit into the GPU </li>
<li>❗ &nbsp; Requires some knowledge about model architecture to set configuration options correctly </li>
<li>❗ &nbsp; Requires very fast networking (multi-node), data transfers between GPUs often become a bottleneck </li>
</ul>

|
**Tensor Parallel (TP)**

.. raw:: html

<ul class="no-bullets">
<li>❗ &nbsp; Model code changes required </li>
<li>🤔 &nbsp; Fixed global batch size (does not scale with number of GPUs) </li>
<li>✅ &nbsp; Model (weights, optimizer state, activations) gets distributed across all GPUs</li>
<li>✅ &nbsp; Parallelizes the computation of layers that are too large to fit onto a single GPU </li>
<li>❗ &nbsp; Requires lots of knowledge about model architecture to set configuration options correctly </li>
<li>🤔 &nbsp; Less GPU data transfers required, but data transfers don't overlap with computation like in FSDP </li>
</ul>

|
**2D Parallel (FSDP + TP)**

.. raw:: html

<ul class="no-bullets">
<li>❗ &nbsp; Model code changes required</li>
<li>✅ &nbsp; Training with very large batch sizes (batch size scales across data-parallel dimension)</li>
<li>✅ &nbsp; Model (weights, optimizer state, activations) gets distributed across all GPUs</li>
<li>✅ &nbsp; Parallelizes the computation of layers that are too large to fit onto a single GPU</li>
<li>❗ &nbsp; Requires lots of knowledge about model architecture to set configuration options correctly</li>
<li>✅ &nbsp; Tensor-parallel within machines and FSDP across machines reduces data transfer bottlenecks</li>
</ul>

|
PyTorch Lightning supports all the parallelisms mentioned above natively through PyTorch, with the exception of pipeline parallelism (PP) which is not yet supported.

|
Loading

0 comments on commit c09356d

Please sign in to comment.