(1/n) Support 2D Parallelism #19846

awaelchli · 2024-05-04T15:25:45Z

What does this PR do?

Adds a new ModelParallelStrategy that enables user-defined model parallelism.

def parallelize_my_model(model, device_mesh):
	# User-defined function that applies the desired parallelizations specific to the model
    # (TP, FSDP2, activation checkpointing, ...)
    ...


strategy = ModelParallelStrategy(
    parallelize_fn=parallelize_my_model,
    # Define the size of the 2D parallelism
    # Set to "auto" to apply TP intra-node and DP inter-node
    data_parallel_size=2,
    tensor_parallel_size=2,
)

fabric = L.Fabric(..., strategy=strategy)
fabric.launch()

# 1. Initializes the device mesh
# 2. Runs `parallelize_fn` here
# 3. Calls `.to_empty()` if model is on meta-device
# 4. Calls `.reset_parameters()` on submodules
model = fabric.setup(model)

The emphasis here must be on user-defined. The strategy does not do anything to the model except set up the device mesh for the user to consume. It is the user's responsibility to correctly parse the device mesh and set up the parallelization in their model. This means applying TP, FSDP, activation checkpointing, etc.

See examples/fabric/tensor_parallel for a full example.

Future PRs will add documentation pages.

What's not supported yet?

Not all checkpoint features are implemented. Only supports saving and loading distributed checkpoints right now. Future PRs will add the missing code.
Mixed precision with grad scaler not supported (16-mixed). Only bf16-mixed and bf16-true is supported. Future PRs will add the missing code.
Trainer: Future PRs will implement the same strategy for the PL Trainer, where the parallelize_fn will be replaced by a hook in the LightningModule.
HSDP (requires additional dimension of the device mesh). Could be exposed by an additional optional argument or making data_parallel_size accept a tuple.

cc @Borda @carmocca @justusschock @awaelchli

src/lightning/fabric/strategies/model_parallel.py

tests/tests_fabric/strategies/test_model_parallel_integration.py

github-actions · 2024-05-04T16:22:43Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, lightning, 3.8, 2.0, oldest)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.1)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.2)	success	✅
pl-cpu (macOS-14, lightning, 3.10, 2.3)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 2.0, oldest)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.1)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.2)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.3)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 2.0, oldest)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.1)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.2)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.3)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 2.0)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 2.0)	success	✅
pl-cpu (windows-2022, pytorch, 3.8, 2.0)	success	✅
pl-cpu (macOS-12, pytorch, 3.11, 2.0)	success	✅
pl-cpu (macOS-12, pytorch, 3.11, 2.1)	success	✅
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.0)	success	✅
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.1)	success	✅
pl-cpu (windows-2022, pytorch, 3.11, 2.0)	success	✅
pl-cpu (windows-2022, pytorch, 3.11, 2.1)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/__init__.py, src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/strategies/model_parallel.py, src/lightning/fabric/utilities/init.py, src/lightning/pytorch/strategies/fsdp.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
pytorch-lightning (GPUs) (testing Lightning \| latest)	success	✅
pytorch-lightning (GPUs) (testing PyTorch \| latest)	success	✅

These checks are required after the changes to src/lightning/pytorch/strategies/fsdp.py, src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/__init__.py, src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/strategies/model_parallel.py, src/lightning/fabric/utilities/init.py.

🟢 pytorch_lightning: Benchmarks

Check ID	Status
lightning.Benchmarks	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/__init__.py, src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/strategies/model_parallel.py, src/lightning/fabric/utilities/init.py, src/lightning/pytorch/strategies/fsdp.py.

🟢 fabric: Docs

Check ID	Status
docs-make (fabric, doctest)	success	✅
docs-make (fabric, html)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/__init__.py, src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/strategies/model_parallel.py, src/lightning/fabric/utilities/init.py.

🟢 pytorch_lightning: Docs

Check ID	Status
docs-make (pytorch, doctest)	success	✅
docs-make (pytorch, html)	success	✅

These checks are required after the changes to src/lightning/pytorch/strategies/fsdp.py, docs/source-pytorch/conf.py.

🟢 lightning_fabric: CPU workflow

Check ID	Status
fabric-cpu (macOS-11, lightning, 3.8, 2.0, oldest)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
fabric-cpu (macOS-11, lightning, 3.11, 2.1)	success	✅
fabric-cpu (macOS-11, lightning, 3.11, 2.2)	success	✅
fabric-cpu (macOS-14, lightning, 3.10, 2.3)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 2.0, oldest)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.11, 2.1)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.11, 2.2)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.11, 2.3)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 2.0, oldest)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
fabric-cpu (windows-2022, lightning, 3.11, 2.1)	success	✅
fabric-cpu (windows-2022, lightning, 3.11, 2.2)	success	✅
fabric-cpu (windows-2022, lightning, 3.11, 2.3)	success	✅
fabric-cpu (macOS-11, fabric, 3.8, 2.0)	success	✅
fabric-cpu (ubuntu-20.04, fabric, 3.8, 2.0)	success	✅
fabric-cpu (windows-2022, fabric, 3.8, 2.0)	success	✅
fabric-cpu (macOS-12, fabric, 3.11, 2.0)	success	✅
fabric-cpu (macOS-12, fabric, 3.11, 2.1)	success	✅
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.0)	success	✅
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.1)	success	✅
fabric-cpu (windows-2022, fabric, 3.11, 2.0)	success	✅
fabric-cpu (windows-2022, fabric, 3.11, 2.1)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/__init__.py, src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/strategies/model_parallel.py, src/lightning/fabric/utilities/init.py, tests/tests_fabric/strategies/test_fsdp.py, tests/tests_fabric/strategies/test_fsdp_integration.py, tests/tests_fabric/strategies/test_model_parallel.py, tests/tests_fabric/strategies/test_model_parallel_integration.py, tests/tests_fabric/utilities/test_init.py.

🟢 lightning_fabric: Azure GPU

Check ID	Status
lightning-fabric (GPUs) (testing Fabric \| latest)	success	✅
lightning-fabric (GPUs) (testing Lightning \| latest)	success	✅

These checks are required after the changes to examples/fabric/tensor_parallel/data.py, examples/fabric/tensor_parallel/model.py, examples/fabric/tensor_parallel/parallelism.py, examples/fabric/tensor_parallel/train.py, src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/__init__.py, src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/strategies/model_parallel.py, src/lightning/fabric/utilities/init.py, tests/tests_fabric/strategies/test_fsdp.py, tests/tests_fabric/strategies/test_fsdp_integration.py, tests/tests_fabric/strategies/test_model_parallel.py, tests/tests_fabric/strategies/test_model_parallel_integration.py, tests/tests_fabric/utilities/test_init.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/__init__.py, src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/strategies/model_parallel.py, src/lightning/fabric/utilities/init.py, src/lightning/pytorch/strategies/fsdp.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.11)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.11)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.11)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.11)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.11)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.11)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.11)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.11)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.11)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.11)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.11)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.11)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.11)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.11)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.11)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/__init__.py, src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/strategies/model_parallel.py, src/lightning/fabric/utilities/init.py, src/lightning/pytorch/strategies/fsdp.py.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

codecov · 2024-05-04T16:43:29Z

Codecov Report

Attention: Patch coverage is 90.69767% with 20 lines in your changes are missing coverage. Please review.

Project coverage is 59%. Comparing base (0f12271) to head (0d9afe8).

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #19846     +/-   ##
=========================================
- Coverage      84%      59%    -25%     
=========================================
  Files         424      420      -4     
  Lines       34702    34802    +100     
=========================================
- Hits        29097    20437   -8660     
- Misses       5605    14365   +8760

examples/fabric/tensor_parallel/model.py

carmocca · 2024-05-06T10:27:18Z

src/lightning/fabric/strategies/model_parallel.py

+TModel = TypeVar("TModel", bound=Module)
+
+
+class ModelParallelStrategy(ParallelStrategy):


I find this name confusing. Data-parallelism is not model parallelism yet this class supports both. Can we think of something else?

What about ManualParallelStrategy? It was the name proposed a million years ago: #11922

Some discussion internally to consider another name, will rename the strategy in a follow up if we converge to a final decision.

src/lightning/fabric/strategies/model_parallel.py

carmocca · 2024-05-06T10:49:27Z

tests/tests_fabric/strategies/test_model_parallel_integration.py

+    fabric.launch()
+    assert fabric.strategy.device_mesh.mesh_dim_names == ("data_parallel", "tensor_parallel")
+    assert fabric.strategy.device_mesh.size(0) == 1
+    assert fabric.strategy.device_mesh.size(1) == 4


CI runs with 2 devices. This could introduce a deadlock if PyTorch adds a collective call in the device mesh. Could we change it to 2 just in case?

For 2D I need at least 2*2=4 devices.
What do you mean this could add a deadlock. Where? I'm calling it on all ranks.

I didn't notice that you had min_cuda_gpus=4.

Be aware then that this will not run on our CI (gets skipped) because all our agents run with only 2 visible CUDA devices.

https://dev.azure.com/Lightning-AI/lightning/_build/results?buildId=201341&view=logs&j=3f274fac-2e11-54ca-487e-194c91f3ae9f&t=8e4ceb7c-ceed-5ee0-b6fc-c9023e41cb74&l=1306

Yes I'm aware. I was torn between doing this vs. simulating it on CPU. DTensor works on CPU, but I'd still need processes spawned for proper e2e testing and we currently don't have the combination of standalone=True and min_gpus=x in the CI. Also I wasn't convinced enough that CPU testing would be representative enough. Maybe we need to revisit this later because it could be easy to miss updates to the tests.

tests/tests_fabric/strategies/test_model_parallel_integration.py

carmocca · 2024-05-06T10:56:20Z

examples/fabric/tensor_parallel/model.py

+    if dp_mesh.size() > 1:
+        assert dp_mesh.ndim == 1  # Hybrid-sharding not supported
+
+        mp_policy = MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32)


Isn't this meant to be encapsulated by fabric?

Since the user modifies the model directly here, probably not. In any case not for this iteration.

Worth leaving a comment then because as a user I would be confused about the impact of doing this and setting precision="something" in Fabric

Right, I added a comment. We will need to do something about precision soon, let's brainstorm this for a follow up

tests/tests_fabric/strategies/test_model_parallel_integration.py

for more information, see https://pre-commit.ci

carmocca · 2024-05-07T14:06:49Z

tests/tests_fabric/strategies/test_model_parallel_integration.py

+    fabric.launch()
+    assert fabric.strategy.device_mesh.mesh_dim_names == ("data_parallel", "tensor_parallel")
+    assert fabric.strategy.device_mesh.size(0) == 1
+    assert fabric.strategy.device_mesh.size(1) == 4


I didn't notice that you had min_cuda_gpus=4.

Be aware then that this will not run on our CI (gets skipped) because all our agents run with only 2 visible CUDA devices.

https://dev.azure.com/Lightning-AI/lightning/_build/results?buildId=201341&view=logs&j=3f274fac-2e11-54ca-487e-194c91f3ae9f&t=8e4ceb7c-ceed-5ee0-b6fc-c9023e41cb74&l=1306

for more information, see https://pre-commit.ci

github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels May 4, 2024

awaelchli added feature Is an improvement or enhancement strategy and removed fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels May 4, 2024

awaelchli added this to the 2.3 milestone May 4, 2024

github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels May 4, 2024

awaelchli commented May 4, 2024

View reviewed changes

src/lightning/fabric/strategies/model_parallel.py Outdated Show resolved Hide resolved

src/lightning/fabric/strategies/model_parallel.py Outdated Show resolved Hide resolved

awaelchli commented May 4, 2024

View reviewed changes

tests/tests_fabric/strategies/test_model_parallel_integration.py Outdated Show resolved Hide resolved

awaelchli marked this pull request as ready for review May 4, 2024 16:22

awaelchli requested review from carmocca, justusschock, williamFalcon, lantiga, Borda and tchaton as code owners May 4, 2024 16:22

github-actions bot added the docs Documentation related label May 4, 2024

Borda reviewed May 4, 2024

View reviewed changes

examples/fabric/tensor_parallel/model.py Outdated Show resolved Hide resolved

Borda approved these changes May 4, 2024

View reviewed changes

squash all commits

8d4b31c

awaelchli force-pushed the examples/tp branch from 20872e9 to 8d4b31c Compare May 5, 2024 15:04

carmocca reviewed May 6, 2024

View reviewed changes

awaelchli added 4 commits May 6, 2024 13:41

remove debug code in test

f2b9705

add dataset seeding comment

f2da1d7

Don't allow FSDP v1 modules

605a699

move parallelize function to a separate file

fc0281e

pre-commit-ci bot and others added 6 commits May 6, 2024 12:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

b3103ce

for more information, see https://pre-commit.ci

add more assertions

c917bee

[pre-commit.ci] auto fixes from pre-commit.com hooks

48d04ba

for more information, see https://pre-commit.ci

fix type hint

a4d0c69

xfail test

309edfa

[pre-commit.ci] auto fixes from pre-commit.com hooks

e09b501

for more information, see https://pre-commit.ci

awaelchli requested a review from carmocca May 7, 2024 07:38

carmocca approved these changes May 7, 2024

View reviewed changes

mergify bot added the ready PRs ready to be merged label May 7, 2024

awaelchli and others added 5 commits May 7, 2024 21:22

move the materialize function to init utilities

ffa56dd

make device mesh dimension check a unit test

4a8c3c5

add note about mp_policy in example

aa571af

[pre-commit.ci] auto fixes from pre-commit.com hooks

6f04b36

for more information, see https://pre-commit.ci

handle linkcheck

0d9afe8

awaelchli merged commit 0c8a193 into master May 7, 2024
117 of 118 checks passed

awaelchli deleted the examples/tp branch May 7, 2024 21:02

awaelchli mentioned this pull request May 7, 2024

(2/n) Support 2D Parallelism - Distributed Checkpoints #19852

Merged

This was referenced May 16, 2024

(5/n) Support 2D Parallelism in Lightning Trainer #19878

Merged

(6/n) Support 2D Parallelism - Trainer example #19879

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(1/n) Support 2D Parallelism #19846

(1/n) Support 2D Parallelism #19846

awaelchli commented May 4, 2024 •

edited

github-actions bot commented May 4, 2024 •

edited

codecov bot commented May 4, 2024 •

edited

carmocca May 6, 2024

awaelchli May 7, 2024

carmocca May 6, 2024

awaelchli May 6, 2024

carmocca May 7, 2024

awaelchli May 7, 2024 •

edited

carmocca May 6, 2024

awaelchli May 6, 2024

carmocca May 7, 2024

awaelchli May 7, 2024 •

edited

carmocca May 7, 2024

		TModel = TypeVar("TModel", bound=Module)


		class ModelParallelStrategy(ParallelStrategy):

(1/n) Support 2D Parallelism #19846

(1/n) Support 2D Parallelism #19846

Conversation

awaelchli commented May 4, 2024 • edited

What does this PR do?

What's not supported yet?

github-actions bot commented May 4, 2024 • edited

⚡ Required checks status: All passing 🟢

Groups summary

codecov bot commented May 4, 2024 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awaelchli May 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awaelchli May 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awaelchli commented May 4, 2024 •

edited

github-actions bot commented May 4, 2024 •

edited

codecov bot commented May 4, 2024 •

edited

awaelchli May 7, 2024 •

edited

awaelchli May 7, 2024 •

edited