Implement `no_sync` for `thunder.distributed.fsdp` (PR2457) #45

crcrpar · 2024-03-22T05:15:24Z

tldr

Enables no_sync for thunder.jit(thunder.distributed.fsdp(model)). The accompanied changes are:

new argument of return_none_instead_of_grads of ThunderFunction.forward
- This could be eliminated once a TraceCtx's bound symbols are not deleted even if it just returns one or more Nones
removal of no_sync check before applying dist_prims.synchronize to args and kwargs
- FSDP's forward needs this prim for its param AllGather
- [ddp] visitor_transform removes dist_prims.all_reduce, dist_prims.wait, and preaveraging when no_sync
- [fsdp] visitor_transform removes comms and puts dist_prims.stash_grad_for_fsdp and optional param AllGather when no_sync
  - The generated trace and its executable python code return unsynchronized unsharded gradients.
  - The prim's implementation accumulates the grads as param._thunder_fsdp_unsharded_grad.
  - ThunderFunction's backward returns Nones instead of such grads to avoid shape mismatch between params and unsharded grads.

as of fa61c49

llama-2-7b-hf
world size 8 H100s
micro batch size 1
global batch size 32
gradient accumulation 4
no bucketing (of AllGather and ReduceScatter)

zero2

command: torchrun --nproc-per-node=8 thunder/benchmarks/benchmark_litgpt.py --compile=thunder_inductor --distributed_mode=fsdp --nsys_enabled=False --micro_batch_size=1 --global_batch_size=32 --skip_data_sync <false|true> --model_name=Llama-2-7b-hf --shard_mode=zero2 --bucketing_mode=none --json_path "<filename>.json" --return_metrics_as_json=true

	w/ `no_sync`	w/o `no_sync`
tokens/sec	82713.0	80341.0
memory consumption [GB]	65.6	40.3

zero3

command: torchrun --nproc-per-node=8 thunder/benchmarks/benchmark_litgpt.py --compile=thunder_inductor --distributed_mode=fsdp --nsys_enabled=False --micro_batch_size=1 --global_batch_size=32 --skip_data_sync <false|true> --model_name=Llama-2-7b-hf --shard_mode=zero3 --bucketing_mode=none --json_path "<filename>.json" --return_metrics_as_json=true

	w/ `no_sync`	w/o `no_sync`
tokens/sec	77839.0	75511.9
memory consumption [GB]	52.5	27.1

crcrpar · 2024-03-25T07:46:30Z

thunder/executors/torch_autograd.py

-    def forward(ctx, compiled_backward, saved_tensors, saved_other, flat_output, *flat_args):
+    def forward(
+        ctx,
+        return_none_instead_of_grads,


[RFC] new argument to ThunderFunction.apply

crcrpar · 2024-03-25T08:00:13Z

thunder/benchmarks/benchmark_litgpt.py

+        if self.skip_data_sync:
+            data_sync_ctx = self.model.no_sync
+        else:
+            data_sync_ctx = nullcontext


cc @parthmannan

thunder/distributed/__init__.py

crcrpar · 2024-04-23T15:26:50Z

@t-vi this is ready for merge

crcrpar · 2024-04-23T15:45:46Z

How is HF's Accelerate or LIghtning's Fabric wrapping PyTorch's no_sync?

fabric: https://lightning.ai/docs/fabric/stable/advanced/gradient_accumulation.html
accelerate:
- no_wync wrapper: https://huggingface.co/docs/accelerate/en/concept_guides/gradient_synchronization
- accumulate, abstraction of an optimizer step with gradient accumulation: https://huggingface.co/docs/accelerate/v0.29.3/en/package_reference/accelerator#accelerate.Accelerator.accumulate

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

for `ThunderFunction` Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

for more information, see https://pre-commit.ci

Co-authored-by: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>

t-vi

Thank you, @crcrpar @IvanYashchuk

crcrpar requested review from mruberry, lantiga, robieta and t-vi as code owners March 22, 2024 05:15

crcrpar mentioned this pull request Mar 22, 2024

Support nvfuser fd.add_output(output, alias_input) #50

Merged

crcrpar force-pushed the crpa/fsdp-no-sync branch 3 times, most recently from 0c7f3d6 to f8f3ed4 Compare March 24, 2024 15:59

crcrpar requested a review from carmocca as a code owner March 24, 2024 15:59

github-actions bot added the has conflicts label Mar 24, 2024

crcrpar force-pushed the crpa/fsdp-no-sync branch from f8f3ed4 to 01e7b08 Compare March 25, 2024 04:13

github-actions bot removed the has conflicts label Mar 25, 2024

crcrpar force-pushed the crpa/fsdp-no-sync branch from 01e7b08 to 2e3d5e2 Compare March 25, 2024 06:38

crcrpar requested a review from IvanYashchuk March 25, 2024 06:39

crcrpar force-pushed the crpa/fsdp-no-sync branch from 6fcc38a to 22c5736 Compare March 25, 2024 07:38

crcrpar commented Mar 25, 2024

View reviewed changes

crcrpar force-pushed the crpa/fsdp-no-sync branch 2 times, most recently from 8a2ad62 to 1ff7530 Compare March 27, 2024 13:48

crcrpar force-pushed the crpa/fsdp-no-sync branch from 4a729fe to dd348cd Compare April 3, 2024 05:27

github-actions bot added the has conflicts label Apr 4, 2024

crcrpar force-pushed the crpa/fsdp-no-sync branch from dd348cd to fa61c49 Compare April 5, 2024 10:36

github-actions bot removed the has conflicts label Apr 5, 2024

crcrpar commented Apr 5, 2024

View reviewed changes

thunder/distributed/__init__.py Outdated Show resolved Hide resolved

thunder/distributed/__init__.py Show resolved Hide resolved

crcrpar force-pushed the crpa/fsdp-no-sync branch 3 times, most recently from 6652486 to 80e77c2 Compare April 9, 2024 15:25

crcrpar mentioned this pull request Apr 11, 2024

Sunset thunder/benchmarks/distributed.py and Improve thunder/benchmarks/benchmark_litgpt.py #158

Open

1 task

crcrpar force-pushed the crpa/fsdp-no-sync branch from 80e77c2 to d9e80c4 Compare April 11, 2024 15:10

github-actions bot added the has conflicts label Apr 12, 2024

crcrpar force-pushed the crpa/fsdp-no-sync branch from d9e80c4 to d39e450 Compare April 17, 2024 06:59

crcrpar force-pushed the crpa/fsdp-no-sync branch from c1604ca to ba55439 Compare April 23, 2024 15:26

crcrpar force-pushed the crpa/fsdp-no-sync branch 3 times, most recently from e9bde84 to 078e7b1 Compare May 1, 2024 06:44

crcrpar and others added 17 commits May 3, 2024 17:47

prim of stash_grad_for_fsdp

d2bcb6f

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

enable no_sync for fsdp'd module and return_none_instead_of_grads

0b9b5c4

for `ThunderFunction` Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

transform for fsdp no_sync

723a77e

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

test fsdp no_sync

98eae40

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

ddp no_sync transform update

e21ded7

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Enable gradient accumulation in litgpt benchmark

11cb45f

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

fix type annotation

360bfef

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

create and pass a map from index to FQN

be2dbfd

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

simpler parameterization

e938519

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

check necessity of allgather bucketing in backward

dc532f9

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

revert unwanted changes from rebase

6dedfd5

check num of bsyms

04a56fd

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Call optimizer.step without if stmt

73e873f

cover FSDPBucketingStrategy.LAYER

a74bb13

[pre-commit.ci] auto fixes from pre-commit.com hooks

8c7b059

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

a355d3b

for more information, see https://pre-commit.ci

Update thunder/executors/torch_autograd.py

0565d82

Co-authored-by: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>

crcrpar force-pushed the crpa/fsdp-no-sync branch from 078e7b1 to 0565d82 Compare May 3, 2024 08:47

t-vi approved these changes May 3, 2024

View reviewed changes

t-vi merged commit 85b2cd8 into main May 3, 2024
37 of 39 checks passed

t-vi deleted the crpa/fsdp-no-sync branch May 3, 2024 12:26

carmocca mentioned this pull request May 13, 2024

Support no_sync with Thunder FSDP Lightning-AI/litgpt#1414

Merged

parthmannan mentioned this pull request May 13, 2024

Fix nsys profiling in benchmark_litgpt #413

Merged

xwang233 mentioned this pull request May 21, 2024

benchmark_litgpt.py + Llama-3-8B + FSDP hits OOM since 5/4/24 on H100 #439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `no_sync` for `thunder.distributed.fsdp` (PR2457) #45

Implement `no_sync` for `thunder.distributed.fsdp` (PR2457) #45

crcrpar commented Mar 22, 2024 •

edited

Loading

crcrpar Mar 25, 2024

crcrpar Mar 25, 2024

crcrpar commented Apr 23, 2024 •

edited

Loading

crcrpar commented Apr 23, 2024

t-vi left a comment

Implement no_sync for thunder.distributed.fsdp (PR2457) #45

Implement no_sync for thunder.distributed.fsdp (PR2457) #45

Conversation

crcrpar commented Mar 22, 2024 • edited Loading

tldr

zero2

zero3

crcrpar Mar 25, 2024

Choose a reason for hiding this comment

crcrpar Mar 25, 2024

Choose a reason for hiding this comment

crcrpar commented Apr 23, 2024 • edited Loading

crcrpar commented Apr 23, 2024

t-vi left a comment

Choose a reason for hiding this comment

Implement `no_sync` for `thunder.distributed.fsdp` (PR2457) #45

Implement `no_sync` for `thunder.distributed.fsdp` (PR2457) #45

crcrpar commented Mar 22, 2024 •

edited

Loading

crcrpar commented Apr 23, 2024 •

edited

Loading