[distributed][Tensor Parallelism] Implement early transforms for column-wise and row-wise linear and embedding #410

crcrpar · 2024-05-13T13:16:34Z

this implements a trace transform that converts one or more linear and/or embedding layers into column-wise or row-wise tensor parallel ones by (1) sharding their weight and bias and (2) inserting needed communication and/or scattering before and/or after the modified layers.

Out of four supported ops, row-wise parallel linear would lead to a BoundSymbol modification. The change caused is to omit the bias term and that bias is added to the result of communication (after post-processing).

example

class Model(nn.Module):
    def __init__(self, n_in: int, n_hidden: int, n_out: int) -> None:
        self.l1 = nn.Linear(n_in, n_hidden)
        self.l2 = nn.Linear(n_hidden, n_out)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.l2(F.gelu(self.l1(x))


device = torch.device(f"cuda:{rank}")

model = Model().to(device)
jitted_model = thunder.jit(model)
tp_jitted_model = thunder.distributed.column_parallel(jitted_model, ("l1",))
tp_jitted_model = thunder.distributed.row_parallel(tp_jitted_model, ("l2",))

x = torch.randn(..., device=device)
y = tp_jitted_model(x)
assert y.size(1) == n_out

cc @Borda @apaz-cli @carmocca @awaelchli @crcrpar

crcrpar · 2024-05-17T16:32:13Z

The failures as of f724a88 look related to #432.

thunder/distributed/tensor_parallel.py

lantiga

Amazing work @crcrpar

Mostly nitpicks, it was great fun reviewing this

thunder/distributed/__init__.py

thunder/distributed/prims.py

thunder/distributed/tensor_parallel/row_wise.py

thunder/executors/torchex.py

thunder/tests/distributed/test_ddp.py

t-vi

Supergood, LGTM. Very excited to see the test_tensor_parallel_both_column_and_row be the first actual example of composing early_transforms!
I added a few minor nits.

thunder/distributed/prims.py

thunder/executors/torchex.py

thunder/distributed/tensor_parallel/column_wise.py

thunder/distributed/prims.py

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

This is to avoid passing preprocessed input into another ops while they are supposed to take the original input. For example, suppose we have two embeddings and one of them is column-parallel and the other not, the previous implementation modified the input regardless of embedding's parallelism. Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

thunder/tests/distributed/test_tensor_parallel.py

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

thunder/core/proxies.py

lantiga

Looks great! Ship it! 🚀

crcrpar added the distributed label May 13, 2024

crcrpar requested a review from IvanYashchuk May 13, 2024 13:16

crcrpar force-pushed the crpa/tensor-parallel branch 2 times, most recently from 60844aa to f2278ed Compare May 13, 2024 16:53

github-actions bot added the documentation Improvements or additions to documentation label May 13, 2024

crcrpar marked this pull request as ready for review May 13, 2024 18:04

crcrpar requested review from mruberry, lantiga, robieta, t-vi and carmocca as code owners May 13, 2024 18:04

crcrpar force-pushed the crpa/tensor-parallel branch from 0ac5f94 to 7e79451 Compare May 13, 2024 18:04

github-actions bot added the has conflicts label May 16, 2024

crcrpar force-pushed the crpa/tensor-parallel branch from 7e79451 to 98b3468 Compare May 17, 2024 04:25

github-actions bot removed the has conflicts label May 17, 2024

crcrpar force-pushed the crpa/tensor-parallel branch 4 times, most recently from 25b2d14 to f724a88 Compare May 17, 2024 16:04

crcrpar commented May 20, 2024

View reviewed changes

thunder/distributed/tensor_parallel.py Outdated Show resolved Hide resolved

thunder/distributed/tensor_parallel.py Outdated Show resolved Hide resolved

crcrpar changed the title ~~[distributed][Tensor Parallelism] Implement Column-wise Linear~~ [distributed][Tensor Parallelism] Implement early transform for Column-wise Parallel May 20, 2024

crcrpar force-pushed the crpa/tensor-parallel branch from b5a6e02 to 19a01fd Compare May 22, 2024 04:39

crcrpar marked this pull request as draft May 23, 2024 07:07

crcrpar force-pushed the crpa/tensor-parallel branch from 19a01fd to 6db1ebf Compare May 24, 2024 13:33

crcrpar changed the title ~~[distributed][Tensor Parallelism] Implement early transform for Column-wise Parallel~~ [distributed][Tensor Parallelism] Implement early transforms for column-wise and row-wise linear and embedding May 24, 2024

crcrpar marked this pull request as ready for review May 28, 2024 12:48

lantiga approved these changes May 28, 2024

View reviewed changes

t-vi approved these changes May 28, 2024

View reviewed changes

t-vi reviewed May 29, 2024

View reviewed changes

thunder/distributed/prims.py Outdated Show resolved Hide resolved

crcrpar added 5 commits May 30, 2024 04:53

parametrize tp linear bias

f6d2680

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

cosmetic

a9078ab

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

no initial param sync, input grad check

b5909ae

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

input grad check

632ca68

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar commented May 30, 2024

View reviewed changes

thunder/tests/distributed/test_tensor_parallel.py Show resolved Hide resolved

thunder/tests/distributed/test_tensor_parallel.py Outdated Show resolved Hide resolved

crcrpar added 15 commits May 30, 2024 07:38

grads check

6a5a6e5

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

create input_swap_map per bsym

ea70ae3

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

destroy process group, always

ae30fc6

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

multiple row-parallel linears

0500919

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

local split in last dim

0f57a0c

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

expose hardcoded params as class var

605d701

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

simplify test

88bbbe2

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

bit more informative err msg

9ff8848

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

fix bwd of row-embed postprocess

264b549

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

organized

7f13ea0

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

organized

8231177

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Merge branch 'main' into crpa/tensor-parallel

46bd367

use gelu for better numeric

5bf797b

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

longer name

46183d8

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

no nvfuser linear/matmul

f146205

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

t-vi reviewed May 31, 2024

View reviewed changes

thunder/core/proxies.py Show resolved Hide resolved

t-vi enabled auto-merge (squash) May 31, 2024 10:50

lantiga approved these changes May 31, 2024

View reviewed changes

t-vi merged commit 9107a3d into main May 31, 2024
37 checks passed

t-vi deleted the crpa/tensor-parallel branch May 31, 2024 16:47

This was referenced May 31, 2024

Refactor cross_entropy using log_softmax and nll_loss references #260

Merged

Update benchmarks/targets.py: inference/forward/backward parametrization #498

Merged

Fix leftover DDP type #506

Merged

crcrpar added the tensor parallel distributed - tensor parallel label Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[distributed][Tensor Parallelism] Implement early transforms for column-wise and row-wise linear and embedding #410

[distributed][Tensor Parallelism] Implement early transforms for column-wise and row-wise linear and embedding #410

crcrpar commented May 13, 2024 •

edited

Loading

crcrpar commented May 17, 2024

lantiga left a comment

t-vi left a comment

lantiga left a comment

[distributed][Tensor Parallelism] Implement early transforms for column-wise and row-wise linear and embedding #410

[distributed][Tensor Parallelism] Implement early transforms for column-wise and row-wise linear and embedding #410

Conversation

crcrpar commented May 13, 2024 • edited Loading

crcrpar commented May 17, 2024

lantiga left a comment

Choose a reason for hiding this comment

t-vi left a comment

Choose a reason for hiding this comment

lantiga left a comment

Choose a reason for hiding this comment

crcrpar commented May 13, 2024 •

edited

Loading