Do we need an N-dim sub-DeviceMesh? #126530

botbw · 2024-05-17T10:39:18Z

🚀 The feature, motivation and pitch

Hey there! Currently torch.distributed._tensor.DeviceMesh only supports 1-D sub-meshes, will it be possible to manipulate it as an NDArray and generate N-dim sub-meshes?

For example, in 2-D tensor parallelism together with pipeline parallelism, the mesh looks like [pp, tp0, tp1] == [2, 2, 2], and if an all-gather/all-reduce on tp is needed for rank == 0, mesh[tp0], mesh[tp1] only gives [0, 1] and [0, 2] (3 is missing).

import os
import torch
from torch.distributed._tensor import DeviceMesh, mesh_resources
rank = int(os.environ['RANK'])
if __name__ == "__main__":
    mesh = DeviceMesh("cpu", [
        [
            [0, 1],
            [2, 3]
        ], # pp_rank == 0
        [
            [4, 5],
            [6, 7]
        ]  # pp_rank == 1
    ], mesh_dim_names=['pp', 'tp0', 'tp1'])
    if rank == 0: # pp_rank == tp0_rank == tp1_rank == 0
        print(mesh['tp0'])
        print(mesh['tp1'])

A workaround is to represent tp0 and tp1 in a single dim, but there could be scenario in which tp0 is replicated and only tp2 needs the communication.

Alternatively, could you please provide some workaround for this?

Alternatives

No response

Additional context

No response

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @msaroufim

The text was updated successfully, but these errors were encountered:

botbw · 2024-05-21T05:24:17Z

hey any updates on this?
@awgu @yf225 @wanchaol @wz337

wz337 · 2024-05-21T20:39:18Z

@botbw Thanks for raising the issue. We are actually working on the feature and hoping to have it in nightly asap so we can get this in 2.4 release as well.

Fixes #126530 Pull Request resolved: #127465 Approved by: https://github.com/wconstab, https://github.com/wanchaol

Fixes #126530 Pull Request resolved: #127465 Approved by: https://github.com/wconstab (cherry picked from commit e72232f)

Fixes #126530 Pull Request resolved: #127465 Approved by: https://github.com/wconstab, https://github.com/wanchaol (cherry picked from commit 1699eda)

Fixes pytorch#126530 Pull Request resolved: pytorch#127465 Approved by: https://github.com/wconstab

Fixes pytorch#126530 Pull Request resolved: pytorch#127465 Approved by: https://github.com/wconstab, https://github.com/wanchaol

awgu added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 17, 2024

awgu assigned wanchaol and wz337 May 17, 2024

yf225 added module: dtensor distributed tensor tag triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 20, 2024

wz337 mentioned this issue May 29, 2024

[DeviceMesh] Adding nD slicing support back #127465

Closed

pytorchmergebot closed this as completed in e72232f May 30, 2024

pytorchmergebot pushed a commit that referenced this issue May 31, 2024

[DeviceMesh] Adding nD slicing support back (#127465)

1699eda

Fixes #126530 Pull Request resolved: #127465 Approved by: https://github.com/wconstab, https://github.com/wanchaol

bigfootjon pushed a commit that referenced this issue Jun 5, 2024

[DeviceMesh] Adding nD slicing support back (#127465)

41c09ad

Fixes #126530 Pull Request resolved: #127465 Approved by: https://github.com/wconstab (cherry picked from commit e72232f)

bigfootjon pushed a commit that referenced this issue Jun 5, 2024

[DeviceMesh] Adding nD slicing support back (#127465)

6114c8f

Fixes #126530 Pull Request resolved: #127465 Approved by: https://github.com/wconstab, https://github.com/wanchaol (cherry picked from commit 1699eda)

petrex pushed a commit to petrex/pytorch that referenced this issue Jun 5, 2024

[DeviceMesh] Adding nD slicing support back (pytorch#127465)

0c3af34

Fixes pytorch#126530 Pull Request resolved: pytorch#127465 Approved by: https://github.com/wconstab

petrex pushed a commit to petrex/pytorch that referenced this issue Jun 5, 2024

[DeviceMesh] Adding nD slicing support back (pytorch#127465)

67eb162

Fixes pytorch#126530 Pull Request resolved: pytorch#127465 Approved by: https://github.com/wconstab, https://github.com/wanchaol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we need an N-dim sub-DeviceMesh? #126530

Do we need an N-dim sub-DeviceMesh? #126530

botbw commented May 17, 2024 •

edited by pytorch-bot bot

botbw commented May 21, 2024

wz337 commented May 21, 2024

Do we need an N-dim sub-DeviceMesh? #126530

Do we need an N-dim sub-DeviceMesh? #126530

Comments

botbw commented May 17, 2024 • edited by pytorch-bot bot

🚀 The feature, motivation and pitch

Alternatives

Additional context

botbw commented May 21, 2024

wz337 commented May 21, 2024

botbw commented May 17, 2024 •

edited by pytorch-bot bot