add the basic apis for auto_parallel #33804

sandyhouse · 2021-06-28T05:54:48Z

PR types

New features

PR changes

Others

Describe

add the basic directory for auto_parallel （python/paddle/distributed/auto_parallel)
add the following api:

ProcessMesh
shard_tensor
shard_op
set_pipeline_stage
set_offload_device
set_shard_mask

Usage:

ProcessMessage

    import numpy as np
    import paddle
    import paddle.distributed as dist
    
    paddle.enable_static()
    
    mesh = dist.ProcessMesh(np.array([[2, 4, 5], [0, 1, 3]]))
    assert mesh.parent is None
    assert mesh.topology == [2, 3]
    assert mesh.process_group == [2, 4, 5, 0, 1, 3]
    mesh.set_placement([0, 1, 2, 3, 4, 5])

shard_tensor

    import numpy as np
    import paddle
    import paddle.distributed as dist
    paddle.enable_static()
    mesh = dist.ProcessMesh(np.array([[2, 4, 5], [0, 1, 3]]))
    x = paddle.ones([4, 6])
    dist.shard_tensor(x, mesh, [0, -1])

shard_op

    import numpy as np
    import paddle
    import paddle.distributed as dist
    paddle.enable_static()
    mesh = dist.ProcessMesh(np.array([[2, 4, 5], [0, 1, 3]]))
    x = paddle.ones([4, 6])
    y = paddle.zeros([4, 6])
    kwargs = {'x': x, 'y': y}
    dist.shard_op(paddle.add, mesh, None, **kwargs)

set_pipeline_stage

    import numpy as np
    import paddle
    import paddle.distributed as dist
    paddle.enable_static()
    dist.set_pipeline_stage(1)

set_offload_device

    import numpy as np
    import paddle
    import paddle.distributed as dist
    paddle.enable_static()
    x = paddle.ones([4, 6])
    dist.set_offload_device(x, 'cpu')

set_shard_mask

    import numpy as np
    import paddle
    import paddle.distributed as dist
    paddle.enable_static()
    mesh = dist.ProcessMesh(np.array([[2, 4, 5], [0, 1, 3]]))
    x = paddle.ones([4, 6])
    dist.set_shard_mask(x, mask)

paddle-bot-old · 2021-06-28T05:54:51Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… auto_parallel_basic

paddle-bot-old · 2021-07-17T02:35:14Z

Sorry to inform you that bf24fb7's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

… auto_parallel_basic

XiaoguangHu01 · 2021-08-10T01:30:52Z

python/paddle/distributed/auto_parallel/interface.py

+    And the first logical process is the one with id=2.
+
+    Args:
+        mesh (numpy.ndarray): an N-dimensional array describes the toplogy


这里参数类型用numpy.ndarray的原因是什么呢？从示例代码看的话，是不是用python的list就可以了？

已修改为使用python list.

XiaoguangHu01 · 2021-08-10T01:48:34Z

python/paddle/distributed/auto_parallel/interface.py

+
+    Args:
+        x (Tensor): the tensor to process.
+        mask (numpy.ndarray): the shape of `mask` must be the same as the ProcessMesh belonging to


这里的mask是否用python的list就可以？

已修改为使用Python list

XiaoguangHu01 · 2021-08-10T01:55:10Z

python/paddle/distributed/auto_parallel/interface.py

+
+    Args:
+        x (tensor): the tensor to process.
+        device (str): the device that the tensor `x` will be put on, e.g., 'gpu:0', 'cpu'.


set_offload_device什么情况下需要设置成'gpu:0'，表示什么意思呢？

从实际使用场景看，offload的使用需求是offload指定的tensor到cpu，此处已去掉gpu:0

XiaoguangHu01 · 2021-08-10T02:25:26Z

paddle/fluid/framework/framework.proto

@@ -175,6 +191,7 @@ message VarDesc {
  optional bool need_check_feed = 4 [ default = false ];
  optional bool is_parameter = 5 [ default = false ];
  optional bool stop_gradient = 6 [ default = false ];
+  repeated Attr attrs = 7;


这些新增的字段，在保存模型的时候，会被存下来吗？
我看示例代码，模型定义的时候就会添加这些字段，模型定义完再调用模型保存的时候，是不是会把这些字段都保存下来？什么时候把这些字段去掉呢？

自动并行主要包括以下几个主要过程：1. 使用自动并行接口标识关键tensor或op；2. 自动补全：补全所有tensor和op的分布式属性；3. 逻辑切分；4. 物理映射；5. 执行训练。其中步骤1-3会使用到此处新增的字段；所以该接口新增的字段会在步骤1-3完成后删除，且该过程用户无感知。

常规的模型保存过程是执行部分训练或全部训练完成后进行模型保存，这时，新增字段已经完全删除。

但存在一个特殊的情形，即用户完成组网后即刻保存模型，这时相关的字段会被保存下来。但我们认为，这种特殊情形是不应该存在的，因为完成组网后即保存模型是没有意义的。

XiaoguangHu01 · 2021-08-10T02:57:07Z

python/paddle/fluid/framework.py

+        mesh_id = self.attr(mesh_attr_name)
+        return _g_process_mesh_map[mesh_id]
+
+    def dims_mapping(self, name):


表示Tensor整体的维度概念时用dimension, 一般从1开始编号，1维Tensor，2维Tensor
表示Tensor第几维概念是用axis和axes，一般从0开始编号，Tensor的第1维，Tensor的第2维
这里看起来是表示整体维度的概念，建议直接用单数dim_mapping

… auto_parallel_basic

XiaoguangHu01

LGTM

add auto_parallel dir

b985745

sandyhouse added 2 commits June 28, 2021 19:35

mv to paddle.distributed

b79e749

add shard_xx api

1671850

sandyhouse changed the title ~~add the basic directory for auto_parallel~~ [WIP] add the basic directory for auto_parallel Jul 1, 2021

sandyhouse added 3 commits July 8, 2021 21:55

add distributed attrs for var

ec55a43

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

25abc00

… auto_parallel_basic

add ut, test=develop

bf24fb7

sandyhouse added 21 commits July 18, 2021 19:54

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

8ea9363

… auto_parallel_basic

add dist

9e4b3d8

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e65f77e

… auto_parallel_basic

update

8b95c1e

update

ccae6ae

update

d107751

update

f7e70ea

update

3111159

update, test=develop

70cdb69

update, test=develop

9e5b0f0

update, test=develop

59936ef

update, test=develop

27ee413

update, test=develop

3a8ceef

update, test=develop

d11f317

update, test=develop

f5ef245

update

7293b4f

update

1240edc

update

05455fb

update

3e1b3a0

update

8950c35

update, test=develop

b94a9f2

sandyhouse added 4 commits August 5, 2021 11:17

add set_placement, test=develop

890c70c

update doc, test=develop

4b90b03

update doc, test=develop

c395b84

update doc, test=develop

8390e01

sandyhouse changed the title ~~[WIP] add the basic directory for auto_parallel~~ add the basic directory and related apis for auto_parallel Aug 5, 2021

sandyhouse added 2 commits August 6, 2021 12:27

update doc, test=develop

f7d5631

update, test=develop

3a2666e

sandyhouse changed the title ~~add the basic directory and related apis for auto_parallel~~ add the basic apis for auto_parallel Aug 6, 2021

fuyinno4 previously approved these changes Aug 6, 2021

View reviewed changes

sandyhouse requested review from chenwhql and PangHua August 6, 2021 09:23

PangHua previously approved these changes Aug 6, 2021

View reviewed changes

sandyhouse requested a review from XiaoguangHu01 August 6, 2021 11:37

XiaoguangHu01 reviewed Aug 10, 2021

View reviewed changes

update ndarray to nested list, test=develop

773516b

sandyhouse dismissed stale reviews from PangHua and fuyinno4 via 773516b August 10, 2021 05:33

sandyhouse added 2 commits August 10, 2021 13:44

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

685504f

… auto_parallel_basic

update, test=develop

7ac6299

sandyhouse requested review from XiaoguangHu01 and removed request for chenwhql August 10, 2021 10:02

XiaoguangHu01 approved these changes Aug 11, 2021

View reviewed changes

sandyhouse requested a review from PangHua August 11, 2021 03:02

PangHua approved these changes Aug 11, 2021

View reviewed changes

sandyhouse merged commit 3f962e7 into PaddlePaddle:develop Aug 11, 2021

aoyulong mentioned this pull request Aug 23, 2021

Add auto completion module for auto parallel #34813

Merged

JZ-LIANG mentioned this pull request Aug 24, 2021

[Auto Parallel] Logical Partition & Dist Op #35117

Merged

sandyhouse deleted the auto_parallel_basic branch March 8, 2022 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add the basic apis for auto_parallel #33804

add the basic apis for auto_parallel #33804

sandyhouse commented Jun 28, 2021 •

edited

paddle-bot-old bot commented Jun 28, 2021

paddle-bot-old bot commented Jul 17, 2021

XiaoguangHu01 Aug 10, 2021

sandyhouse Aug 10, 2021

XiaoguangHu01 Aug 10, 2021

sandyhouse Aug 10, 2021

XiaoguangHu01 Aug 10, 2021

sandyhouse Aug 10, 2021

XiaoguangHu01 Aug 10, 2021

sandyhouse Aug 10, 2021

XiaoguangHu01 Aug 10, 2021

sandyhouse Aug 10, 2021

XiaoguangHu01 left a comment

add the basic apis for auto_parallel #33804

add the basic apis for auto_parallel #33804

Conversation

sandyhouse commented Jun 28, 2021 • edited

PR types

PR changes

Describe

paddle-bot-old bot commented Jun 28, 2021

paddle-bot-old bot commented Jul 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

sandyhouse commented Jun 28, 2021 •

edited