[AutoParallel] Dygraph basic impl for semi auto parallel #55698

chenwhql · 2023-07-25T11:17:05Z

PR types

Breaking changes

PR changes

Others

Description

Pcard-73145

[AutoParallel] Dygraph basic impl for semi auto parallel

打通DistTensor在动态图执行的基本流程：

Python端创建DistTensor->Eager Forward API -> PHI Forward API -> Forward结果 -> Eager Backward -> PHI Backward API

本PR主要工作：

在动态图前反向执行路径中添加基本的DistTensor处理逻辑
添加动半在PHI前反向API中处理分支的代码自动生成逻辑

由于动半的API执行逻辑需要的侵入式修改PHI原先的执行逻辑，为了避免代码自动生成逻辑变得更加难以维护，因此在新增文件中重新梳理了动半相关的API分支生成逻辑，分离维护，便于后续整合InferSPMD、Reshard等功能

动半生成代码分支示例：

// Auto Parallel condition
  if (AllInputsAreDistTensor(x)) {
    // 1. Create API Output & Prepare Dist and Dense Output
    Tensor api_output;

    auto dist_out = SetKernelDistOutput(&api_output);
    auto dense_out = dist_out->mutable_value();

    // 2. InferSPMD (Infer Global Shape and DistAttr of Inputs&Outputs)
    phi::MetaTensor meta_dist_out(dist_out);
    phi::UnchangedInferMeta(MakeMetaTensor(*x.impl()), &meta_dist_out);

    // 3. Select Kernel
    VLOG(6) << "relu API dist branch: kernel key: [" << kernel_backend << ", " << kernel_layout << ", "<< kernel_data_type << "]";
    auto kernel_result = phi::KernelFactory::Instance().SelectKernelOrThrowError(
        "relu", {kernel_backend, kernel_layout, kernel_data_type});
    const auto& kernel = kernel_result.kernel;
    VLOG(6) << "relu kernel: " << kernel;
    auto* dev_ctx = GetDeviceContextByBackend(kernel_result.has_fallback_cpu ? Backend::CPU : kernel_backend);

    // 4. Reshard Input

    // 5. PrepareData (DataTransform & Prepare Dist and Dense Input)
    auto dist_input_x = PrepareDataForDistTensor(x, GetKernelInputArgDef(kernel.InputAt(0), kernel_backend), {}, kernel_result.is_stride_kernel);
    auto input_x = dist_input_x->mutable_value();

    // 6. Infer Local DenseTensor Meta
    phi::MetaTensor meta_dense_out(dense_out);
    phi::UnchangedInferMeta(MakeMetaTensor(*input_x), &meta_dense_out);

    // 7. DenseTensor Kernel Call
    using kernel_signature = void(*)(const phi::DeviceContext&, const phi::DenseTensor&, phi::DenseTensor*);
    auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
    (*kernel_fn)(*dev_ctx, *input_x, dense_out);

    // 8. Reshard Output

    // 9. Return
    return api_output;
  }

本PR只是打通的基础的执行逻辑，还有诸多问题后续需要处理：

目前仅支持了输入输出类型全部为Tensor的API生成自动并行执行分支，其他输入类型包括（optional Tensor，vector Tensor，optional vector Tensor），其他输出类型包括（vector Tensor）
InferSPMD还需要设计一下，如果只需要推导Shape，DistTensor中可能不需要一个DenseTensorMeta成员，而只需要一个DDim成员，否则存储两套layout和dtype可能有不一致的问题；如何复用InferMeta，DistAttr的推导如何结合目前尚不明确，也需要设计
WITH_DISTRIBUTED的界定在代码中目前比较模糊，有些被宏管理，有些又没有，比如通信相关的utils和kernel，很容易编译出错，后续需要整理一下，可以在python到C++层统一处理一下，确保用户能看到正确的提示，C++端尽量不再区分对待了，不然开发和阅读体验都比较差
原生成API中Profile的逻辑比较混乱，没有边界，需要模块化整理以便于维护，后续再添加
还有诸多TODO直接已记录在代码中，后续需要逐条完善

… ap/dygraph_basic_impl

…addle into ap/dygraph_basic_impl

LiYuRio · 2023-08-11T07:04:35Z

test/auto_parallel/test_dist_tensor.py

+        mesh = dist.ProcessMesh([0, 1], dim_names=["x"])
+        dist_attr = dist.DistAttr(mesh=mesh, sharding_specs=["x", None])


感觉现在写测试可以直接创建一个replicate状态的tensor。因为目前DistTensor的构造函数还没有把local_meta和dist_meta区分开，而且还没有加根据dist_attr创建local_tensor的逻辑，默认创建的都是replicate状态，所以指定sharding_specs为非replicate状态暂时没意义。

LiYuRio · 2023-08-11T07:11:39Z

paddle/phi/core/distributed/auto_parallel/dist_tensor.h

+      : meta_(meta), dist_attr_(dist_attr) {
+    value_ = std::make_unique<DenseTensor>(*dense_tensor);
+  }
+


这个和上面那个也接收DenseTensor的构造函数，是不是理论上存在一个就够了

LiYuRio · 2023-08-11T07:16:20Z

paddle/fluid/eager/grad_tensor_holder.cc

+#include "paddle/phi/core/distributed/auto_parallel/dist_attr.h"
+#include "paddle/phi/core/distributed/auto_parallel/dist_tensor.h"


这里是不是也该用with_distribute包一下，不过现在开不开WITH_DISTRIBUTE，dist_attr都会编

LiYuRio · 2023-08-11T07:24:06Z

paddle/phi/core/meta_tensor.cc

+#ifdef PADDLE_WITH_DISTRIBUTE
+      || is_dist_tensor
+#endif


如果提前写一个is_dist_tensor = false，这里可以少写一个#ifdef

zyfncg · 2023-08-11T07:48:46Z

paddle/phi/api/yaml/generator/dist_api_gen.py

+          "The kernel of ({}) for input tensors is unimplemented, please check the type of input tensors."));
+"""
+
+# TODO(chenweihang): add profle function code later


zyfncg · 2023-08-11T11:19:53Z

paddle/phi/api/yaml/generator/dist_api_gen.py

+        if len(self.kernel['func']) > 1:
+            kernel_dispatch_code = ''
+            for kernel_name in self.kernel['func']:
+                kernel_dispatch_code += self.gene_dispatch_code(
+                    kernel_name, inplace_flag
+                )
+            return API_IMPL_TEMPLATE.format(


这里是多个kernel配置下不支持dist的处理吗？

目前是的，DistTensor只支持DenseTensor，且复用DenseTensor kernel

ForFishes · 2023-08-11T13:26:01Z

paddle/fluid/eager/grad_tensor_holder.cc

@@ -178,6 +199,10 @@ void GradTensorHolder::add(size_t slot_id,
                                                        &buffer_values);
        }
      }
+#ifdef PADDLE_WITH_DISTRIBUTE
+    } else if (t.is_dist_tensor()) {
+      buffer_tensor = add_ad_func(t, buffer_tensor);


add_ad_func 可以支持 distentor吗？

应该可以，add_ad_func call paddle::experimental::add, 然后这个PR会为paddle::experimental::add生成dist分支

ForFishes · 2023-08-11T13:36:14Z

paddle/fluid/pybind/eager_method.cc

@@ -248,6 +248,27 @@ static PyObject* tensor_method_numpy(TensorObject* self,
                           place,
                           dense_tensor->Holder()->ptr(),
                           dense_tensor->Holder()->size());
+#ifdef PADDLE_WITH_DISTRIBUTE
+    } else if (self->tensor.is_dist_tensor()) {
+      // TODO(chenweihang): deal with DistTensor as local DenseTensor now,


直接numpy构造的disttensor，应该是global tensor，

是的，不过这里是tensor.numpy()方法，现在暂时取local tensor值打印了，如果用户在组网中拿任意一个tensor的numpy()值的话，有可能是shard或者partial的，看后面要不要先把完整的值拿回来再转numpy

… ap/dygraph_basic_impl

ForFishes

LGTM

LiYuRio · 2023-08-15T02:59:51Z

paddle/phi/core/distributed/auto_parallel/r_to_s_reshard_function.cc

@@ -93,6 +93,7 @@ std::shared_ptr<DistTensor> RToSReshardFunction::Eval(

  return std::make_shared<DistTensor>(
      std::make_shared<DenseTensor>(out_physical_tensor_cur_rank),
+      out_physical_tensor_cur_rank.meta(),


这里DistTensor构造函数的第二个参数是不是需要传入分布式的meta，感觉这里应该用in的meta，拿切分后的meta分布式形状就不对了。

目前单测中有检查，用in的meta会失败，先使用切分后的

LiYuRio

LGTM

raindrops2sea

LGTM

chenwhql added 18 commits July 19, 2023 08:49

add phi forward api gen impl

c7c679d

add phi backward gen code

0d7a9f4

polish api code gen impl

c016f4e

polish code gen impl

0ab020b

resolve conflict

3aae08f

remove auto_paralel namespace

9138bdd

add dygraph forward impl

f1c864e

add for_auto_parallel cond

adf7718

fix code gen errors

39908b4

add dygraph backward impl

cfe4734

resolve conflict with develop

cebc59f

resolve conflict with develop

3aabe68

refactor dist api gen impl

1e18719

revert origin api gen impl

af8476c

replace template for override func

d7d0dba

resolve conflict with develop

89d2687

fix dnnl marco error

8503a2e

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

016c953

… ap/dygraph_basic_impl

chenwhql force-pushed the ap/dygraph_basic_impl branch from ea82d63 to 016c953 Compare August 10, 2023 03:55

chenwhql added 2 commits August 10, 2023 06:14

revert third_party change

aa09027

add with distributed marco

fcd9b5a

chenwhql force-pushed the ap/dygraph_basic_impl branch from 186eba3 to fcd9b5a Compare August 10, 2023 07:12

chenwhql added 3 commits August 10, 2023 07:17

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

aba3231

… ap/dygraph_basic_impl

Update grad_tensor_holder.cc details

6ffd1cf

Merge branch 'ap/dygraph_basic_impl' of https://github.com/chenwhql/P…

e1b9483

…addle into ap/dygraph_basic_impl

chenwhql requested review from ForFishes, LiYuRio and zyfncg August 11, 2023 01:19

LiYuRio reviewed Aug 11, 2023

View reviewed changes

merge dist tensor constructor

1bfecf9

zyfncg reviewed Aug 11, 2023

View reviewed changes

chenwhql added 2 commits August 11, 2023 11:25

change test tensor to replicate

d410c9f

fx typo

332754e

ForFishes reviewed Aug 11, 2023

View reviewed changes

chenwhql added 3 commits August 11, 2023 15:17

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

5fdd74e

… ap/dygraph_basic_impl

resolve conflict with develop

e65d9cf

fix out dim error

b5a5708

zyfncg approved these changes Aug 15, 2023

View reviewed changes

ForFishes approved these changes Aug 15, 2023

View reviewed changes

LiYuRio reviewed Aug 15, 2023

View reviewed changes

LiYuRio approved these changes Aug 15, 2023

View reviewed changes

raindrops2sea approved these changes Aug 15, 2023

View reviewed changes

chenwhql merged commit 7039bef into PaddlePaddle:develop Aug 16, 2023
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoParallel] Dygraph basic impl for semi auto parallel #55698

[AutoParallel] Dygraph basic impl for semi auto parallel #55698

chenwhql commented Jul 25, 2023 •

edited

Loading

LiYuRio Aug 11, 2023

chenwhql Aug 11, 2023

LiYuRio Aug 11, 2023

chenwhql Aug 11, 2023

LiYuRio Aug 11, 2023

chenwhql Aug 11, 2023

LiYuRio Aug 11, 2023

chenwhql Aug 11, 2023

zyfncg Aug 11, 2023

chenwhql Aug 11, 2023

zyfncg Aug 11, 2023

chenwhql Aug 11, 2023

ForFishes Aug 11, 2023

chenwhql Aug 11, 2023

ForFishes Aug 11, 2023

chenwhql Aug 11, 2023

ForFishes left a comment

LiYuRio Aug 15, 2023

chenwhql Aug 15, 2023

LiYuRio left a comment

raindrops2sea left a comment

		mesh = dist.ProcessMesh([0, 1], dim_names=["x"])
		dist_attr = dist.DistAttr(mesh=mesh, sharding_specs=["x", None])

		#include "paddle/phi/core/distributed/auto_parallel/dist_attr.h"
		#include "paddle/phi/core/distributed/auto_parallel/dist_tensor.h"

[AutoParallel] Dygraph basic impl for semi auto parallel #55698

[AutoParallel] Dygraph basic impl for semi auto parallel #55698

Conversation

chenwhql commented Jul 25, 2023 • edited Loading

PR types

PR changes

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ForFishes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LiYuRio left a comment

Choose a reason for hiding this comment

raindrops2sea left a comment

Choose a reason for hiding this comment

chenwhql commented Jul 25, 2023 •

edited

Loading