-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoParallel] Dygraph basic impl for semi auto parallel #55698
[AutoParallel] Dygraph basic impl for semi auto parallel #55698
Conversation
… ap/dygraph_basic_impl
ea82d63
to
016c953
Compare
186eba3
to
fcd9b5a
Compare
… ap/dygraph_basic_impl
…addle into ap/dygraph_basic_impl
mesh = dist.ProcessMesh([0, 1], dim_names=["x"]) | ||
dist_attr = dist.DistAttr(mesh=mesh, sharding_specs=["x", None]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉现在写测试可以直接创建一个replicate状态的tensor。因为目前DistTensor的构造函数还没有把local_meta和dist_meta区分开,而且还没有加根据dist_attr创建local_tensor的逻辑,默认创建的都是replicate状态,所以指定sharding_specs为非replicate状态暂时没意义。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thx
: meta_(meta), dist_attr_(dist_attr) { | ||
value_ = std::make_unique<DenseTensor>(*dense_tensor); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个和上面那个也接收DenseTensor的构造函数,是不是理论上存在一个就够了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thx
#include "paddle/phi/core/distributed/auto_parallel/dist_attr.h" | ||
#include "paddle/phi/core/distributed/auto_parallel/dist_tensor.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是不是也该用with_distribute包一下,不过现在开不开WITH_DISTRIBUTE,dist_attr都会编
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thx
paddle/phi/core/meta_tensor.cc
Outdated
#ifdef PADDLE_WITH_DISTRIBUTE | ||
|| is_dist_tensor | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果提前写一个is_dist_tensor = false,这里可以少写一个#ifdef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thx
"The kernel of ({}) for input tensors is unimplemented, please check the type of input tensors.")); | ||
""" | ||
|
||
# TODO(chenweihang): add profle function code later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
profile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thx
if len(self.kernel['func']) > 1: | ||
kernel_dispatch_code = '' | ||
for kernel_name in self.kernel['func']: | ||
kernel_dispatch_code += self.gene_dispatch_code( | ||
kernel_name, inplace_flag | ||
) | ||
return API_IMPL_TEMPLATE.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是多个kernel配置下不支持dist的处理吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前是的,DistTensor只支持DenseTensor,且复用DenseTensor kernel
@@ -178,6 +199,10 @@ void GradTensorHolder::add(size_t slot_id, | |||
&buffer_values); | |||
} | |||
} | |||
#ifdef PADDLE_WITH_DISTRIBUTE | |||
} else if (t.is_dist_tensor()) { | |||
buffer_tensor = add_ad_func(t, buffer_tensor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_ad_func 可以支持 distentor吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该可以,add_ad_func call paddle::experimental::add, 然后这个PR会为paddle::experimental::add生成dist分支
@@ -248,6 +248,27 @@ static PyObject* tensor_method_numpy(TensorObject* self, | |||
place, | |||
dense_tensor->Holder()->ptr(), | |||
dense_tensor->Holder()->size()); | |||
#ifdef PADDLE_WITH_DISTRIBUTE | |||
} else if (self->tensor.is_dist_tensor()) { | |||
// TODO(chenweihang): deal with DistTensor as local DenseTensor now, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接numpy构造的disttensor,应该是global tensor,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,不过这里是tensor.numpy()方法,现在暂时取local tensor值打印了,如果用户在组网中拿任意一个tensor的numpy()值的话,有可能是shard或者partial的,看后面要不要先把完整的值拿回来再转numpy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -93,6 +93,7 @@ std::shared_ptr<DistTensor> RToSReshardFunction::Eval( | |||
|
|||
return std::make_shared<DistTensor>( | |||
std::make_shared<DenseTensor>(out_physical_tensor_cur_rank), | |||
out_physical_tensor_cur_rank.meta(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里DistTensor构造函数的第二个参数是不是需要传入分布式的meta,感觉这里应该用in的meta,拿切分后的meta分布式形状就不对了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前单测中有检查,用in的meta会失败,先使用切分后的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Breaking changes
PR changes
Others
Description
Pcard-73145
[AutoParallel] Dygraph basic impl for semi auto parallel
打通DistTensor在动态图执行的基本流程:
本PR主要工作:
由于动半的API执行逻辑需要的侵入式修改PHI原先的执行逻辑,为了避免代码自动生成逻辑变得更加难以维护,因此在新增文件中重新梳理了动半相关的API分支生成逻辑,分离维护,便于后续整合InferSPMD、Reshard等功能
动半生成代码分支示例:
本PR只是打通的基础的执行逻辑,还有诸多问题后续需要处理: