Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoParallel] Adapt static spmd rules for dynamic graph #56367

Merged
merged 31 commits into from Aug 31, 2023

Conversation

chenwhql
Copy link
Contributor

@chenwhql chenwhql commented Aug 16, 2023

PR types

New features

PR changes

Others

Description

Pcard-73145

[AutoParallel] Adapt static infer spmd

本PR尝试在动半适配现有的切分推导规则,需要对现有的设计做一些局部改动,设计说明如下

现有切分推导规则基类核心函数如下:

class SPMDRuleBase {
 public:
  virtual ~SPMDRuleBase() {}

  InferForward(const std::vector<DistTensorSpec>& input_specs,
               const paddle::framework::AttributeMap& attrs);

  virtual std::pair<std::vector<TensorDistAttr>, std::vector<TensorDistAttr>>
  InferBackward(const std::vector<DistTensorSpec>& output_specs,
                const paddle::framework::AttributeMap& attrs);
  ...
};
  1. 由于动态图对执行时调度性能有较高的要求,且目前整体架构采用特异化生成接口的设计,因此切分推导函数本身需要采用变参(variadic)的形式实现,类似InferMeta和phi Kernel的形式,例如
using SpmdInfo =
    std::pair<std::vector<TensorDistAttr>, std::vector<TensorDistAttr>>;

SpmdInfo MatmulSpmdInferForward(const DistMetaTensor& x,
                         const DistMetaTensor& y,
                         bool trans_x,
                         bool trans_y);
  1. 由于需要兼顾动静一体的需求,我们需要能够将变参的SPMD推导函数归一化为统一的函数形式,以供静半统一调度,因此使用模板元编程实现相关归一化的宏,例如
using InferSpmdFn = SpmdInfo (*)(const InferSpmdContext&);

#define PD_INFER_SPMD(...)                                    \
  ::phi::distributed::InferSpmdFnImpl<decltype(&__VA_ARGS__), \
                                      &__VA_ARGS__>::Call

// PD_INFER_SPMD(MatmulInferSpmd) 即可将前述函数转换为InferSpmdFn的形式,具体归一之后的函数形式还需要商定
  1. 由于动半关于切分推导和转换的核心逻辑需要在phi中调用,且phi不能依赖fluid,因此切分推导规则及其核心数据结构需要迁移至phi中,目前在本PR中给出的迁移方式如下,具体迁移方式可以再讨论
  • 核心数据结构迁移至 phi/core/distributed/auto_parallel
  • 具体算子SPMD推导迁移至 phi/infermeta/spmd_rules 中,算子特异化的实现原则上不能放在core目录下,且spmd属于tensor meta信息的一种,放到infermeta下也合理
  1. Spmd推导函数的返回值暂时仍使用 std::pair<std::vector<TensorDistAttr>, std::vector<TensorDistAttr>> ,与原先设计保持一致,但考虑到动态图对调度性能的要求,这大概率不是最终状态,STL容器构造析构对API调度性能有较大的影响,最后有可能还是需要设置到DistTensor的dist_attr_成员上,看后期调度性能的影响再决定

  2. Spmd推导函数的输入参数采用了InferSpmdContext进行归一化,考虑如下:

  • 目前看const std::vector<DistTensorSpec>& input_specs, const paddle::framework::AttributeMap& attrs能够满足需求,但如果后面出现Tensorvector<Tensor>并存的输入参数,可能需要进一步引入range进行区分,用context归一化可以适配将来可能的变化,而不需要将来去统一改变函数签名
  • Context中可以根据需求灵活换用高效的容器实现,而不影响推导函数签名
  1. Spmd推导函数的输入Tensor要额外构建一个容器去存放,换用small_vector,相比std::vector可节省一些堆上构造析构开销

  2. Spmd推导函数的输入属性需要使用vector结构,无法使用map结构,因为在动态图执行流程中传入的时候,属性没有name,用vector结构也可以适配静态图的map输入,有必要的话此处可以复用phi之前建设实现的大量的arg mapping函数

  3. 基于CodeStyle中的命名约定,命名风格上采用Spmd而不是SPMD
    image

  4. 原先SPMD rules迁移而不是拷贝,仅保留一份代码;utils函数由于多处使用,先拷贝一份,后续可以迁移完成后移除原先的实现,暂时不在本PR中全局替换

  5. 原先python端单测的形式先尽可能保持不变,因此在pybind层通过参数处理以匹配不同的参数形式,后续视静半的使用需求再调整

@paddle-bot
Copy link

paddle-bot bot commented Aug 16, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@chenwhql chenwhql changed the title [AutoParallel] Adapt static infer spmd [AutoParallel] Adapt static infer spmd demo Aug 22, 2023
@chenwhql chenwhql changed the title [AutoParallel] Adapt static infer spmd demo [AutoParallel] Adapt static spmd rules for dynamic graph Aug 25, 2023
})
.def("infer_backward",
[](const phi::distributed::SpmdRule &self,
const std::vector<DistTensorSpec> &input_specs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer_backward need the info of input tensors and output tensors for inference, please ref to new api:
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/distributed/auto_parallel/spmd_rules/common.h#L62

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, change pybind infer_backward api to this format

@@ -340,6 +343,44 @@ void BindAutoParallel(py::module *m) {
.def("infer_forward", &SPMDRuleBase::InferForward)
.def("infer_backward", &SPMDRuleBase::InferBackward);

py::class_<phi::distributed::SpmdRule>(*m, "SpmdRule")
.def("infer_forward",
[](const phi::distributed::SpmdRule &self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistTensorSpec seen to be redundant now, would it be better that expose the InferSpmdContext and MetaTensor API into python and static mode build the input ctx directly ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this can be determined according to the needs of semi-static. This PR try not to change the original test framework as much as possible.

y_dist_tensor_spec.set_dims_mapping({-1, 0});
infered_dist_attrs = matmul_rule->InferForward(
{x_dist_tensor_spec, y_dist_tensor_spec}, attrs);
x_dist_attr.set_dims_mapping({-1, -1});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better that provide an API which build MetaTensor for "shape" and "distattr" directly?
or build inferspmdcontext from "shape" and "distattr" and attribute directly

Copy link
Contributor Author

@chenwhql chenwhql Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not recommended, MetaTensor is a thin encapsulation and does not hold the object life cycle. If such a constructor is required, I tend to inherit a DistMetaTensor to do so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

// TODO(chenweihang): support other attr type later by needed
PD_SPECIALIZE_InferSpmdFnCallHelper_FOR_ATTRIBUTE(bool);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this method cover complex attribute type like std::vector<int64_t>, std::vector, std::vector

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can support these types later, now matmul cannot cover these types

static SpmdInfo Call(const InferSpmdContext& ctx, PreviousArgs&... pargs) {
static_assert(attr_idx == 0,
"InferSpmd's Input should appear before Attributes.");
const DistMetaTensor& arg = ctx.InputAt(in_idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should ctx maintains input_tensor_list and output_tensor_list separately ?

in case of variadic input/output Op, it maybe a problem:

  • variadic input and single output op(concat,addn):could be adapted by assume the last tensor is output
  • single input and variadic output op(split,unstack):could be adapted by assume the first tensor is input
  • variadic input and variadic output op (not yet,but future ?):could not be adapted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to distinguish the list of input and output here.

For the case of vector input and output, our rule function will also faithfully reflect its type at that time, so there is no need for additional merging.

For example:

  • concat op
SpmdInfo ConcatSpmdInferForward(const std::vector<const DistMetaTensor*>& x,
                     const DistMetaTensor& out,
                     const Scalar& axis_scalar);
  • split op
SpmdInfo SplitSpmdInferBackward(const DistMetaTensor& x,
                    const std::vector<const DistMetaTensor*>& out,
                    const IntArray& sections,
                    const Scalar& axis);

auto out_shape = output_specs[0].shape();
SpmdInfo MatmulSpmdInferBackward(const DistMetaTensor& x,
const DistMetaTensor& y,
const DistMetaTensor& out,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for variadic op like split and concat, should use vector for the variadic slot ?

Phi api for concat:
PADDLE_API Tensor concat(const std::vector& x, const Scalar& axis)

spmd for concat:
SpmdInfo ConcatSpmdInferBackward(const std::vector& x, const DistMetaTensor& out, const Scalar& axis)

AND for ReplicatedSpmd Rule which is a bottom line rule for all Ops that have not specific rule:
SpmdInfo ReplicatedSpmdInferBackward(const std::vector& x, const std::vector&out)

Copy link
Contributor Author

@chenwhql chenwhql Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

for ReplicatedSpmd Rule, we can use the general format

SpmdInfo ReplicatedSpmdInferBackward(
  const std::vector<const DistMetaTensor*>& x,
  const std::vector<const DistMetaTensor*>& out,
  const std::vector<phi::Attribtue>& attrs)

we also can unify its format into SpmdInfo (*)(const InferSpmdContext&)

Comment on lines +28 to +30
# After replaced all spmd rules by phi impl, we can recover the
# api name to `get_spmd_rule`
self.rule = core.get_phi_spmd_rule("matmul")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的注释可以加到pybind接口上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, 下一个PR调整

Copy link
Contributor

@JZ-LIANG JZ-LIANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@LiYuRio LiYuRio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenwhql chenwhql merged commit 54fcd9a into PaddlePaddle:develop Aug 31, 2023
25 of 26 checks passed
BeingGod pushed a commit to BeingGod/Paddle that referenced this pull request Sep 9, 2023
…e#56367)

* move matmul spmd rules into phi

* add basic infer spmd utils

* addspmd factory

* fix compile error

* add unittest

* refine infer spmd test and utils

* debug infer spmd test

* adapt python test

* poish details

* change to vector attr arg

* revert needless change

* update matmul spmd rule test

* remove original rule

* polish details

* fix marco error

* add comment

* pass backward test

* fix compile error

* add cmake rule for spmd_rules_test

* add dist meta tensor

* update pybind impl

* add marco for rules
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants