【PIR Dist Op Reg No.10】 reg pull_gpups_sparse #62935

xiaoyewww · 2024-03-21T14:58:20Z

PR Category

Execute Infrastructure

PR Types

Devs

Description

#60436
注册算子pull_gpups_sparse

paddle-bot · 2024-03-21T14:58:25Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

xiaoyewww · 2024-03-21T14:59:44Z

@kangguangli 这里有个疑问，这里ops.ymal里对应的data_type在GetExpectedKernelType指定了fp32了，能直接用fp32来表示吗

kangguangli · 2024-03-22T02:56:59Z

@kangguangli 这里有个疑问，这里ops.ymal里对应的data_type在GetExpectedKernelType指定了fp32了，能直接用fp32来表示吗

目前是不行的，代码生成机制不支持

xiaoyewww · 2024-03-22T06:22:08Z

@kangguangli 这里有个疑问，这里ops.ymal里对应的data_type在GetExpectedKernelType指定了fp32了，能直接用fp32来表示吗

目前是不行的，代码生成机制不支持

那这里的这样写有问题吗，直接通过out来推导，w是optional属性

kangguangli · 2024-03-22T06:31:58Z

@kangguangli 这里有个疑问，这里ops.ymal里对应的data_type在GetExpectedKernelType指定了fp32了，能直接用fp32来表示吗

目前是不行的，代码生成机制不支持

那这里的这样写有问题吗，直接通过out来推导，w是optional属性

这样也是不合理的，可以参考 #62384 对nop的处理，注册 interfaces : paddle::dialect::ParseKernelKeyInterface

xiaoyewww · 2024-03-24T15:30:00Z

Details

感谢已修改，但我这个反向算子是不是注册有问题，还需要相同的方法注册push_gpups_sparse？

kangguangli

pull_gpups_sparse的注册没什么问题，现在主要是 pull_gpups_sparse 的反向是push_gpups_sparse，而非pull_gpups_sparse_grad。你现在有些地方注册了pull_gpups_sparse_grad，有些地方是push_gpups_sparse，需要统一改成push_gpups_sparse

kangguangli · 2024-03-25T03:15:23Z

test/ir/pir/translator/test_pull_gpups_sparse_translator.py

+from paddle.base.layer_helper import LayerHelper
+
+
+class TestPullGpupsSparseOpTranslator(test_op_translator.TestOpTranslator):


Suggested change

class TestPullGpupsSparseOpTranslator(test_op_translator.TestOpTranslator):

class TestPullGpupsSparseOpTranslator(test_op_translator.TestOpWithBackwardTranslator):

带反向的算子继承这个类，另外此时需要重载下它的setUp方法，设置forward_op_type和backward_op_type，具体可以看看基类的实现。

kangguangli · 2024-03-25T03:15:58Z

paddle/phi/api/yaml/backward.yaml

@@ -1823,6 +1823,16 @@
    data_type : x
  optional : boxes_num

+- backward_op : pull_gpups_sparse_grad


Suggested change

- backward_op : pull_gpups_sparse_grad

- backward_op : push_gpups_sparse

下面的定义里也要做相应修改

多谢，已修改

kangguangli · 2024-03-26T07:53:22Z

paddle/phi/api/yaml/backward.yaml

+  args : (Tensor w, Tensor ids, Tensor out, int[] size, bool is_sparse, bool is_distributed)
+  output : Tensor(out_grad)
+  infer_meta :
+    func : PushGpupsSparseInferMeta


这个InferMeta没注册，需要实现一下

多谢，这个infershape是空函数，我删除一下

这里infermeta是空函数也要在yaml中配置infer_meta，现在的报错是因为没有infer_meta 引起的

@xingmingyyj 多谢，我修改一下，那前向我记得好像是可以配置的么，如果infershape中没有实现

这里还是把InferMeta删除吧，目前的报错原因我怀疑和代码生成机制相关，具体的还需要定位，先删了试试。
我先把这里backward.yaml里的infer_meta删了试试，看看单测能不能跑通。

paddle/fluid/pir/dialect/operator/ir/ops.yaml

paddle/phi/api/yaml/backward.yaml

paddle/phi/infermeta/binary.h

kangguangli · 2024-04-10T07:22:57Z

paddle/phi/infermeta/ternary.h

+void PushGpupsSparseInferMeta(const std::vector<const MetaTensor*> ids,
+                              const std::vector<const MetaTensor*> out,
+                              const std::vector<int>& size,


infermeta的参数类型和backward.yaml里声明的参数不一致，目前的报错应该是因为这个

xiaoyewww · 2024-04-11T14:33:32Z

@kangguangli 麻烦大佬再看一下，这里感觉是反向算子参数写错了，但是我找了半天没找到入参到底在哪

kangguangli · 2024-04-15T06:14:57Z

paddle/phi/api/yaml/backward.yaml

+  infer_meta :
+    func : PushGpupsSparseInferMeta


Suggested change

infer_meta :

func : PushGpupsSparseInferMeta

看上去似乎还是老问题

之前的问题和这个问题不太一样吧，看起来是代码生成机制的问题，可能是生成的代码有一些作用域相关的语法错误，这里需要看看生成的backward_api.cc是什么样的，需要做一些深入的debug，@xingmingyyj 辛苦帮忙看下。

xingmingyyj · 2024-04-16T11:02:25Z

paddle/phi/api/yaml/backward.yaml

@@ -1835,6 +1835,13 @@
    data_type : x
  optional : boxes_num

+- backward_op : push_gpups_sparse


这里需要这样注册

- backward_op : push_gpups_sparse forward : pull_gpups_sparse (Tensor w, Tensor[] ids, int[] size={}, bool is_sparse=false, bool is_distributed=false) -> Tensor[](out) args : (Tensor[] ids, Tensor[] out_grad, int[] size, bool is_sparse, bool is_distributed) output : Tensor[](out_grad_grad){ids.size()} infer_meta : func : PushGpupsSparseInferMeta kernel : func : push_gpups_sparse inplace : (out_grad -> out_grad_grad)

现在的报错是因为删除了infer_meta之后，代码生成脚本强制含有infer_meta这个key，现在不满足，脚本执行失败导致生成的代码{}不匹配。加上infer_meta之后的报错是因为没有指定{ids.size()}导致的。
根据上述配置为了可以正确翻译需要在paddle/fluid/ir_adaptor/translator/op_compat_gen.py中加入

op_arg_name_mappings['push_gpups_sparse'].update( {"out_grad": "Out@GRAD", "out_grad_grad": "Out@GRAD"} )

xingmingyyj · 2024-04-16T11:05:22Z

paddle/fluid/pir/dialect/op_generator/ops_api_gen.py

@@ -161,6 +161,7 @@
    'lars_momentum_',
    'max_pool2d_v2',
    'partial_sum',
+    'pull_gpups_sparse',


因为反向中有inplace属性还需要加入，

'push_gpups_sparse', 'push_gpups_sparse_',

xingmingyyj · 2024-04-16T11:07:20Z

paddle/phi/infermeta/ternary.h

@@ -203,6 +203,13 @@ void NllLossRawInferMeta(const MetaTensor& input,
                         MetaTensor* total_weight,
                         MetaConfig config = MetaConfig());

+void PushGpupsSparseInferMeta(const std::vector<const MetaTensor*> ids,


这里需要添加一下引用

void PushGpupsSparseInferMeta(const std::vector<const MetaTensor*>& ids, const std::vector<const MetaTensor*>& out,

xingmingyyj · 2024-04-16T11:08:42Z

test/ir/pir/translator/test_pull_gpups_sparse_translator.py

+        self.op_type = "pull_gpups_sparse"
+        ids = paddle.ones(shape=(1,), dtype='int64')
+        out = paddle.ones(shape=(1,), dtype='int64')
+        attrs = {'size': [], 'is_sparse': False, 'is_distributed': False}


Suggested change

attrs = {'size': [], 'is_sparse': False, 'is_distributed': False}

attrs = {'size': [1], 'is_sparse': False, 'is_distributed': False}

xingmingyyj · 2024-04-16T11:17:26Z

test/ir/pir/translator/test_pull_gpups_sparse_translator.py

+            outputs={"Out": out},
+            attrs=attrs,
+        )
+


这里在append_op中需要添加下面两行代码：

out.stop_gradient=False return out

因为在基类中可以看到是依赖out插入反向op，但是好像out的stop_gradient属性默认为True，导致没有插入反向。
这里更推荐在基类的build_model函数里面设置stop_gradient属性为False

def build_model(self): with paddle.static.scope_guard(self.new_scope): with paddle.static.program_guard(self.main_program): out = self.append_op() out.stop_gradient = False append_backward(out)

感谢，太详细了，但这里有个疑问，为什么需要return out呢

因为这里在基类里面是想插入反向Op,插入反向Op是通过append_backward这个api实现的，它实际上接受的是loss，但是我们直接传入out就可以，因为只是验证一下能否正确翻译。

2024-04-17 00:03:39 paddle/fluid/pir/dialect/operator/ir/pd_op_bwd.cc: In static member function ???static std::vector<pir::Type> paddle::dialect::PushGpupsSparseOp::InferMeta(const std::vector<pir::Value>&, pir::AttributeMap*)???: 2024-04-17 00:03:39 paddle/fluid/pir/dialect/operator/ir/pd_op_bwd.cc:29665:63: error: ???None??? was not declared in this scope; did you mean ???none???? 2024-04-17 00:03:39 29665 | std::vector<paddle::dialect::IrTensor> vec_dense_out_grad((None), paddle::dialect::IrTensor()); 2024-04-17 00:03:39 | ^~~~ 2024-04-17 00:03:39 | none

这里还是meta报错，这里是我参数填错了吗，我之前删掉了w这个变量，看上去这个变量在反向上不需要

反向的配置可以参考

- backward_op : push_gpups_sparse forward : pull_gpups_sparse (Tensor w, Tensor[] ids, int[] size={}, bool is_sparse=false, bool is_distributed=false) -> Tensor[](out) args : (Tensor[] ids, Tensor[] out_grad, int[] size, bool is_sparse, bool is_distributed) output : Tensor[](out_grad_grad){out_grad.size()} infer_meta : func : PushGpupsSparseInferMeta kernel : func : push_gpups_sparse inplace : (out_grad -> out_grad_grad)

现在的报错是因为没有{out_grad.size()}，所以在代码自动生成的时候填充了None

2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc: In function ???PyObject* paddle::pybind::push_gpups_sparse(PyObject*, PyObject*, PyObject*)???: 2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc:3406:12: error: ???eager_api_push_gpups_sparse??? was not declared in this scope; did you mean ???static_api_push_gpups_sparse???? 2024-04-17 23:10:15 3406 | return eager_api_push_gpups_sparse(self, args, kwargs); 2024-04-17 23:10:15 | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ 2024-04-17 23:10:15 | static_api_push_gpups_sparse 2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc: In function ???PyObject* paddle::pybind::push_gpups_sparse_(PyObject*, PyObject*, PyObject*)???: 2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc:3415:12: error: ???eager_api_push_gpups_sparse_??? was not declared in this scope; did you mean ???static_api_push_gpups_sparse_???? 2024-04-17 23:10:15 3415 | return eager_api_push_gpups_sparse_(self, args, kwargs); 2024-04-17 23:10:15 | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2024-04-17 23:10:15 | static_api_push_gpups_sparse_
报了这个错误，是不是api不是放在NO_NEED_GEN_STATIC_ONLY_APIS里面的？

这里应该把

push_gpups_sparse_, push_gpups_sparse

加入到NO_NEED_GEN_STATIC_ONLY_APIS就可以了

xingmingyyj · 2024-04-19T05:55:26Z

test/ir/pir/translator/test_pull_gpups_sparse_translator.py

+            outputs={"Out": out},
+            attrs=attrs,
+        )
+


Suggested change

return out

这里把out返回就可以了

感谢，已修改

xingmingyyj · 2024-04-21T12:49:09Z

paddle/phi/infermeta/binary.cc

+
+  for (size_t i = 0; i < n_ids; ++i) {
+    out[i]->set_dims(outs_dims[i]);
+    out[i]->share_lod(*ids[i], i);


这里再辛苦为out设置一下dtype，其他地方应该没有问题了。

感谢已修改，这个ci上有什么明显错误吗，我看ci上之前已经所有ut通过了

感谢已修改，这个ci上有什么明显错误吗，我看ci上之前已经所有ut通过了

CI上没有问题，这里InferMeta的功能不仅是InferShape还有dtype。

xingmingyyj · 2024-04-24T05:45:27Z

可以rerun一下CI,推进合入了

xiaoyewww · 2024-04-24T06:30:07Z

可以rerun一下CI,推进合入了

重新推pr吗，PR-CI-Coverage那条ci我没法rerun

luotao1 · 2024-04-24T06:46:47Z

PR-CI-Coverage那条ci我没法rerun

那个是覆盖率没过，我帮你豁免下即可。

luotao1 · 2024-04-24T06:55:29Z

@xiaoyewww 覆盖率都木有命中，辛苦看看是不是要加单测？

kangguangli · 2024-04-24T08:12:17Z

@xiaoyewww 覆盖率都木有命中，辛苦看看是不是要加单测？

现在已经有单测，只是覆盖不到这部分代码，这个问题是预料之内的，没有关系。原因是这样的，目前的单测只能测试新旧IR的翻译部分，不能测试运行时部分，因为分布式算子的运行时单测写起来会复杂很多，如果要对比精度的话，就更复杂了。出于简化任务要求的目的，所以一开始就没有这样要求。我觉得可以豁免下，不用新增单测了。

luotao1 · 2024-04-24T08:19:54Z

了解了，已豁免

paddle-bot bot added the contributor External developers label Mar 21, 2024

luotao1 mentioned this pull request Mar 21, 2024

【PIR】PIR下的分布式算子注册 #60436

Closed

kangguangli reviewed Mar 25, 2024

View reviewed changes

luotao1 added the HappyOpenSource 快乐开源活动issue与PR label Mar 25, 2024

luotao1 assigned luotao1 and kangguangli Mar 25, 2024

kangguangli reviewed Mar 26, 2024

View reviewed changes

xiaoyewww force-pushed the pir/pull_gpups_sparse branch from fe58560 to c83300f Compare March 27, 2024 14:55

xingmingyyj reviewed Mar 29, 2024

View reviewed changes

paddle/fluid/pir/dialect/operator/ir/ops.yaml Outdated Show resolved Hide resolved

kangguangli reviewed Apr 3, 2024

View reviewed changes

paddle/phi/api/yaml/backward.yaml Outdated Show resolved Hide resolved

kangguangli reviewed Apr 3, 2024

View reviewed changes

paddle/phi/api/yaml/backward.yaml Outdated Show resolved Hide resolved

kangguangli reviewed Apr 7, 2024

View reviewed changes

paddle/phi/infermeta/binary.h Outdated Show resolved Hide resolved

xiaoyewww added 12 commits April 7, 2024 11:17

feat(pir): reg pull_gpups_sparse

e31d6df

feat(pir): reg pull_gpups_sparse

1e73d81

feat(pir): reg pull_gpups_sparse

fc40117

feat(pir): reg pull_gpups_sparse

4712afc

feat(pir): reg pull_gpups_sparse

b57cbd3

feat(pir): reg pull_gpups_sparse

0f72238

feat(pir): reg pull_gpups_sparse

e111141

feat(pir): reg pull_gpups_sparse

e62f81d

feat(pir): reg pull_gpups_sparse

abd7d26

feat(pir): reg pull_gpups_sparse

42f2fb1

feat(pir): reg pull_gpups_sparse

2d7f4a2

feat(pir): reg pull_gpups_sparse

ae75069

feat(pir): reg pull_gpups_sparse

eed6fc5

xiaoyewww mentioned this pull request Apr 9, 2024

【PIR Dist Op Reg No.22】 reg pull_sparse_v2 #63014

Merged

kangguangli reviewed Apr 10, 2024

View reviewed changes

feat(pir): reg pull_gpups_sparse

f2b0d36

kangguangli reviewed Apr 15, 2024

View reviewed changes

xingmingyyj reviewed Apr 16, 2024

View reviewed changes

feat(pir): reg pull_gpups_sparse

76f6d9d

xiaoyewww force-pushed the pir/pull_gpups_sparse branch from a347954 to 76f6d9d Compare April 16, 2024 15:43

xiaoyewww added 4 commits April 17, 2024 09:55

feat(pir): reg pull_gpups_sparse

78d109c

feat(pir): reg pull_gpups_sparse

2da11e9

feat(pir): reg pull_gpups_sparse

c31daaf

feat(pir): reg pull_gpups_sparse

e4511b7

xingmingyyj reviewed Apr 19, 2024

View reviewed changes

feat(pir): reg pull_gpups_sparse

6c70009

xingmingyyj reviewed Apr 21, 2024

View reviewed changes

feat(pir): reg pull_gpups_sparse

0ed361d

kangguangli requested review from kangguangli and xingmingyyj April 22, 2024 03:12

xingmingyyj approved these changes Apr 22, 2024

View reviewed changes

kangguangli approved these changes Apr 24, 2024

View reviewed changes

zyfncg approved these changes Apr 24, 2024

View reviewed changes

kangguangli merged commit cf0f65b into PaddlePaddle:develop Apr 25, 2024
30 checks passed

xiaoyewww deleted the pir/pull_gpups_sparse branch May 10, 2024 15:10

		from paddle.base.layer_helper import LayerHelper


		class TestPullGpupsSparseOpTranslator(test_op_translator.TestOpTranslator):

	- backward_op : pull_gpups_sparse_grad
	- backward_op : push_gpups_sparse

	attrs = {'size': [], 'is_sparse': False, 'is_distributed': False}
	attrs = {'size': [1], 'is_sparse': False, 'is_distributed': False}

【PIR Dist Op Reg No.10】 reg pull_gpups_sparse #62935

【PIR Dist Op Reg No.10】 reg pull_gpups_sparse #62935

Conversation

xiaoyewww commented Mar 21, 2024 • edited Loading

PR Category

PR Types

Description

paddle-bot bot commented Mar 21, 2024

xiaoyewww commented Mar 21, 2024

kangguangli commented Mar 22, 2024

xiaoyewww commented Mar 22, 2024

kangguangli commented Mar 22, 2024

xiaoyewww commented Mar 24, 2024

kangguangli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaoyewww commented Apr 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xingmingyyj commented Apr 24, 2024

xiaoyewww commented Apr 24, 2024

luotao1 commented Apr 24, 2024

luotao1 commented Apr 24, 2024

kangguangli commented Apr 24, 2024

luotao1 commented Apr 24, 2024

xiaoyewww commented Mar 21, 2024 •

edited

Loading