Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【PIR Dist Op Reg No.10】 reg pull_gpups_sparse #62935

Merged
merged 24 commits into from
Apr 25, 2024

Conversation

xiaoyewww
Copy link
Contributor

@xiaoyewww xiaoyewww commented Mar 21, 2024

PR Category

Execute Infrastructure

PR Types

Devs

Description

#60436
注册算子pull_gpups_sparse

Copy link

paddle-bot bot commented Mar 21, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Mar 21, 2024
@xiaoyewww
Copy link
Contributor Author

@kangguangli 这里有个疑问,这里ops.ymal里对应的data_type在GetExpectedKernelType指定了fp32了,能直接用fp32来表示吗

@kangguangli
Copy link
Contributor

@kangguangli 这里有个疑问,这里ops.ymal里对应的data_type在GetExpectedKernelType指定了fp32了,能直接用fp32来表示吗

目前是不行的,代码生成机制不支持

@xiaoyewww
Copy link
Contributor Author

@kangguangli 这里有个疑问,这里ops.ymal里对应的data_type在GetExpectedKernelType指定了fp32了,能直接用fp32来表示吗

目前是不行的,代码生成机制不支持

那这里的这样写有问题吗,直接通过out来推导,w是optional属性

@kangguangli
Copy link
Contributor

@kangguangli 这里有个疑问,这里ops.ymal里对应的data_type在GetExpectedKernelType指定了fp32了,能直接用fp32来表示吗

目前是不行的,代码生成机制不支持

那这里的这样写有问题吗,直接通过out来推导,w是optional属性

这样也是不合理的,可以参考 #62384 对nop的处理,注册 interfaces : paddle::dialect::ParseKernelKeyInterface

@xiaoyewww
Copy link
Contributor Author

Details

感谢已修改,但我这个反向算子是不是注册有问题,还需要相同的方法注册push_gpups_sparse?

Copy link
Contributor

@kangguangli kangguangli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull_gpups_sparse的注册没什么问题,现在主要是 pull_gpups_sparse 的反向是push_gpups_sparse,而非pull_gpups_sparse_grad。你现在有些地方注册了pull_gpups_sparse_grad,有些地方是push_gpups_sparse,需要统一改成push_gpups_sparse

from paddle.base.layer_helper import LayerHelper


class TestPullGpupsSparseOpTranslator(test_op_translator.TestOpTranslator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class TestPullGpupsSparseOpTranslator(test_op_translator.TestOpTranslator):
class TestPullGpupsSparseOpTranslator(test_op_translator.TestOpWithBackwardTranslator):

带反向的算子继承这个类,另外此时需要重载下它的setUp方法,设置forward_op_type和backward_op_type,具体可以看看基类的实现。

@@ -1823,6 +1823,16 @@
data_type : x
optional : boxes_num

- backward_op : pull_gpups_sparse_grad
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- backward_op : pull_gpups_sparse_grad
- backward_op : push_gpups_sparse

下面的定义里也要做相应修改

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多谢,已修改

@luotao1 luotao1 added the HappyOpenSource 快乐开源活动issue与PR label Mar 25, 2024
args : (Tensor w, Tensor ids, Tensor out, int[] size, bool is_sparse, bool is_distributed)
output : Tensor(out_grad)
infer_meta :
func : PushGpupsSparseInferMeta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个InferMeta没注册,需要实现一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多谢,这个infershape是空函数,我删除一下

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里infermeta是空函数也要在yaml中配置infer_meta,现在的报错是因为没有infer_meta 引起的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xingmingyyj 多谢,我修改一下,那前向我记得好像是可以配置的么,如果infershape中没有实现

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还是把InferMeta删除吧,目前的报错原因我怀疑和代码生成机制相关,具体的还需要定位,先删了试试。
我先把这里backward.yaml里的infer_meta删了试试,看看单测能不能跑通。

Comment on lines 206 to 208
void PushGpupsSparseInferMeta(const std::vector<const MetaTensor*> ids,
const std::vector<const MetaTensor*> out,
const std::vector<int>& size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infermeta的参数类型和backward.yaml里声明的参数不一致,目前的报错应该是因为这个

@xiaoyewww
Copy link
Contributor Author

@kangguangli 麻烦大佬再看一下,这里感觉是反向算子参数写错了,但是我找了半天没找到入参到底在哪

Comment on lines +1842 to +1843
infer_meta :
func : PushGpupsSparseInferMeta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
infer_meta :
func : PushGpupsSparseInferMeta

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看上去似乎还是老问题

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前的问题和这个问题不太一样吧,看起来是代码生成机制的问题,可能是生成的代码有一些作用域相关的语法错误,这里需要看看生成的backward_api.cc是什么样的,需要做一些深入的debug,@xingmingyyj 辛苦帮忙看下。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

@@ -1835,6 +1835,13 @@
data_type : x
optional : boxes_num

- backward_op : push_gpups_sparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要这样注册

- backward_op : push_gpups_sparse
  forward : pull_gpups_sparse (Tensor w, Tensor[] ids, int[] size={}, bool is_sparse=false, bool is_distributed=false) -> Tensor[](out)
  args : (Tensor[] ids, Tensor[] out_grad, int[] size, bool is_sparse, bool is_distributed)
  output : Tensor[](out_grad_grad){ids.size()}
  infer_meta :
    func : PushGpupsSparseInferMeta
  kernel :
    func : push_gpups_sparse
  inplace : (out_grad -> out_grad_grad)

现在的报错是因为删除了infer_meta之后,代码生成脚本强制含有infer_meta这个key,现在不满足,脚本执行失败导致生成的代码{}不匹配。加上infer_meta之后的报错是因为没有指定{ids.size()}导致的。
根据上述配置为了可以正确翻译需要在paddle/fluid/ir_adaptor/translator/op_compat_gen.py中加入

    op_arg_name_mappings['push_gpups_sparse'].update(
        {"out_grad": "Out@GRAD", "out_grad_grad": "Out@GRAD"}
    )

@@ -161,6 +161,7 @@
'lars_momentum_',
'max_pool2d_v2',
'partial_sum',
'pull_gpups_sparse',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为反向中有inplace属性还需要加入,

    'push_gpups_sparse',
    'push_gpups_sparse_',

@@ -203,6 +203,13 @@ void NllLossRawInferMeta(const MetaTensor& input,
MetaTensor* total_weight,
MetaConfig config = MetaConfig());

void PushGpupsSparseInferMeta(const std::vector<const MetaTensor*> ids,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要添加一下引用

void PushGpupsSparseInferMeta(const std::vector<const MetaTensor*>& ids,
                              const std::vector<const MetaTensor*>& out,

self.op_type = "pull_gpups_sparse"
ids = paddle.ones(shape=(1,), dtype='int64')
out = paddle.ones(shape=(1,), dtype='int64')
attrs = {'size': [], 'is_sparse': False, 'is_distributed': False}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attrs = {'size': [], 'is_sparse': False, 'is_distributed': False}
attrs = {'size': [1], 'is_sparse': False, 'is_distributed': False}

outputs={"Out": out},
attrs=attrs,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里在append_op中需要添加下面两行代码:

out.stop_gradient=False
return out

因为在基类中可以看到是依赖out插入反向op,但是好像out的stop_gradient属性默认为True,导致没有插入反向。
这里更推荐在基类的build_model函数里面设置stop_gradient属性为False

    def build_model(self):
        with paddle.static.scope_guard(self.new_scope):
            with paddle.static.program_guard(self.main_program):
                out = self.append_op()
                out.stop_gradient = False
                append_backward(out)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢,太详细了,但这里有个疑问,为什么需要return out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为这里在基类里面是想插入反向Op,插入反向Op是通过append_backward这个api实现的,它实际上接受的是loss,但是我们直接传入out就可以,因为只是验证一下能否正确翻译。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024-04-17 00:03:39 paddle/fluid/pir/dialect/operator/ir/pd_op_bwd.cc: In static member function ???static std::vector<pir::Type> paddle::dialect::PushGpupsSparseOp::InferMeta(const std::vector<pir::Value>&, pir::AttributeMap*)???:
2024-04-17 00:03:39 paddle/fluid/pir/dialect/operator/ir/pd_op_bwd.cc:29665:63: error: ???None??? was not declared in this scope; did you mean ???none????
2024-04-17 00:03:39 29665 |    std::vector<paddle::dialect::IrTensor> vec_dense_out_grad((None), paddle::dialect::IrTensor());
2024-04-17 00:03:39       |                                                               ^~~~
2024-04-17 00:03:39       |                                                               none

这里还是meta报错,这里是我参数填错了吗,我之前删掉了w这个变量,看上去这个变量在反向上不需要

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

反向的配置可以参考

- backward_op : push_gpups_sparse
  forward : pull_gpups_sparse (Tensor w, Tensor[] ids, int[] size={}, bool is_sparse=false, bool is_distributed=false) -> Tensor[](out)
  args : (Tensor[] ids, Tensor[] out_grad, int[] size, bool is_sparse, bool is_distributed)
  output : Tensor[](out_grad_grad){out_grad.size()}
  infer_meta :
    func : PushGpupsSparseInferMeta
  kernel :
    func : push_gpups_sparse
  inplace : (out_grad -> out_grad_grad)

现在的报错是因为没有{out_grad.size()},所以在代码自动生成的时候填充了None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc: In function ???PyObject* paddle::pybind::push_gpups_sparse(PyObject*, PyObject*, PyObject*)???: 2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc:3406:12: error: ???eager_api_push_gpups_sparse??? was not declared in this scope; did you mean ???static_api_push_gpups_sparse???? 2024-04-17 23:10:15 3406 | return eager_api_push_gpups_sparse(self, args, kwargs); 2024-04-17 23:10:15 | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ 2024-04-17 23:10:15 | static_api_push_gpups_sparse 2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc: In function ???PyObject* paddle::pybind::push_gpups_sparse_(PyObject*, PyObject*, PyObject*)???: 2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc:3415:12: error: ???eager_api_push_gpups_sparse_??? was not declared in this scope; did you mean ???static_api_push_gpups_sparse_???? 2024-04-17 23:10:15 3415 | return eager_api_push_gpups_sparse_(self, args, kwargs); 2024-04-17 23:10:15 | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2024-04-17 23:10:15 | static_api_push_gpups_sparse_
报了这个错误,是不是api不是放在NO_NEED_GEN_STATIC_ONLY_APIS里面的?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该把

push_gpups_sparse_,
push_gpups_sparse

加入到NO_NEED_GEN_STATIC_ONLY_APIS就可以了

outputs={"Out": out},
attrs=attrs,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return out

这里把out返回就可以了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢,已修改


for (size_t i = 0; i < n_ids; ++i) {
out[i]->set_dims(outs_dims[i]);
out[i]->share_lod(*ids[i], i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里再辛苦为out设置一下dtype,其他地方应该没有问题了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢已修改,这个ci上有什么明显错误吗,我看ci上之前已经所有ut通过了

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢已修改,这个ci上有什么明显错误吗,我看ci上之前已经所有ut通过了

CI上没有问题,这里InferMeta的功能不仅是InferShape还有dtype。

@xingmingyyj
Copy link
Contributor

可以rerun一下CI,推进合入了

@xiaoyewww
Copy link
Contributor Author

可以rerun一下CI,推进合入了

重新推pr吗,PR-CI-Coverage那条ci我没法rerun

@luotao1
Copy link
Contributor

luotao1 commented Apr 24, 2024

PR-CI-Coverage那条ci我没法rerun

那个是覆盖率没过,我帮你豁免下即可。

@luotao1
Copy link
Contributor

luotao1 commented Apr 24, 2024

image @xiaoyewww 覆盖率都木有命中,辛苦看看是不是要加单测?

@kangguangli
Copy link
Contributor

image @xiaoyewww 覆盖率都木有命中,辛苦看看是不是要加单测?

现在已经有单测,只是覆盖不到这部分代码,这个问题是预料之内的,没有关系。原因是这样的,目前的单测只能测试新旧IR的翻译部分,不能测试运行时部分,因为分布式算子的运行时单测写起来会复杂很多,如果要对比精度的话,就更复杂了。出于简化任务要求的目的,所以一开始就没有这样要求。我觉得可以豁免下,不用新增单测了。

@luotao1
Copy link
Contributor

luotao1 commented Apr 24, 2024

了解了,已豁免

@kangguangli kangguangli merged commit cf0f65b into PaddlePaddle:develop Apr 25, 2024
30 checks passed
@xiaoyewww xiaoyewww deleted the pir/pull_gpups_sparse branch May 10, 2024 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers HappyOpenSource 快乐开源活动issue与PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants