-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【PIR Dist Op Reg No.10】 reg pull_gpups_sparse #62935
【PIR Dist Op Reg No.10】 reg pull_gpups_sparse #62935
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
@kangguangli 这里有个疑问,这里ops.ymal里对应的data_type在GetExpectedKernelType指定了fp32了,能直接用fp32来表示吗 |
目前是不行的,代码生成机制不支持 |
那这里的这样写有问题吗,直接通过out来推导,w是optional属性 |
这样也是不合理的,可以参考 #62384 对nop的处理,注册 interfaces : paddle::dialect::ParseKernelKeyInterface |
感谢已修改,但我这个反向算子是不是注册有问题,还需要相同的方法注册push_gpups_sparse? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pull_gpups_sparse的注册没什么问题,现在主要是 pull_gpups_sparse 的反向是push_gpups_sparse,而非pull_gpups_sparse_grad。你现在有些地方注册了pull_gpups_sparse_grad,有些地方是push_gpups_sparse,需要统一改成push_gpups_sparse
from paddle.base.layer_helper import LayerHelper | ||
|
||
|
||
class TestPullGpupsSparseOpTranslator(test_op_translator.TestOpTranslator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class TestPullGpupsSparseOpTranslator(test_op_translator.TestOpTranslator): | |
class TestPullGpupsSparseOpTranslator(test_op_translator.TestOpWithBackwardTranslator): |
带反向的算子继承这个类,另外此时需要重载下它的setUp方法,设置forward_op_type和backward_op_type,具体可以看看基类的实现。
paddle/phi/api/yaml/backward.yaml
Outdated
@@ -1823,6 +1823,16 @@ | |||
data_type : x | |||
optional : boxes_num | |||
|
|||
- backward_op : pull_gpups_sparse_grad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- backward_op : pull_gpups_sparse_grad | |
- backward_op : push_gpups_sparse |
下面的定义里也要做相应修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多谢,已修改
args : (Tensor w, Tensor ids, Tensor out, int[] size, bool is_sparse, bool is_distributed) | ||
output : Tensor(out_grad) | ||
infer_meta : | ||
func : PushGpupsSparseInferMeta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个InferMeta没注册,需要实现一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多谢,这个infershape是空函数,我删除一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里infermeta是空函数也要在yaml中配置infer_meta,现在的报错是因为没有infer_meta 引起的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xingmingyyj 多谢,我修改一下,那前向我记得好像是可以配置的么,如果infershape中没有实现
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里还是把InferMeta删除吧,目前的报错原因我怀疑和代码生成机制相关,具体的还需要定位,先删了试试。
我先把这里backward.yaml里的infer_meta删了试试,看看单测能不能跑通。
fe58560
to
c83300f
Compare
paddle/phi/infermeta/ternary.h
Outdated
void PushGpupsSparseInferMeta(const std::vector<const MetaTensor*> ids, | ||
const std::vector<const MetaTensor*> out, | ||
const std::vector<int>& size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infermeta的参数类型和backward.yaml里声明的参数不一致,目前的报错应该是因为这个
@kangguangli 麻烦大佬再看一下,这里感觉是反向算子参数写错了,但是我找了半天没找到入参到底在哪 |
infer_meta : | ||
func : PushGpupsSparseInferMeta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infer_meta : | |
func : PushGpupsSparseInferMeta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看上去似乎还是老问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前的问题和这个问题不太一样吧,看起来是代码生成机制的问题,可能是生成的代码有一些作用域相关的语法错误,这里需要看看生成的backward_api.cc是什么样的,需要做一些深入的debug,@xingmingyyj 辛苦帮忙看下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
@@ -1835,6 +1835,13 @@ | |||
data_type : x | |||
optional : boxes_num | |||
|
|||
- backward_op : push_gpups_sparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要这样注册
- backward_op : push_gpups_sparse
forward : pull_gpups_sparse (Tensor w, Tensor[] ids, int[] size={}, bool is_sparse=false, bool is_distributed=false) -> Tensor[](out)
args : (Tensor[] ids, Tensor[] out_grad, int[] size, bool is_sparse, bool is_distributed)
output : Tensor[](out_grad_grad){ids.size()}
infer_meta :
func : PushGpupsSparseInferMeta
kernel :
func : push_gpups_sparse
inplace : (out_grad -> out_grad_grad)
现在的报错是因为删除了infer_meta
之后,代码生成脚本强制含有infer_meta
这个key,现在不满足,脚本执行失败导致生成的代码{}不匹配。加上infer_meta
之后的报错是因为没有指定{ids.size()}
导致的。
根据上述配置为了可以正确翻译需要在paddle/fluid/ir_adaptor/translator/op_compat_gen.py
中加入
op_arg_name_mappings['push_gpups_sparse'].update(
{"out_grad": "Out@GRAD", "out_grad_grad": "Out@GRAD"}
)
@@ -161,6 +161,7 @@ | |||
'lars_momentum_', | |||
'max_pool2d_v2', | |||
'partial_sum', | |||
'pull_gpups_sparse', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为反向中有inplace属性还需要加入,
'push_gpups_sparse',
'push_gpups_sparse_',
paddle/phi/infermeta/ternary.h
Outdated
@@ -203,6 +203,13 @@ void NllLossRawInferMeta(const MetaTensor& input, | |||
MetaTensor* total_weight, | |||
MetaConfig config = MetaConfig()); | |||
|
|||
void PushGpupsSparseInferMeta(const std::vector<const MetaTensor*> ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要添加一下引用
void PushGpupsSparseInferMeta(const std::vector<const MetaTensor*>& ids,
const std::vector<const MetaTensor*>& out,
self.op_type = "pull_gpups_sparse" | ||
ids = paddle.ones(shape=(1,), dtype='int64') | ||
out = paddle.ones(shape=(1,), dtype='int64') | ||
attrs = {'size': [], 'is_sparse': False, 'is_distributed': False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attrs = {'size': [], 'is_sparse': False, 'is_distributed': False} | |
attrs = {'size': [1], 'is_sparse': False, 'is_distributed': False} |
outputs={"Out": out}, | ||
attrs=attrs, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里在append_op中需要添加下面两行代码:
out.stop_gradient=False
return out
因为在基类中可以看到是依赖out插入反向op,但是好像out的stop_gradient
属性默认为True,导致没有插入反向。
这里更推荐在基类的build_model函数里面设置stop_gradient
属性为False
def build_model(self):
with paddle.static.scope_guard(self.new_scope):
with paddle.static.program_guard(self.main_program):
out = self.append_op()
out.stop_gradient = False
append_backward(out)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感谢,太详细了,但这里有个疑问,为什么需要return out
呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为这里在基类里面是想插入反向Op,插入反向Op是通过append_backward这个api实现的,它实际上接受的是loss,但是我们直接传入out就可以,因为只是验证一下能否正确翻译。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024-04-17 00:03:39 paddle/fluid/pir/dialect/operator/ir/pd_op_bwd.cc: In static member function ???static std::vector<pir::Type> paddle::dialect::PushGpupsSparseOp::InferMeta(const std::vector<pir::Value>&, pir::AttributeMap*)???:
2024-04-17 00:03:39 paddle/fluid/pir/dialect/operator/ir/pd_op_bwd.cc:29665:63: error: ???None??? was not declared in this scope; did you mean ???none????
2024-04-17 00:03:39 29665 | std::vector<paddle::dialect::IrTensor> vec_dense_out_grad((None), paddle::dialect::IrTensor());
2024-04-17 00:03:39 | ^~~~
2024-04-17 00:03:39 | none
这里还是meta报错,这里是我参数填错了吗,我之前删掉了w这个变量,看上去这个变量在反向上不需要
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
反向的配置可以参考
- backward_op : push_gpups_sparse
forward : pull_gpups_sparse (Tensor w, Tensor[] ids, int[] size={}, bool is_sparse=false, bool is_distributed=false) -> Tensor[](out)
args : (Tensor[] ids, Tensor[] out_grad, int[] size, bool is_sparse, bool is_distributed)
output : Tensor[](out_grad_grad){out_grad.size()}
infer_meta :
func : PushGpupsSparseInferMeta
kernel :
func : push_gpups_sparse
inplace : (out_grad -> out_grad_grad)
现在的报错是因为没有{out_grad.size()}
,所以在代码自动生成的时候填充了None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc: In function ???PyObject* paddle::pybind::push_gpups_sparse(PyObject*, PyObject*, PyObject*)???: 2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc:3406:12: error: ???eager_api_push_gpups_sparse??? was not declared in this scope; did you mean ???static_api_push_gpups_sparse???? 2024-04-17 23:10:15 3406 | return eager_api_push_gpups_sparse(self, args, kwargs); 2024-04-17 23:10:15 | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ 2024-04-17 23:10:15 | static_api_push_gpups_sparse 2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc: In function ???PyObject* paddle::pybind::push_gpups_sparse_(PyObject*, PyObject*, PyObject*)???: 2024-04-17 23:10:15 ../paddle/fluid/pybind/ops_api.cc:3415:12: error: ???eager_api_push_gpups_sparse_??? was not declared in this scope; did you mean ???static_api_push_gpups_sparse_???? 2024-04-17 23:10:15 3415 | return eager_api_push_gpups_sparse_(self, args, kwargs); 2024-04-17 23:10:15 | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2024-04-17 23:10:15 | static_api_push_gpups_sparse_
报了这个错误,是不是api不是放在NO_NEED_GEN_STATIC_ONLY_APIS里面的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里应该把
push_gpups_sparse_,
push_gpups_sparse
加入到NO_NEED_GEN_STATIC_ONLY_APIS就可以了
a347954
to
76f6d9d
Compare
outputs={"Out": out}, | ||
attrs=attrs, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return out |
这里把out返回就可以了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感谢,已修改
|
||
for (size_t i = 0; i < n_ids; ++i) { | ||
out[i]->set_dims(outs_dims[i]); | ||
out[i]->share_lod(*ids[i], i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里再辛苦为out设置一下dtype,其他地方应该没有问题了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感谢已修改,这个ci上有什么明显错误吗,我看ci上之前已经所有ut通过了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感谢已修改,这个ci上有什么明显错误吗,我看ci上之前已经所有ut通过了
CI上没有问题,这里InferMeta的功能不仅是InferShape还有dtype。
可以rerun一下CI,推进合入了 |
重新推pr吗,PR-CI-Coverage那条ci我没法rerun |
那个是覆盖率没过,我帮你豁免下即可。 |
@xiaoyewww 覆盖率都木有命中,辛苦看看是不是要加单测? |
现在已经有单测,只是覆盖不到这部分代码,这个问题是预料之内的,没有关系。原因是这样的,目前的单测只能测试新旧IR的翻译部分,不能测试运行时部分,因为分布式算子的运行时单测写起来会复杂很多,如果要对比精度的话,就更复杂了。出于简化任务要求的目的,所以一开始就没有这样要求。我觉得可以豁免下,不用新增单测了。 |
了解了,已豁免 |
PR Category
Execute Infrastructure
PR Types
Devs
Description
#60436
注册算子pull_gpups_sparse