Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【PIR Dist Op Reg No.28】 reg dgc #62781

Merged
merged 9 commits into from
Apr 9, 2024

Conversation

xiaoyewww
Copy link
Contributor

@xiaoyewww xiaoyewww commented Mar 16, 2024

PR Category

Execute Infrastructure

PR Types

Devs

Description

#60436
注册算子dgc

Copy link

paddle-bot bot commented Mar 16, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Mar 16, 2024
@xiaoyewww xiaoyewww force-pushed the pir/dgc branch 2 times, most recently from 5b3b1bd to 2dc683d Compare March 17, 2024 04:36
@luotao1 luotao1 added the HappyOpenSource 快乐开源活动issue与PR label Mar 18, 2024
@@ -454,6 +454,13 @@
optional : in_accum, in_state, out_scale, out_accum, out_state
inplace : (scale -> out_scale, in_accum -> out_accum, in_state -> out_state)

- op : dgc
args : (Tensor u, Tensor v, Tensor grad, Tensor param, Tensor current_step, Tensor nranks, float[] sparsity, , float m=0.9, bool use_nesterov=true, float rampup_begin_step=0.0, float rampup_step=0.0, float regular_coeff=0.0, int regular_type=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多打了逗号

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢已修改,请问一下新的报错是注册算子的时候还有地方写错了吗

Copy link
Contributor

@kangguangli kangguangli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

从代码来看,

if (WITH_DGC)
op_library(dgc_op DEPS dgc)

这里应该是对应流水线没有开启WITH_DGC选项,需要修改CMakeLists,
在第30行前加入这样的判断:

if(NOT WITH_DGC)
  list(REMOVE_ITEM TEST_INTERP_CASES test_dgc_translate)
endif()

@xiaoyewww
Copy link
Contributor Author

从代码来看,

if (WITH_DGC)
op_library(dgc_op DEPS dgc)

这里应该是对应流水线没有开启WITH_DGC选项,需要修改CMakeLists, 在第30行前加入这样的判断:

if(NOT WITH_DGC)
  list(REMOVE_ITEM TEST_INTERP_CASES test_dgc_translate)
endif()

多谢,已修改

@xiaoyewww
Copy link
Contributor Author

从代码来看,

if (WITH_DGC)
op_library(dgc_op DEPS dgc)

这里应该是对应流水线没有开启WITH_DGC选项,需要修改CMakeLists, 在第30行前加入这样的判断:

if(NOT WITH_DGC)
  list(REMOVE_ITEM TEST_INTERP_CASES test_dgc_translate)
endif()

这里这样修改后似乎还是遇到老问题

Comment on lines 30 to 33
if(NOT WITH_DGC)
list(REMOVE_ITEM TEST_INTERP_CASES test_dgc_translator)
endif()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加错地方了,加在test/ir/pir/translator/CMakeLists.txt里

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

抱歉,不知道怎么加在那边了

@xiaoyewww
Copy link
Contributor Author

@kangguangli 这个麻烦再看一下,还是原来的问题

@kangguangli
Copy link
Contributor

现在的报错是在翻译过程中报错:

2024-03-26 19:29:43 ======================================================================
2024-03-26 19:29:43 ERROR: test_translator (test_dgc_translator.TestDgcOpTranslator)
2024-03-26 19:29:43 ----------------------------------------------------------------------
2024-03-26 19:29:43 Traceback (most recent call last):
2024-03-26 19:29:43 File "/mnt/paddle/build/test/ir/pir/translator/test_dgc_translator.py", line 73, in test_translator
2024-03-26 19:29:43 self.check()
2024-03-26 19:29:43 File "/mnt/paddle/build/test/ir/pir/translator/test_op_translator.py", line 42, in check
2024-03-26 19:29:43 pir_program = pir.translate_to_pir(self.main_program.desc)
2024-03-26 19:29:43 ValueError:
2024-03-26 19:29:43
2024-03-26 19:29:43 --------------------------------------
2024-03-26 19:29:43 C++ Traceback (most recent call last):
2024-03-26 19:29:43 --------------------------------------
2024-03-26 19:29:43 0 paddle::TranslateLegacyProgramToProgram(paddle::framework::ProgramDesc const&)
2024-03-26 19:29:43 1 paddle::translator::ProgramTranslator::Translate()
2024-03-26 19:29:43 2 paddle::translator::ProgramTranslator::TranslateBlock(paddle::framework::BlockDesc const&, unsigned long, unsigned long, paddle::translator::TranslationContext*, pir::Block*)
2024-03-26 19:29:43 3 paddle::translator::ProgramTranslator::TranslateGeneralOperation(paddle::framework::OpDesc const*, paddle::translator::TranslationContext*, pir::Block*)
2024-03-26 19:29:43 4 std::_Function_handler<pir::Operation* (pir::IrContext*, paddle::translator::TranslationContext*, paddle::framework::OpDesc const&, pir::Block*), paddle::translator::OpTranscriber>::_M_invoke(std::_Any_data const&, pir::IrContext*&&, paddle::translator::TranslationContext*&&, paddle::framework::OpDesc const&, pir::Block*&&)
2024-03-26 19:29:43 5 paddle::translator::OpTranscriber::operator()(pir::IrContext*, paddle::translator::TranslationContext*, paddle::framework::OpDesc const&, pir::Block*)
2024-03-26 19:29:43 6 common::enforce::EnforceNotMet::EnforceNotMet(common::ErrorSummary const&, char const*, int)
2024-03-26 19:29:43 7 common::enforce::GetCurrentTraceBackStringabi:cxx11
2024-03-26 19:29:43
2024-03-26 19:29:43 ----------------------
2024-03-26 19:29:43 Error Message Summary:
2024-03-26 19:29:43 ----------------------
2024-03-26 19:29:43 InvalidArgumentError: Input param not found when parsing op dgc
2024-03-26 19:29:43 [Hint: Expected legacy_input_vars.size() == 1UL, but received legacy_input_vars.size():0 != 1UL:1.] (at /paddle/paddle/fluid/ir_adaptor/translator/op_translator.cc:560)

这个错误是说翻译的时候在旧IR算子中没有发现 param 这个输入,一般是由于param为空导致的,原因是 旧IR下有些输入没有正确的声明optional,后续遇到此类问题,可以在算子的yaml里添加下面的声明试试。

   optional: param

@xiaoyewww
Copy link
Contributor Author

现在的报错是在翻译过程中报错:

2024-03-26 19:29:43 ======================================================================
2024-03-26 19:29:43 ERROR: test_translator (test_dgc_translator.TestDgcOpTranslator)
2024-03-26 19:29:43 ----------------------------------------------------------------------
2024-03-26 19:29:43 Traceback (most recent call last):
2024-03-26 19:29:43 File "/mnt/paddle/build/test/ir/pir/translator/test_dgc_translator.py", line 73, in test_translator
2024-03-26 19:29:43 self.check()
2024-03-26 19:29:43 File "/mnt/paddle/build/test/ir/pir/translator/test_op_translator.py", line 42, in check
2024-03-26 19:29:43 pir_program = pir.translate_to_pir(self.main_program.desc)
2024-03-26 19:29:43 ValueError:
2024-03-26 19:29:43
2024-03-26 19:29:43 --------------------------------------
2024-03-26 19:29:43 C++ Traceback (most recent call last):
2024-03-26 19:29:43 --------------------------------------
2024-03-26 19:29:43 0 paddle::TranslateLegacyProgramToProgram(paddle::framework::ProgramDesc const&)
2024-03-26 19:29:43 1 paddle::translator::ProgramTranslator::Translate()
2024-03-26 19:29:43 2 paddle::translator::ProgramTranslator::TranslateBlock(paddle::framework::BlockDesc const&, unsigned long, unsigned long, paddle::translator::TranslationContext*, pir::Block*)
2024-03-26 19:29:43 3 paddle::translator::ProgramTranslator::TranslateGeneralOperation(paddle::framework::OpDesc const*, paddle::translator::TranslationContext*, pir::Block*)
2024-03-26 19:29:43 4 std::_Function_handler<pir::Operation* (pir::IrContext*, paddle::translator::TranslationContext*, paddle::framework::OpDesc const&, pir::Block*), paddle::translator::OpTranscriber>::_M_invoke(std::_Any_data const&, pir::IrContext*&&, paddle::translator::TranslationContext*&&, paddle::framework::OpDesc const&, pir::Block*&&)
2024-03-26 19:29:43 5 paddle::translator::OpTranscriber::operator()(pir::IrContext*, paddle::translator::TranslationContext*, paddle::framework::OpDesc const&, pir::Block*)
2024-03-26 19:29:43 6 common::enforce::EnforceNotMet::EnforceNotMet(common::ErrorSummary const&, char const*, int)
2024-03-26 19:29:43 7 common::enforce::GetCurrentTraceBackStringabi:cxx11
2024-03-26 19:29:43
2024-03-26 19:29:43 ----------------------
2024-03-26 19:29:43 Error Message Summary:
2024-03-26 19:29:43 ----------------------
2024-03-26 19:29:43 InvalidArgumentError: Input param not found when parsing op dgc
2024-03-26 19:29:43 [Hint: Expected legacy_input_vars.size() == 1UL, but received legacy_input_vars.size():0 != 1UL:1.] (at /paddle/paddle/fluid/ir_adaptor/translator/op_translator.cc:560)

这个错误是说翻译的时候在旧IR算子中没有发现 param 这个输入,一般是由于param为空导致的,原因是 旧IR下有些输入没有正确的声明optional,后续遇到此类问题,可以在算子的yaml里添加下面的声明试试。

   optional: param

谢谢,已这样修改,但有点困惑,哪里可以看到旧IR算子中缺少这个输入呢

@kangguangli
Copy link
Contributor

从报错提示中可以看出旧IR中缺少这个参数,如果对相关代码感兴趣,可以看看这个目录下的代码paddle/fluid/ir_adaptor/translator,PFCC仓库应该也有一份文档。

另外,PR模版现在有变更,麻烦按照 https://github.com/PaddlePaddle/Paddle/wiki/PULL-REQUEST-TEMPLATE--REFERENCE 修改一下,我们的任务应该属于 执行架构(Execute Infrastructure)和 框架本身开发的改动。

@xiaoyewww
Copy link
Contributor Author

Details

多谢已修改

args : (Tensor u, Tensor v, Tensor grad, Tensor param, Tensor current_step, Tensor nranks, float[] sparsity, float m=0.9, bool use_nesterov=true, float rampup_begin_step=0.0, float rampup_step=0.0, float regular_coeff=0.0, int regular_type=0)
output : Tensor(u_out), Tensor(v_out), Tensor(encode_grad), Tensor(grad_out), Tensor(k), Tensor(gather_buff)
kernel :
func : dgc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func : dgc
func : dgc
param: [u, v, ...]

在kernel里面需要指定传给dgc_kernel参数,顺序和kernel参数一致,其他地方不用修改

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多谢,已修改,请问一下这里为什么要添加呢

@xiaoyewww
Copy link
Contributor Author

@kangguangli 请问这个pr模版有什么问题吗,按照格式修改过了?

Copy link
Contributor

@kangguangli kangguangli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR模版需要把原来的注释部分删除

output : Tensor(u_out), Tensor(v_out), Tensor(encode_grad), Tensor(grad_out), Tensor(k), Tensor(gather_buff)
kernel :
func : dgc
param : [u, v, grad, param, current_step, nranks, sparsity, m, use_nesterov, rampup_begin_step, rampup_step, regular_coeff, regular_type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个顺序和kernel顺序不完全一致,再和paddle/phi/kernels/gpu/dgc_kernel.cu里的DGCKernel对照下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改,因为这里我看sparsity没有提到默认值所以提前了一下

@xiaoyewww xiaoyewww force-pushed the pir/dgc branch 3 times, most recently from f5be4ad to 68bf75c Compare April 4, 2024 02:59
@xiaoyewww
Copy link
Contributor Author

@kangguangli 这个麻烦review一下,windows ci挂了

@@ -0,0 +1,77 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2023 -> 2024

@kangguangli kangguangli merged commit 6adee41 into PaddlePaddle:develop Apr 9, 2024
30 checks passed
co63oc pushed a commit to co63oc/Paddle that referenced this pull request Apr 9, 2024
* feat(pir): reg dgc

* feat(pir): dgc

* feat(pir): dgc

* feat(pir): reg dgc

* feat(pir): reg dgc

* feat(pir): reg dgc

* feat(pir): reg dgc

* feat(pir): reg dgc

* feat(pir): reg dgc
co63oc pushed a commit to co63oc/Paddle that referenced this pull request Apr 10, 2024
* feat(pir): reg dgc

* feat(pir): dgc

* feat(pir): dgc

* feat(pir): reg dgc

* feat(pir): reg dgc

* feat(pir): reg dgc

* feat(pir): reg dgc

* feat(pir): reg dgc

* feat(pir): reg dgc
@xiaoyewww xiaoyewww deleted the pir/dgc branch May 10, 2024 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers HappyOpenSource 快乐开源活动issue与PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants