Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpRegistry::CreateOp调用,输入/输出名字写错,报错信息难以理解 #3899

Closed
Xreki opened this issue Sep 6, 2017 · 10 comments
Closed

Comments

@Xreki
Copy link
Contributor

Xreki commented Sep 6, 2017

FCOp,通过组合MulOp, Row,RowwiseAddOp, SoftmaxOp等实现,在C++实现中通过调用OpRegistry::CreateOp来创建相应的Op

FCOp的输入为XWMulOp的输入为XY,在创建MulOp将输入误写为:

    AppendOp(framework::OpRegistry::CreateOp(
             "mul", {{"X", {Input("X")}}, {"W", {Input("W")}}},
             {{"Out", {Output("mul_out")}}}, {}));

单测时报错信息如下:

133: ======================================================================
133: ERROR: test_all (__main__.TestFCOp)
133: ----------------------------------------------------------------------
133: Traceback (most recent call last):
133:   File "/home/liuyiqun01/github/Paddle/python/paddle/v2/framework/tests/op_test_util.py", line 55, in test_all
133:     op = Operator(self.type, **kwargs)
133:   File "/home/liuyiqun01/github/Paddle/build_paddle/build/python/build/lib-python/paddle/v2/framework/op.py", line 161, in __call__
133:     return self.get_op_info(t).method(**kwargs)
133:   File "/home/liuyiqun01/github/Paddle/build_paddle/build/python/build/lib-python/paddle/v2/framework/op.py", line 132, in __impl__
133:     return core.Operator.create(opdesc.SerializeToString())
133: RuntimeError: basic_string::_S_construct null not valid
133: 
133: ----------------------------------------------------------------------

这个报错信息太难以理解了,花了很多时间才发现是MulOp输入的名字写错了,正确的写法应该是:

    AppendOp(framework::OpRegistry::CreateOp(
             "mul", {{"X", {Input("X")}}, {"Y", {Input("W")}}},
             {{"Out", {Output("mul_out")}}}, {}));

重构后的Paddle,需要更友好、直观的错误提示。

@QiJune
Copy link
Member

QiJune commented Sep 6, 2017

遇到过同样的问题,,打了一堆log来定位出错的地方

@reyoung
Copy link
Collaborator

reyoung commented Sep 6, 2017

Yes, I found a similar problem this weekend.

This should have been fixed in PR #3831. @QiJune @Xreki It is very kind if you can help to check whether the newest develop branch has the same problem or not.

@Xreki
Copy link
Contributor Author

Xreki commented Sep 6, 2017

@reyoung 用的是最新的develop分支,仍然报同样的错误,挂在CheckAllInputOutputSet函数里面。

@lcy-seso
Copy link
Contributor

lcy-seso commented Sep 6, 2017

我也遇到这个问题,使用的也是最新分枝,和 @Xreki 的情形相同。

@Xreki
Copy link
Contributor Author

Xreki commented Sep 8, 2017

我又一次在output名字上栽了个跟头,想想这也是我个人的粗心造成的。具体是这样的:

我在写FCOp的时候,想把输出的key从Out换回Y,于是我在C++代码里面将代码改好了

AddOutput("Y", "The activated output matrix of FC operator");

引用处也都改成了Output("Y")。C++代码看起来是没什么问题了,可是跑单测的时候出了如下错误

136: Test command: /bin/env "PYTHONPATH=/home/liuyiqun01/github/Paddle/build_paddle/build/python/build/lib-python" "python2" "test_fc_op.py"
136: Test timeout computed to be: 9.99988e+06
136: Run
1/1 Test #136: test_fc_op .......................***Exception: SegFault 11.22 sec

经过几番调试,发现是python测试程序里面输出的名字没改过来,功能是Y = X1 * W1 + X2 * W2 + b,我打印出Op的DebugString,如下

136: Op(fc), inputs:{W[W2, W1], X[X2, X1], b[b]}, outputs:{Y[], add_out[add_out], mul_out[mul_out]}.
136: Op(mul), inputs:{X[X2], Y[W2]}, outputs:{Out[mul_out]}.
136: Op(mul), inputs:{X[X1], Y[W1]}, outputs:{Out[add_out]}.
136: Op(add), inputs:{X[mul_out], Y[add_out]}, outputs:{Out[mul_out]}.
136: Op(rowwise_add), inputs:{X[mul_out], b[b]}, outputs:{Out[add_out]}.
136: Op(identity), inputs:{X[add_out]}, outputs:{Y[@EMPTY@]}.
136: Op(scale), inputs:{X[add_out]}, outputs:{Out[@EMPTY@]}.

发现有两个EMPTYVar,产生这个错误是我自己的疏忽。在这里我想说的两点是:
1. Paddle内部是不是应该对Op的输入输出变量名是否为EMPTY做一下检查?
2. Paddle应该尽量准确地打印错误信息,而不是这样由系统扔出来一个Exception: SegFault,真的很难查。。。

@lcy-seso
Copy link
Contributor

lcy-seso commented Sep 8, 2017

我们是否可以一起努力,保证此 PR 的尽快merge:#3452

@lcy-seso
Copy link
Contributor

lcy-seso commented Sep 8, 2017

另一个和命名相关的小问题 :#3976

@Xreki
Copy link
Contributor Author

Xreki commented Sep 8, 2017

@reyoung 我查到产生上述log的根源了,是由PADDLE_ENFORCE语句产生的。
为了验证PADDLE_ENFORCE,我做了如下实验:

  • git pull当前最新develop分支,将mul_op.cc里面的一个PADDLE_ENFORCE改成肯定不满足的情况,如下
 45     PADDLE_ENFORCE_EQ(
 46         // x_mat_dims[1], y_mat_dims[0],
 47         1, 2,
 48         "First matrix's width must be equal with second matrix's height.");
  • 按照设定,执行单测test_mul_op的时候,单测会挂掉,预期的输出是:
125: Check failed in mul_op.cc, line 45
125: First matrix's width must be equal with second matrix's height.
...
  • 但实际的输出是
test 125
    Start 125: test_mul_op

125: Test command: /bin/env "PYTHONPATH=/home/liuyiqun01/github/Paddle/build_paddle/build/python/build/lib-python" "python2" "test_mul_op.py"
125: Test timeout computed to be: 9.99988e+06
125: EEEEEEEEEE
125: ======================================================================
125: ERROR: test_cpu_gpu_compare (__main__.TestMulGradOp)
125: ----------------------------------------------------------------------
125: Traceback (most recent call last):
125:   File "test_mul_op.py", line 45, in test_cpu_gpu_compare
125:     self.compare_grad(self.op, self.inputs)
125:   File "/home/liuyiqun01/github/Paddle/python/paddle/v2/framework/tests/gradient_checker.py", line 216, in compare_grad
125:     out_names, core.CPUPlace())
125:   File "/home/liuyiqun01/github/Paddle/python/paddle/v2/framework/tests/gradient_checker.py", line 164, in __get_gradient
125:     forward_op.infer_shape(scope)
125: RuntimeError: basic_string::_S_construct null not valid
125: 

我们设定的错误提示完全没有输出。。。

再来说下,CreateOp中,输入/输出key写错了这个错误的原因,可以看下operator.ccOperatorBase的构造函数:

OperatorBase::OperatorBase(const std::string& type,
                           const VariableNameMap& inputs,
                           const VariableNameMap& outputs,
                           const AttributeMap& attrs)
    : type_(type), inputs_(inputs), outputs_(outputs), attrs_(attrs) {
  GenerateTemporaryNames();
  CheckAllInputOutputSet();
}

其实在CheckAllInputOutputSet()有对输入输出进行检查:

void OperatorBase::CheckAllInputOutputSet() const {
  auto& info_map = OpInfoMap::Instance();
  auto* op_info = info_map.GetNullable(Type());
  if (op_info == nullptr || op_info->proto_ == nullptr) return;

  for (auto& in : op_info->Proto().inputs()) {
    PADDLE_ENFORCE(inputs_.find(in.name()) != inputs_.end(),
                   "Type %s's input %s is not set", Type(), in.name());
  }

  for (auto& out : op_info->Proto().outputs()) {
    PADDLE_ENFORCE(outputs_.find(out.name()) != outputs_.end(),
                   "Type %s's output %s is not set", Type(), out.name());
  }
}

没错,当我们输入/输出的key配置错误时,就是挂在这两个PADDLE_ENFORCE。然而没有准确的错误提示信息,让我们调程序变的困难无比。

@Xreki
Copy link
Contributor Author

Xreki commented Sep 11, 2017

@Superjom 说PADDLE_ENFORCE能在C++测试中打印Stack信息,只是在Python测试中存在问题。我做了如下一个测试。
新建一个enforce_failure_test.cc,其中的内容为:

#include "gtest/gtest.h"
#include "paddle/platform/enforce.h"

TEST(ENFORCE, Failure) {
  PADDLE_ENFORCE(false, "Example to show the behavior when enforce is failed.");
}

执行单测make test ARGS="-R enforce_failure_test -V",输出如下:

$ make test ARGS="-R enforce_failure_test -V"  
Running tests...
UpdateCTestConfiguration  from :/home/liuyiqun01/github/Paddle/build_paddle/build/DartConfiguration.tcl
UpdateCTestConfiguration  from :/home/liuyiqun01/github/Paddle/build_paddle/build/DartConfiguration.tcl
Test project /home/liuyiqun01/github/Paddle/build_paddle/build
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 84
    Start 84: enforce_failure_test

84: Test command: /home/liuyiqun01/github/Paddle/build_paddle/build/paddle/platform/enforce_failure_test
84: Test timeout computed to be: 9.99988e+06
84: Running main() from gtest_main.cc
84: [==========] Running 1 test from 1 test case.
84: [----------] Global test environment set-up.
84: [----------] 1 test from ENFORCE
84: [ RUN      ] ENFORCE.Failure
84: unknown file: Failure
84: C++ exception with description "basic_string::_S_construct null not valid" thrown in the test body.
84: [  FAILED  ] ENFORCE.Failure (1 ms)
84: [----------] 1 test from ENFORCE (1 ms total)
84: 
84: [----------] Global test environment tear-down
84: [==========] 1 test from 1 test case ran. (1 ms total)
84: [  PASSED  ] 0 tests.
84: [  FAILED  ] 1 test, listed below:
84: [  FAILED  ] ENFORCE.Failure
84: 
84:  1 FAILED TEST
1/1 Test #84: enforce_failure_test .............***Failed    0.00 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   0.01 sec

The following tests FAILED:
         84 - enforce_failure_test (Failed)
Errors while running CTest
make: *** [test] Error 8

@Xreki
Copy link
Contributor Author

Xreki commented Sep 11, 2017

PADDLE_ENFORCE的问题已经在#4002 中解决,谢谢@reyoung 。当CreateOp中误将op的输出key Y写成W,能正确地输出错误提示信息、堆栈信息。

136: ERROR: test_check_output (__main__.TestFCOp)
136: ----------------------------------------------------------------------
136: Traceback (most recent call last):
136:   File "test_fc_op.py", line 40, in test_check_output
136:     self.check_output()
136:   File "/home/liuyiqun01/github/Paddle/python/paddle/v2/framework/tests/op_test.py", line 203, in check_output
136:     self.check_output_with_place(place)
136:   File "/home/liuyiqun01/github/Paddle/python/paddle/v2/framework/tests/op_test.py", line 171, in check_output_with_place
136:     self.op = create_op(self.scope, self.op_type, self.inputs, self.outputs)
136:   File "/home/liuyiqun01/github/Paddle/python/paddle/v2/framework/tests/op_test.py", line 41, in create_op
136:     return Operator(op_type, **kwargs)
136:   File "/home/liuyiqun01/github/Paddle/build_paddle/build/python/build/lib-python/paddle/v2/framework/op.py", line 172, in __call__
136:     return self.get_op_info(t).method(**kwargs)
136:   File "/home/liuyiqun01/github/Paddle/build_paddle/build/python/build/lib-python/paddle/v2/framework/op.py", line 140, in __impl__
136:     return core.Operator.create(opdesc.SerializeToString())
136: RuntimeError: Type mul's input Y is not set at [/home/liuyiqun01/github/Paddle/paddle/framework/operator.cc:167]
136: PaddlePaddle Call Stacks: 
136: 0       0x7fe466e44527p _ZN6paddle8platform13EnforceNotMetC2ENSt15__exception_ptr13exception_ptrEPKci + 663
136: 1       0x7fe4683b55fap _ZNK6paddle9framework12OperatorBase22CheckAllInputOutputSetEv + 426
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants