Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用RecordIO和ParallelExector进行训练出现SegmentionFault #13809

Closed
zzhzz opened this issue Oct 10, 2018 · 18 comments
Closed

使用RecordIO和ParallelExector进行训练出现SegmentionFault #13809

zzhzz opened this issue Oct 10, 2018 · 18 comments
Assignees

Comments

@zzhzz
Copy link
Contributor

zzhzz commented Oct 10, 2018

在使用RecordIO以及ParallelExector加速训练的过程中,发生了SegmentionFault,错误信息如下:
*** Aborted at 1539160971 (unix time) try "date -d @1539160971" if you are using GNU date ***
2079471 PC: @ 0x0 (unknown)
2079472 *** SIGSEGV (@0x7f3000000002) received by PID 51269 (TID 0x7f305c3ac700) from PID 2; stack trace: ***
2079473 @ 0x7f305bb7e7e0 (unknown)
2079474 @ 0x7f3000000002 (unknown)

神经网络是一个词向量模型,通过设置环境变量输出Paddle的log,报错前的一部分log如下:
I1010 08:42:51.656551 51287 operator.cc:130] CUDAPlace(0) Op(adam), inputs:{Beta1Pow[beta1_pow_acc_3:float1], Beta2Pow[beta2_po w_acc_3:float1], Grad[fc_1.b_0@GRAD:float173], LearningRate[learning_rate_0:float1], Moment1[moment1_3:float[173]({ })], Moment2[moment2_3:float173], Param[fc_1.b_0:float173]}, outputs:{Moment1Out[moment1_3173], Moment2Out[moment2_ 3173], ParamOut[fc_1.b_0173]}.
2079458 I1010 08:42:51.656599 51287 operator.cc:663] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library type[PLAIN]
2079459 I1010 08:42:51.656657 51287 operator.cc:142] CUDAPlace(0) Op(adam), inputs:{Beta1Pow[beta1_pow_acc_3:float1], Beta2Pow[beta2_po w_acc_3:float1], Grad[fc_1.b_0@GRAD:float173], LearningRate[learning_rate_0:float1], Moment1[moment1_3:float[173]({ })], Moment2[moment2_3:float173], Param[fc_1.b_0:float173]}, outputs:{Moment1Out[moment1_3173], Moment2Out[moment2
3173], ParamOut[fc_1.b_0173]}.
2079460 I1010 08:42:51.660423 51287 operator.cc:130] CUDAPlace(0) Op(scale), inputs:{X[beta2_pow_acc_3:float1]}, outputs:{Out[beta2_pow _acc_31]}.
2079461 I1010 08:42:51.660465 51287 operator.cc:663] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library _type[PLAIN]
2079462 I1010 08:42:51.660521 51287 operator.cc:142] CUDAPlace(0) Op(scale), inputs:{X[beta2_pow_acc_3:float1]}, outputs:{Out[beta2_pow _acc_31]}.
2079463 I1010 08:42:51.660552 51287 operator.cc:130] CUDAPlace(0) Op(scale), inputs:{X[beta1_pow_acc_3:float1]}, outputs:{Out[beta1_pow _acc_31]}.
2079464 I1010 08:42:51.660575 51287 operator.cc:663] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library _type[PLAIN]
2079465 I1010 08:42:51.660604 51287 operator.cc:142] CUDAPlace(0) Op(scale), inputs:{X[beta1_pow_acc_3:float1]}, outputs:{Out[beta1_pow _acc_31]}.
2079466 I1010 08:42:51.663774 51288 tensor_util.cu:107] TensorCopySync 1 from CUDAPlace(0) to CPUPlace
2079467 I1010 08:42:51.700305 51288 tensor_util.cu:25] TensorCopy 1 from CPUPlace to CPUPlace
2079468 I1010 08:42:51.700296 51286 tensor_util.cu:107] TensorCopySync 21639, 200 from CUDAPlace(0) to CPUPlace
2079469 I1010 08:42:51.703213 51286 tensor_util.cu:25] TensorCopy 21639, 200 from CPUPlace to CPUPlace
2079470 *** Aborted at 1539160971 (unix time) try "date -d @1539160971" if you are using GNU date ***
2079471 PC: @ 0x0 (unknown)
2079472 *** SIGSEGV (@0x7f3000000002) received by PID 51269 (TID 0x7f305c3ac700) from PID 2; stack trace: ***
2079473 @ 0x7f305bb7e7e0 (unknown)
2079474 @ 0x7f3000000002 (unknown)

@JiabinYang
Copy link
Contributor

请问有可复现的code么

@zzhzz
Copy link
Contributor Author

zzhzz commented Oct 11, 2018

等下我贴一下,我是单机单卡训练的

@chengduoZH
Copy link
Contributor

chengduoZH commented Oct 11, 2018

您的运行环境是什么?@zzhzz

@zzhzz
Copy link
Contributor Author

zzhzz commented Oct 12, 2018

@chengduoZH centos release 6.9(final) with paddlepaddle-gpu 0.14.0

@zzhzz
Copy link
Contributor Author

zzhzz commented Oct 18, 2018

*** Aborted at 1539834486 (unix time) try "date -d @1539834486" if you are using GNU date ***
4742533 PC: @ 0x0 (unknown)
4742534 *** SIGSEGV (@0x7f2d00000002) received by PID 21528 (TID 0x7f2e6907e700) from PID 2; stack trace: ***
4742535 @ 0x7f2e688507e0 (unknown)
4742536 @ 0x7f2e33cc6c8b paddle::framework::Scope::VarInternal()
4742537 @ 0x7f2e33cc6d5e paddle::framework::Scope::Var()
4742538 @ 0x7f2e33b7a355 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run()
4742539 @ 0x7f2e329df4b9 paddle::framework::ParallelExecutor::Run()
4742540 @ 0x7f2e328f48c0 ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework16ParallelExecut orERKSt6vectorISsSaISsEERKSsE101_vIS6_SB_SD_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_c allEE1_4_FUNESV
4742541 @ 0x7f2e32915b74 pybind11::cpp_function::dispatcher()
4742542 @ 0x7f2e68b69ce8 PyEval_EvalFrameEx
4742543 @ 0x7f2e68b6c37d PyEval_EvalCodeEx
4742544 @ 0x7f2e68b69d70 PyEval_EvalFrameEx
4742545 @ 0x7f2e68b6c37d PyEval_EvalCodeEx
4742546 @ 0x7f2e68b69d70 PyEval_EvalFrameEx
4742547 @ 0x7f2e68b6c37d PyEval_EvalCodeEx
4742548 @ 0x7f2e68b69d70 PyEval_EvalFrameEx
4742549 @ 0x7f2e68b69e9e PyEval_EvalFrameEx
4742550 @ 0x7f2e68b69e9e PyEval_EvalFrameEx
4742551 @ 0x7f2e68b6c37d PyEval_EvalCodeEx
4742552 @ 0x7f2e68b6c4b2 PyEval_EvalCode
4742553 @ 0x7f2e68b961c2 PyRun_FileExFlags
4742554 @ 0x7f2e68b97559 PyRun_SimpleFileExFlags
4742555 @ 0x7f2e68bad1dd Py_Main
4742556 @ 0x7f2e67e40d1d __libc_start_main
4742557 @ 0x4006b1 (unknown)

@chengduoZH
根据log来看是在create variable的时候报错的

@zzhzz
Copy link
Contributor Author

zzhzz commented Oct 18, 2018

core dump 信息如下

RuntimeError: Cannot access memory at address 0x7f2d00000002
) at /opt/rh/devtoolset-2/root/usr/include/c++/4.8.2/bits/basic_string.h:293
#1  _M_rep (this=0x7f2dfd4b9490, name=Traceback (most recent call last):
  File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 556, in to_string
    header = ptr.cast(reptype) - 1
RuntimeError: Cannot access memory at address 0x7f2d00000002
) at /opt/rh/devtoolset-2/root/usr/include/c++/4.8.2/bits/basic_string.h:301
#2  size (this=0x7f2dfd4b9490, name=Traceback (most recent call last):
  File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 556, in to_string
    header = ptr.cast(reptype) - 1
RuntimeError: Cannot access memory at address 0x7f2d00000002
) at /opt/rh/devtoolset-2/root/usr/include/c++/4.8.2/bits/basic_string.h:716
#3  operator<< <char, std::char_traits<char>, std::allocator<char> > (this=0x7f2dfd4b9490, name=Traceback (most recent call last):
  File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 556, in to_string
    header = ptr.cast(reptype) - 1
RuntimeError: Cannot access memory at address 0x7f2d00000002
) at /opt/rh/devtoolset-2/root/usr/include/c++/4.8.2/bits/basic_string.h:2758
#4  paddle::framework::Scope::VarInternal (this=0x7f2dfd4b9490, name=Traceback (most recent call last):
  File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 556, in to_string
    header = ptr.cast(reptype) - 1
RuntimeError: Cannot access memory at address 0x7f2d00000002
) at /mnt/paddle/Paddle/paddle/fluid/framework/scope.cc:147
#5  0x00007f2e33cc6d5e in paddle::framework::Scope::Var (this=0x7f2dfd4b9490, name="mean_0.tmp_0@GRAD")
    at /mnt/paddle/Paddle/paddle/fluid/framework/scope.cc:59
#6  0x00007f2e33b7a355 in paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run (this=0x7f2dfd4dd070, 
    fetch_tensors=std::vector of length 2, capacity 2 = {...})
    at /mnt/paddle/Paddle/paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.cc:56
#7  0x00007f2e329df4b9 in paddle::framework::ParallelExecutor::Run (this=Unhandled dwarf expression opcode 0xf3
)
    at /mnt/paddle/Paddle/paddle/fluid/framework/parallel_executor.cc:262

@chengduoZH
Copy link
Contributor

chengduoZH commented Oct 18, 2018

@zzhzz 您这是迭代几轮之后报错的还是第一次迭代就报错?

@zzhzz
Copy link
Contributor Author

zzhzz commented Oct 19, 2018

@chengduoZH 第一次训练过程中报错

@chengduoZH
Copy link
Contributor

@zzhzz 如果不适用RecordIO是没有问题的 是吗?

@zzhzz
Copy link
Contributor Author

zzhzz commented Oct 19, 2018

@chengduoZH 这个没有测试。但是如果同时不使用ParallelExecutor和RecordIO是可以运行的

@chengduoZH
Copy link
Contributor

您先验证一下 如果不用RecordIO会不会有问题

@zzhzz
Copy link
Contributor Author

zzhzz commented Oct 19, 2018

@chengduoZH 就是使用py_reader那个接口吧,我试试

@chengduoZH
Copy link
Contributor

您也可以直接用feed的方式

@zzhzz
Copy link
Contributor Author

zzhzz commented Oct 20, 2018

@chengduoZH 使用py_reader + ParallelExecutor不会出问题

@chengduoZH
Copy link
Contributor

@zzhzz 您找到这个问题的原因了吗

@zzhzz
Copy link
Contributor Author

zzhzz commented Nov 16, 2018

还没有,我目前的GPU需要跑训练任务,暂时没有闲置资源用来debug

@chengduoZH
Copy link
Contributor

好的,我跟进一下

@chengduoZH
Copy link
Contributor

目前Paddle这边不再继续维护RecordIO,建议大家使用py_reader读取数据。因为使用RecordIO需要用户将数据转成RecordIO格式,并且在大部分模型中用RecordIO的收益并不大。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants