Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.5和更高版本模型GPU报错 #19628

Closed
kahitomi opened this issue Sep 4, 2019 · 8 comments
Closed

1.5和更高版本模型GPU报错 #19628

kahitomi opened this issue Sep 4, 2019 · 8 comments
Assignees
Labels
User 用于标记用户问题

Comments

@kahitomi
Copy link

kahitomi commented Sep 4, 2019

模型是一个带copy的lstm生成模型
模型在1.3,1.4版本下GPU和CPU都没有问题。
在使用paddle 1.5和lastest版本时候,GPU版本报错,CPU版本正常。GPU报错如下~:
image

提示中没有错误的具体位置~
根据提示,寻找sum相关的OP,未果~
最后定位的代码,在DynamicRNN的block()中,如下

......
......
with drnn.block():
......
......
    current_h_expand_seq = pd.reshape(
                current_h, [-1, 1, self.decoder_size])
    current_h_expand_seq = pd.expand(
                current_h_expand_seq, [1, self.max_length, 1])

    copy_score_sub = pd.elementwise_mul(
                copy_score_weight, current_h_expand_seq, axis=0)
    copy_score = pd.reduce_sum(copy_score_sub, dim=2)
......
......

去掉这段代码就不报错了
其中self.decoder_size是解码的lstm size,self.max_length是生成的最大步长
其中current_h是本时间步的LSTM输出,shape为[batch_size, self.decoder_size]
其中copy_score_weight来自encoder端的所有单词,shape为[batch_size, self.max_length, self.decoder_size]

同时实验去掉了reduce_sum的op,代码是

......
......
with drnn.block():
......
......
    current_h_expand_seq = pd.reshape(
                current_h, [-1, 1, self.decoder_size])
    current_h_expand_seq = pd.expand(
                current_h_expand_seq, [1, self.max_length, 1])

    copy_score_in = pd.concat([copy_score_weight, current_h_expand_seq], axis=2)
    copy_score_in = pd.reshape(
                copy_score_in, [-1, self.decoder_size * 2])
    copy_score = fluid.layers.fc(input=copy_score_in,
                    act='tanh',
                    size=1,
                    bias_attr=False,
                    param_attr=fluid.ParamAttr(name="copy_score_combine_weight_w"))
    copy_score = pd.reshape(
                copy_score, [-1, self.max_length])
......
......

结果报的错和开始是一致的,还是报sum to lod tensor~

完整报错如下:

Traceback (most recent call last):
File "train.py", line 528, in <module>
train()
File "train.py", line 467, in train
return_numpy=False)
File "/home/slurm/job/tmp/job-128277/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/executor.py", line 651, in run
use_program_cache=use_program_cache)
File "/home/slurm/job/tmp/job-128277/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/executor.py", line 749, in _run
exe.run(program.desc, scope, 0, True, True, fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet: holder_ should not be null
Tensor holds no memory. Call Tensor::mutable_data first. at [/paddle/paddle/fluid/framework/tensor.cc:23]
PaddlePaddle Call Stacks: 
0 0x7fc4e021aff8p void paddle::platform::EnforceNotMet::Init<std::string>(std::string, char const*, int) + 360
1 0x7fc4e021b347p paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int) + 87
2 0x7fc4e21d8f09p paddle::framework::Tensor::check_memory_size() const + 185
3 0x7fc4e0221c59p float const* paddle::framework::Tensor::data<float>() const + 25
4 0x7fc4e06c38cbp void paddle::operators::SumToLoDTensor<float>(paddle::framework::ExecutionContext const&) + 763
5 0x7fc4e06cca38p std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::SumKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::SumKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::SumKernel<paddle::platform::CUDADeviceContext, int>, paddle::operators::SumKernel<paddle::platform::CUDADeviceContext, long>, paddle::operators::SumKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) + 248
6 0x7fc4e2183037p paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&, paddle::framework::RuntimeContext*) const + 375
7 0x7fc4e2183411p paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 529
8 0x7fc4e2180a0cp paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) + 332
9 0x7fc4e03a746ep paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool) + 382
10 0x7fc4e1c2139dp paddle::operators::WhileGradOp::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 1869
11 0x7fc4e2180a0cp paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) + 332
12 0x7fc4e03a746ep paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool) + 382
13 0x7fc4e03aa50fp paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocator<std::string> > const&, bool) + 143
14 0x7fc4e020bf8dp
15 0x7fc4e024d936p
16 0x7fc58a29fcc8p PyEval_EvalFrameEx + 28264
17 0x7fc58a2a235dp PyEval_EvalCodeEx + 2061
18 0x7fc58a29fd50p PyEval_EvalFrameEx + 28400
19 0x7fc58a2a235dp PyEval_EvalCodeEx + 2061
20 0x7fc58a29fd50p PyEval_EvalFrameEx + 28400
21 0x7fc58a2a235dp PyEval_EvalCodeEx + 2061
22 0x7fc58a29fd50p PyEval_EvalFrameEx + 28400
23 0x7fc58a2a235dp PyEval_EvalCodeEx + 2061
24 0x7fc58a2a2492p PyEval_EvalCode + 50
25 0x7fc58a2cc1a2p PyRun_FileExFlags + 146
26 0x7fc58a2cd539p PyRun_SimpleFileExFlags + 217
27 0x7fc58a2e31bdp Py_Main + 3149
28 0x7fc5894e0bd5p __libc_start_main + 245
29 0x4007a1p
@wanghaoshuang wanghaoshuang added the User 用于标记用户问题 label Sep 4, 2019
@wanghaoshuang
Copy link
Contributor

@zhaoyuchen2018 帮忙看下sum op相关的问题?

@zhaoyuchen2018
Copy link
Contributor

看起来是sum的某个tensor里面没有数据,能提供下复现的代码吗

@zenghsh3
Copy link

Hi,请问这个issue有进展吗,有用户反馈类似问题:PaddlePaddle/PARL#151
@zhaoyuchen2018

@zhaoyuchen2018
Copy link
Contributor

Hi,请问这个issue有进展吗,有用户反馈类似问题:PaddlePaddle/PARL#151
@zhaoyuchen2018

看了下是某个tensor没有数据,需要复现的代码才能进一步debug

@zenghsh3
Copy link

目前发现是在多GPU时才有这个问题,设置环境变量 export CUDA_VISIBLE_DEVICES=0后能正常运行。
如果需要复现代码,可以参考https://github.com/PaddlePaddle/PARL/tree/develop/examples/IMPALA

@jinxing94
Copy link

请问这个有进展吗,我碰到类似的问题, #19679

@kahitomi
Copy link
Author

kahitomi commented Sep 18, 2019

@zhaoyuchen2018
我整理了一份代码,https://github.com/kahitomi/paddle_test_code
可以拉下来直接运行

@zhaoyuchen2018
Copy link
Contributor

zhaoyuchen2018 commented Oct 14, 2019

#20602, 这个patch fix了,麻烦试一下。@kahitomi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

5 participants