Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用paddleocr跑自己数据时,loss.backward()出错 #4170

Closed
daeing opened this issue Sep 24, 2021 · 6 comments
Closed

使用paddleocr跑自己数据时,loss.backward()出错 #4170

daeing opened this issue Sep 24, 2021 · 6 comments
Assignees

Comments

@daeing
Copy link

daeing commented Sep 24, 2021

loss可以顺利计算,但是跑这一句话时,下面的print跑不出结果,报错下面信息,并且显存还有的没有释放。
python: ../nptl/pthread_mutex_lock.c:79: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.


C++ Traceback (most recent call last):

0 paddle::imperative::BasicEngine::Execute()
1 paddle::imperative::PreparedOp::Run(paddle::imperative::NameVariableWrapperMap const&, paddle::imperative::NameVariableWrapperMap const&, paddle::framework::AttributeMap const&)
2 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::MatMulGradKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::MatMulGradKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::MatMulGradKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
3 paddle::operators::MatMulGradKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
4 paddle::operators::MatMulGradKernel<paddle::platform::CUDADeviceContext, float>::CalcInputGrad(paddle::framework::ExecutionContext const&, paddle::framework::Tensor const&, bool, bool, paddle::framework::Tensor const&, bool, bool, paddle::framework::Tensor*) const
5 paddle::operators::MatMulGradKernel<paddle::platform::CUDADeviceContext, float>::MatMul(paddle::framework::ExecutionContext const&, paddle::framework::Tensor const&, bool, paddle::framework::Tensor const&, bool, paddle::framework::Tensor*) const
6 void paddle::operators::math::Blaspaddle::platform::CUDADeviceContext::MatMul(paddle::framework::Tensor const&, paddle::operators::math::MatDescriptor const&, paddle::framework::Tensor const&, paddle::operators::math::MatDescriptor const&, float, paddle::framework::Tensor*, float) const
7 void paddle::operators::math::Blaspaddle::platform::CUDADeviceContext::GEMM(CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, float, float const*, float const*, float, float*) const
8 cublasSgemm_v2
9 paddle::framework::SignalHandle(char const*, int)
10 paddle::platform::GetCurrentTraceBackStringabi:cxx11


Error Message Summary:

FatalError: Process abort signal is detected by the operating system.
[TimeInfo: *** Aborted at 1632470105 (unix time) try "date -d @1632470105" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x3e7000045b7) received by PID 17847 (TID 0x7fa8e38b50c0) from PID 17847 ***]

Aborted (core dumped)

@tink2123
Copy link
Collaborator

请问跑的是哪个模型?除了修改数据,还做了其他代码修改吗?

@daeing
Copy link
Author

daeing commented Sep 24, 2021

请问跑的是哪个模型?除了修改数据,还做了其他代码修改吗?

python tools/train.py -c configs/rec/ch_PP-OCRv2/ch_PP-OCRv2_rec.yml
gitclone的项目。没有修改代码,只是更换了数据。跑的时m1_enhance(LCNet),dataloader可以顺利载入,loss也可以顺利计算。但是执行loss.backward()的时候报了上面的错误。paddlepaddle和paddleocr版本如下:
paddleocr 2.3.0.1
paddlepaddle-gpu 2.1.3.post101

paddle.utils.run_check()也可以顺利打印

@tink2123
Copy link
Collaborator

看下是不是显存不足了,bs减小一半试试能不能跑起来

@daeing
Copy link
Author

daeing commented Sep 24, 2021

看下是不是显存不足了,bs减小一半试试能不能跑起来

减小到4可以跑了,但是显存利用了3个g左右,我的p40时22个g的。我增大到8就又不行了?这个是什么原因啊?

@vance-coder
Copy link

看下是不是显存不足了,bs减小一半试试能不能跑起来

减小到4可以跑了,但是显存利用了3个g左右,我的p40时22个g的。我增大到8就又不行了?这个是什么原因啊?

请问一下,找到解决办法了吗?另外 p40能用多大的batch size,p40不是24G显存吗?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants