Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多线程调用C++推理库进行RNN算子崩溃问题!!!! #49737

Open
marsbzp opened this issue Jan 11, 2023 · 6 comments
Open

多线程调用C++推理库进行RNN算子崩溃问题!!!! #49737

marsbzp opened this issue Jan 11, 2023 · 6 comments
Assignees
Labels
status/need-more-info 信息不全 type/debug 帮用户debug

Comments

@marsbzp
Copy link

marsbzp commented Jan 11, 2023

bug描述 Describe the Bug

系统环境/System Environment:Centos
版本号/Version:Paddle_inference:2.3 PaddleOCR: 2.4 cuda:10.1
运行指令/Command Code:
完整报错/Complete Error Message:

多线程调用C++推理库进行OCR推理时出现报错,使用的是官网下载的识别模型ch_ppocr_server_v2.0_rec_infer,
非必现,线程数越多,越容易出现。不报错时可以正常分析,基本都是崩在了RNN算子那里,是否是Paddle实现的RNN算子有问题(每个线程有单独的检测识别模型,各线程间模型不共享)
和下面这个issue基本一致可以参照里面描述进行复现
PaddlePaddle/PaddleOCR#6514
PaddlePaddle/FastDeploy#1143
使用fastdeploy库paddle后端进行推理也会有相同的问题

报错信息如下

The detection visualized image saved in ./ocr_vis.png
Detected boxes num: 8
Detected boxes num: 8
Detected boxes num: 8
The detection visualized image saved in ./ocr_vis.png
The detection visualized image saved in ./ocr_vis.png
The detection visualized image saved in ./ocr_vis.png

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffd397fe700 (LWP 5157)]
0x00007fff5308b04e in ?? () from /lib64/libcuda.so.1
Missing separate debuginfos, use: debuginfo-install glibc-2.17-324.el7_9.x86_64 libgcc-4.8.5-44.el7.x86_64 libgomp-4.8.5-44.el7.x86_64
(gdb) bt
#0 0x00007fff5308b04e in ?? () from /lib64/libcuda.so.1
#1 0x00007fff53204b0f in ?? () from /lib64/libcuda.so.1
#2 0x00007fff530751e0 in ?? () from /lib64/libcuda.so.1
#3 0x00007fff531dded6 in ?? () from /lib64/libcuda.so.1
#4 0x00007fff52f85a1b in ?? () from /lib64/libcuda.so.1
#5 0x00007fff52f85c98 in ?? () from /lib64/libcuda.so.1
#6 0x00007fff52f85cde in ?? () from /lib64/libcuda.so.1
#7 0x00007fff5310c806 in cuLaunchKernel () from /lib64/libcuda.so.1
#8 0x00007fff64b5aa19 in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#9 0x00007fff64b5aaa7 in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#10 0x00007fff64b90e9b in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#11 0x00007fff647f83de in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#12 0x00007fff647f29ea in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#13 0x00007fff646d7a76 in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#14 0x00007fff647a4bf3 in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#15 0x00007fff647e908f in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#16 0x00007fff647ebfd8 in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#17 0x00007fff647ec7bf in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#18 0x00007fff647f23b0 in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#19 0x00007fff645a4342 in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#20 0x00007fff645cf335 in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#21 0x00007fff645d3e81 in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#22 0x00007fff645c1d20 in ?? () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#23 0x00007fff645c277f in cudnnRNNForwardInference () from /usr/local/cuda-10.1/lib64/libcudnn.so.7
#24 0x00007fff995599ef in ?? () from /home/share/disk2/zhangjinlong21/inference_project/ocr_inference/cpp_infer_release/3rdparty/paddle_inference_2.3/paddle/lib/libpaddle_inference.so
#25 0x00007fff9955e2f4 in void phi::RnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor const&>, float, bool, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocatorphi::DenseTensor* >, phi::DenseTensor*) () from /home/share/disk2/zhangjinlong21/inference_project/ocr_inference/cpp_infer_release/3rdparty/paddle_inference_2.3/paddle/lib/libpaddle_inference.so
#26 0x00007fff9955ee5d in void phi::KernelImpl<void ()(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor const&>, float, bool, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocatorphi::DenseTensor* >, phi::DenseTensor*), &(void phi::RnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor const&>, float, bool, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocatorphi::DenseTensor* >, phi::DenseTensor*))>::KernelCallHelper<std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor const&>, float, bool, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocatorphi::DenseTensor* >, phi::DenseTensor*, phi::TypeTag >::Compute<1, 1, 0, 0, phi::GPUContext const, phi::DenseTensor const>(phi::KernelContext*, phi::GPUContext const&, phi::DenseTensor const&) ()
from /home/share/disk2/zhangjinlong21/inference_project/ocr_inference/cpp_infer_release/3rdparty/paddle_inference_2.3/paddle/lib/libpaddle_inference.so
#27 0x00007fff9cefaa3a in paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&, paddle::framework::RuntimeContext*) const ()
from /home/share/disk2/zhangjinlong21/inference_project/ocr_inference/cpp_infer_release/3rdparty/paddle_inference_2.3/paddle/lib/libpaddle_inference.so
#28 0x00007fff9cefb629 in paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&) const ()
from /home/share/disk2/zhangjinlong21/inference_project/ocr_inference/cpp_infer_release/3rdparty/paddle_inference_2.3/paddle/lib/libpaddle_inference.so
#29 0x00007fff9ceec10b in paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&) () from /home/share/disk2/zhangjinlong21/inference_project/ocr_inference/cpp_infer_release/3rdparty/paddle_inference_2.3/paddle/lib/libpaddle_inference.so
#30 0x00007fff96fa23d0 in paddle::framework::NaiveExecutor::Run() () from /home/share/disk2/zhangjinlong21/inference_project/ocr_inference/cpp_infer_release/3rdparty/paddle_inference_2.3/paddle/lib/libpaddle_inference.so
#31 0x00007fff96bfd8fb in paddle::AnalysisPredictor::ZeroCopyRun() () from /home/share/disk2/zhangjinlong21/inference_project/ocr_inference/cpp_infer_release/3rdparty/paddle_inference_2.3/paddle/lib/libpaddle_inference.so
#32 0x000000000044198c in PaddleOCR::CRNNRecognizer::Run (this=0x7ffd397ecfd0, img_list=..., times=0x7ffd397ecf70) at /home/share/disk1/bzp/ocr_inference_multithread/cpp_infer_qm/src/ocr_rec.cpp:63
#33 0x000000000042ce4a in predictor_thread (cv_all_img_names=..., thread_id=14) at /home/share/disk1/bzp/ocr_inference_multithread/cpp_infer_qm/src/main.cpp:139
#34 0x00000000004314e7 in std::__invoke_impl<void, void ()(std::vector<cv::String, std::allocatorcv::String >, int), std::vector<cv::String, std::allocatorcv::String >, int>(std::__invoke_other, void (&&)(std::vector<cv::String, std::allocatorcv::String >, int), std::vector<cv::String, std::allocatorcv::String >&&, int&&) (__f=<unknown type in /home/share/disk1/bzp/ocr_inference_multithread/cpp_infer_qm/build/ppocr, CU 0x0, DIE 0x40211>,
__args#0=<unknown type in /home/share/disk1/bzp/ocr_inference_multithread/cpp_infer_qm/build/ppocr, CU 0x0, DIE 0x40234>, __args#1=<unknown type in /home/share/disk1/bzp/ocr_inference_multithread/cpp_infer_qm/build/ppocr, CU 0x0, DIE 0x40244>)
at /usr/local/gcc-8.2/include/c++/8.2.0/bits/invoke.h:60
#35 0x000000000042fca9 in std::__invoke<void ()(std::vector<cv::String, std::allocatorcv::String >, int), std::vector<cv::String, std::allocatorcv::String >, int>(void (&&)(std::vector<cv::String, std::allocatorcv::String >, int), std::vector<cv::String, std::allocatorcv::String >&&, int&&) (__fn=<unknown type in /home/share/disk1/bzp/ocr_inference_multithread/cpp_infer_qm/build/ppocr, CU 0x0, DIE 0x42458>,
__args#0=<unknown type in /home/share/disk1/bzp/ocr_inference_multithread/cpp_infer_qm/build/ppocr, CU 0x0, DIE 0x4247a>, __args#1=<unknown type in /home/share/disk1/bzp/ocr_inference_multithread/cpp_infer_qm/build/ppocr, CU 0x0, DIE 0x42489>)
---Type to continue, or q to quit---
at /usr/local/gcc-8.2/include/c++/8.2.0/bits/invoke.h:95
#36 0x00000000004350b9 in std::thread::_Invoker<std::tuple<void ()(std::vector<cv::String, std::allocatorcv::String >, int), std::vector<cv::String, std::allocatorcv::String >, int> >::_M_invoke<0ul, 1ul, 2ul> (this=0x2ace388)
at /usr/local/gcc-8.2/include/c++/8.2.0/thread:234
#37 0x0000000000435058 in std::thread::_Invoker<std::tuple<void (
)(std::vector<cv::String, std::allocatorcv::String >, int), std::vector<cv::String, std::allocatorcv::String >, int> >::operator() (this=0x2ace388) at /usr/local/gcc-8.2/include/c++/8.2.0/thread:243
#38 0x000000000043503c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(std::vector<cv::String, std::allocatorcv::String >, int), std::vector<cv::String, std::allocatorcv::String >, int> > >::_M_run (this=0x2ace380)
at /usr/local/gcc-8.2/include/c++/8.2.0/thread:186
#39 0x00007ffff7f3f19d in std::execute_native_thread_routine (__p=0x2ace380) at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#40 0x00007fff62ccaea5 in start_thread () from /lib64/libpthread.so.0
#41 0x00007fff624db9fd in clone () from /lib64/libc.so.6
(gdb)
(gdb)
(gdb)
(gdb)
(gdb)

@paddle-bot
Copy link

paddle-bot bot commented Jan 11, 2023

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@zhangbo9674
Copy link
Contributor

你好,建议先尝试升级Paddle_inference到2.4版本

@marsbzp
Copy link
Author

marsbzp commented Jan 31, 2023

你好,建议先尝试升级Paddle_inference到2.4版本

你好升级到2.4也是一样的,用你们的fastdeploy paddle推理后端也是一样的
PaddlePaddle/FastDeploy#1143

@marsbzp
Copy link
Author

marsbzp commented Feb 2, 2023

fastdeploy使用paddle后端推理参看
PaddlePaddle/FastDeploy#1143
里面有详细的复现过程

@marsbzp
Copy link
Author

marsbzp commented Feb 22, 2023

这问题你们定位了吗

@marsbzp
Copy link
Author

marsbzp commented Apr 11, 2023

这问题你们定位了吗

这问题定位的怎么样了,我看了下onnx runtime的代码他的每个线程都有自己的硬件资源句柄,所以使用ORT多线程推理不崩溃的,paddle实现资源池是单实例的,所有线程共享包括cuda_stream cudnn_handle,崩溃位置就发生cudnnRNNForward那里,看你们RNN算子最后调的函数和ORT一样的,可能有差别的地方就是cuda_stream cudnn_handle是各线程共享,你们再好好看看吧

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/need-more-info 信息不全 type/debug 帮用户debug
Projects
None yet
Development

No branches or pull requests

2 participants