Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paddle多cpu训练与预测问题 #19354

Closed
xuzhenglei1991 opened this issue Aug 22, 2019 · 6 comments
Closed

paddle多cpu训练与预测问题 #19354

xuzhenglei1991 opened this issue Aug 22, 2019 · 6 comments

Comments

@xuzhenglei1991
Copy link

xuzhenglei1991 commented Aug 22, 2019

之前用的paddle1.2版本进行的训练,但是速度太慢,故改成多cpu的训练方式,由于paddle1.2调用多cpu的时候报错,具体错误为:

File "python/train_multi_cpu.py", line 397, in train
    feed=feeder.feed(full_batch))
  File "/home/disk7/paddle_release_home/python/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 247, in run
    feed_tensor_dict)
paddle.fluid.core.EnforceNotMet: Enforce failed. Expected member_->places_.size() == lod_tensors.size(), but received member_->places_.size():48 != lod_tensors.size():40.
The number of samples of current batch is less than the count of devices, currently, it is not allowed. (48 vs 40) at [/paddle/paddle/fluid/framework/parallel_executor.cc:314]
PaddlePaddle Call Stacks:
0       0x7f96bfe40986p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486
1       0x7f96bff2ec4ep paddle::framework::ParallelExecutor::FeedAndSplitTensorIntoLocalScopes(std::unordered_map<std::string, paddle::framework::LoDTensor, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, paddle::framework::LoDTensor> > > const&) + 1118
2       0x7f96bfe8e7f1p
3       0x7f96bfe7a7c0p
4       0x7f974c3d9bb8p PyEval_EvalFrameEx + 25016
5       0x7f974c3dd0bdp PyEval_EvalCodeEx + 2061
6       0x7f974c3da345p PyEval_EvalFrameEx + 26949
7       0x7f974c3da460p PyEval_EvalFrameEx + 27232
8       0x7f974c3dd0bdp PyEval_EvalCodeEx + 2061
9       0x7f974c3dd1f2p PyEval_EvalCode + 50
10      0x7f974c405f42p PyRun_FileExFlags + 146
11      0x7f974c4072d9p PyRun_SimpleFileExFlags + 217
12      0x7f974c41d00dp Py_Main + 3149
13      0x7f974b61abd5p __libc_start_main + 245
14            0x4007a1p

故改成paddle1.5进行训练

这样又遇到一个问题,预测的代码是C++ paddle1.2.0_pb32版本的 load模型的时候会报core,堆栈信息如下:

(gdb) bt
#0  0x00007ffc829783f7 in raise () from /opt/compiler/gcc-4.8.2/lib/libc.so.6
#1  0x00007ffc829797d8 in abort () from /opt/compiler/gcc-4.8.2/lib/libc.so.6
#2  0x00007ffc83268c65 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007ffc83266e06 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38
#4  0x00007ffc83266e33 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#5  0x00007ffc83267052 in __cxxabiv1::__cxa_throw (obj=0x7ffc740319b0, tinfo=0x1785710 <typeinfo for paddle::platform::EnforceNotMet>, dest=
    0x64ddb4 <paddle::platform::EnforceNotMet::~EnforceNotMet()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:87
#6  0x000000000077c5f4 in paddle::framework::ExtractAttribute<std::vector<int, std::allocator<int> > >::operator()(boost::variant<boost::blank, int, float, std::string, std::vector<int, std::allocator<int> >, std::vector<float, std::allocator<float> >, std::vector<std::string, std::allocator<std::string> >, bool, std::vector<bool, std::allocator<bool> >, paddle::framework::BlockDesc*, long, std::vector<paddle::framework::BlockDesc*, std::allocator<paddle::framework::BlockDesc*> >, std::vector<long, std::allocator<long> > >&) const ()
#7  0x00000000007801cf in paddle::framework::TypedAttrChecker<std::vector<int, std::allocator<int> > >::operator()(std::unordered_map<std::string, boost::variant<boost::blank, int, float, std::string, std::vector<int, std::allocator<int> >, std::vector<float, std::allocator<float> >, std::vector<std::string, std::allocator<std::string> >, bool, std::vector<bool, std::allocator<bool> >, paddle::framework::BlockDesc*, long, std::vector<paddle::framework::BlockDesc*, std::allocator<paddle::framework::BlockDesc*> >, std::vector<long, std::allocator<long> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, boost::variant<boost::blank, int, float, std::string, std::vector<int, std::allocator<int> >, std::vector<float, std::allocator<float> >, std::vector<std::string, std::allocator<std::string> >, bool, std::vector<bool, std::allocator<bool> >, paddle::framework::BlockDesc*, long, std::vector<paddle::framework::BlockDesc*, std::allocator<paddle::framework::BlockDesc*> >, std::vector<long, std::allocator<long> > > > > >*) const ()
#8  0x000000000067a959 in paddle::framework::OpRegistry::CreateOp(std::string const&, std::map<std::string, std::vector<std::string, std::allocator<std::string> >, std::less<std::string>, std::allocator<std::pair<std::string const, std::vector<std::string, std::allocator<std::string> > > > > const&, std::map<std::string, std::vector<std::string, std::allocator<std::string> >, std::less<std::string>, std::allocator<std::pair<std::string const, std::vector<std::string, std::allocator<std::string> > > > > const&, std::unordered_map<std::string, boost::variant<boost::blank, int, float, std::string, std::vector<int, std::allocator<int> >, std::vector<float, std::allocator<float> >, std::vector<std::string, std::allocator<std::string> >, bool, std::vector<bool, std::allocator<bool> >, paddle::framework::BlockDesc*, long, std::vector<paddle::framework::BlockDesc*, std::allocator<paddle::framework::BlockDesc*> >, std::vector<long, std::allocator<long> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, boost::variant<boost::blank, int, float, std::string, std::vector<int, std::allocator<int> >, std::vector<float, std::allocator<float> >, std::vector<std::string, std::allocator<std::string> >, bool, std::vector<bool, std::allocator<bool> >, paddle::framework::BlockDesc*, long, std::vector<paddle::framework::BlockDesc*, std::allocator<paddle::framework::BlockDesc*> >, std::vector<long, std::allocator<long> > > > > >) ()
#9  0x000000000067aae3 in paddle::framework::OpRegistry::CreateOp(paddle::framework::OpDesc const&) ()
#10 0x0000000000653194 in paddle::framework::Executor::Prepare(paddle::framework::ProgramDesc const&, int, std::vector<std::string, std::allocator<std::string> > const&) ()
#11 0x000000000063b67e in visionary::lac::MainTagger::create_buff (this=0x19f0b00, buff=0x7ffc740008c0) at baidu/visionary/lac/src/main_tagger.cpp:100
#12 0x00000000006330ab in visionary::lac::Lac::create_buff (this=0x19f09d0) at baidu/visionary/lac/src/lac.cpp:107
#13 0x000000000062ef13 in visionary::lac::lac_buff_create (lac_handle=0x19f09d0) at baidu/visionary/lac/src/ilac.cpp:43
#14 0x0000000000629f3f in tagging (max_result_num=1000) at baidu/visionary/lac/tools/lac_class_demo.cpp:135
#15 0x000000000062a5b9 in thread_worker (arg=0x7fffa242eaf0) at baidu/visionary/lac/tools/lac_class_demo.cpp:205

辛苦看下如何解决这个问题。更换训练维paddle1.2的多cpu训练接口?还是更换预测库的维paddle1.5_pb32的呢?

@seiriosPlus
Copy link
Collaborator

第一个问题: 不是增量训练的问题,是因为batch size小于线程并发数, 参考官网文档关于CPU_NUM的说明

第二个问题:看上去是由于版本不兼容导致的,可以用新版本生成__model__文件试一下。

@xuzhenglei1991
Copy link
Author

第一个问题: 不是增量训练的问题,是因为batch size小于线程并发数, 参考官网文档关于CPU_NUM的说明

第二个问题:看上去是由于版本不兼容导致的,可以用新版本生成__model__文件试一下。

第二个问题是这样的: 用的1.5版本生成了__model__文件,用1.2的C++版本预测的。没太懂这个新版本生成__model__文件指的是?

@seiriosPlus
Copy link
Collaborator

第一个问题: 不是增量训练的问题,是因为batch size小于线程并发数, 参考官网文档关于CPU_NUM的说明
第二个问题:看上去是由于版本不兼容导致的,可以用新版本生成__model__文件试一下。

第二个问题是这样的: 用的1.5版本生成了__model__文件,用1.2的C++版本预测的。没太懂这个新版本生成__model__文件指的是?

用1.5版本生成的__model__用1.2预测可能会导致不兼容的问题, 可以使用1.5版本预测。

@xuzhenglei1991
Copy link
Author

第一个问题: 不是增量训练的问题,是因为batch size小于线程并发数, 参考官网文档关于CPU_NUM的说明
第二个问题:看上去是由于版本不兼容导致的,可以用新版本生成__model__文件试一下。

第二个问题是这样的: 用的1.5版本生成了__model__文件,用1.2的C++版本预测的。没太懂这个新版本生成__model__文件指的是?

用1.5版本生成的__model__用1.2预测可能会导致不兼容的问题, 可以使用1.5版本预测。

C++咱们这边有对应的pb32版本的1.5版本么

@ysh329
Copy link
Contributor

ysh329 commented Aug 26, 2019

第一个问题: 不是增量训练的问题,是因为batch size小于线程并发数, 参考官网文档关于CPU_NUM的说明
第二个问题:看上去是由于版本不兼容导致的,可以用新版本生成__model__文件试一下。

第二个问题是这样的: 用的1.5版本生成了__model__文件,用1.2的C++版本预测的。没太懂这个新版本生成__model__文件指的是?

用1.5版本生成的__model__用1.2预测可能会导致不兼容的问题, 可以使用1.5版本预测。

C++咱们这边有对应的pb32版本的1.5版本么

已经有提供的安装包了,可以参考安装与编译C++预测库-使用文档

@paddle-bot-old
Copy link

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants