Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

目前遇到问题总结 #2061

Closed
lgcy opened this issue Feb 22, 2021 · 2 comments
Closed

目前遇到问题总结 #2061

lgcy opened this issue Feb 22, 2021 · 2 comments

Comments

@lgcy
Copy link
Contributor

lgcy commented Feb 22, 2021

没有用官方镜像,自己pip install padddlepaddle-gpu==2.0.0,使用check工具检查正常
执行Python tools/train.py --configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml,遇到如下几个问题:

  1. 如果使用单卡多机,在4个v100上,存在程序卡住的情况,要么是正常迭代一会儿之后,要么就是训练启动的时候直接卡住。尝试减少numworkers以及batch-size,依旧解决不了
  2. 如果使用单机单卡,依旧存在卡住的情况,不过比多卡情况要好很多,能够迭代的次数更多
  3. 使用单机单卡的情况,训练2个epoch之后,挂掉,错误如下:
    C++ Traceback (most recent call last):
    0 std::thread::_State_implstd::thread::_Invoker<std::tuple<ThreadPool::ThreadPool(unsigned long)::{lambda()Upload PaddleOCR code  #1}>>::_M_run()
    1 std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::Result_base::_Deleter>()>, bool)
    3 paddle::framework::SignalHandle(char const*, int)
    4 paddle::platform::GetCurrentTraceBackString()
    Error Message Summary:
    FatalError: A serious error (Segmentation fault) is detected by the operating system. (at /paddle/paddle/fluid/platform/init.cc:303)
    [TimeInfo: *** Aborted at 1613617811 (unix time) try "date -d @1613617811" if you are using GNU date ***]
    [SignalInfo: *** SIGSEGV (@0x7f56bed7c000) received by PID 31381 (TID 0x7f58b672d740) from PID 18446744072616394752 ***]
@tink2123
Copy link
Collaborator

每次挂掉的iter数是固定的吗?训练的bs和num_worker分别是多少?

@lgcy
Copy link
Contributor Author

lgcy commented Feb 23, 2021

bs64, num_worker8, 这样会卡住,在4个v100上会卡住, bs64, num_worker4依然会卡住
bs128, num_worker8, 在一个v100上,2个epoch后会挂掉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants