Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBnet模型在NPU上跑八卡卡死 #10095

Closed
old-steel opened this issue Jun 5, 2023 · 3 comments
Closed

DBnet模型在NPU上跑八卡卡死 #10095

old-steel opened this issue Jun 5, 2023 · 3 comments
Assignees

Comments

@old-steel
Copy link

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem
训练dbnet启动八卡的时候卡死,通过gpd查看,是在飞桨框架内异常

  • 系统环境/System Environment:Ubuntu18
  • 版本号/Version:Paddle: PaddleOCR: 问题相关组件/Related components: PaddleOCR 2.6
  • 运行指令/Command Code:python3 -m paddle.distributed.launch --npus '0,1,2,3,4,5,6,7' tools/train.py -c configs/det/det_r50_vd_db.yml
  • 完整报错/Complete Error Message:
  • [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
    --Type for more, q to quit, c to continue without paging--
    syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
    38 ../sysdeps/unix/sysv/linux/aarch64/syscall.S: No such file or directory.
    warning: File "/opt/compiler/gcc-8.2/lib64/libstdc++.so.6.0.25-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
    To enable execution of this file add
    add-auto-load-safe-path /opt/compiler/gcc-8.2/lib64/libstdc++.so.6.0.25-gdb.py
    line to your configuration file "/root/.gdbinit".
    To completely disable this security protection add
    set auto-load safe-path /
    line to your configuration file "/root/.gdbinit".
    For more information about this security protection see the
    "Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
    info "(gdb)Auto-loading safe path"
    (gdb) bt
    #0 syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
    Upload PaddleOCR code  #1 0x0000ffffa642e6f4 in std::__atomic_futex_unsigned_base::_M_futex_wait_until (this=, __addr=, __val=, __has_timeout=, __s=...,
    __ns=...) at ../../../../../gcc-8.2.0/libstdc++-v3/src/c++11/futex.cc:55
    del tmp #2 0x0000ffff9e1ae054 in paddle::pybind::MultiDeviceFeedReaderpaddle::operators::reader::LoDTensorBlockingQueue::WaitFutures(std::__exception_ptr::exception_ptr*) ()
    from /opt/py37env/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
    optimize PaddleOCR  #3 0x0000ffff9e1ae3a4 in paddle::pybind::MultiDeviceFeedReaderpaddle::operators::reader::LoDTensorBlockingQueue::CheckNextStatus() ()
    from /opt/py37env/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
    polish infer_rec and add ic15_dict #4 0x0000ffff9e1b432c in paddle::pybind::MultiDeviceFeedReaderpaddle::operators::reader::LoDTensorBlockingQueue::ReadNextList() ()
    from /opt/py37env/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
    Fixocr #5 0x0000ffff9e1aa87c in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<std::vector<phi::TensorArray, std::allocatorphi::TensorArray >, paddle::pybind::MultiDeviceFerpaddle::operators::reader::LoDTensorBlockingQueue, , pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guardpybind11::gil_scoped_release >(std::vector<phi::TensorArr::allocatorphi::TensorArray > (paddle::pybind::MultiDeviceFeedReaderpaddle::operators::reader::LoDTensorBlockingQueue::)(), pybind11::name const&, pybind11::is_method const&, pybind11::sconst&, pybind11::call_guardpybind11::gil_scoped_release const&)::{lambda(paddle::pybind::MultiDeviceFeedReaderpaddle::operators::reader::LoDTensorBlockingQueue)Upload PaddleOCR code  #1}, std::vector<phi::Teny, std::allocatorphi::TensorArray >, paddle::pybind::MultiDeviceFeedReaderpaddle::operators::reader::LoDTensorBlockingQueue, pybind11::name, pybind11::is_method, pybind11::sibling, pybinll_guardpybind11::gil_scoped_release >(pybind11::cpp_function::initialize<std::vector<phi::TensorArray, std::allocatorphi::TensorArray >, paddle::pybind::MultiDeviceFeedReaderpaddle::ope:reader::LoDTensorBlockingQueue, , pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guardpybind11::gil_scoped_release >(std::vector<phi::TensorArray, std::allocator > (paddle::pybind::MultiDeviceFeedReaderpaddle::operators::reader::LoDTensorBlockingQueue::)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybinll_guardpybind11::gil_scoped_release const&)::{lambda(paddle::pybind::MultiDeviceFeedReaderpaddle::operators::reader::LoDTensorBlockingQueue)Upload PaddleOCR code  #1}&&, std::vector<phi::TensorArray, std::allphi::TensorArray> > ()(paddle::pybind::MultiDeviceFeedReaderpaddle::operators::reader::LoDTensorBlockingQueue), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&d11::call_guardpybind11::gil_scoped_release const&)::{lambda(pybind11::detail::function_call&)optimize PaddleOCR  #3}::_FUN(pybind11::detail::function_call&) ()
    from /opt/py37env/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
    Fix inference #6 0x0000ffff9ddacd00 in pybind11::cpp_function::dispatcher(_object
    , _object*, _object*) () from /opt/py37env/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
    add visualize code #7 0x0000000000543470 in _PyMethodDef_RawFastCallKeywords ()
    update det process code, +32 -> -32 #8 0x00000000005447ac in _PyObject_FastCallKeywords ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)
@github-actions
Copy link
Contributor

github-actions bot commented Aug 5, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 5, 2023
@August-us
Copy link

同样在训练卡死,不过一般都在第二个epoch或者第三个epoch卡死,看样子好像是某个进程的dataloader退出了,不知道怎么解决

@github-actions github-actions bot removed the stale label Jan 3, 2024
@UserWangZz
Copy link
Collaborator

该issue长时间未更新,暂将此issue关闭,如有需要可重新开启。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants