Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU训练时候报错 #1406

Closed
sarawon opened this issue Feb 21, 2017 · 17 comments
Closed

GPU训练时候报错 #1406

sarawon opened this issue Feb 21, 2017 · 17 comments
Assignees
Labels

Comments

@sarawon
Copy link

sarawon commented Feb 21, 2017

root@gputest:~/demo/mnist# sh train.sh 
Using debug command gdb --args
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/../opt/paddle/bin/paddle_trainer...done.
(gdb) r
Starting program: /usr/opt/paddle/bin/paddle_trainer --config=vgg_16_mnist.py --dot_period=10 --log_period=100 --test_all_data_in_one_period=1 --use_gpu=1 --trainer_count=4 --num_passes=10 --save_dir=./mnist_vgg_model
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
I0221 18:48:34.436648 29562 Util.cpp:155] commandline: /usr/opt/paddle/bin/paddle_trainer --config=vgg_16_mnist.py --dot_period=10 --log_period=100 --test_all_data_in_one_period=1 --use_gpu=1 --trainer_count=4 --num_passes=10 --save_dir=./mnist_vgg_model 
[New Thread 0x7ffff3686700 (LWP 29567)]
[New Thread 0x7ffff2e85700 (LWP 29568)]
*** stack smashing detected ***: /usr/opt/paddle/bin/paddle_trainer terminated

Program received signal SIGABRT, Aborted.
0x00007ffff5cd0c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff5cd0c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff5cd4028 in __GI_abort () at abort.c:89
#2  0x00007ffff5d0d2a4 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff5e19113 "*** %s ***: %s terminated\n")
    at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff5da4bbc in __GI___fortify_fail (msg=<optimized out>, msg@entry=0x7ffff5e190fb "stack smashing detected") at fortify_fail.c:38
#4  0x00007ffff5da4b60 in __stack_chk_fail () at stack_chk_fail.c:28
#5  0x00000000008954ce in hl_create_global_resources (device_prop=<optimized out>) at /root/paddle/paddle/cuda/src/hl_cuda_device.cc:499
#6  0x00000000008959f9 in hl_specify_devices_start (device=device@entry=0x0, number=18345040, number@entry=0)
    at /root/paddle/paddle/cuda/src/hl_cuda_device.cc:593
#7  0x0000000000895d2d in hl_start () at /root/paddle/paddle/cuda/src/hl_cuda_device.cc:430
#8  0x0000000000818602 in paddle::initMain (argc=1, argc@entry=9, argv=argv@entry=0x7fffffffe528) at /root/paddle/paddle/utils/Util.cpp:179
#9  0x000000000052ac5b in main (argc=9, argv=0x7fffffffe528) at /root/paddle/paddle/trainer/TrainerMain.cpp:41
@hedaoyuan
Copy link
Contributor

自己编译的paddle程序吗?贴一下/root/paddle/paddle/cuda/src/hl_cuda_device.cc:499前后相关源码吧。

@sarawon
Copy link
Author

sarawon commented Feb 21, 2017

用的是发布的0.9.0版的deb包
源码:
image

@hedaoyuan
Copy link
Contributor

@gangliao 看起来像是docker+gpu环境的问题?

@gangliao
Copy link
Contributor

@hedaoyuan 看起来 应该是的

@sarawon 换一个其他类型demo, 也有这个问题吗?

@sarawon
Copy link
Author

sarawon commented Feb 22, 2017

@gangliao @hedaoyuan 是的 换了另外一个也是类似的错误栈
我准备换cuda 7.5版本试试,现在装的是8.0,你们觉得呢

@gangliao
Copy link
Contributor

@sarawon 我回头也试一下

@sarawon
Copy link
Author

sarawon commented Feb 22, 2017

对了 我没有用docker 我们用的是kvm虚拟机 显卡K40 cuda8.0 paddlepaddle0.9.0

@sarawon
Copy link
Author

sarawon commented Feb 22, 2017

ubuntu的操作系统

@sarawon
Copy link
Author

sarawon commented Feb 22, 2017

@hedaoyuan @gangliao 我换成cuda7.5训练就跑起来了

@hedaoyuan
Copy link
Contributor

@sarawon 多谢你的工作。另外,你环境的CUDA驱动版本是多少,nvidia-smi 命令显示的类似NVIDIA-SMI 352.79 Driver Version: 352.79这行信息,后续我们复现一下CUDA 8.0下的问题。

@sarawon
Copy link
Author

sarawon commented Feb 22, 2017

@hedaoyuan 我现在机器上的cuda版本已经换成7.5,可以工作了,复现CUDA8.0问题的时候可以用下面的cuda包哈:
https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-repo-ubuntu1404-8-0-local-ga2_8.0.61-1_amd64-deb
paddle的deb包地址是:
https://github.com/PaddlePaddle/Paddle/releases/download/v0.9.0/paddle-0.9.0-gpu-ubuntu14.04.deb

GPU卡是K40

@rainmanzheng
Copy link

安装gpu版本的时候不是提醒了要用7.5么。。。你可能看了假文档

@sarawon
Copy link
Author

sarawon commented Feb 22, 2017

我在paddle用户交流群里提前问了下的8.0可不可以,然后有人说可以。。。

@hedaoyuan
Copy link
Contributor

  • 用7.5吧,paddle在这个环境上用的多。
  • 8.0环境编译+8.0环境运行,这个一般也是没有问题的(之前有验证)。
  • 7.5环境编译(链接中的这个安装包)+8.0环境运行,一般也是没有问题的,不过没有跑过你这个ubuntu环境。

@sarawon
Copy link
Author

sarawon commented Feb 22, 2017

嗯 好的 多谢 我先用7.5的 我说的这个你们回头也可以复现下 找到根本原因的话 辛苦也在这个issue里评论下哈

@hedaoyuan
Copy link
Contributor

这个issue先close了,8.0的环境问题,后续另起issue说明。

@sarawon
Copy link
Author

sarawon commented Feb 22, 2017

嗯 好的

wangxicoding pushed a commit to wangxicoding/Paddle that referenced this issue Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants