Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

配置gpu运行book例子的02.recognize_digits,报错CUDA error: invalid device function #5629

Closed
shiyazhou121 opened this issue Nov 14, 2017 · 12 comments
Labels
User 用于标记用户问题

Comments

@shiyazhou121
Copy link

问题描述:
按照【AI学习】PaddlePaddle深度学习实战-PaddlePaddle在不同平台的安装 (http://learn.baidu.com/pages/index.html#/courseInfo/13655?courseId=13655&_k=usdv7x)中centos 6.3环境安装gpu版paddle方法。首先安装python27-gcc482,然后按照视频中方法配置gpu。
下面是配置的cudnn和cuda的环境变量
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/home/work/cudnn/cudnn_v5/cuda/lib64:/usr/local/ganglia/lib64:/usr/local/apr/lib:/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/ganglia/lib64:/usr/local/apr/lib:/usr/local/cuda/lib64:/usr/local/cuda/lib::/home/work/cuda-8.0/lib64:/home/work/cuda-8.0/lib:/home/HGCP_Program/software-install/hadoop-v2/hadoop/lib:/home/HGCP_Program/software-install/hadoop-v2/hadoop/libhce:/home/HGCP_Program/software-install/hadoop-v2/hadoop/libhdfs:/home/HGCP_Program/software-install/openmpi-1.8.5/lib:/home/work/cuda-8.0/lib64:/home/work/cuda-8.0/lib:/home/HGCP_Program/software-install/hadoop-v2/hadoop/lib:/home/HGCP_Program/software-install/hadoop-v2/hadoop/libhce:/home/HGCP_Program/software-install/hadoop-v2/hadoop/libhdfs:/home/HGCP_Program/software-install/openmpi-1.8.5/lib

配置完成后,尝试运行book中的02.recognize_digits时报错,下面是全部日志

I1114 14:56:59.516850 4275 Util.cpp:166] commandline: --use_gpu=1 --trainer_count=1
W1114 14:57:08.683694 4275 CpuId.h:112] PaddlePaddle wasn't compiled to use avx instructions, but these are available on your machine and could speed up CPU computations via CMAKE .. -DWITH_AVX=ON
[INFO 2017-11-14 14:57:08,688 layers.py:2539] output for __conv_pool_0___conv: c = 20, h = 24, w = 24, size = 11520
[INFO 2017-11-14 14:57:08,689 layers.py:2667] output for __conv_pool_0___pool: c = 20, h = 12, w = 12, size = 2880
[INFO 2017-11-14 14:57:08,690 layers.py:2539] output for __conv_pool_1___conv: c = 50, h = 8, w = 8, size = 3200
[INFO 2017-11-14 14:57:08,691 layers.py:2667] output for __conv_pool_1___pool: c = 50, h = 4, w = 4, size = 800
F1114 14:57:08.697180 4275 hl_gpu_matrix_kernel.cuh:181] Check failed: cudaSuccess == err (0 vs. 8) [hl_gpu_apply_unary_op failed] CUDA error: invalid device function
*** Check failure stack trace: ***
@ 0x7fe360c605ed google::LogMessage::Fail()
@ 0x7fe360c6409c google::LogMessage::SendToLog()
@ 0x7fe360c600e3 google::LogMessage::Flush()
@ 0x7fe360c655ae google::LogMessageFatal::~LogMessageFatal()
@ 0x7fe360aeaec4 hl_gpu_apply_unary_op<>()
@ 0x7fe360aeb205 paddle::BaseMatrixT<>::applyUnary<>()
@ 0x7fe360aeb433 paddle::BaseMatrixT<>::zero()
@ 0x7fe3609868d1 paddle::Parameter::enableType()
@ 0x7fe3609821cc paddle::parameterInitNN()
@ 0x7fe36098491a paddle::NeuralNetwork::init()
@ 0x7fe3609ad491 paddle::GradientMachine::create()
@ 0x7fe360c3d3b3 GradientMachine::createFromPaddleModelPtr()
@ 0x7fe360c3d58f GradientMachine::createByConfigProtoStr()
@ 0x7fe36084c4cd _wrap_GradientMachine_createByConfigProtoStr
@ 0x4b4cb9 PyEval_EvalFrameEx
@ 0x4b6b28 PyEval_EvalCodeEx
@ 0x4b5d10 PyEval_EvalFrameEx
@ 0x4b6b28 PyEval_EvalCodeEx
@ 0x4b5d10 PyEval_EvalFrameEx
@ 0x4b6b28 PyEval_EvalCodeEx
@ 0x52940f function_call
@ 0x422cba PyObject_Call
@ 0x4271ad instancemethod_call
@ 0x422cba PyObject_Call
@ 0x48121f slot_tp_init
@ 0x47eb1a type_call
@ 0x422cba PyObject_Call
@ 0x4b31dd PyEval_EvalFrameEx
@ 0x4b6b28 PyEval_EvalCodeEx
@ 0x4b5d10 PyEval_EvalFrameEx
@ 0x4b6b28 PyEval_EvalCodeEx
@ 0x4b6c52 PyEval_EvalCode
Aborted

之后尝试其他book例子,发现全部是这个报错,这个是什么原因?怎么解决?

@kuke kuke added the User 用于标记用户问题 label Nov 14, 2017
@kuke
Copy link
Contributor

kuke commented Nov 14, 2017

@shiyazhou121 应该是你GPU的问题,可否贴一个运行nvidia-smi的输出的截图?

@shiyazhou121
Copy link
Author

image

@kuke
Copy link
Contributor

kuke commented Nov 14, 2017

@shiyazhou121 invalid device function常见的问题是显卡计算能力不够,但你所用的是P40,显卡计算能力应该不成问题。1)请确保你的显卡处于正常工作状态;2)PaddlePaddle是否正确安装或者版本太老?

@shiyazhou121
Copy link
Author

@kuke 感谢一直帮忙解决问题,
1)请确保你的显卡处于正常工作状态;
这个该怎么看
2)PaddlePaddle是否正确安装或者版本太老?
我是下载python-gcc482,然后使用里面的pip安装的paddlepaddle_gpu。安装完成后,修改02.recognize_digits中的train.py。
image
设置use_gpu=1,代表使用gpu。然后运行出现了上面的报错。

我尝试use_gpu=with_gpu。使用cpu,可以正常运行。结果也和预期一样。

@kuke
Copy link
Contributor

kuke commented Nov 14, 2017

刚忘记了,你是否设置了CUDA_VISIBLE_DEVICES这个环境变量, 可以按如下方式设置
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5

@shiyazhou121
Copy link
Author

image
设置后,还是之前的报错。

@kuke
Copy link
Contributor

kuke commented Nov 14, 2017

@shiyazhou121 建议你先试用下docker镜像paddlepaddle/paddle:latest-gpu, 免去了环境设置的各种问题,

@shiyazhou121
Copy link
Author

shiyazhou121 commented Nov 17, 2017

@kuke 我在https://stackoverflow.com/questions/39850309/how-to-resolve-cudasuccess-err-0-vs-8-error-on-paddle-v0-8-0b,发现了跟我遇到一样的的问题

发现好像需要编译paddle,这个是2016年的,现在应该有些不同吧?

@qingqing01
Copy link
Contributor

qingqing01 commented Nov 17, 2017

@shiyazhou121 从你的截图来看,你使用的是 Tesla P40 的GPU,这个GPU编译时gencode需要加 sm_61, 我们当前CUDA编译也有可能没有支持该架构,导致你不能运行吧。

我们CUDA架构查看: https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/flags.cmake , 搜索sm_

Paddle/cmake/flags.cmake

Lines 198 to 203 in 56f28a0

# Custom gpu architecture
set(CUDA_ARCH)
if(CUDA_ARCH)
specify_cuda_arch(${CUDA_VERSION} ${CUDA_ARCH})
endif()

这里觉得有点Bug,你可能需要重新编译下。

另外,#5713 这个PR加了对sm_61的支持。

@qingqing01
Copy link
Contributor

@shiyazhou121

查看GPU架构的方式:

cd /usr/local/cuda/samples/1_Utilities/deviceQuery/
make
./deviceQuery

CUDA Capability Major/Minor version number后面的数字。

@Yancey1989
Copy link
Contributor

Yancey1989 commented Nov 17, 2017

@shiyazhou121

  1. 可以尝试下使用nvidia-docker来启动GPU的镜像,可以屏蔽很多环境以及驱动问题。
  2. 如果nvidia-docker安装有问题的话,也可以安装CI上最新编译的whl包(基于cuda8 cudnn7编译):http://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn5cp27cp27mu/.lastSuccessful/paddlepaddle-0.10.0-cp27-cp27mu-linux_x86_64.whl
    厂内有些python环境需要cp27m的格式,可以下载:http://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn5cp27cp27mu/.lastSuccessful/paddlepaddle-0.10.0-cp27-cp27m-linux_x86_64.whl

@Yancey1989
Copy link
Contributor

这个issue很久没有更新的消息了,我先关掉了,如果有任何疑问可以随时重新打开它。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

4 participants