Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ubuntu 14.04, CUDA 7.0: make runtest core dumped #2432

Closed
iwen-kang opened this issue May 8, 2015 · 7 comments
Closed

ubuntu 14.04, CUDA 7.0: make runtest core dumped #2432

iwen-kang opened this issue May 8, 2015 · 7 comments

Comments

@iwen-kang
Copy link

ubuntu 14.04, CUDA 7.0, cudnn-6.5-linux-x64-v2, 2xGeForce GTX 970 with 4G each, default ATLAS.
caffe master cloned; make all, make test successful, however, make runtest core dumped (see below):
Any idea? How to workaround this?
...
[----------] 3 tests from DeconvolutionLayerTest/3, where TypeParam = caffe::DoubleGPU
[ RUN ] DeconvolutionLayerTest/3.TestSetup
[ OK ] DeconvolutionLayerTest/3.TestSetup (0 ms)
[ RUN ] DeconvolutionLayerTest/3.TestSimpleDeconvolution
[ OK ] DeconvolutionLayerTest/3.TestSimpleDeconvolution (0 ms)
[ RUN ] DeconvolutionLayerTest/3.TestGradient
F0508 13:09:00.207708 13415 math_functions.cu:81] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure
*** Check failure stack trace: ***
@ 0x2afa1b93bdaa (unknown)
@ 0x2afa1b93bce4 (unknown)
@ 0x2afa1b93b6e6 (unknown)
@ 0x2afa1b93e687 (unknown)
@ 0x2afa1d390478 caffe::caffe_gpu_memcpy()
@ 0x2afa1d33d18e caffe::SyncedMemory::gpu_data()
@ 0x2afa1d33e472 caffe::Blob<>::gpu_data()
@ 0x2afa1d372861 caffe::DeconvolutionLayer<>::Forward_gpu()
@ 0x456e91 caffe::Layer<>::Forward()
@ 0x45a555 caffe::GradientChecker<>::CheckGradientSingle()
@ 0x45d39b caffe::GradientChecker<>::CheckGradientExhaustive()
@ 0x6e3ccc caffe::DeconvolutionLayerTest_TestGradient_Test<>::TestBody()
@ 0x721b83 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x7187c7 testing::Test::Run()
@ 0x71886e testing::TestInfo::Run()
@ 0x718975 testing::TestCase::Run()
@ 0x71bcb8 testing::internal::UnitTestImpl::RunAllTests()
@ 0x71bf47 testing::UnitTest::Run()
@ 0x44607a main
@ 0x2afa1df61ec5 (unknown)
@ 0x44b3a9 (unknown)
@ (nil) (unknown)
make: *** [runtest] Aborted (core dumped)

@iwen-kang
Copy link
Author

Note: there are other failures prior to the 'make runtest' core dumped, they all seem to be related to caffe::FloatGPU or DoubleGPU:
[ FAILED ] SigmoidCrossEntropyLossLayerTest/2.TestGradient, where TypeParam = caffe::FloatGPU (12 ms)
[ FAILED ] NeuronLayerTest/3.TestExpGradientBase2Scale3, where TypeParam = caffe::DoubleGPU (20 ms)
[ FAILED ] NeuronLayerTest/3.TestExpLayerBase2Scale3, where TypeParam = caffe::DoubleGPU (8 ms)
debug: (top_id, top_data_id, blob_id, feat_id)=0,119,0,119; feat = -0.75706510112861081; objective+ = -nan; objective- = -nan
[ FAILED ] NeuronLayerTest/3.TestSigmoidGradient, where TypeParam = caffe::DoubleGPU (6039 ms)
debug: (top_id, top_data_id, blob_id, feat_id)=0,118,0,118; feat = 2.0560870934207354; objective+ = -2.1944496275174755e+304; objective- = -2.1944496275174755e+304
[ FAILED ] NeuronLayerTest/3.TestExpGradientBase2Shift1, where TypeParam = caffe::DoubleGPU (22 ms)
...

@dberwick
Copy link

FWIW Similar situation (ubuntu 14.04, CUDA 7.0, cudnn-6.5-linux-x64-v2, default ATLAS.
caffe master cloned; make all, make test successful, however, make runtest core dumped) but different single GPU (GTX 980M)
.
.
.
Value of: in_data_a[i] * in_data_b[i] * in_data_c[i]
Actual: 0.134864
Expected: data[i]
Which is: 0
[ FAILED ] EltwiseLayerTest/2.TestProd, where TypeParam = caffe::FloatGPU (3 ms)
[ RUN ] EltwiseLayerTest/2.TestSumCoeffGradient
[ OK ] EltwiseLayerTest/2.TestSumCoeffGradient (33 ms)
[ RUN ] EltwiseLayerTest/2.TestSetUp
[ OK ] EltwiseLayerTest/2.TestSetUp (0 ms)
[ RUN ] EltwiseLayerTest/2.TestSum
[ OK ] EltwiseLayerTest/2.TestSum (0 ms)
[ RUN ] EltwiseLayerTest/2.TestMaxGradient
[ OK ] EltwiseLayerTest/2.TestMaxGradient (14 ms)
[ RUN ] EltwiseLayerTest/2.TestSumGradient
[ OK ] EltwiseLayerTest/2.TestSumGradient (32 ms)
[----------] 10 tests from EltwiseLayerTest/2 (17354 ms total)

[----------] 8 tests from CuDNNNeuronLayerTest/0, where TypeParam = float
[ RUN ] CuDNNNeuronLayerTest/0.TestReLUGradientWithNegativeSlopeCuDNN
F0512 12:42:15.196022 21846 relu_layer.cu:27] Check failed: error == cudaSuccess (8 vs. 0) invalid device function
*** Check failure stack trace: ***
@ 0x2b82c00fddaa (unknown)
@ 0x2b82c00fdce4 (unknown)
@ 0x2b82c00fd6e6 (unknown)
@ 0x2b82c0100687 (unknown)
@ 0x2b82c1bbbd20 caffe::ReLULayer<>::Forward_gpu()
@ 0x2b82c1ba83eb caffe::CuDNNReLULayer<>::Forward_gpu()
@ 0x45a962 caffe::Layer<>::Forward()
@ 0x45ff2a caffe::GradientChecker<>::CheckGradientSingle()
@ 0x541344 caffe::GradientChecker<>::CheckGradientEltwise()
@ 0x57fe9f caffe::CuDNNNeuronLayerTest_TestReLUGradientWithNegativeSlopeCuDNN_Test<>::TestBody()
@ 0x7216e3 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x718327 testing::Test::Run()
@ 0x7183ce testing::TestInfo::Run()
@ 0x7184d5 testing::TestCase::Run()
@ 0x71b818 testing::internal::UnitTestImpl::RunAllTests()
@ 0x71baa7 testing::UnitTest::Run()
@ 0x4460ba main
@ 0x2b82c2703ec5 (unknown)
@ 0x44b3e9 (unknown)
@ (nil) (unknown)
make: *** [runtest] Aborted (core dumped)

@dberwick
Copy link

OK, in my case the issue was my own fault. Looking around a bit more, someone indicated the invalid device comment was a dependency between CUDA and the GPU. I misread the comments in the Makefile.config. Although I had installed Cuda 7.0, I comment out the -gencode arch-compute_50 lines becaue my GPU only supports Cuda 5.2. Once I put those back in all tests passed.

@n-zhang
Copy link

n-zhang commented May 12, 2015

Please ask usage questions on the caffe-users list. Thanks!

@n-zhang n-zhang closed this as completed May 12, 2015
@iwen-kang
Copy link
Author

Hi dberwick,
Can you please share your Make.config? I did not comment out the *_50 lines at all, yet I still got runtest core dump. I think my problem is different from yours, you had "invalid device function" yet I have math_functions.cu:81 error:
F0508 13:09:00.207708 13415 math_functions.cu:81] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure

Hi Ning,
I saw caffe-users reported similar math_functions.cu:81 issue below yet it seems no one knows a solution:
https://groups.google.com/forum/#!msg/caffe-users/-XcvUmFGJco/7hljfiTRRkIJ

Can you please kindly advise if there's a fix or workaround available?
Thanks!
Iwen

@dberwick
Copy link

Refer to http://caffe.berkeleyvision.org/installation.html

Contributions simplifying and improving our build system are welcome!

cuDNN acceleration switch (uncomment to build with cuDNN).

USE_CUDNN := 1

CPU-only switch (uncomment to build without GPU support).

CPU_ONLY := 1

To customize your choice of compiler, uncomment and set the following.

N.B. the default for Linux is g++ and the default for OSX is clang++

CUSTOM_CXX := g++

CUDA directory contains bin/ and lib/ directories that we need.

CUDA_DIR := /usr/local/cuda

On Ubuntu 14.04, if cuda tools are installed via

"sudo apt-get install nvidia-cuda-toolkit" then use this instead:

CUDA_DIR := /usr

CUDA architecture setting: going with all of them.

For CUDA < 6.0, comment the *_50 lines for compatibility.

CUDA_ARCH := -gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_50,code=compute_50

BLAS choice:

atlas for ATLAS (default)

mkl for MKL

open for OpenBlas

BLAS := atlas

Custom (MKL/ATLAS/OpenBLAS) include and lib directories.

Leave commented to accept the defaults for your choice of BLAS

(which should work)!

BLAS_INCLUDE := /path/to/your/blas

BLAS_LIB := /path/to/your/blas

This is required only if you will compile the matlab interface.

MATLAB directory should contain the mex binary in /bin.

MATLAB_DIR := /usr/local

MATLAB_DIR := /Applications/MATLAB_R2012b.app

NOTE: this is required only if you will compile the python interface.

We need to be able to find Python.h and numpy/arrayobject.h.

PYTHON_INCLUDE := /usr/include/python2.7
/usr/lib/python2.7/dist-packages/numpy/core/include

Anaconda Python distribution is quite popular. Include path:

Verify anaconda location, sometimes it's in root.

ANACONDA_HOME := $(HOME)/anaconda

PYTHON_INCLUDE := $(ANACONDA_HOME)/include \

    # $(ANACONDA_HOME)/include/python2.7 \
    # $(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include \

We need to be able to find libpythonX.X.so or .dylib.

PYTHON_LIB := /usr/lib

PYTHON_LIB := $(ANACONDA_HOME)/lib

Uncomment to support layers written in Python (will link against Python libs)

WITH_PYTHON_LAYER := 1

Whatever else you find you need goes here.

INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include
#LIBRARY_DIRS := $(PYTHON_LIB) /usr/lib
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib/cudnn /usr/lib

Uncomment to use pkg-config to specify OpenCV library paths.

(Usually not necessary -- OpenCV libraries are normally installed in one of the above $LIBRARY_DIRS.)

USE_PKG_CONFIG := 1

BUILD_DIR := build
DISTRIBUTE_DIR := distribute

Uncomment for debugging. Does not work on OSX due to #171

DEBUG := 1

The ID of the GPU that 'make runtest' will use to run unit tests.

TEST_GPUID := 0

enable pretty build (comment to see full commands)

Q ?= @

@iwen-kang
Copy link
Author

I got it working now. I did nothing new except rebooting my Ubuntu desktop, everything started to work after that!
Correct me if I'm wrong, I did not recall anywhere in the upstream software (Cuda, Boost, OpenCV, etc) installation that reboot is required. Perhaps it is specific to my hardware? Anyway, I'm glad to report Caffe runs well in my platform now :-)

Thanks,
Iwen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants