Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault during cudnnCreate #6205

Open
iamarcel opened this issue Jan 30, 2018 · 0 comments
Open

Segmentation fault during cudnnCreate #6205

iamarcel opened this issue Jan 30, 2018 · 0 comments

Comments

@iamarcel
Copy link

Issue summary

The Caffe tests seem to fail whenever cudnnCreate is called. I compiled the latest master (08a95a4) using Cmake and the following options:

-DBLAS=open 
-DUSE_LMDB=off 
-DUSE_LEVELDB=off 
-DCUDNN_INCLUDE=/cudnn/cudnn5.1/include 
-DCUDNN_LIBRARY=/cudnn/cudnn5.1/lib64/libcudnn.so

Here's the backtrace with a debug build:

$ gdb ./test/test.testbin-d
(gdb) catch signal SIGSEGV
(gdb) r
...
Cuda number of devices: 2
Current device id: 0
Current device name: Tesla K40c
[==========] Running 2073 tests from 279 test cases.
... (tests from PoolingLayerTest/0,1,2,3 that all succeed)

[----------] 8 tests from CuDNNPoolingLayerTest/0, where TypeParam = float
[ RUN      ] CuDNNPoolingLayerTest/0.TestSetupCuDNN

Thread 1 "test.testbin-d" hit Catchpoint 1 (signal SIGSEGV), 0x00007fffaec999ce in ?? () from /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.1
(gdb) bt
#0  0x00007fffaec999ce in ?? () from /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.1
#1  0x00007fffaef2dc0f in ?? () from /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.1
#2  0x00007fffaef2dc69 in ?? () from /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.1
#3  0x00007fffaeb648fd in ?? () from /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.1
#4  0x00007fffaeb6c53b in ?? () from /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.1
#5  0x00007fffaf20dced in ?? () from /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.1
#6  0x00007fffaeb6f240 in ?? () from /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.1
#7  0x00007fffaeb709e3 in ?? () from /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.1
#8  0x00007fffaeb6767c in __cuda_CallJitEntryPoint () from /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.1
#9  0x00007fffaebe3366 in nvPTXCompilerCompile () from /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.1
#10 0x00007fffb068d969 in fatBinaryCtl_Compile () from /usr/lib/nvidia-384/libnvidia-fatbinaryloader.so.384.111
#11 0x00007fffb0aad02f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#12 0x00007fffb0aadb33 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#13 0x00007fffb09fb9ed in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#14 0x00007fffb09fbd00 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#15 0x00007fffe927024d in ?? () from /cudnn/cudnn5.1/lib64/libcudnn.so.5
#16 0x00007fffe9265c70 in ?? () from /cudnn/cudnn5.1/lib64/libcudnn.so.5
#17 0x00007fffe92734c6 in ?? () from /cudnn/cudnn5.1/lib64/libcudnn.so.5
#18 0x00007fffe9276ad1 in ?? () from /cudnn/cudnn5.1/lib64/libcudnn.so.5
#19 0x00007fffe926a63c in ?? () from /cudnn/cudnn5.1/lib64/libcudnn.so.5
#20 0x00007fffe9253cc2 in ?? () from /cudnn/cudnn5.1/lib64/libcudnn.so.5
#21 0x00007fffe928c38f in ?? () from /cudnn/cudnn5.1/lib64/libcudnn.so.5
#22 0x00007fffe8d97c54 in cudnnCreate () from /cudnn/cudnn5.1/lib64/libcudnn.so.5
#23 0x00007ffff77d90f7 in caffe::CuDNNPoolingLayer<float>::LayerSetUp (this=0x7fffffffdb80, bottom=std::vector of length 1, capacity 1 = {...}, top=std::vector of length 1, capacity 1 = {...}) at /home/mlsamyn/caffe/src/caffe/layers/cudnn_pooling_layer.cpp:12
#24 0x0000000000ad1546 in caffe::Layer<float>::SetUp (this=0x7fffffffdb80, bottom=std::vector of length 1, capacity 1 = {...}, top=std::vector of length 1, capacity 1 = {...}) at /home/mlsamyn/caffe/include/caffe/layer.hpp:70
#25 0x0000000000ac09a6 in caffe::CuDNNPoolingLayerTest_TestSetupCuDNN_Test<float>::TestBody (this=0x1742ed0) at /home/mlsamyn/caffe/src/caffe/test/test_pooling_layer.cpp:974
#26 0x0000000000fa881a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0x1742ed0, method=&virtual testing::Test::TestBody(), location=0x1075473 "the test body") at /home/mlsamyn/caffe/src/gtest/gtest-all.cpp:3393
#27 0x0000000000fa3927 in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x1742ed0, method=&virtual testing::Test::TestBody(), location=0x1075473 "the test body") at /home/mlsamyn/caffe/src/gtest/gtest-all.cpp:3429
#28 0x0000000000f8f262 in testing::Test::Run (this=0x1742ed0) at /home/mlsamyn/caffe/src/gtest/gtest-all.cpp:3465
#29 0x0000000000f8fa54 in testing::TestInfo::Run (this=0x161bc40) at /home/mlsamyn/caffe/src/gtest/gtest-all.cpp:3641
#30 0x0000000000f9009f in testing::TestCase::Run (this=0x15f8140) at /home/mlsamyn/caffe/src/gtest/gtest-all.cpp:3748
#31 0x0000000000f9554d in testing::internal::UnitTestImpl::RunAllTests (this=0x1612f20) at /home/mlsamyn/caffe/src/gtest/gtest-all.cpp:5540
#32 0x0000000000fa9a1d in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x1612f20, method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0xf9528e <testing::internal::UnitTestImpl::RunAllTests()>, 
    location=0x1075f30 "auxiliary test code (environments or event listeners)") at /home/mlsamyn/caffe/src/gtest/gtest-all.cpp:3393
#33 0x0000000000fa461a in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x1612f20, method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0xf9528e <testing::internal::UnitTestImpl::RunAllTests()>, 
    location=0x1075f30 "auxiliary test code (environments or event listeners)") at /home/mlsamyn/caffe/src/gtest/gtest-all.cpp:3429
#34 0x0000000000f9415c in testing::UnitTest::Run (this=0x1466f00 <testing::UnitTest::GetInstance()::instance>) at /home/mlsamyn/caffe/src/gtest/gtest-all.cpp:5174
#35 0x0000000000e77f28 in main (argc=1, argv=0x7fffffffe3d8) at /home/mlsamyn/caffe/src/caffe/test/test_caffe_main.cpp:37
(gdb) f 23
#23 0x00007ffff77d90f7 in caffe::CuDNNPoolingLayer<float>::LayerSetUp (this=0x7fffffffdb80, bottom=std::vector of length 1, capacity 1 = {...}, top=std::vector of length 1, capacity 1 = {...}) at /home/mlsamyn/caffe/src/caffe/layers/cudnn_pooling_layer.cpp:12
12	  CUDNN_CHECK(cudnnCreate(&handle_));
(gdb) p *this
$1 = {> = {> = {_vptr.Layer = 0x1404f68 +16>, layer_param_ = { = {}, static kNameFieldNumber = 1, static kTypeFieldNumber = 2, static kBottomFieldNumber = 3, 
        static kTopFieldNumber = 4, static kPhaseFieldNumber = 10, static kLossWeightFieldNumber = 5, static kParamFieldNumber = 6, static kBlobsFieldNumber = 7, static kPropagateDownFieldNumber = 11, static kIncludeFieldNumber = 8, static kExcludeFieldNumber = 9, static kTransformParamFieldNumber = 100, 
        static kLossParamFieldNumber = 101, static kAccuracyParamFieldNumber = 102, static kArgmaxParamFieldNumber = 103, static kBatchNormParamFieldNumber = 139, static kBiasParamFieldNumber = 141, static kConcatParamFieldNumber = 104, static kContrastiveLossParamFieldNumber = 105, 
        static kConvolutionParamFieldNumber = 106, static kCropParamFieldNumber = 144, static kDataParamFieldNumber = 107, static kDropoutParamFieldNumber = 108, static kDummyDataParamFieldNumber = 109, static kEltwiseParamFieldNumber = 110, static kEluParamFieldNumber = 140, static kEmbedParamFieldNumber = 137, 
        static kExpParamFieldNumber = 111, static kFlattenParamFieldNumber = 135, static kHdf5DataParamFieldNumber = 112, static kHdf5OutputParamFieldNumber = 113, static kHingeLossParamFieldNumber = 114, static kImageDataParamFieldNumber = 115, static kInfogainLossParamFieldNumber = 116, 
        static kInnerProductParamFieldNumber = 117, static kInputParamFieldNumber = 143, static kLogParamFieldNumber = 134, static kLrnParamFieldNumber = 118, static kMemoryDataParamFieldNumber = 119, static kMvnParamFieldNumber = 120, static kParameterParamFieldNumber = 145, static kPoolingParamFieldNumber = 121, 
        static kPowerParamFieldNumber = 122, static kPreluParamFieldNumber = 131, static kPythonParamFieldNumber = 130, static kRecurrentParamFieldNumber = 146, static kReductionParamFieldNumber = 136, static kReluParamFieldNumber = 123, static kReshapeParamFieldNumber = 133, static kScaleParamFieldNumber = 142, 
        static kSigmoidParamFieldNumber = 124, static kSoftmaxParamFieldNumber = 125, static kSppParamFieldNumber = 132, static kSliceParamFieldNumber = 126, static kTanhParamFieldNumber = 127, static kThresholdParamFieldNumber = 128, static kTileParamFieldNumber = 138, static kWindowDataParamFieldNumber = 129, 
        _unknown_fields_ = {fields_ = 0x0}, _has_bits_ = {0, 512}, name_ = 0x1607c90, type_ = 0x1607c90, bottom_ = { = {static kInitialSize = 0, elements_ = 0x0, current_size_ = 0, allocated_size_ = 0, total_size_ = 0}, }, 
        top_ = { = {static kInitialSize = 0, elements_ = 0x0, current_size_ = 0, allocated_size_ = 0, total_size_ = 0}, }, loss_weight_ = {static kInitialSize = , elements_ = 0x0, current_size_ = 0, total_size_ = 0}, 
        param_ = { = {static kInitialSize = 0, elements_ = 0x0, current_size_ = 0, allocated_size_ = 0, total_size_ = 0}, }, blobs_ = { = {static kInitialSize = 0, elements_ = 0x0, current_size_ = 0, 
            allocated_size_ = 0, total_size_ = 0}, }, propagate_down_ = {static kInitialSize = , elements_ = 0x0, current_size_ = 0, total_size_ = 0}, include_ = { = {static kInitialSize = 0, elements_ = 0x0, current_size_ = 0, 
            allocated_size_ = 0, total_size_ = 0}, }, exclude_ = { = {static kInitialSize = 0, elements_ = 0x0, current_size_ = 0, allocated_size_ = 0, total_size_ = 0}, }, transform_param_ = 0x0, loss_param_ = 0x0, 
        accuracy_param_ = 0x0, argmax_param_ = 0x0, batch_norm_param_ = 0x0, bias_param_ = 0x0, concat_param_ = 0x0, contrastive_loss_param_ = 0x0, convolution_param_ = 0x0, crop_param_ = 0x0, data_param_ = 0x0, dropout_param_ = 0x0, dummy_data_param_ = 0x0, eltwise_param_ = 0x0, elu_param_ = 0x0, 
        embed_param_ = 0x0, exp_param_ = 0x0, flatten_param_ = 0x0, hdf5_data_param_ = 0x0, hdf5_output_param_ = 0x0, hinge_loss_param_ = 0x0, image_data_param_ = 0x0, infogain_loss_param_ = 0x0, inner_product_param_ = 0x0, input_param_ = 0x0, log_param_ = 0x0, lrn_param_ = 0x0, memory_data_param_ = 0x0, 
        mvn_param_ = 0x0, parameter_param_ = 0x0, pooling_param_ = 0x6fb0ff0, power_param_ = 0x0, prelu_param_ = 0x0, python_param_ = 0x0, recurrent_param_ = 0x0, reduction_param_ = 0x0, relu_param_ = 0x0, reshape_param_ = 0x0, scale_param_ = 0x0, sigmoid_param_ = 0x0, softmax_param_ = 0x0, spp_param_ = 0x0, 
        slice_param_ = 0x0, tanh_param_ = 0x0, threshold_param_ = 0x0, tile_param_ = 0x0, window_data_param_ = 0x0, phase_ = 0, _cached_size_ = 0, static default_instance_ = 0x16196e0}, phase_ = caffe::TRAIN, blobs_ = std::vector of length 0, capacity 0, 
      param_propagate_down_ = std::vector of length 0, capacity 0, loss_ = std::vector of length 0, capacity 0}, kernel_h_ = 3, kernel_w_ = 3, stride_h_ = 2, stride_w_ = 2, pad_h_ = 0, pad_w_ = 0, channels_ = 0, height_ = 0, width_ = -8560, pooled_height_ = 32767, pooled_width_ = -141436653, 
    global_pooling_ = false, rand_idx_ = {data_ = {px = 0x0, pn = {pi_ = 0x0}}, diff_ = {px = 0x0, pn = {pi_ = 0x0}}, shape_data_ = {px = 0x0, pn = {pi_ = 0x0}}, shape_ = std::vector of length 0, capacity 0, count_ = 0, capacity_ = 0}, max_idx_ = {data_ = {px = 0x0, pn = {pi_ = 0x0}}, diff_ = {px = 0x0, pn = {
          pi_ = 0x0}}, shape_data_ = {px = 0x0, pn = {pi_ = 0x0}}, shape_ = std::vector of length 0, capacity 0, count_ = 0, capacity_ = 0}}, handles_setup_ = false, handle_ = 0x0, bottom_desc_ = 0x3f800000, top_desc_ = 0xffffffff3f800000, pooling_desc_ = 0x7fffffffdf70, mode_ = (unknown: 4294958944)}

Steps to reproduce

$ git clone https://github.com/BVLC/caffe.git
$ cd caffe
$ mkdir build && cd build
$ cmake .. -DBLAS=open -DUSE_LMDB=off -DUSE_LEVELDB=off -DCUDNN_INCLUDE=/cudnn/cudnn5.1/include -DCUDNN_LIBRARY=/cudnn/cudnn5.1/lib64/libcudnn.so
$ make all
$ make runtest

Your system configuration

Operating system

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.2 LTS
Release:	16.04
Codename:	xenial

Compiler: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
CUDA version (if applicable): 8.0.44
CUDNN version (if applicable): 5.1
BLAS: open
GPU configuration:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 00000000:02:00.0 Off |                  Off |
| 23%   42C    P8    20W / 235W |     18MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          Off  | 00000000:03:00.0 Off |                  Off |
| 23%   41C    P8    20W / 235W |      1MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant