cuDNN bug in Caffe with "group" in conv layer (misaligned address) #5729

svobora · 2017-06-30T11:06:15Z

Issue summary

Using "group" parameter in any convolution layer, with CUDNN, I get "misaligned address" error when the training phase starts. The (first?) test phase is not affected. The error disappears when I build caffe with CUDA but without CUDNN. However such a training is 2x slower...

Steps to reproduce

Checkout repo, build with CUDNN, use "group" parameter of Convolution layer in some net and run training.

Your system configuration

Operating system: Ubuntu 16.04
Compiler: gcc 5.4
CUDA version (if applicable): 8
CUDNN version (if applicable): 5.1
BLAS: open

deepali-c · 2017-06-30T13:20:09Z

Could you please post the error log?

svobora · 2017-06-30T13:41:33Z

F0630 15:37:53.939421 12138 benchmark.cpp:92] Check failed: error == cudaSuccess (74 vs. 0) misaligned address
*** Check failure stack trace: ***
F0630 15:37:53.939426 12256 math_functions.cu:79] Check failed: error == cudaSuccess (74 vs. 0) misaligned address
*** Check failure stack trace: ***
@ 0x7f71ff1c55cd google::LogMessage::Fail()
@ 0x7f71ff1c55cd google::LogMessage::Fail()
@ 0x7f71ff1c7433 google::LogMessage::SendToLog()
@ 0x7f71ff1c7433 google::LogMessage::SendToLog()
@ 0x7f71ff1c515b google::LogMessage::Flush()
@ 0x7f71ff1c515b google::LogMessage::Flush()
@ 0x7f71ff1c7e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f71ff1c7e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f71ff8101da caffe::Timer::MilliSeconds()
@ 0x7f71ff9bdc0a caffe::caffe_gpu_memcpy()
@ 0x7f71ff82eb1d caffe::SyncedMemory::mutable_cpu_data()
@ 0x7f71ff80f73a caffe::Timer::Seconds()
@ 0x7f71ff8161f2 caffe::Blob<>::mutable_cpu_data()
@ 0x7f71ff98e85d caffe::Solver<>::Step()
@ 0x7f71ff916144 caffe::ImageDataLayer<>::load_batch()
@ 0x7f71ff98f26a caffe::Solver<>::Solve()
@ 0x40e0ba train()
@ 0x40a687 main
@ 0x7f71ff8a1d37 caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
@ 0x7f71fe135830 __libc_start_main
@ 0x40b029 _start
@ (nil) (unknown)

azhangwei · 2017-07-04T09:10:16Z

i met the same question, the group size 2 is ok, 3 or larger is some wrong

ayushchopra96 · 2017-07-21T11:39:40Z

I tried it, get the same error even on removing group. The network converges well without cudNN but slowly. Did you manage any fix to the problem? @svobora .
What exactly is the root cause for the problem

cateweb · 2017-07-30T15:28:12Z

same here. with group=output it requires a huge amount of mem. reducing batch size to the minimum crashes after 2000-3000 iteration as out of memory

cpwei80 · 2017-08-01T08:14:43Z

Consider use ConvolutionDepthwise (#5665) to replace convolution with group parameters.

douzsh · 2018-01-24T06:39:28Z

I got the same error with the following layer

layer { name: "fc2_conv_b" type: "Convolution" bottom: "fc2_a" top: "fc2_b"
    param { lr_mult: 1 decay_mult: 1  }
    param { lr_mult: 2 decay_mult: 0  }
        convolution_param { num_output: 64 pad: 1 kernel_size: 3 group: 4
stride: 1
        weight_filler {  type: "xavier"    }
        bias_filler {  type: "constant"    }
    }}

I don't know why, with output num 128 or kernel_size 5 there will be no problem......
Can anyone fix this?

Noiredd · 2018-02-26T10:52:35Z

I'm unable to reproduce the problem; more specific instructions are needed.
@douzsh How does fc2_a blob look like? Grouped convolution will work or fail depending on its input, so its shape is important. Your layer worked for me if I shaped the blob to dimension 1x16x32x32.

douzsh · 2018-02-27T03:14:29Z

@Noiredd
the Input blob's shpae is 1x64x64x64.
Hope you can help me and solve this issue.

Noiredd · 2018-02-27T07:28:46Z

@douzsh I just ran this network with no problems - both in Python and caffe time.
What's the output of caffe device_query -gpu=all on your machine? What are your CUDA and cuDNN versions?

douzsh · 2018-02-27T07:34:25Z

@Noiredd
Thatz really a good news. Can you tell me your CUDA and cudnn version too. I can upgrade my lib for training...
BTW, my CUDA is 7.0 and cudnn V6 is used while training.

Noiredd · 2018-02-27T07:56:23Z

@douzsh I'm pretty sure you need at least CUDA 7.5 to run cuDNN 6 - see the download page for Nvidia cuDNN for a list of compatible releases. I ran my test on CUDA 9.0.176 and cuDNN 7.0.5.

hoszbh · 2018-04-04T08:17:39Z

@svobora this is a bug of Caffe, I solved it by modifying cudnn_conv_layer.cpp and aligning the address to be multiples of 32.

You can insert tow lines of code before size_t total_max_workspace = ... as follow:

       size_t m=32;
       max_workspace = (max_workspace + m-1) / m * m; //align address to be multiples of m

BTW, I think there is another bug, these lines should be put in else block:

      for (int g = 0; g < (this->group_ * CUDNN_STREAMS_PER_GROUP); g++) {
        workspace[g] = reinterpret_cast<char *>(workspaceData)+g*max_workspace;
      }

Prier · 2018-05-03T15:01:07Z

@hoszbh Just wanted to confirm that you fix is working thanks a lot. Do you know why this fix not on the master yet?

as issue BVLC/caffe#5729 (comment)

hzxie · 2019-08-05T09:14:00Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN bug in Caffe with "group" in conv layer (misaligned address) #5729

cuDNN bug in Caffe with "group" in conv layer (misaligned address) #5729

svobora commented Jun 30, 2017

deepali-c commented Jun 30, 2017

svobora commented Jun 30, 2017

azhangwei commented Jul 4, 2017

ayushchopra96 commented Jul 21, 2017

cateweb commented Jul 30, 2017

cpwei80 commented Aug 1, 2017

douzsh commented Jan 24, 2018

Noiredd commented Feb 26, 2018

douzsh commented Feb 27, 2018

Noiredd commented Feb 27, 2018

douzsh commented Feb 27, 2018

Noiredd commented Feb 27, 2018

hoszbh commented Apr 4, 2018 •

edited

Prier commented May 3, 2018

hzxie commented Aug 5, 2019

yulinhuyang commented May 28, 2020

cuDNN bug in Caffe with "group" in conv layer (misaligned address) #5729

cuDNN bug in Caffe with "group" in conv layer (misaligned address) #5729

Comments

svobora commented Jun 30, 2017

Issue summary

Steps to reproduce

Your system configuration

deepali-c commented Jun 30, 2017

svobora commented Jun 30, 2017

azhangwei commented Jul 4, 2017

ayushchopra96 commented Jul 21, 2017

cateweb commented Jul 30, 2017

cpwei80 commented Aug 1, 2017

douzsh commented Jan 24, 2018

Noiredd commented Feb 26, 2018

douzsh commented Feb 27, 2018

Noiredd commented Feb 27, 2018

douzsh commented Feb 27, 2018

Noiredd commented Feb 27, 2018

hoszbh commented Apr 4, 2018 • edited

Prier commented May 3, 2018

hzxie commented Aug 5, 2019

yulinhuyang commented May 28, 2020

hoszbh commented Apr 4, 2018 •

edited