New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN bug in Caffe with "group" in conv layer (misaligned address) #5729

Open
svobora opened this Issue Jun 30, 2017 · 14 comments

Comments

Projects
None yet
10 participants
@svobora
Copy link

svobora commented Jun 30, 2017

Issue summary

Using "group" parameter in any convolution layer, with CUDNN, I get "misaligned address" error when the training phase starts. The (first?) test phase is not affected. The error disappears when I build caffe with CUDA but without CUDNN. However such a training is 2x slower...

Steps to reproduce

Checkout repo, build with CUDNN, use "group" parameter of Convolution layer in some net and run training.

Your system configuration

Operating system: Ubuntu 16.04
Compiler: gcc 5.4
CUDA version (if applicable): 8
CUDNN version (if applicable): 5.1
BLAS: open

@deepali-c

This comment has been minimized.

Copy link

deepali-c commented Jun 30, 2017

Could you please post the error log?

@svobora

This comment has been minimized.

Copy link
Author

svobora commented Jun 30, 2017

F0630 15:37:53.939421 12138 benchmark.cpp:92] Check failed: error == cudaSuccess (74 vs. 0) misaligned address
*** Check failure stack trace: ***
F0630 15:37:53.939426 12256 math_functions.cu:79] Check failed: error == cudaSuccess (74 vs. 0) misaligned address
*** Check failure stack trace: ***
@ 0x7f71ff1c55cd google::LogMessage::Fail()
@ 0x7f71ff1c55cd google::LogMessage::Fail()
@ 0x7f71ff1c7433 google::LogMessage::SendToLog()
@ 0x7f71ff1c7433 google::LogMessage::SendToLog()
@ 0x7f71ff1c515b google::LogMessage::Flush()
@ 0x7f71ff1c515b google::LogMessage::Flush()
@ 0x7f71ff1c7e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f71ff1c7e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f71ff8101da caffe::Timer::MilliSeconds()
@ 0x7f71ff9bdc0a caffe::caffe_gpu_memcpy()
@ 0x7f71ff82eb1d caffe::SyncedMemory::mutable_cpu_data()
@ 0x7f71ff80f73a caffe::Timer::Seconds()
@ 0x7f71ff8161f2 caffe::Blob<>::mutable_cpu_data()
@ 0x7f71ff98e85d caffe::Solver<>::Step()
@ 0x7f71ff916144 caffe::ImageDataLayer<>::load_batch()
@ 0x7f71ff98f26a caffe::Solver<>::Solve()
@ 0x40e0ba train()
@ 0x40a687 main
@ 0x7f71ff8a1d37 caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
@ 0x7f71fe135830 __libc_start_main
@ 0x40b029 _start
@ (nil) (unknown)

@azhangwei

This comment has been minimized.

Copy link

azhangwei commented Jul 4, 2017

i met the same question, the group size 2 is ok, 3 or larger is some wrong

@ayushchopra96

This comment has been minimized.

Copy link

ayushchopra96 commented Jul 21, 2017

I tried it, get the same error even on removing group. The network converges well without cudNN but slowly. Did you manage any fix to the problem? @svobora .
What exactly is the root cause for the problem

@cateweb

This comment has been minimized.

Copy link

cateweb commented Jul 30, 2017

same here. with group=output it requires a huge amount of mem. reducing batch size to the minimum crashes after 2000-3000 iteration as out of memory

@cpwei80

This comment has been minimized.

Copy link

cpwei80 commented Aug 1, 2017

Consider use ConvolutionDepthwise (#5665) to replace convolution with group parameters.

@douzsh

This comment has been minimized.

Copy link

douzsh commented Jan 24, 2018

I got the same error with the following layer

layer { name: "fc2_conv_b" type: "Convolution" bottom: "fc2_a" top: "fc2_b"
    param { lr_mult: 1 decay_mult: 1  }
    param { lr_mult: 2 decay_mult: 0  }
        convolution_param { num_output: 64 pad: 1 kernel_size: 3 group: 4
stride: 1
        weight_filler {  type: "xavier"    }
        bias_filler {  type: "constant"    }
    }}

I don't know why, with output num 128 or kernel_size 5 there will be no problem......
Can anyone fix this?

@Noiredd

This comment has been minimized.

Copy link
Member

Noiredd commented Feb 26, 2018

I'm unable to reproduce the problem; more specific instructions are needed.
@douzsh How does fc2_a blob look like? Grouped convolution will work or fail depending on its input, so its shape is important. Your layer worked for me if I shaped the blob to dimension 1x16x32x32.

@douzsh

This comment has been minimized.

Copy link

douzsh commented Feb 27, 2018

@Noiredd
the Input blob's shpae is 1x64x64x64.
Hope you can help me and solve this issue.

@Noiredd

This comment has been minimized.

Copy link
Member

Noiredd commented Feb 27, 2018

@douzsh I just ran this network with no problems - both in Python and caffe time.
What's the output of caffe device_query -gpu=all on your machine? What are your CUDA and cuDNN versions?

@douzsh

This comment has been minimized.

Copy link

douzsh commented Feb 27, 2018

@Noiredd
Thatz really a good news. Can you tell me your CUDA and cudnn version too. I can upgrade my lib for training...
BTW, my CUDA is 7.0 and cudnn V6 is used while training.

@Noiredd

This comment has been minimized.

Copy link
Member

Noiredd commented Feb 27, 2018

@douzsh I'm pretty sure you need at least CUDA 7.5 to run cuDNN 6 - see the download page for Nvidia cuDNN for a list of compatible releases. I ran my test on CUDA 9.0.176 and cuDNN 7.0.5.

@hoszbh

This comment has been minimized.

Copy link

hoszbh commented Apr 4, 2018

@svobora this is a bug of Caffe, I solved it by modifying cudnn_conv_layer.cpp and aligning the address to be multiples of 32.

You can insert tow lines of code before size_t total_max_workspace = ... as follow:

       size_t m=32;
       max_workspace = (max_workspace + m-1) / m * m; //align address to be multiples of m

BTW, I think there is another bug, these lines should be put in else block:

      for (int g = 0; g < (this->group_ * CUDNN_STREAMS_PER_GROUP); g++) {
        workspace[g] = reinterpret_cast<char *>(workspaceData)+g*max_workspace;
      }
@Prier

This comment has been minimized.

Copy link

Prier commented May 3, 2018

@hoszbh Just wanted to confirm that you fix is working thanks a lot. Do you know why this fix not on the master yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment