Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN bug in Caffe with "group" in conv layer (misaligned address) #5729

Open
svobora opened this issue Jun 30, 2017 · 16 comments
Open

cuDNN bug in Caffe with "group" in conv layer (misaligned address) #5729

svobora opened this issue Jun 30, 2017 · 16 comments

Comments

@svobora
Copy link

svobora commented Jun 30, 2017

Issue summary

Using "group" parameter in any convolution layer, with CUDNN, I get "misaligned address" error when the training phase starts. The (first?) test phase is not affected. The error disappears when I build caffe with CUDA but without CUDNN. However such a training is 2x slower...

Steps to reproduce

Checkout repo, build with CUDNN, use "group" parameter of Convolution layer in some net and run training.

Your system configuration

Operating system: Ubuntu 16.04
Compiler: gcc 5.4
CUDA version (if applicable): 8
CUDNN version (if applicable): 5.1
BLAS: open

@deepali-c
Copy link

Could you please post the error log?

@svobora
Copy link
Author

svobora commented Jun 30, 2017

F0630 15:37:53.939421 12138 benchmark.cpp:92] Check failed: error == cudaSuccess (74 vs. 0) misaligned address
*** Check failure stack trace: ***
F0630 15:37:53.939426 12256 math_functions.cu:79] Check failed: error == cudaSuccess (74 vs. 0) misaligned address
*** Check failure stack trace: ***
@ 0x7f71ff1c55cd google::LogMessage::Fail()
@ 0x7f71ff1c55cd google::LogMessage::Fail()
@ 0x7f71ff1c7433 google::LogMessage::SendToLog()
@ 0x7f71ff1c7433 google::LogMessage::SendToLog()
@ 0x7f71ff1c515b google::LogMessage::Flush()
@ 0x7f71ff1c515b google::LogMessage::Flush()
@ 0x7f71ff1c7e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f71ff1c7e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f71ff8101da caffe::Timer::MilliSeconds()
@ 0x7f71ff9bdc0a caffe::caffe_gpu_memcpy()
@ 0x7f71ff82eb1d caffe::SyncedMemory::mutable_cpu_data()
@ 0x7f71ff80f73a caffe::Timer::Seconds()
@ 0x7f71ff8161f2 caffe::Blob<>::mutable_cpu_data()
@ 0x7f71ff98e85d caffe::Solver<>::Step()
@ 0x7f71ff916144 caffe::ImageDataLayer<>::load_batch()
@ 0x7f71ff98f26a caffe::Solver<>::Solve()
@ 0x40e0ba train()
@ 0x40a687 main
@ 0x7f71ff8a1d37 caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
@ 0x7f71fe135830 __libc_start_main
@ 0x40b029 _start
@ (nil) (unknown)

@azhangwei
Copy link

i met the same question, the group size 2 is ok, 3 or larger is some wrong

@ayushchopra96
Copy link

I tried it, get the same error even on removing group. The network converges well without cudNN but slowly. Did you manage any fix to the problem? @svobora .
What exactly is the root cause for the problem

@cateweb
Copy link

cateweb commented Jul 30, 2017

same here. with group=output it requires a huge amount of mem. reducing batch size to the minimum crashes after 2000-3000 iteration as out of memory

@cpwei80
Copy link

cpwei80 commented Aug 1, 2017

Consider use ConvolutionDepthwise (#5665) to replace convolution with group parameters.

@douzsh
Copy link

douzsh commented Jan 24, 2018

I got the same error with the following layer

layer { name: "fc2_conv_b" type: "Convolution" bottom: "fc2_a" top: "fc2_b"
    param { lr_mult: 1 decay_mult: 1  }
    param { lr_mult: 2 decay_mult: 0  }
        convolution_param { num_output: 64 pad: 1 kernel_size: 3 group: 4
stride: 1
        weight_filler {  type: "xavier"    }
        bias_filler {  type: "constant"    }
    }}

I don't know why, with output num 128 or kernel_size 5 there will be no problem......
Can anyone fix this?

@Noiredd
Copy link
Member

Noiredd commented Feb 26, 2018

I'm unable to reproduce the problem; more specific instructions are needed.
@douzsh How does fc2_a blob look like? Grouped convolution will work or fail depending on its input, so its shape is important. Your layer worked for me if I shaped the blob to dimension 1x16x32x32.

@douzsh
Copy link

douzsh commented Feb 27, 2018

@Noiredd
the Input blob's shpae is 1x64x64x64.
Hope you can help me and solve this issue.

@Noiredd
Copy link
Member

Noiredd commented Feb 27, 2018

@douzsh I just ran this network with no problems - both in Python and caffe time.
What's the output of caffe device_query -gpu=all on your machine? What are your CUDA and cuDNN versions?

@douzsh
Copy link

douzsh commented Feb 27, 2018

@Noiredd
Thatz really a good news. Can you tell me your CUDA and cudnn version too. I can upgrade my lib for training...
BTW, my CUDA is 7.0 and cudnn V6 is used while training.

@Noiredd
Copy link
Member

Noiredd commented Feb 27, 2018

@douzsh I'm pretty sure you need at least CUDA 7.5 to run cuDNN 6 - see the download page for Nvidia cuDNN for a list of compatible releases. I ran my test on CUDA 9.0.176 and cuDNN 7.0.5.

@hoszbh
Copy link

hoszbh commented Apr 4, 2018

@svobora this is a bug of Caffe, I solved it by modifying cudnn_conv_layer.cpp and aligning the address to be multiples of 32.

You can insert tow lines of code before size_t total_max_workspace = ... as follow:

       size_t m=32;
       max_workspace = (max_workspace + m-1) / m * m; //align address to be multiples of m

BTW, I think there is another bug, these lines should be put in else block:

      for (int g = 0; g < (this->group_ * CUDNN_STREAMS_PER_GROUP); g++) {
        workspace[g] = reinterpret_cast<char *>(workspaceData)+g*max_workspace;
      }

@Prier
Copy link

Prier commented May 3, 2018

@hoszbh Just wanted to confirm that you fix is working thanks a lot. Do you know why this fix not on the master yet?

@hzxie
Copy link

hzxie commented Aug 5, 2019

See also #6548

mentezar added a commit to mentezar/caffe that referenced this issue Oct 13, 2019
…. This fixes the error by aligning by the address to be multiple of m (32). Fixes also another bug of not correctly grouped if..else

see: BVLC#5729
@yulinhuyang
Copy link

after fix the code ,should compile caffe again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests