New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
caffe conv kernel for theano. tests work, but needs integration and some... #2002
Conversation
@@ -0,0 +1,33 @@ | |||
// Copyright 2014 BVLC and contributors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not ok. You need to add the full license (either inline as a comment, or in a separate file that is referred to here).
The code is a little rough around the edges, but I'm pretty sure we want to merge after the issues are taken care of. |
I think Fred wanted to clean it up and merge it properly, and that is why i left it so. However, I will make all the changes you suggest. On 29-Jul-2014, at 4:34 pm, abergeron notifications@github.com wrote:
|
A made a PR to your branch with some advancement. It return bad value right
|
… cuda_ndarray gpuval = cuda_ndarray.conv(img, kern, mode, subsample). So, made the changes in the test_conv_cuda_ndarray _test_dummy(). I see that the cpu version is computed using py_conv(), which in turn calls scipy.signal.convolve2d. How can the result 'gpuval' now be the same as scipy.signal.convolve2d instead of the scipy.signal.correlate? Also, this still passes tests for all image, kernel, channel and batch sizes: https://github.com/stencilman/Theano-1/blob/fb66035292ef070b86466bf61c9c42b8faaa0a1c/theano/sandbox/cuda/tests/test_conv_gemm.py
…nv_cuda_ndarray.py. I rotated the kernel by 180 before convolution, and this now gives the same result as GpuConvMM. So, I think the cuda/c part is completely fine and the corrent arguments are being passed to the cublas function.
Fred, I made two commits, please have a look.. I do not think there is any problem with the cuda/c code.. Thanks! |
@@ -0,0 +1,193 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still have to keep the original copyright notice, which was deleted here.
Thanks! Oh, support for non-square kernel is great. I will try and look into it later this week. |
Hi all, I am sorry for being so negative but I am very disheartened with the speed of Theano. The fprop is awesome and the speed is exactly like in torch7, however the bprop in Theano is 5x slower!! I really wanted to use Theano, but if I can't get it to be faster, I will have no option. I will be very grateful if anyone could provide any input. :-( Thank you in advance.
|
If you look at those part of the profile: Theano Linker time (includes C, CUDA code generation/compiling): and Class <% time> <#call> <#apply> We see that we spend ~95% of the time inside the GpuCorrMM op. Here is more Ops <% time> <#call> <#apply> and Ops <% time> <#call> <#apply> This tell that we spend much more time in the valid mode then in the full Can you rerun the profiling with this extra theano flags: Fred On Wed, Aug 6, 2014 at 10:02 AM, Arjun Jain notifications@github.com
|
Hi Fred, thanks a ton for your reply. Please find bellow the complete log after also using profile_memory=True. What I dont understand is why is the 'valid' getting called at all? I only call conv2d with 'full'. Perhaps this happens during the back prop, but I am really not sure why. Any help would be greatly appreciated. Thanks a lot!!!
|
if the forward is valid, in the grad, you will have a valid and a full Why do you use the full mode in the forward? That is very strange. I need to leave. I'll see if I can work on that tonight. It the speed
I looked at caffe code and it work differently in the backward case then Here is the profile of convnet-benchmark you can compare it to what is on
On Wed, Aug 6, 2014 at 12:00 PM, Arjun Jain notifications@github.com
|
Part of the difference is that we do separate grad for the weights and the We need to implement a new op GPU corrgrad, that will compute both grad at Keep me updated on what you do and when as I'll also work on it tonight,
|
Hi Fred, I use full because I want the same shape of output as input. I can change to using valid in fprop, but does that change anything? Yes, I checked the caffe code and how they calculate gradients. I can add this to GpuCorrMM. How can I add a grad function that can call cuda function to get the gradient value? I tried using cuda-memcheck, but it always seems to get stuck. I will run cuda-memcheck on a minimal program instead and let you know. Yes, I do think the back prop is as slow as I report. Would be amazing if you have some time to look at it later. Thank you! I am happy to do anything to make it fast, I really need this. Thanks a lot, |
I'm looking into it, finaly I don't think we need to create a new op that The problem is related to the fact that we use the same code for the Fred On Wed, Aug 6, 2014 at 6:24 PM, Arjun Jain notifications@github.com wrote:
|
It is great to know we might not need to do a new op! Thanks for working on it, let me know if I can help in any way, I would be more than happy to! I would help us a lot here to have fast convolution in Theano. :-) |
I'll go sleep. I have something that run, but don't return the right answer: https://github.com/nouiz/Theano/tree/conv_gemm If you can review it and try to fix it, it would be great. On Wed, Aug 6, 2014 at 8:22 PM, Arjun Jain notifications@github.com wrote:
|
Great, thanks! I will have a look and let you know if I can find anything. |
I am a bit confused, why do you want to call col2im in the 'full' mode? Why do you want to do the 'valid' and 'full' mode differently in the the cuda code? |
I added support to non-square kernels here: https://github.com/stencilman/Theano-1/commit/85b8a90f553699e67e85291cad6350e6abb5a944 It passed all tests. Is that what you were trying to do? |
I'm tring do to as in this fct: cunn_SpatialConvolutionMM_updateGradInput In particular this part: https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu#L303 On Wed, Aug 6, 2014 at 11:17 PM, Arjun Jain notifications@github.com
|
Ok. updateGradInput calculates the gradients. I am still a bit confused as to why? It is because you want to get it to work and then you want to change it to a different function? All tests for full mode will fail right if you calculate the gradient instead of doing the corr? Please do explain. And yes, the non-square stuff works perfectly, I will create a PR. |
I created a PR(#2023) for code which does non-square kernels. About your branch and the changes you made last night, I am unsure why you want to do what cunn_SpatialConvolutionMM_updateGradInput does. Can you please explain? |
Also, why do you think it is the 'full' corr that makes it slow? And how is the algorithm different for us in the full mode than them? what i find is that the forward is super fast, even if it is full. Somehow in the backward it is very slow. It will really help me a lot if you can fix this or tell me how to make it faster for back prop. Thanks a lot. |
I started to write that reply yesterday. But forgot to hit reply. Here it On Thu, Aug 7, 2014 at 10:25 AM, Arjun Jain notifications@github.com
I ran the convnet-benchmark before my PR and saw this result (tm is the CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride Here we see that it take 1753ms (second to last line) for the bprop vs th with the code in the master of theano, we don't call col2im. There
Can you check to make sure your torch implemenation also use the full mode |
I am sorry @nouiz , I still dont understand how the full is any different from valid. For me, full is just valid with some padding, and that is how we implement it. col2im in my opinion has nothing to do with the full or valid modes, it is only useful for calculating the gradient, please correct me if I am wrong. The convnet-benchmark results as @nouiz report are clearly weird, no? It takes 99msec fprop vs 1753ms bprop? Why? I will try to get to the bottom of the slow speed. From tomorrow morning, I will also not be available until next Friday as I am going to vancouver for a conference tomorrow morning. I will see what I can do today. |
On Fri, Aug 8, 2014 at 9:35 AM, Arjun Jain notifications@github.com wrote:
You are right that the full mode is valid with padding. When we have 2 The convnet-benchmark results as @nouiz https://github.com/nouiz report
I guess it is the reason I wrote above.
|
Thanks a lot for your reply @nouiz . Hmm, I dont think so. I dont think it is the zero padding that is making it slow. Also, I think the col2im and im2col do not deal with padding differently. IMHO, the different in speed between this and torch will not be a orders of magnitude because of the zero padding. It is something else. @nouiz I will be very grateful if you could tell me what Theano does in case of bprop? What size convolutions? Maybe we try to manually fprop the same size convolution and see how much time it takes? What do you think? |
Btw @nouiz we seem to have a misunderstand of what caffe/torch does. Correct me if I am wrong: you seem to think they use col2im + gemm for full mode. I dont think so. They dont even have a full mode. They use col2im only when they are doing their bprop. I dont understand how Theano does the bprop. |
the bprop is also a convolution. So when you tell that they use col2im+gemm The grad graph of conv2d is defined in that method: https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L775 The grad again the filter is also a convolution, check here: https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L898 You see there the this grad is always a valid convolution. Here you see https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L848 The grad again the image is also a convolution check here: https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L943 Here you see that in that case the, the mode of that convolution is full is https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L916 So in the case the fprop is valid, the bprop again the inputs is a full On Fri, Aug 8, 2014 at 10:29 AM, Arjun Jain notifications@github.com
|
Thanks a ton for that comment @nouiz ! That clarifies things a lot! So the grads are conv ops too. Is it possible that the "unroll" while doing the bprop conv doesnt have the optimum shapes and thus it is not fast? I am still not sure if we need col2im or what the gemm+col2im does. i have a suspicion the problem has something to do with the sizes.. |
I haven't been able to work on it today. I won't work on it next week. So Do you know if the torch code in full mode update in place the weight? If Also, it is possible that we need to allocate bigger memory with that are
|
I will try to continue to look into this. maybe this info helps you in figuring the soln: they(torch) create a 'module' and a 'criterion', and when you do module:forward(), it goes forwards and when you do module:backwards() it goes backward(recursively).
And you:
where optimMethod is (https://github.com/torch/optim/blob/master/sgd.lua). |
I don't have the time to look, but where is the "parameter" used in gradWeight This would help solve some of the questions I had when writting the Theano On Fri, Aug 8, 2014 at 7:15 PM, Arjun Jain notifications@github.com wrote:
|
gradWeight: https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu#L407 finput and fgardInput are used in updateGradInput (which transfers the gradients for the chain rule): |
The caffe convolution works and passes test, however, code needs some cleaning, which are marked with TODO in comments. I created a new file, theano/sandbox/cuda/tests/test_conv_gemm.py that calls GpuConvMM.
TODO:
Other possible follow up in gh-2015
NEWS.txt