Convolution layer using cudnn. #2096

Merged
merged 10 commits into from Sep 16, 2014

Projects

None yet

5 participants

@abergeron
Member

TODO:

  • tests
  • a way to use the gpu without cudnn
  • avoid GpuContiguous where possible
  • add support for subsamples.
@abergeron
Member

With the tests we have a new problem: a way to detect that the cudnn library is not present and skip the test in that case rather than fail.

Otherwise all should be ok. We can always work on avoiding GpuContiguous later.

@nouiz nouiz commented on the diff Sep 9, 2014
theano/gof/cc.py
@@ -661,7 +665,7 @@ def code_gen(self):
" didn't return a string for c_init_code_apply")
try:
- struct_init = op.c_init_code_struct(node, id + 1)
+ struct_init = op.c_init_code_struct(node, id + 1, sub_struct)
@nouiz
nouiz Sep 9, 2014 Member

This interface change should be refleted in the doc.

@abergeron
abergeron Sep 9, 2014 Member

It was partially done, I forgot to change extending/cop.txt. Now it's done.

@nouiz nouiz commented on the diff Sep 9, 2014
theano/sandbox/cuda/dnn.py
+
+from theano.sandbox.cuda.opt import (local_optimizer, gpu_contiguous,
+ gpu_optimizer)
+
+@local_optimizer([GpuConv])
+def local_conv_dnn(node):
+ if isinstance(node.op, GpuConv):
+ if (node.op.subsample != (1, 1) or
+ node.op.border_mode not in ['full', 'valid']):
+ return
+ img, kern = node.inputs
+ border_mode = node.op.border_mode
+ return [GpuDnnConv(border_mode)(gpu_contiguous(img),
+ gpu_contiguous(kern))]
+
+gpu_optimizer.register("conv_cudnn", local_conv_dnn, 'cudnn')
@nouiz
nouiz Sep 9, 2014 Member

As this file name is dnn.py, add the "dnn" name as a tag.

@abergeron
abergeron Sep 9, 2014 Member

I chose 'cudnn' as a generic tag to enable cudnn-optimized ops. I can change it to 'dnn' but that feels too generic.

@nouiz nouiz commented on the diff Sep 9, 2014
theano/sandbox/cuda/dnn.py
@@ -0,0 +1,205 @@
+import copy
+import os
+
+import theano
+from theano import Apply
+from theano import tensor
+from theano.compat.six import StringIO
+from theano.sandbox.cuda.type import CudaNdarrayType
+from theano.sandbox.cuda import GpuOp
+from theano.sandbox.cuda.basic_ops import (as_cuda_ndarray_variable,
+ gpu_contiguous)
+from theano.sandbox.cuda.blas import GpuConv
+
+class GpuDnnConv(GpuOp):
@nouiz
nouiz Sep 9, 2014 Member

We need a mechanism that will detect if cudnn is available or not. For this, base yourself on GCC_Compiler.try_flags() to test if we can compile with the cudnn.h header. This will allow to enable/disable the tests and make a good error message in the c_code()if the lib isn't there.

@abergeron
abergeron Sep 9, 2014 Member

I'll try to make something like this.

@nouiz
Member
nouiz commented Sep 10, 2014

I have a seg fault. The problem is that cudnnDestroyTensor4dDescriptor is called in instantiate(). Probably, there is another error that got hidden by this error. You should init the descriptot to NULL and call the destructor on them only if they aren't NULL.

@abergeron
Member

This should be fixed, but I haven't been able to make it crash before.

@nouiz
Member
nouiz commented Sep 10, 2014

It we use an unsupported card, the error during the init isn't making stuff
crash at the time. We get a strange error at run time.

On Wed, Sep 10, 2014 at 3:14 PM, abergeron notifications@github.com wrote:

This should be fixed, but I haven't been able to make it crash before.


Reply to this email directly or view it on GitHub
#2096 (comment).

@abergeron
Member

Now it should just fail all the tests, instead of crashing.

@abergeron
Member

But you would need to clear the theano cache before running it again.

@nouiz nouiz referenced this pull request in soumith/convnet-benchmarks Sep 11, 2014
Closed

Theano cuDNN binding #27

@stencilman
Contributor

@abergeron, @nouiz: wowow!! I am very impressed by the speed at which you pulled this in Theano! And also very very happy, thank you! 👍

@nouiz nouiz referenced this pull request in soumith/convnet-benchmarks Sep 11, 2014
Merged

benchmark cudnn back-end. #28

@nouiz
Member
nouiz commented Sep 11, 2014

Here is what I think we should do before merging this PR:

  • Detect in the op if the input are contiguous, if not raise an error.
  • return the nvidia error message in the error
  • doc

Another PR:

  • support non-contiguous input
  • 3 op as GpuCorrMM (to support well subsample). I think they should be cross-correlation. We could more easily reuse them for convolution(with kernel flipping) outside the graph.
@f0k
Contributor
f0k commented Sep 11, 2014

Great to see this is already started! As the API does not explicitly support full convolution, this should really be expanded into a setup akin to the three GpuCorrMM ops with defined gradients for manual use and a graph optimizer that selects one of the algorithms (forward, bprop wrt. weights, bprop wrt. inputs) to replace GpuConv instances as it sees fit. #2072 becomes more and more interesting...

@abergeron
Member

I use GpuContiguous in the optimizer, so the problem of detecting the "contiguousness" of inputs should be non-existant. If we want to make this independantly useable, there could always be a wrapper function that inserts the GpuContiguous ops.

Also, the API does support full convolution through the pad_{w,h} parameter. The kernel selection is already done for us so that we don't have to do algorithm selection like what is done for GpuCorrMM. This is the whole point of the library.

I can define the gradient for manual use, since the library provides function for those. But it would be harder to have them replaced on the fly in existing graphs, unless we go ahead with the plan to do this for the regular conv operation.

The library can do both cross-correlation and convolution, so we don't need to choose one or the other, we can do both. Kernel-flipping in the graph is a bad idea since that will lead to negative strides which are not supported (and thus will require a copy).

@f0k
Contributor
f0k commented Sep 11, 2014

Also, the API does support full convolution through the pad_{w,h} parameter. The kernel selection is already done for us so that we don't have to do algorithm selection like what is done for GpuCorrMM.

Ah great, so running the backward pass wrt. inputs is just as fast as running the forward pass with correct padding?

I can define the gradient for manual use, since the library provides function for those.

I see that the library also supports upscaling the input, so then probably everything can just be expressed with cudnnConvolutionForward. If the cudnnConvolutionBackward* functions are not faster anyway, then I agree that there shouldn't be three separate Ops. But then maybe GpuDnnConv should support padding, upsampling, strides and kernel flipping and define its gradients in terms of itself. In addition there could be a convenience function with the limited border_mode and subsample interface known from conv2d.

But it would be harder to have them replaced on the fly in existing graphs, unless we go ahead with the plan to do this for the regular conv operation.

I think the only problem is strided convolutions, because ConvOp.grad() currently uses conv3D for the gradients in this case. To use the fast backpropagation algorithms for strided convolution provided by GpuCorrMM_grad* or cuDNN, we'd need graph optimizers that replace particular conv3D instantiations, or implement ConvOp.grad() differently.

The library can do both cross-correlation and convolution, so we don't need to choose one or the other, we can do both. Kernel-flipping in the graph is a bad idea since that will lead to negative strides which are not supported (and thus will require a copy).

Kernel flips are inserted by ConvOp.grad(), though. The easiest might be to check the kernel strides in the C code to see if we are looking at flipped C-contiguous filters, and then avoid creating a copy by flipping CUDNN_CONVOLUTION and CUDNN_CROSS_CORRELATION.

@abergeron
Member

The plan since not so long ago is to split all convolution ops into parts for the gradient. We do intend to use the Backward* functions for these purposes, but since this hasn't been done yet for the generic versions (Conv, GpuConv), then inserting those ops into the graph via an optimizer is not really possible, which is why they have lower priority.

I do intend to extend the op to support at least the full interface of conv2d and maybe other things that the library also support.

All sorts of trickery to avoid copies is possible, but also time-consuming so we didn't do it for now.

@f0k
Contributor
f0k commented Sep 11, 2014

The plan since not so long ago is to split all convolution ops into parts for the gradient.

Ah, I see. But wouldn't that complicate graph optimization because one would have to deal with three different convolution Ops rather than one? An alternative could be making ConvOp generic enough to express all cases (with padding, upsampling, strides, and kernel flipping) and then have graph optimizers select whatever algorithm fits the specification. Of course, if all convolution implementations ported to Theano come with three predefined algorithms (like cuda-convnet, GpuCorrMM, cuDNN), it might make things easier to introduce two convolution gradient ops, but it's not always optimal to follow the choice of the designers anyway (as I saw for GpuCorrMM).

I do intend to extend the op to support at least the full interface of conv2d and maybe other things that the library also support.

Cool!

@abergeron
Member

The problem with expressing all cases is that it is almost impossible to do in certain cases (like when you have subsampling or strides). This is why we use the 3D gradient in the 2D case. Otherwise the results would be wrong. Splitting the cases into different ops does not limit us to only using one implementation/algorithm though. So we can do some trickery like what is done for GpuCorrMM anyway, but inside these ops.

But I feel like this discussion is best suited for the mailing lists rather that the comments on this issue.

As for the actual code: I believe I've adressed all your comments @nouiz.

@nouiz

Is that syntax supported by python 2.6?

Owner

Yes. It's even supported on 2.4 according to: http://legacy.python.org/dev/peps/pep-0328/

@nouiz

Why did you remove those init to NULL? We added it to prevent errors during %(fail)s

Owner

Because the C++ spec states that POD member will be initialized to all-zeros. So they are already NULL when the function starts. And that wasn't the cause of errors.

I tried it again and it crash as before. I check and they aren't NULL. I see two way to fix this, Init them to NULL, or to go to a different fail in the init then the fail used during execution. This second option seem more complicated. Why not just add back the init to NULL?

What I have found in that in C++03, there is some stuff initialized. But here we use an old g++, that probably don't support that version, or at least it don't enable c++03 support by default. In which version did you found that it should auto init to null?

Owner

Yes, you're right. This changed in C++03, so I'll add them back.

@nouiz nouiz referenced this pull request Sep 16, 2014
Merged

Abergeron dnn #2112

@nouiz nouiz merged commit 74bc7ab into Theano:master Sep 16, 2014

1 check was pending

continuous-integration/travis-ci The Travis CI build is in progress
Details
@f0k f0k referenced this pull request in soumith/convnet-benchmarks Sep 18, 2014
Open

[August 2014] Discussion of results #16

@HapeMask
Contributor

Just a small comment, hopefully this is the right place. The latest master version doesn't install theano/sandbox/cuda/cudnn_helper.h as part of the normal install process. If I manually copy it to /lib/python3.4/site-packages/Theano-0.6.0-py3.4.egg/theano/sandbox/cuda/ (where it would normally be installed), everything works fine.

@abergeron
Member

This should be fixed when this PR is merged. Thanks for the report.

@abergeron
Member

Actually I mean when #2117 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment