New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convolution layer using cudnn. #2096

Merged
merged 10 commits into from Sep 16, 2014

Conversation

Projects
None yet
5 participants
@abergeron
Member

abergeron commented Sep 8, 2014

TODO:

  • tests
  • a way to use the gpu without cudnn
  • avoid GpuContiguous where possible
  • add support for subsamples.
@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 8, 2014

Member

With the tests we have a new problem: a way to detect that the cudnn library is not present and skip the test in that case rather than fail.

Otherwise all should be ok. We can always work on avoiding GpuContiguous later.

Member

abergeron commented Sep 8, 2014

With the tests we have a new problem: a way to detect that the cudnn library is not present and skip the test in that case rather than fail.

Otherwise all should be ok. We can always work on avoiding GpuContiguous later.

@@ -661,7 +665,7 @@ def code_gen(self):
" didn't return a string for c_init_code_apply")
try:
struct_init = op.c_init_code_struct(node, id + 1)
struct_init = op.c_init_code_struct(node, id + 1, sub_struct)

This comment has been minimized.

@nouiz

nouiz Sep 9, 2014

Member

This interface change should be refleted in the doc.

@nouiz

nouiz Sep 9, 2014

Member

This interface change should be refleted in the doc.

This comment has been minimized.

@abergeron

abergeron Sep 9, 2014

Member

It was partially done, I forgot to change extending/cop.txt. Now it's done.

@abergeron

abergeron Sep 9, 2014

Member

It was partially done, I forgot to change extending/cop.txt. Now it's done.

return [GpuDnnConv(border_mode)(gpu_contiguous(img),
gpu_contiguous(kern))]
gpu_optimizer.register("conv_cudnn", local_conv_dnn, 'cudnn')

This comment has been minimized.

@nouiz

nouiz Sep 9, 2014

Member

As this file name is dnn.py, add the "dnn" name as a tag.

@nouiz

nouiz Sep 9, 2014

Member

As this file name is dnn.py, add the "dnn" name as a tag.

This comment has been minimized.

@abergeron

abergeron Sep 9, 2014

Member

I chose 'cudnn' as a generic tag to enable cudnn-optimized ops. I can change it to 'dnn' but that feels too generic.

@abergeron

abergeron Sep 9, 2014

Member

I chose 'cudnn' as a generic tag to enable cudnn-optimized ops. I can change it to 'dnn' but that feels too generic.

gpu_contiguous)
from theano.sandbox.cuda.blas import GpuConv
class GpuDnnConv(GpuOp):

This comment has been minimized.

@nouiz

nouiz Sep 9, 2014

Member

We need a mechanism that will detect if cudnn is available or not. For this, base yourself on GCC_Compiler.try_flags() to test if we can compile with the cudnn.h header. This will allow to enable/disable the tests and make a good error message in the c_code()if the lib isn't there.

@nouiz

nouiz Sep 9, 2014

Member

We need a mechanism that will detect if cudnn is available or not. For this, base yourself on GCC_Compiler.try_flags() to test if we can compile with the cudnn.h header. This will allow to enable/disable the tests and make a good error message in the c_code()if the lib isn't there.

This comment has been minimized.

@abergeron

abergeron Sep 9, 2014

Member

I'll try to make something like this.

@abergeron

abergeron Sep 9, 2014

Member

I'll try to make something like this.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Sep 10, 2014

Member

I have a seg fault. The problem is that cudnnDestroyTensor4dDescriptor is called in instantiate(). Probably, there is another error that got hidden by this error. You should init the descriptot to NULL and call the destructor on them only if they aren't NULL.

Member

nouiz commented Sep 10, 2014

I have a seg fault. The problem is that cudnnDestroyTensor4dDescriptor is called in instantiate(). Probably, there is another error that got hidden by this error. You should init the descriptot to NULL and call the destructor on them only if they aren't NULL.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 10, 2014

Member

This should be fixed, but I haven't been able to make it crash before.

Member

abergeron commented Sep 10, 2014

This should be fixed, but I haven't been able to make it crash before.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Sep 10, 2014

Member

It we use an unsupported card, the error during the init isn't making stuff
crash at the time. We get a strange error at run time.

On Wed, Sep 10, 2014 at 3:14 PM, abergeron notifications@github.com wrote:

This should be fixed, but I haven't been able to make it crash before.


Reply to this email directly or view it on GitHub
#2096 (comment).

Member

nouiz commented Sep 10, 2014

It we use an unsupported card, the error during the init isn't making stuff
crash at the time. We get a strange error at run time.

On Wed, Sep 10, 2014 at 3:14 PM, abergeron notifications@github.com wrote:

This should be fixed, but I haven't been able to make it crash before.


Reply to this email directly or view it on GitHub
#2096 (comment).

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 10, 2014

Member

Now it should just fail all the tests, instead of crashing.

Member

abergeron commented Sep 10, 2014

Now it should just fail all the tests, instead of crashing.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 10, 2014

Member

But you would need to clear the theano cache before running it again.

Member

abergeron commented Sep 10, 2014

But you would need to clear the theano cache before running it again.

@nouiz nouiz referenced this pull request Sep 11, 2014

Closed

Theano cuDNN binding #27

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Sep 11, 2014

Contributor

@abergeron, @nouiz: wowow!! I am very impressed by the speed at which you pulled this in Theano! And also very very happy, thank you! 👍

Contributor

stencilman commented Sep 11, 2014

@abergeron, @nouiz: wowow!! I am very impressed by the speed at which you pulled this in Theano! And also very very happy, thank you! 👍

@nouiz nouiz referenced this pull request Sep 11, 2014

Merged

benchmark cudnn back-end. #28

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Sep 11, 2014

Member

Here is what I think we should do before merging this PR:

  • Detect in the op if the input are contiguous, if not raise an error.
  • return the nvidia error message in the error
  • doc

Another PR:

  • support non-contiguous input
  • 3 op as GpuCorrMM (to support well subsample). I think they should be cross-correlation. We could more easily reuse them for convolution(with kernel flipping) outside the graph.
Member

nouiz commented Sep 11, 2014

Here is what I think we should do before merging this PR:

  • Detect in the op if the input are contiguous, if not raise an error.
  • return the nvidia error message in the error
  • doc

Another PR:

  • support non-contiguous input
  • 3 op as GpuCorrMM (to support well subsample). I think they should be cross-correlation. We could more easily reuse them for convolution(with kernel flipping) outside the graph.
@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Sep 11, 2014

Contributor

Great to see this is already started! As the API does not explicitly support full convolution, this should really be expanded into a setup akin to the three GpuCorrMM ops with defined gradients for manual use and a graph optimizer that selects one of the algorithms (forward, bprop wrt. weights, bprop wrt. inputs) to replace GpuConv instances as it sees fit. #2072 becomes more and more interesting...

Contributor

f0k commented Sep 11, 2014

Great to see this is already started! As the API does not explicitly support full convolution, this should really be expanded into a setup akin to the three GpuCorrMM ops with defined gradients for manual use and a graph optimizer that selects one of the algorithms (forward, bprop wrt. weights, bprop wrt. inputs) to replace GpuConv instances as it sees fit. #2072 becomes more and more interesting...

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 11, 2014

Member

I use GpuContiguous in the optimizer, so the problem of detecting the "contiguousness" of inputs should be non-existant. If we want to make this independantly useable, there could always be a wrapper function that inserts the GpuContiguous ops.

Also, the API does support full convolution through the pad_{w,h} parameter. The kernel selection is already done for us so that we don't have to do algorithm selection like what is done for GpuCorrMM. This is the whole point of the library.

I can define the gradient for manual use, since the library provides function for those. But it would be harder to have them replaced on the fly in existing graphs, unless we go ahead with the plan to do this for the regular conv operation.

The library can do both cross-correlation and convolution, so we don't need to choose one or the other, we can do both. Kernel-flipping in the graph is a bad idea since that will lead to negative strides which are not supported (and thus will require a copy).

Member

abergeron commented Sep 11, 2014

I use GpuContiguous in the optimizer, so the problem of detecting the "contiguousness" of inputs should be non-existant. If we want to make this independantly useable, there could always be a wrapper function that inserts the GpuContiguous ops.

Also, the API does support full convolution through the pad_{w,h} parameter. The kernel selection is already done for us so that we don't have to do algorithm selection like what is done for GpuCorrMM. This is the whole point of the library.

I can define the gradient for manual use, since the library provides function for those. But it would be harder to have them replaced on the fly in existing graphs, unless we go ahead with the plan to do this for the regular conv operation.

The library can do both cross-correlation and convolution, so we don't need to choose one or the other, we can do both. Kernel-flipping in the graph is a bad idea since that will lead to negative strides which are not supported (and thus will require a copy).

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Sep 11, 2014

Contributor

Also, the API does support full convolution through the pad_{w,h} parameter. The kernel selection is already done for us so that we don't have to do algorithm selection like what is done for GpuCorrMM.

Ah great, so running the backward pass wrt. inputs is just as fast as running the forward pass with correct padding?

I can define the gradient for manual use, since the library provides function for those.

I see that the library also supports upscaling the input, so then probably everything can just be expressed with cudnnConvolutionForward. If the cudnnConvolutionBackward* functions are not faster anyway, then I agree that there shouldn't be three separate Ops. But then maybe GpuDnnConv should support padding, upsampling, strides and kernel flipping and define its gradients in terms of itself. In addition there could be a convenience function with the limited border_mode and subsample interface known from conv2d.

But it would be harder to have them replaced on the fly in existing graphs, unless we go ahead with the plan to do this for the regular conv operation.

I think the only problem is strided convolutions, because ConvOp.grad() currently uses conv3D for the gradients in this case. To use the fast backpropagation algorithms for strided convolution provided by GpuCorrMM_grad* or cuDNN, we'd need graph optimizers that replace particular conv3D instantiations, or implement ConvOp.grad() differently.

The library can do both cross-correlation and convolution, so we don't need to choose one or the other, we can do both. Kernel-flipping in the graph is a bad idea since that will lead to negative strides which are not supported (and thus will require a copy).

Kernel flips are inserted by ConvOp.grad(), though. The easiest might be to check the kernel strides in the C code to see if we are looking at flipped C-contiguous filters, and then avoid creating a copy by flipping CUDNN_CONVOLUTION and CUDNN_CROSS_CORRELATION.

Contributor

f0k commented Sep 11, 2014

Also, the API does support full convolution through the pad_{w,h} parameter. The kernel selection is already done for us so that we don't have to do algorithm selection like what is done for GpuCorrMM.

Ah great, so running the backward pass wrt. inputs is just as fast as running the forward pass with correct padding?

I can define the gradient for manual use, since the library provides function for those.

I see that the library also supports upscaling the input, so then probably everything can just be expressed with cudnnConvolutionForward. If the cudnnConvolutionBackward* functions are not faster anyway, then I agree that there shouldn't be three separate Ops. But then maybe GpuDnnConv should support padding, upsampling, strides and kernel flipping and define its gradients in terms of itself. In addition there could be a convenience function with the limited border_mode and subsample interface known from conv2d.

But it would be harder to have them replaced on the fly in existing graphs, unless we go ahead with the plan to do this for the regular conv operation.

I think the only problem is strided convolutions, because ConvOp.grad() currently uses conv3D for the gradients in this case. To use the fast backpropagation algorithms for strided convolution provided by GpuCorrMM_grad* or cuDNN, we'd need graph optimizers that replace particular conv3D instantiations, or implement ConvOp.grad() differently.

The library can do both cross-correlation and convolution, so we don't need to choose one or the other, we can do both. Kernel-flipping in the graph is a bad idea since that will lead to negative strides which are not supported (and thus will require a copy).

Kernel flips are inserted by ConvOp.grad(), though. The easiest might be to check the kernel strides in the C code to see if we are looking at flipped C-contiguous filters, and then avoid creating a copy by flipping CUDNN_CONVOLUTION and CUDNN_CROSS_CORRELATION.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 11, 2014

Member

The plan since not so long ago is to split all convolution ops into parts for the gradient. We do intend to use the Backward* functions for these purposes, but since this hasn't been done yet for the generic versions (Conv, GpuConv), then inserting those ops into the graph via an optimizer is not really possible, which is why they have lower priority.

I do intend to extend the op to support at least the full interface of conv2d and maybe other things that the library also support.

All sorts of trickery to avoid copies is possible, but also time-consuming so we didn't do it for now.

Member

abergeron commented Sep 11, 2014

The plan since not so long ago is to split all convolution ops into parts for the gradient. We do intend to use the Backward* functions for these purposes, but since this hasn't been done yet for the generic versions (Conv, GpuConv), then inserting those ops into the graph via an optimizer is not really possible, which is why they have lower priority.

I do intend to extend the op to support at least the full interface of conv2d and maybe other things that the library also support.

All sorts of trickery to avoid copies is possible, but also time-consuming so we didn't do it for now.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Sep 11, 2014

Contributor

The plan since not so long ago is to split all convolution ops into parts for the gradient.

Ah, I see. But wouldn't that complicate graph optimization because one would have to deal with three different convolution Ops rather than one? An alternative could be making ConvOp generic enough to express all cases (with padding, upsampling, strides, and kernel flipping) and then have graph optimizers select whatever algorithm fits the specification. Of course, if all convolution implementations ported to Theano come with three predefined algorithms (like cuda-convnet, GpuCorrMM, cuDNN), it might make things easier to introduce two convolution gradient ops, but it's not always optimal to follow the choice of the designers anyway (as I saw for GpuCorrMM).

I do intend to extend the op to support at least the full interface of conv2d and maybe other things that the library also support.

Cool!

Contributor

f0k commented Sep 11, 2014

The plan since not so long ago is to split all convolution ops into parts for the gradient.

Ah, I see. But wouldn't that complicate graph optimization because one would have to deal with three different convolution Ops rather than one? An alternative could be making ConvOp generic enough to express all cases (with padding, upsampling, strides, and kernel flipping) and then have graph optimizers select whatever algorithm fits the specification. Of course, if all convolution implementations ported to Theano come with three predefined algorithms (like cuda-convnet, GpuCorrMM, cuDNN), it might make things easier to introduce two convolution gradient ops, but it's not always optimal to follow the choice of the designers anyway (as I saw for GpuCorrMM).

I do intend to extend the op to support at least the full interface of conv2d and maybe other things that the library also support.

Cool!

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 11, 2014

Member

The problem with expressing all cases is that it is almost impossible to do in certain cases (like when you have subsampling or strides). This is why we use the 3D gradient in the 2D case. Otherwise the results would be wrong. Splitting the cases into different ops does not limit us to only using one implementation/algorithm though. So we can do some trickery like what is done for GpuCorrMM anyway, but inside these ops.

But I feel like this discussion is best suited for the mailing lists rather that the comments on this issue.

As for the actual code: I believe I've adressed all your comments @nouiz.

Member

abergeron commented Sep 11, 2014

The problem with expressing all cases is that it is almost impossible to do in certain cases (like when you have subsampling or strides). This is why we use the 3D gradient in the 2D case. Otherwise the results would be wrong. Splitting the cases into different ops does not limit us to only using one implementation/algorithm though. So we can do some trickery like what is done for GpuCorrMM anyway, but inside these ops.

But I feel like this discussion is best suited for the mailing lists rather that the comments on this issue.

As for the actual code: I believe I've adressed all your comments @nouiz.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Sep 11, 2014

Is that syntax supported by python 2.6?

Is that syntax supported by python 2.6?

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 11, 2014

Owner

Yes. It's even supported on 2.4 according to: http://legacy.python.org/dev/peps/pep-0328/

Owner

abergeron replied Sep 11, 2014

Yes. It's even supported on 2.4 according to: http://legacy.python.org/dev/peps/pep-0328/

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Sep 11, 2014

Why did you remove those init to NULL? We added it to prevent errors during %(fail)s

Why did you remove those init to NULL? We added it to prevent errors during %(fail)s

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 11, 2014

Owner

Because the C++ spec states that POD member will be initialized to all-zeros. So they are already NULL when the function starts. And that wasn't the cause of errors.

Owner

abergeron replied Sep 11, 2014

Because the C++ spec states that POD member will be initialized to all-zeros. So they are already NULL when the function starts. And that wasn't the cause of errors.

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Sep 12, 2014

I tried it again and it crash as before. I check and they aren't NULL. I see two way to fix this, Init them to NULL, or to go to a different fail in the init then the fail used during execution. This second option seem more complicated. Why not just add back the init to NULL?

nouiz replied Sep 12, 2014

I tried it again and it crash as before. I check and they aren't NULL. I see two way to fix this, Init them to NULL, or to go to a different fail in the init then the fail used during execution. This second option seem more complicated. Why not just add back the init to NULL?

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Sep 12, 2014

What I have found in that in C++03, there is some stuff initialized. But here we use an old g++, that probably don't support that version, or at least it don't enable c++03 support by default. In which version did you found that it should auto init to null?

nouiz replied Sep 12, 2014

What I have found in that in C++03, there is some stuff initialized. But here we use an old g++, that probably don't support that version, or at least it don't enable c++03 support by default. In which version did you found that it should auto init to null?

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 16, 2014

Owner

Yes, you're right. This changed in C++03, so I'll add them back.

Owner

abergeron replied Sep 16, 2014

Yes, you're right. This changed in C++03, so I'll add them back.

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Sep 16, 2014

nouiz replied Sep 16, 2014

@nouiz nouiz referenced this pull request Sep 16, 2014

Merged

Abergeron dnn #2112

@nouiz nouiz merged commit 74bc7ab into Theano:master Sep 16, 2014

1 check was pending

continuous-integration/travis-ci The Travis CI build is in progress
Details
@HapeMask

This comment has been minimized.

Show comment
Hide comment
@HapeMask

HapeMask Sep 18, 2014

Contributor

Just a small comment, hopefully this is the right place. The latest master version doesn't install theano/sandbox/cuda/cudnn_helper.h as part of the normal install process. If I manually copy it to /lib/python3.4/site-packages/Theano-0.6.0-py3.4.egg/theano/sandbox/cuda/ (where it would normally be installed), everything works fine.

Contributor

HapeMask commented Sep 18, 2014

Just a small comment, hopefully this is the right place. The latest master version doesn't install theano/sandbox/cuda/cudnn_helper.h as part of the normal install process. If I manually copy it to /lib/python3.4/site-packages/Theano-0.6.0-py3.4.egg/theano/sandbox/cuda/ (where it would normally be installed), everything works fine.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 19, 2014

Member

This should be fixed when this PR is merged. Thanks for the report.

Member

abergeron commented Sep 19, 2014

This should be fixed when this PR is merged. Thanks for the report.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 19, 2014

Member

Actually I mean when #2117 is merged.

Member

abergeron commented Sep 19, 2014

Actually I mean when #2117 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment