dnn_conv() always uses forward path #2268

Closed
f0k opened this Issue Nov 18, 2014 · 16 comments

Projects

None yet

3 participants

@f0k
Contributor
f0k commented Nov 18, 2014

The performance of dnn_conv() is severely hampered by the fact that it always uses GpuDnnConv(), which in turn uses the CUDNN_CONVOLUTION_FWD path in cuDNN. When dnn_conv() is used to compute a full convolution without subsampling, it would be a lot faster to use the CUDNN_CONVOLUTION_DATA_GRAD path. Similarly, depending on the image and filter shape, it can be a lot faster to use the CUDNN_CONVOLUTION_WEIGHT_GRAD path.

For illustration, see the output of convnet-benchmarks:

$ SKIP_LEGACY=1 SKIP_GEMM=1 python pylearn2_benchmark.py i1x115x80,k32x8x6,b64
Using gpu device 0: GeForce GT 640

CONFIG: input = 1 x 115 x 80 * ker = 1 x 32 x 8 x 6 ( bs = 64 , stride = 1 )
(auto) theano.sandbox.cuda.dnn.GpuDnnConv          ==> fprop         ==>      13
(auto) theano.sandbox.cuda.dnn.GpuDnnConv          ==> bprop inputs  ==>     354
(auto) theano.sandbox.cuda.dnn.GpuDnnConv          ==> bprop weights ==>     148

(manual conv) theano.sandbox.cuda.dnn.GpuDnnConv   ==> fprop         ==>      13
(manual conv) theano.sandbox.cuda.dnn.GpuDnnConv   ==> bprop inputs  ==>      19
(manual conv) theano.sandbox.cuda.dnn.GpuDnnConv   ==> bprop weights ==>      16

The upper convolutions are automatic replacements of GpuConv with dnn_conv(), the lower ones use GpuDnnConvGradI and GpuDnnConvGradW, respectively. Except for the forward pass, the upper convolutions are about one order of magnitude slower. Part of the difference is explained by the automatic version inserting some extra dimshuffles, but most of it is due to the convolution operation itself.

Note that this is an extreme case. Depending on the input and filter shapes, differences can be less (but also even more) pronounced.
Also note that this can be seen as a bug in cuDNN, which could itself select a more optimal implementation for what it's asked to compute. I will file a bug report.

Still, we may want to change behavior of dnn_conv() to choose a more optimal algorithm, similar to what theano.sandbox.cuda.opt.local_conv_gemm does. Alternatively or in addition, we could leave it up to the new meta-optimizer (#2266) to try all applicable options.

Any opinions on this?

@abergeron
Member

If I understand correctly you are saying that we suffer from the fact that GpuConv only has the forward mode and thus we always replace it with the forward op, right?

In that case there is a third option which is to make proper gradient ops for GpuConv (which will help with the gradient in cases of subsampling) and possible for Conv too. Then we can just replace those the equivalent version in dnn.

If that seems like too much short term work, then we can patch the dnn optimization in a similar way to the gemm one and select one of the ops directly that way. I kind of think that placing this in dnn_conv is a bad idea since we don't get problems when it's used correctly (as in: for the forward case only).

@f0k
Contributor
f0k commented Nov 18, 2014

If I understand correctly you are saying that we suffer from the fact that GpuConv only has the forward mode and thus we always replace it with the forward op, right?

Sort of. It can just as well be seen as suffering from the fact that dnn_conv() always uses GpuDnnConv().

In that case there is a third option which is to make proper gradient ops for GpuConv (which will help with the gradient in cases of subsampling) and possible for Conv too. Then we can just replace those the equivalent version in dnn.

That would also help for some cases, yes.

I kind of think that placing this in dnn_conv is a bad idea since we don't get problems when it's used correctly (as in: for the forward case only).

I think placing it in dnn_conv() would be a lot better than placing it in the optimizer. If a user manually uses dnn_conv() for a full convolution, it will currently also use a suboptimal implementation (namely, GpuDnnConv with a lot of padding instead of GpuDnnConv_GradI without padding).
The only problem I see is that dnn_conv() does not allow specifying the input and kernel shape, so it cannot make as informed a choice as would be possible with shape information.

Arguably, the best option of all would be fixing cuDNN.

@f0k
Contributor
f0k commented Nov 18, 2014

Update: I've created a gist demonstrating the performance difference and tried to file a bug with Nvidia, but just got an error message, so I've sent it to the email address mentioned at the end of the arXiv paper instead.

@nouiz
Member
nouiz commented Nov 19, 2014

To speed up for grad vs the inputs (full mode) I see 2 ways:

  1. in local_conv_dnn, when we see the full mode we can use GpuDnnConvGradI
    instead of GpuDnnConv. Better, in dnn_conv, use GpuDnnConvGradI when mode
    is full.

The second "fix" would fix both the grad vs inputs and weights.

  1. To differentiate the direction between the forward and backward, we can
    modify the ConvOp.grad, to store in the op a parameter for the direction
    and keep it. That way, we will know which op we need on the GPU.

What do you think?

On Tue, Nov 18, 2014 at 3:42 PM, Jan Schlüter notifications@github.com
wrote:

Update: I've created a gist demonstrating the performance difference
https://gist.github.com/f0k/1a5e15f130e5838d64fe and tried to file a
bug with Nvidia, but just got an error message, so I've sent it to the
email address mentioned at the end of the arXiv paper
http://arxiv.org/abs/1410.0759 instead.


Reply to this email directly or view it on GitHub
#2268 (comment).

@abergeron
Member

Your second option is probably the best way to go for now.

We should still talk to NVIDIA about the cuDNN problem that Jan demonstrated. It might help to make a C (or C++) example linking directly to cuDNN for that

@f0k
Contributor
f0k commented Nov 19, 2014

Your second option is probably the best way to go for now.

That would be a quick "fix" (to not call it a hack), yes! But just as with caffe, it's not always optimal to choose the forward pass algorithm to compute the forward pass and the grad wrt. inputs algorithm to compute the grad wrt. inputs. Maybe this can be distinguished in the optimizer by looking at the shape as for caffe (if shape information is available), falling back to the information from the "fix" otherwise.

No matter what, I'm leaning towards just including all three algorithms in the meta-optimizer. This way the meta-optimizer should always find a solution at least as fast as manually using dnn_conv() (except for extra dimshuffles/flips).

We should still talk to NVIDIA about the cuDNN problem that Jan demonstrated.

They already replied to my email, noting that my configuration has a single output feature, which leads to the matrix-matrix multiplication degrading into a matrix-vector multiplication. The faster runtime when using the backward pass algorithm stems from the fact that it's doing several small GEMMs instead of one big GEMV. Both configurations heavily underutilize the GPU, which is a good point, I guess, but that configuration is taken from a network that advanced the state-of-the-art in what it was supposed to do and I'd still be interested in making it as fast as possible, even if it's small.

I explained in more detail why Theano sometimes just knows it has to do a convolution, without further details, and why it would be great if cuDNN could hide the complexity of choosing the best algorithm. Let's see!

@f0k
Contributor
f0k commented Nov 19, 2014

Short-term proposal:

  • We add a direction_hint parameter to ConvOp and GpuConv, which is set to one of 'forward', 'backprop weights' or 'backprop inputs'. It is always set to 'forward' in nnet.conv2d, and set to the correct value in instances created by ConvOp.grad(). It is propagated to GpuConv by the local_gpu_conv optimizer.
  • The existing local_conv_gemm optimizer uses this hint whenever it doesn't have the necessary shape information to decide on what implementation to use.
  • conv_dnn() gets a direction parameter enforcing the algorithm to use. It's documented as being internal and temporary and not to be used by end users. (We can also move that into an internal function with an extra parameter called by the front-end conv_dnn(), but I'm not sure if that is any more elegant.)
  • The existing local_conv_dnn optimizer passes the direction hint to conv_dnn().
  • The new meta-optimizer tries conv_dnn() with all three values for the direction parameter.

Long-term proposal:

  • ConvOp and GpuConv become placeholders instead of working ops. They only take note of their parameters (inputs, filters, border_mode/padding, stride, flipping, optionally input and filter shape). They provide a grad() method returning placeholders for the two gradients.
  • Graph optimizers replace the placeholders by real implementations. They can directly map the place holders to the three cuDNN ops and the three caffe ops. For caffe, they would still include the heuristics to swap the forward pass and backward pass wrt. weights algorithms depending on the shape. They also cope with introducing the FFT or legacy implementations with correctly dimshuffled arguments (either directly, or via a step in between that turns a gradient placeholder into a forward pass placeholder with the necessary dimshuffling and flipping). If no suitable implementation can be found, the program will crash at runtime (e.g., if using custom padding, but disabling cuDNN and caffe).

No matter what, we always have the problem that there can be no strict mapping between "it's the forward pass in a neural network" and "we use the forward pass algorithm". If a user creates a neural network which does full convolutions (with a stride of 1), then we should always swap things around. That's why I'm not sure about distinguishing the forward pass and gradient in the graph. I can see that it's advantageous for strided convolutions, though, and to avoid extra dimshuffles/flipping.

Is there a way to define a general convolution placeholder that can express all three directions without saying "I'm a gradient wrt. weights" or "I'm a forward pass"?

@nouiz
Member
nouiz commented Nov 19, 2014

On Wed, Nov 19, 2014 at 5:44 AM, Jan Schlüter notifications@github.com
wrote:

Short-term proposal:

  • We add a direction_hint parameter to ConvOp and GpuConv, which is
    set to one of 'forward', 'backprop weights' or 'backprop inputs'. It
    is always set to 'forward' in nnet.conv2d, and set to the correct
    value in instances created by ConvOp.grad(). It is propagated to
    GpuConv by the local_gpu_conv optimizer.
  • The existing local_conv_gemm optimizer uses this hint whenever it
    doesn't have the necessary shape information to decide on what
    implementation to use.
  • conv_dnn() gets a direction parameter enforcing the algorithm to
    use. It's documented as being internal and temporary and not to be used by
    end users. (We can also move that into an internal function with an extra
    parameter called by the front-end conv_dnn(), but I'm not sure if that
    is any more elegant.)

I would go simple and use just 1 function.

  • The existing local_conv_dnn optimizer passes the direction hint to
    conv_dnn().
  • The new meta-optimizer tries conv_dnn() with all three values for
    the direction parameter.

Agreed. To get a fix merged faster, I would prefer a PR that just do the
direction_hint parameter and merge that. Can you do that? It you know when
to swap the algo based on the shape, it can be done at the same time, but I
think we should delay the first part by much. The current speed loss is
very high.

I also think the meta-optimizer should also test the 3 ops. Or maybe for if
the mode is valid and 1 if the mode is full?

Long-term proposal:

  • ConvOp and GpuConv become placeholders instead of working ops. They
    only take note of their parameters (inputs, filters, border_mode/padding,
    stride, flipping, optionally input and filter shape). They provide a
    grad() method returning placeholders for the two gradients.
  • Graph optimizers replace the placeholders by real implementations.
    They can directly map the place holders to the three cuDNN ops and the
    three caffe ops. For caffe, they would still include the heuristics to swap
    the forward pass and backward pass wrt. weights algorithms depending on the
    shape. They also cope with introducing the FFT or legacy implementations
    with correctly dimshuffled arguments (either directly, or via a step in
    between that turns a gradient placeholder into a forward pass placeholder
    with the necessary dimshuffling and flipping). If no suitable
    implementation can be found, the program will crash at runtime (e.g., if
    using custom padding, but disabling cuDNN and caffe).

No matter what, we always have the problem that there can be no strict
mapping between "it's the forward pass in a neural network" and "we use the
forward pass algorithm". If a user creates a neural network which does full
convolutions (with a stride of 1), then we should always swap things
around. That's why I'm not sure about distinguishing the forward pass and
gradient in the graph. I can see that it's advantageous for strided
convolutions, though, and to avoid extra dimshuffles/flipping.

Is there a way to define a general convolution placeholder that can
express all three directions without saying "I'm a gradient wrt. weights"
or "I'm a forward pass"?

I do not understand your questions. Why not have as you tell the default
direction_hint as "forward" if conv2d() and that in the grad method, we
introduce the correct hint based on the hint received? What type of
placeholder are you looking for?

I think I would just create a new CorrOp and GPUCorrOp. I do not see need
for the convolution itself as it slow things down. Or have the corr vs conv
a parameter of the place holder at the same time?

@f0k
Contributor
f0k commented Nov 19, 2014

I do not understand your questions. Why not have as you tell the default
direction_hint as "forward" if conv2d() and that in the grad method, we
introduce the correct hint based on the hint received? What type of
placeholder are you looking for?

I'm looking for a placeholder Op that describes the operation to perform, including any necessary kernel flipping and transposition. The quick fix with the hint would still define the ConvOp gradient as a graph with a ConvOp on some flipped and dimshuffled arguments, which is very difficult to clean up afterwards (e.g., it would require several new optimizers to turn this into a direct application of GpuConvDnn_GradI, for example, without any redundant dimshuffles, flips and gpu_contiguouses).

And the reason I'd use a placeholder Op instead of extending ConvOp with new parameters is that not all implementations support the same set of features, e.g., the ConvOp implementation does not support custom padding, and only the cuDNN implementation allows to choose whether the kernel is flipped. It would be cleaner to first introduce a placeholder Op that just tells "we need to perform a convolution of input1 with input2 using this padding, striding, kernel flipping (yes or no) and output shape" and then leave it up to the optimizers to choose the best-fitting implementation on compiling the graph. But now my question was whether it's possible to define a single placeholder Op that encompasses everything, instead of three placeholder Ops for the forward pass and the two backward passes (the tricky thing is the striding and padding).

I think I would just create a new CorrOp and GPUCorrOp. [...] Or have the corr vs conv
a parameter of the place holder at the same time?

Yes, I would definitely make that a parameter of the place holder, not create more ops. Then the graph optimizers can deal with flipping the kernel as needed (depending on whether the chosen implementation does a convolution or correlation) or just passing the correct argument to the cuDNN convolution.

Agreed. To get a fix merged faster, I would prefer a PR that just do the
direction_hint parameter and merge that. Can you do that?

Yes. But now that you mention the "full" mode, probably the direction hint should just be treated as a hint in dnn_conv(), not as an enforcement. If border_mode is "full" and stride is (1,1) (the only stride resulting in the graph optimizers becoming active, because otherwise the gradients do not use GpuConv), we should always choose the backward pass wrt. inputs, I think. If border_mode is "valid" and stride is (1,1), we can use the direction hint. Otherwise we will have to ignore it.

@nouiz
Member
nouiz commented Nov 19, 2014

That sound good.

Just to be sure to have the quick fix done rapidly, do you plan to do it or
I need to find time for this or find someone else? thanks

On Wed, Nov 19, 2014 at 9:47 AM, Jan Schlüter notifications@github.com
wrote:

I do not understand your questions. Why not have as you tell the default
direction_hint as "forward" if conv2d() and that in the grad method, we
introduce the correct hint based on the hint received? What type of
placeholder are you looking for?

I'm looking for a placeholder Op that describes the operation to perform,
including any necessary kernel flipping and transposition. The quick fix
with the hint would still define the ConvOp gradient as a graph with a
ConvOp on some flipped and dimshuffled arguments, which is very difficult
to clean up afterwards (e.g., it would require several new optimizers to
turn this into a direct application of GpuConvDnn_GradI, for example,
without any redundant dimshuffles, flips and gpu_contiguouses).

And the reason I'd use a placeholder Op instead of extending ConvOp with
new parameters is that not all implementations support the same set of
features, e.g., the ConvOp implementation does not support custom padding,
and only the cuDNN implementation allows to choose whether the kernel is
flipped. It would be cleaner to first introduce a placeholder Op that just
tells "we need to perform a convolution of input1 with input2 using this
padding, striding, kernel flipping (yes or no) and output shape" and then
leave it up to the optimizers to choose the best-fitting implementation on
compiling the graph. But now my question was whether it's possible to
define a single placeholder Op that encompasses everything, instead of
three placeholder Ops for the forward pass and the two backward passes (the
tricky thing is the striding and padding).

I think I would just create a new CorrOp and GPUCorrOp. [...] Or have the
corr vs conv
a parameter of the place holder at the same time?

Yes, I would definitely make that a parameter of the place holder, not
create more ops. Then the graph optimizers can deal with flipping the
kernel as needed (depending on whether the chosen implementation does a
convolution or correlation) or just passing the correct argument to the
cuDNN convolution.

Agreed. To get a fix merged faster, I would prefer a PR that just do the
direction_hint parameter and merge that. Can you do that?

Yes. But now that you mention the "full" mode, probably the direction hint
should just be treated as a hint in dnn_conv(), not as an enforcement. If
border_mode is "full" and stride is (1,1) (the only stride resulting in the
graph optimizers becoming active, because otherwise the gradients do not
use GpuConv), we should always choose the backward pass wrt. inputs, I
think. If border_mode is "valid" and stride is (1,1), we can use the
direction hint. Otherwise we will have to ignore it.


Reply to this email directly or view it on GitHub
#2268 (comment).

@f0k
Contributor
f0k commented Nov 19, 2014

I think I can do it later today.

@f0k
Contributor
f0k commented Nov 19, 2014

Just to add some more information: On a "real" graphics card, things look a bit different:

$ SKIP_LEGACY=1 SKIP_GEMM=1 python pylearn2_benchmark.py i1x115x80,k32x8x6,b64
Using gpu device 0: GeForce GTX 780 Ti

CONFIG: input = 1 x 115 x 80 * ker = 1 x 32 x 8 x 6 ( bs = 64 , stride = 1 )
(auto) theano.sandbox.cuda.dnn.GpuDnnConv          ==> fprop         ==>       2
(auto) theano.sandbox.cuda.dnn.GpuDnnConv          ==> bprop inputs  ==>      46
(auto) theano.sandbox.cuda.dnn.GpuDnnConv          ==> bprop weights ==>     144

(manual conv) theano.sandbox.cuda.dnn.GpuDnnConv   ==> fprop         ==>       2
(manual conv) theano.sandbox.cuda.dnn.GpuDnnConv   ==> bprop inputs  ==>       3
(manual conv) theano.sandbox.cuda.dnn.GpuDnnConv   ==> bprop weights ==>       4
  1. The relative difference is more pronounced (a factor of up to 36)
  2. The backprop wrt. weights is most strongly affected, while it was the backprop wrt. inputs on the GT 640. So we need the direction hint, not just a different choice for border_mode="full".
@f0k
Contributor
f0k commented Nov 19, 2014

And just for the record: There are even configurations where the forward pass with zero-padding is faster for a full convolution than the backward pass wrt. inputs:

$ python pylearn2_benchmark.py i128x36x12,k64x6x3,b256
Using gpu device 0: GeForce GTX 780 Ti

CONFIG: input = 128 x 36 x 12 * ker = 128 x 64 x 6 x 3 ( bs = 256 , stride = 1 )
(auto) theano.sandbox.cuda.dnn.GpuDnnConv          ==> fprop         ==>      18
(auto) theano.sandbox.cuda.dnn.GpuDnnConv          ==> bprop inputs  ==>      15
(auto) theano.sandbox.cuda.dnn.GpuDnnConv          ==> bprop weights ==>      33

(manual conv) theano.sandbox.cuda.dnn.GpuDnnConv   ==> fprop         ==>      18
(manual conv) theano.sandbox.cuda.dnn.GpuDnnConv   ==> bprop inputs  ==>      32
(manual conv) theano.sandbox.cuda.dnn.GpuDnnConv   ==> bprop weights ==>      28

Look at bprop inputs: The current, "wrong" code path inserted by the graph optimizer is twice as fast as the "correct" code path. So the meta-optimizer should really try all possible algorithms.

@abergeron
Member

I really feel that this algorithm selection should be done inside cuDNN. The meta-optimizer should select between FFT, dnn, and other approaches.

However as long as you do the work, I won't oppose adding all the possible configurations to the meta-optimizer.

@nouiz
Member
nouiz commented Nov 19, 2014

Thanks for the info. It confirm the meta-optimizer will be really useful.

@abergeron, I agree, ideally, cudnn should select its internal code. But
I'm not sure they will do this. Also, they will use an heuristic, no real
timming. heuristic do not always find the best configuration. So even if
nvidia decide to do it, the meta-optimizer can do a better job (timming for
exact shape, no heuristic).

On Wed, Nov 19, 2014 at 1:33 PM, abergeron notifications@github.com wrote:

I really feel that this algorithm selection should be done inside cuDNN.
The meta-optimizer should select between FFT, dnn, and other approaches.

However as long as you do the work, I won't oppose adding all the possible
configurations to the meta-optimizer.


Reply to this email directly or view it on GitHub
#2268 (comment).

@nouiz
Member
nouiz commented Nov 24, 2014

This is done. So I close.

@nouiz nouiz closed this Nov 24, 2014
This was referenced Feb 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment