Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster algorithms and gradients for GpuCorrMM #2033

Merged
merged 22 commits into from Sep 1, 2014

Conversation

f0k
Copy link
Contributor

@f0k f0k commented Aug 12, 2014

As a follow-up to #2023, this PR adds caffe's backward pass wrt. inputs to GpuCorrMM to implement border_mode="full". It passes all the tests and it's about 2-4x faster than simulating a full convolution with a padded valid convolution. On the way, it cleans up the code and adds some more elaborate documentation.

There are some caveats, though, all stemming from the fact that the implementation in caffe is meant as a backward pass for a valid correlation:

  • With border_mode="valid", GpuCorrMM doesn't flip the kernels, but with border_mode="full", it does.
  • With border_mode="valid", subsampling is for subsampling the output, but with border_mode="full", it is for upsampling the input.
  • With border_mode="valid", pad is for padding the input, but with border_mode="full", it is for cropping the output.
  • With border_mode="full", it needs a different memory layout for the kernels.

Currently, GpuCorrMM directly wraps the underlying algorithm, and local_conv_gemm() copes with the different peculiarities, because I wasn't sure whether the GpuCorrMM Op should be inserting dimshuffles, kernel flips and gpu_contiguouses on its own.

Looking at the end result, although it's quite fast, local_conv_gemm() now basically undoes everything that ConvOp.grad() does. It might be possible to write an optimizer that replaces the gradient of a valid convolution with a properly parameterized GpuCorrMM Op, instead of just replacing the full convolution Op that is part of the gradient (introducing redundant dimshuffles etc. on the way). This way we would leverage the caffe implementation better.
The alternative would be to modify the CUDA code to perform a subsampled, padded, full correlation, as would be expected from a Gpu Correlation Op. This would be cleaner, but it would also mean a lot more work and we wouldn't profit as much from caffe's implementation for the case of subsampling != (1,1) or padding != (0,0) (granted, this may be an uninteresting case in practice anyway).
/edit: Another alternative would be splitting this into two Ops: GpuCorrMM for valid correlation with padding and subsampling (the caffe forward pass), and GpuConvMM for full convolution with upsampling and cropping (the caffe backward pass). This way it would be obvious how the operations differ, but the memory layout required for the second Op would still be unintuitive.
Yet another alternative would be splitting it into GpuCorrMM and GpuCorrMM_gradInput. This way it would be obvious how to use GpuCorrMM_gradInput for computing the gradient of GpuCorrMM. We could even add GpuCorrMM_gradKernel for the gradient wrt. weights; it seems caffe does things slightly different there as well.
In both cases, GpuCorrMM should get a grad() method to define its gradient (so it can be used directly in a CNN to avoid kernel flipping and whatnot) and local_conv_gemm() should still be able to replace any GpuConv instances with a gemm-based Op (so it can be used with existing CNN implementations).

@f0k
Copy link
Contributor Author

f0k commented Aug 12, 2014

PS: This is on top of #2023, which wasn't merged yet. If you merge #2023, I can rebase this PR on top of master and solve the merge conflicts.

@f0k f0k mentioned this pull request Aug 12, 2014
8 tasks
@abergeron
Copy link
Member

It already says that it has conflicts. But I think you should wait until #2023 is merged before doing a rebase.

Apart from that, I think you have the right approach in the optimization. The redundant dimshuffles should get merged or eliminated by the optimizer.

@stencilman
Copy link
Contributor

Thanks a lot @f0k !! These are the timing spead-up that I get:
after your changes:

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  89.9%    89.9%      80.250s       2.79e-01s     C      288       18   GpuCorrMM{valid, (1, 1), pad=0}
   5.6%    95.6%       5.022s       2.24e-02s     C      224       14   GpuCorrMM{full, (1, 1), pad=0}

before your changes:

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  57.1%    57.1%      83.547s       2.90e-01s     C      288       18   GpuCorrMM{valid, (1, 1), pad=(0, 0)}
  39.0%    96.1%      57.025s       1.02e-01s     C      560       28   GpuCorrMM{full, (1, 1), pad=(0, 0)}

Maybe there are redundant ops in my graph or I am doing something very wrong.. Is there a way to fine tune my graph? For me, the fprop is still very good, but the bprop is now like 3x slower than torch7. @f0k: did you try the convnet-benchmark(https://github.com/soumith/convnet-benchmarks)?

@f0k
Copy link
Contributor Author

f0k commented Aug 12, 2014

@stencilman: Hmm, so after my change it is about 5 times faster and only called half as often (upper table)? What's the problem then?
I already extended the convnet benchmark to include corrmm in a branch on my fork: https://github.com/f0k/convnet-benchmarks/tree/theano-benchmark-better (I'm postponing the pull request until GpuCorrMM is somewhat more mature). The bprop wrt. inputs was 2-4x faster with the new version. What do you get?
@abergeron: Yes, I'll wait for the merge. And I'll think about splitting the Op as detailed above.

@benanne
Copy link
Contributor

benanne commented Aug 12, 2014

I have nothing of value to contribute here, but I'd just like to say that this is awesome :)

@madisonmay
Copy link
Contributor

Seconded -- awesome work so far @f0k, @stencilman, @abergeron.

@abergeron
Copy link
Member

I barely did anything.

@stencilman
Copy link
Contributor

So @f0k, in your fork(https://github.com/f0k/convnet-benchmarks/tree/theano-benchmark-better) how does this compare to torch mm? My bprop is 3x slower than torch for some reason...

@f0k f0k mentioned this pull request Aug 13, 2014
2 tasks
@f0k
Copy link
Contributor Author

f0k commented Aug 15, 2014

So I've rebased the PR onto master (the latest and greatest 9b3ea9e). @stencilman: I did a force push to my corrmm-faster-fullconv branch, if you still have it checked out, you should probably checkout master, delete the branch and check it out anew.

As discussed in #2023, I will refactor this implementation to have a GpuCorrMM op with two gradient ops, similar to the cuda-convnet wrapper in pylearn2. This will allow us to have a fast bprop wrt. weights in addition to the fast bprop wrt. inputs enabled by 121c1d4. So I'd suggest to not merge it yet.

@stencilman
Copy link
Contributor

@f0k: you are my hero!! Thank you so much, looking forward to hearing from you on this. Really grateful about this :-)

@stencilman
Copy link
Contributor

@f0k: like i said earlier, do let me know if I can help you in any way to get this in quicker.. actually my project depends on this :-).. Thanks!

@f0k
Copy link
Contributor Author

f0k commented Aug 17, 2014

@stencilman: Thanks for your offer, I planned to do it tomorrow (Monday). You can help by benchmarking against Torch again, and possibly with writing the tests... I'll let you know!

@stencilman
Copy link
Contributor

@f0k: I would love to run it against Torch and write any needed tests. Thanks! 👍

@abergeron
Copy link
Member

So is this considered finished?

Even if it's a bit slower than Torch, if it's faster than what we have and works correctly, it's good for merging. Unless you want to work more on it.

@f0k
Copy link
Contributor Author

f0k commented Aug 18, 2014

@abergeron: Doesn't matter to me -- either you merge it and I'll file a new PR that refactors everything, or you don't merge it yet and I'll add a commit to this PR that refactors everything.

(I have the refactored version almost finished, there's just something flipped again...)

@abergeron
Copy link
Member

I'll wait for the refactor then.

@f0k
Copy link
Contributor Author

f0k commented Aug 18, 2014

OK, the refactored version is up. It passes all the previous tests (test_gemm_valid, test_gemm_full, test_gemm_subsample, test_gemm_directly). The optimizer doesn't use the new backprop wrt. weights yet, but the GpuCorrMM op now has gradients defined so it can be used instead of conv2d (and then the faster backprop wrt. weights will be used).

TODO:

  • Write tests for the gradients of GpuCorrMM. I didn't have time to test them at all yet.
  • Modify the conv_gemm optimizer to choose between GpuCorrMM and GpuCorrMM_gradWeights depending on the input shapes of any valid convolutions to be replaced. (Any suggestions?)

I'll be gone for today, but I'll happily accept pull requests to my pull request for any of the TODO items.

@stencilman
Copy link
Contributor

@f0k: Wowowow!! I will take a stab at testing the gradients of GpuCorrMM. I will leave TODO 2 for someome with better knowledge of Theano (perhaps @nouiz). I will also report the speadup.

Thanks again, I am really grateful to you!

# call GpuCorrMM
# TODO: call GpuCorrMM_gradWeights instead if appropriate
return [GpuCorrMM('valid', subsample, pad)(
gpu_contiguous(img), gpu_contiguous(kern))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions on when to use GpuCorrMM_gradWeights instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we needed to split that into 2 op. Here is some info to help find which one to use:
fprop img shape 32x32, kernel 5x5, output, 28x28
in that case, if we reuse the fprop code, the input will be img is 32x32, kernel 28x28 and output 5x5.

What about, if the kernel is bigger then the output, we use the gradWeights case? The only good way to know this is at run time. That is why I think we should keep both valid case in the same op. But with your code, it should be easy. But before doing this change, we should do some benchmark about my selection of the threshold. @stencilman can you do that? Time GpuCorrMM vs GpuCorr_gradWeights for different input size, including my examples?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can do this now. @nouiz: what exactly do you mean by GpuCorr_gradWeights?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nouiz:
torch:

CONFIG: input = 64x32x32 * ker = 64x128x5x5 (bs = 32, stride = 1)
SpatialConvolutionMM:updateOutput(): (tm = 0.0090797543525696)
SpatialConvoltionMM:accGradParameters(): (tm = 0.011068224906921)
SpatialConvolutionMM:updateGradInput(): (tm = 0.0082367658615112)

CONFIG: input = 64x32x32 * ker = 64x128x28x28 (bs = 32, stride = 1)
SpatialConvolutionMM:updateOutput(): (tm = 0.032178223133087)
SpatialConvoltionMM:accGradParameters(): (tm = 0.032377779483795)
SpatialConvolutionMM:updateGradInput(): (tm = 0.02130252122879)

and theano:

CONFIG: input = 64 x 32 x 32 * ker = 64 x 128 x 5 x 5 ( bs = 32 , stride = 1 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 1323.00504803 GFLOP/s ( tm = 0.00776719999313 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.0100939035416 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.00898323154449 )

CONFIG: input = 64 x 32 x 32 * ker = 64 x 128 x 28 x 28 ( bs = 32 , stride = 1 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 308.976205726 GFLOP/s ( tm = 0.0332583694458 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.0303143196106 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0210659198761 )

is this what you were looking for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, a Theano GpuCorrMM vs vs GpuCorrMM_gradWeight. Both do the same computation, just with different algo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the batch size will also play a role, because in one case (fprop), the gemm is used to accumulate over the channels and the for loop to iterate over the minibatch, and in the other case (bprop weights), the for loop is used to accumulate over the channels (= minibatch of the forward pass) and gemm to iterate over the minibatch (= input channels of the forward pass).

I had a benchmark of the two implementations running over night, I will try to figure out a formula to choose between them. A nice aspect is that this may allow us to be faster than caffe/Torch for certain configurations, because in caffe/Torch it's hard-coded which implementation to use in the forward and backward pass.

The problem I see with joining both implementations into one Op is that they require very different memory layouts for the input and output. To compare, these three perform the same computation (copied from test_gemm_directly()):

cpuval = py_conv(npy_img, npy_kern, 'valid', subsample)

op = theano.sandbox.cuda.blas.GpuCorrMM(border_mode='valid',
        subsample=subsample)(i, k)
f = theano.function([i, k], op, mode=theano_mode)
gpuval1 = f(npy_img, npy_kern[:,:,::-1,::-1])

op = theano.sandbox.cuda.blas.GpuCorrMM_gradWeights(border_mode='valid',
        subsample=subsample)(i, k)
f = theano.function([i, k], op, mode=theano_mode)
gpuval2 = numpy.array(f(npy_img.transpose(1, 0, 2, 3),
        npy_kern.transpose(1, 0, 2, 3)[:,:,::-1,::-1])).transpose(1, 0, 2, 3)

# cpuval, gpuval1 and gpuval2 are now approximately equal

If we do the necessary dimshuffles, kernel flippings and gpu_contiguous calls inside the C code of a merged Op, we will probably do unneeded copies. If we do them in Python, we might still do unneeded copies unless we create an optimizer that can merge chains of dimshuffles with gpu_contiguous calls in between. Besides, keeping the code similar to caffe makes it easier to port upstream changes to Theano in case there are any.

So I'd prefer to keep the Ops separate for now, and leave merging them for another PR if that's okay with you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, from my benchmark the magic formula seems to be:

if batchsize * kernelHeight * kernelWidth < inputChannels * outputHeight * outputWidth:
    use GpuCorrMM
else:
    use GpuCorrMM_gradWeights

Unfortunately, even if ConvOp knows about the shapes, GpuConv only knows about image and kernel width and height and the number of input channels, but not about the batchsize. So for now I'll have the optimizer decide based on kernel and output size, following your suggestion -- this gets most cases correct already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use c_code_helper in a few other op. So I would go with that name.

I'm good with the idea of merging ops later. If this PR is stable and give
more performance then the current master, it can be merged. But I think we
need to figure out how to fix @stencilman bug.

@stencilman, I understand your script is complicated. Can you dump your
Theano function and make a small python scrip that load it and init good
input value to the function that generate the error?

http://www.deeplearning.net/software/theano/tutorial/debug_faq.html?highlight=dump%20function#dumping-a-function-to-help-debug

I suppose the problem happen in the first call to the theano function. Is
that the case?

@f0k, it is easy to add the batch size to the GpuConv op. You just need to
add a parameter to the GpuConv.init, store it as a member, modify
eq and hash to check it and modify the op in cuda/opt.py to pass it
to it. My sentence is long, but it is about 5-10 lines of codes. If you
want me to do it, just ask and I'll do it. That way, you can take into
account the fact when it is set. Just make sure the opt don't crash if only
some of the shapes are know. It is fine to do it only if we know all shapes.

thanks

On Tue, Aug 19, 2014 at 10:15 AM, Jan Schlüter notifications@github.com
wrote:

In theano/sandbox/cuda/opt.py:

  •    kern = gpu_contiguous(kern)
    
  •    return [GpuCorrMM(node.op.border_mode, node.op.subsample)(img, kern)]
    
  •    border_mode = node.op.border_mode
    
  •    subsample = node.op.subsample
    
  •    pad = (0,0)
    
  •    if (border_mode == 'full') and (subsample != (1,1)):
    
  •        # need to simulate this via a padded valid convolution
    
  •        pad = 'auto'
    
  •        border_mode = 'valid'
    
  •    if (border_mode == 'valid'):
    
  •        # need to flip the kernel for valid convolution
    
  •        kern = kern[:, :, ::-1, ::-1]
    
  •        # call GpuCorrMM
    
  •        # TODO: call GpuCorrMM_gradWeights instead if appropriate
    
  •        return [GpuCorrMM('valid', subsample, pad)(
    
  •                gpu_contiguous(img), gpu_contiguous(kern))]
    

Okay, from my benchmark the magic formula seems to be:

if batchsize * kernelHeight * kernelWidth < inputChannels * outputHeight * outputWidth:
use GpuCorrMMelse:
use GpuCorrMM_gradWeights

Unfortunately, even if ConvOp knows about the shapes, GpuConv only knows
about image and kernel width and height and the number of input channels,
but not about the batchsize. So for now I'll have the optimizer decide
based on kernel and output size, following your suggestion -- this gets
most cases correct already.


Reply to this email directly or view it on GitHub
https://github.com/Theano/Theano/pull/2033/files#r16416429.

@f0k
Copy link
Contributor Author

f0k commented Aug 18, 2014

Added tests for GpuCorrMM_gradWeights and GpuCorrMM_gradInputs to test_gemm_directly, and fixed a mistake in GpuCorrMM_gradWeights. Still missing: A test_gemm_grads test which tests if the gradients of GpuCorrMM are actually the correct gradients. I'll see if I can do it before I really leave, otherwise I'll leave it open for tomorrow or for somebody else!

@f0k
Copy link
Contributor Author

f0k commented Aug 18, 2014

I added a test for the gradients (test_gemm_grads). It passes for mode="valid", subsample=(1,1) and pad=(0,0), so if you don't use subsampling or padding, the GpuCorrMM()(img, gpu_contiguous(kern[:,:,::-1,::-1])) from this PR can already serve as a replacement for conv2d(img, kern). Note that this is currently required to get a speedup over the automatic optimization.

Additional TODOs:

  • GpuCorrMM needs to pass the input shape to GpuCorrMM_gradInputs and the weight shape to GpuCorrMM_gradWeights to make it work for all subsampling values. Currently, the shape is inferred incorrectly if the output image size is not divisible by the subsampling size.
  • At least one of the gradient ops fails for pad="auto".
  • GpuCorrMM_gradInputs and GpuCorrMM_gradWeights should probably get a grad() function as well.

@stencilman
Copy link
Contributor

X CONFIG: input = 64x64x64 * ker = 64x128x9x9 (bs = 128, stride = 1) CONFIG: input = 128x32x32 * ker = 128x128x9x9 (bs = 128, stride = 1) CONFIG: input = 3x128x128 * ker = 3x96x11x11 (bs = 128, stride = 1) CONFIG: input = 384x13x13 * ker = 384x384x3x3 (bs = 128, stride = 1) CONFIG: input = 128x16x16 * ker = 128x128x7x7 (bs = 128, stride = 1)
torch fprop 0.24305301904678 0.16899198293686 0.10639077425003 0.055703997612 0.041813731193
GpuCorrMM fprop 0.225996734619 0.159330703735 0.0936857147217 0.0532916145325 0.04246251297
torch grad wrt input 0.30804550647736 0.13152647018433 0.091511249542 0.02750152349472 0.02301424741745
GpuCorrMM bprop wrt inputs 0.288951873779 0.121228462219 0.086506401062 0.0258434085846 0.0209931201935
torch grad wrt weights 0.37771952152252 0.1442049741745 0.18967247009277 0.03774124383 0.023773729801
GpuCorrMM bprop wrt weights 0.352622833252 0.132992980957 0.169927322388 0.0339831695557 0.0214208164215

[]

@benanne
Copy link
Contributor

benanne commented Aug 18, 2014

very nice :) @f0k out of curiosity, have you tried using this for 1D convolutions? If so, how's the performance? Would it make sense to make a specialized 1D version of this, or is the 1D-using-2D approach good enough?

@f0k
Copy link
Contributor Author

f0k commented Aug 18, 2014

@stencilman: Cool, thanks for testing! Seems we're on a good way here!
@benanne: I haven't tried, but the theano benchmark in convnet-benchmarks accepts a custom config on the command line, so you can time the 1D-using-2D approach (but the bprop wrt. weights will not be fast yet). If by 1D convolution you refer to your approach of shifting a 2D patch over a spectrogram in the time dimension only, then I don't see a way to make it faster. If you refer to convolving a vector with another vector, then... hmm... do you still have batches? Multiple input and output channels?

@benanne
Copy link
Contributor

benanne commented Aug 18, 2014

Yeah, I meant the former. I have a suspicion that for the CorrMM-approach to computing convolutions, the 1D-using-2D approach is automatically the best thing you can do, which is why I'm interested to find out how it performs :) If I find some time I guess I'll try it for myself using the benchmark.

@stencilman
Copy link
Contributor

During backprop, I get the error bellow when I use GpuCorrMM directly. however when I use optimizer=fast_compile option, I do not get this error.

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1116d80000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total 
CudaNdarray_uninit: error freeing self->devdata. (self=0xd7614f0, self->devata=0x1116d80000)
Traceback (most recent call last):
  File "deep_nets.py", line 71, in <module>
    model.trainEpoch(data.tr, conf, epoch)
  File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
    err = self.train_model()
  File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ajain/Theano/theano/compile/function_module.py", line 579, in __call__
    outputs = self.fn()
RuntimeError: Cuda error: GpuElemwise node_600ed41afec07a555b0c58a139e2f767_0 Mul: an illegal memory access was encountered.
    n_blocks=30 threads_per_block=256
   Call: kernel_Mul_node_600ed41afec07a555b0c58a139e2f767_0_Ccontiguous<<<n_blocks, threads_per_block>>>(numEls, i0_data, i1_data, o0_data)

Apply node that caused the error: GpuElemwise{Mul}[(0, 1)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]},no_inplace}.0, GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)}.0)
Inputs types: [CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D)]
Inputs shapes: [(16, 256, 90, 90), (16, 256, 90, 90)]
Inputs strides: [(2073600, 8100, 90, 1), (2073600, 8100, 90, 1)]
Inputs scalar values: ['not scalar', 'not scalar']

HINT: Re-running with most Theano optimization disabled could give you a back-traces when this node was created. This can be done with by setting the Theano flags optimizer=fast_compile
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.

when I run with exception_verbosity=high I get the error bellow.

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1116d80000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total 
CudaNdarray_uninit: error freeing self->devdata. (self=0xcfd32f0, self->devata=0x1116d80000)
Traceback (most recent call last):
  File "deep_nets.py", line 71, in <module>
    model.trainEpoch(data.tr, conf, epoch)
  File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
    err = self.train_model()
  File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ajain/Theano/theano/gof/link.py", line 164, in raise_with_op
    print_type=True)
  File "/home/ajain/Theano/theano/printing.py", line 104, in debugprint
    stop_on_name=stop_on_name)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 613, in debugprint
    print >> file, '%s%s %s%s' % (prefix, r, id_str, type_str)
  File "/home/ajain/Theano/theano/sandbox/cuda/var.py", line 52, in __str__
    return "CudaNdarrayConstant{"+str(numpy.asarray(self.data))+"}"
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
    return array(a, dtype, copy=False, order=order)
RuntimeError: error copying data to host

Any idea what might be wrong?

@f0k
Copy link
Contributor Author

f0k commented Aug 27, 2014

Added documentation about the CUBLAS bug and rebased onto master. The tests pass, Travis passes, should be finally good for merging. Thanks to everyone involved in reviewing and testing!

prod2 *= node.op.imshp[0]
# compare to decide
if prod1 > prod2:
return [gpu_contiguous(GpuCorrMM_gradWeights('valid', subsample, pad)(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the gpu_contiguous on that line? I think it isn't needed and can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you're right. I thought it should have the same memory layout as before the replacement. Removed it.

@nouiz
Copy link
Member

nouiz commented Aug 29, 2014

I did a global review. There is just 2 thinks left before merging:

  • The return NULL
  • Optional but welcome, remove the not useful gpu_contiguous.

thanks for the good work!

@f0k
Copy link
Contributor Author

f0k commented Aug 30, 2014

Thanks for the review, @nouiz! Fixed everything you noticed.
/edit: please wait with the merge, removing the gpu_contiguous was wrong, just figuring it out...

@f0k
Copy link
Contributor Author

f0k commented Sep 1, 2014

Removing the gpu_contiguous as per @nouiz's suggestion introduced a problem: The dimshuffle() call on the output inserts a DimShuffle op, which returns a CPU output. This changes the type of the output, which is not allowed and results in:

TypeError: ('The type of the replacement must be the same as the type of the original Variable.', GpuConv{valid, (1, 1), None, (6, 6), True, (1, 287, 29), (6, 6)}.0, DimShuffle{1,0,2,3}.0, CudaNdarrayType(float32, 4D), TensorType(float32, 4D), 'local_conv_gemm')

We could use GpuDimShuffle instead, or wrap it in a as_cuda_ndarray_variable to promise that it's going to be moved to the GPU. I opted for the latter.

Interestingly, this wasn't triggered in the tests, I only noticed when running the Theano benchmark script from convnet-benchmarks. I guess that's a general problem with the tests which should be addressed in a separate issue.

@f0k
Copy link
Contributor Author

f0k commented Sep 1, 2014

I've got another optimization question. This is not actually part of this PR, but related to it, so I'll just ask here.

When using GpuCorrMM directly (on a flipped kernel), I get the following graph and timing for the gradient wrt. inputs:

GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)} [@A] ''   3
 |GpuContiguous [@B] ''   2
 | |GpuSubtensor{::, ::, ::int64, ::int64} [@C] ''   1
 |   |<CudaNdarrayType(float32, 4D)> [@D]
 |   |Constant{-1} [@E]
 |   |Constant{-1} [@E]
 |GpuContiguous [@F] ''   0
   |<CudaNdarrayType(float32, 4D)> [@G]
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.012757648468 )

When using the standard conv2d and letting the optimizer replace it, I get:

GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)} [@A] ''   5
 |GpuContiguous [@B] ''   4
 | |GpuDimShuffle{1,0,2,3} [@C] ''   3
 |   |GpuSubtensor{::, ::, ::int64, ::int64} [@D] ''   2
 |     |GpuDimShuffle{1,0,2,3} [@E] ''   1
 |     | |<CudaNdarrayType(float32, 4D)> [@F]
 |     |Constant{-1} [@G]
 |     |Constant{-1} [@G]
 |GpuContiguous [@H] ''   0
   |<CudaNdarrayType(float32, 4D)> [@I]
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0244923362732 )

It's 100% slower than the former, seemingly because of the two dimshuffles that are actually redundant. The original conv2d graph for this case looks like this:

GpuConv{full, (1, 1), None, (6, 6), True, (32, 282, 24), (6, 6)} [@A] ''   2
 |<CudaNdarrayType(float32, 4D)> [@B]
 |GpuSubtensor{::, ::, ::int64, ::int64} [@C] ''   1
   |GpuDimShuffle{1,0,2,3} [@D] ''   0
   | |<CudaNdarrayType(float32, 4D)> [@E]
   |Constant{-1} [@F]
   |Constant{-1} [@F]
theano.tensor.nnet.conv.conv2d bprop inputs: 0.0 GFLOP/s ( tm = 0.0263266887665 )

Any idea on how to fix that? Swap the order of dimshuffle and [:,:,::-1,::-1] in the ConvOp gradient and hope they get merged? Introduce an optimizer that swaps dimshuffle and subtensor (shuffling the subtensor slices as needed)?

/edit: Swapping the dimshuffle and flipping in the ConvOp gradient results in:

GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)} [@A] ''   5
 |GpuContiguous [@B] ''   4
 | |GpuDimShuffle{1,0,2,3} [@C] ''   3
 |   |GpuDimShuffle{1,0,2,3} [@D] ''   2
 |     |GpuSubtensor{::, ::, ::int64, ::int64} [@E] ''   1
 |       |<CudaNdarrayType(float32, 4D)> [@F]
 |       |Constant{-1} [@G]
 |       |Constant{-1} [@G]
 |GpuContiguous [@H] ''   0
   |<CudaNdarrayType(float32, 4D)> [@I]
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0244253120422 )

There's something wrong with the dimshuffles. a) They should have been eliminated in optimization, and b) they should be nearly zero-cost operations, shouldn't they? I guess they are not eliminated in optimization because the first one has already been moved to GPU when the second one is added to the graph, and there is no optimization for GpuDimShuffles.

Anyway, I'll open a separate issue when this PR has been merged.

nouiz added a commit that referenced this pull request Sep 1, 2014
Faster algorithms and gradients for GpuCorrMM
@nouiz nouiz merged commit cfc493d into Theano:master Sep 1, 2014
@f0k f0k deleted the corrmm-faster-fullconv branch September 1, 2014 17:12
@f0k
Copy link
Contributor Author

f0k commented Sep 1, 2014

Yay, thanks for merging, @nouiz!

@stencilman
Copy link
Contributor

Thanks a lot @nouiz! I can tell people to use the master and not to merge with @f0k's branch any more! 👍

@nouiz
Copy link
Member

nouiz commented Sep 2, 2014

Can you time the gpucontiguous in the fast and slow case with the profile=True and profile_memory=True Theano flags? My guess is the -1 in the subtensor, cause the gpucontiguous to be slower. This cause different memory layout of the input and we didn't optimize for that case.

@nouiz
Copy link
Member

nouiz commented Sep 2, 2014

Should we update the GpuCorrMM doc to tell that it is faster when called directly?

What about doing a corr2d() method, that call the conv2d on the cpu with the flipping, that way they could cancel themself after substitution. Maybe we will need to make the merge subsentor optimization work for the GPU op. What do you think of this as a user interface? @lamblin @abergeron, you input that that interface would be welcome.

@f0k
Copy link
Contributor Author

f0k commented Sep 2, 2014

My guess is the -1 in the subtensor, cause the gpucontiguous to be slower.

That's the same in both the slow and fast case, because I explicitly did a [:,:,::-1,::-1] on the weights when using GpuCorrMM directly. So the only difference is the dimshuffle. But I tried with some other shapes and then both cases perform the same... maybe we shouldn't worry too much.

Oh, maybe another thing that I observed: When I tried calling GpuCorrMM directly with some SharedVariable instances, they were not C-contiguous. Is this to be expected?

Should we update the GpuCorrMM doc to tell that it is faster when called directly?

No, because depending on the shapes, it can actually be slower when called directly. The goal should be to have the automatic graph optimization from conv2d be on par or better than a manual GpuCorrMM in all cases.

What about doing a corr2d() method

I think it's not needed for now. Let's first see if the kernel flipping really is a bottleneck and if so, we can still introduce a corr2d or update the conv2d documentation to say that flipping the kernels may improve performance in some cases.

@abergeron
Copy link
Member

I would go for making merge subtensor work on GPU ops. I don't think a proliferation of mostly similar interfaces is a good idea.

@ballasn ballasn mentioned this pull request Sep 4, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants