Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conv gemm non-square kernel support #2023

Merged
merged 11 commits into from Aug 14, 2014
Merged

Conv gemm non-square kernel support #2023

merged 11 commits into from Aug 14, 2014

Conversation

stencilman
Copy link
Contributor

This add support to non-square and kernels and strides sizes. It passes all tests.

Features:

  • A Filter sizes (including non-square) supported
  • All stride sizes supported.

This finishes some TODOs in gh-2015.

…sample (x, y) values and all kernel, batch, image size compatible on GPU. @nouiz: Can you please run the tests once again just to be double sure? I think it works correctly.
@f0k
Copy link
Contributor

f0k commented Aug 8, 2014

Looks good to me, and the tests pass on my office computer:

git fetch upstream pull/2023/head:corrmm
git checkout corrmm
cd theano/sandbox/cuda/tests
nosetests test_conv_cuda_ndarray.py
Using gpu device 0: GeForce GT 640
.............
----------------------------------------------------------------------
Ran 13 tests in 839.078s

OK

You should just change the docstring of GpuCorrMM.__init__(), it still says strides are unsupported.
/Edit: And the local_conv_gemm() optimizer should skip the node.op.subsample == (1, 1) test.

@nouiz
Copy link
Member

nouiz commented Aug 8, 2014

I did a PR to this PR with some doc fix. Can you review it? If you are good
with that, merge it and I'll merge this PR.

On Fri, Aug 8, 2014 at 6:38 AM, Jan Schlüter notifications@github.com
wrote:

Looks good to me, and the tests pass on my office computer:

git fetch upstream pull/2023/head:corrmm
git checkout corrmmcd theano/sandbox/cuda/tests
nosetests test_conv_cuda_ndarray.py
Using gpu device 0: GeForce GT 640

.............

Ran 13 tests in 839.078s

OK

You should just change the docstring of GpuCorrMM.init(), it still
says strides are unsupported.


Reply to this email directly or view it on GitHub
#2023 (comment).

@stencilman
Copy link
Contributor Author

Thanks for the doc changes @nouiz and thanks for testing it @f0k ! Doc changes look good to me.

@f0k
Copy link
Contributor

f0k commented Aug 8, 2014

I've also sent you a PR for this PR to have the conv_gemm optimizer support strided convolution.

@nouiz
Copy link
Member

nouiz commented Aug 8, 2014

Jan commit indicate that we didn't get out correctly, otherwise, the tests
should have failed. The problem is that test_valid and full don't try
subsample. Can you modify test_subsample as test_valid to test those cases?
Le 8 août 2014 10:05, "Jan Schlüter" notifications@github.com a écrit :

I've also sent you a PR for this PR to have the conv_gemm optimizer
support strided convolution.


Reply to this email directly or view it on GitHub
#2023 (comment).

@f0k
Copy link
Contributor

f0k commented Aug 11, 2014

Ah, I see, test_valid and test_full were always meant to test the new convolution only? They should probably be named differently then.

@stencilman: I've sent you a pull request to address Fred's suggestion.

@f0k
Copy link
Contributor

f0k commented Aug 11, 2014

@nouiz: Everything passes now!

@@ -7,6 +7,7 @@


import numpy
import scipy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not ok. Scipy is an optional dependency for theano so the tests need to still work (i.e. not crash) when it isn't there.

@f0k
Copy link
Contributor

f0k commented Aug 13, 2014

@stencilman: I've sent you a PR, hopefully the last one, to clean up the tests. test_valid, test_full and test_subsample now test the original convolution code, test_gemm_valid, test_gemm_full and test_gemm_subsample reuse these test suites to test the gemm-based convolution code (inserted via graph optimization), and test_gemm_directly tests the gemm-based convolution by manually constructing a graph with it. test_subsample fails on my office machine, but the other six test suites pass. I'll have a look at it...
/Edit: Okay, fixed it. You can merge my PR now and then hopefully your PR can be merged into Theano.

@stencilman
Copy link
Contributor Author

@f0k: Did I merge it right or I screwed it up?

@f0k
Copy link
Contributor

f0k commented Aug 13, 2014

@stencilman: Hmm, you merged the wrong branch of mine. I'll instruct you how to fix it in a minute.

@f0k
Copy link
Contributor

f0k commented Aug 13, 2014

So the pull request I meant was this one: https://github.com/stencilman/Theano-1/pull/7
In your checkout, please do the following:

git checkout conv_gemm
git reset --hard 4d4c928
git push --force origin conv_gemm

This way the PR will be reset to the state of this morning. Afterwards, go to https://github.com/stencilman/Theano-1/pull/7 and click the green merge button to accept my PR.

Cleanup CUDA convolution tests
@stencilman
Copy link
Contributor Author

Thanks a lot @f0k!! :-)

Also, do you mind sharing your findings regarding speed comparison with torch7? I would be very grateful. Thank you!

@f0k
Copy link
Contributor

f0k commented Aug 13, 2014

I don't have torch7 installed and the installation instructions scare me. My updated Theano benchmark is merged into convnet-benchmarks, though, please feel free to try it and report back!

@stencilman
Copy link
Contributor Author

Here are the results of the benchmark I get on a Titan Black. As you see we are very competitive except that for some reason our bprop weights is slow(Upto 3x). Do you have any clue why?

Torch7:

CONFIG: input = 3x128x128 * ker = 3x96x11x11 (bs = 128, stride = 1)
SpatialConvolutionMM:updateOutput(): (tm = 0.11015701293945)
SpatialConvolutionMM:updateGradInput(): (tm = 0.099327743053436)
SpatialConvoltionMM:accGradParameters(): (tm = 0.22154402732849)

CONFIG: input = 64x64x64 * ker = 64x128x9x9 (bs = 128, stride = 1)
SpatialConvolutionMM:updateOutput(): (tm = 0.24804699420929)
SpatialConvolutionMM:updateGradInput(): (tm = 0.3508819937706)
SpatialConvoltionMM:accGradParameters(): (tm = 0.39958328008652)

CONFIG: input = 128x32x32 * ker = 128x128x9x9 (bs = 128, stride = 1)
SpatialConvolutionMM:updateOutput(): (tm = 0.1720854640007)
SpatialConvolutionMM:updateGradInput(): (tm = 0.14392179250717)
SpatialConvoltionMM:accGradParameters(): (tm = 0.15553104877472)

Ours

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride = 1 )
(experimental) theano...blas.CorrMM fprop: 1171.54006131 GFLOP/s ( tm = 0.106029006958 )
(experimental) theano...blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0922187194824 )
(experimental) theano...blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.278021148682 )

CONFIG: input = 64 x 64 x 64 * ker = 64 x 128 x 9 x 9 ( bs = 128 , stride = 1 )
(experimental) theano...blas.CorrMM fprop: 2329.91436963 GFLOP/s ( tm = 0.228639373779 )
(experimental) theano...blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.351494049072 )
(experimental) theano...blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.958748046875 )

CONFIG: input = 128 x 32 x 32 * ker = 128 x 128 x 9 x 9 ( bs = 128 , stride = 1 )
(experimental) theano...blas.CorrMM fprop: 1035.75263494 GFLOP/s ( tm = 0.188934539795 )
(experimental) theano...blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.152022628784 )
(experimental) theano...blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.40099987793 )

@f0k
Copy link
Contributor

f0k commented Aug 13, 2014

Cool, thanks for the direct comparison! Our bprop wrt. weights uses the same algorithm as the frop (both do a valid convolution, and we currently only have a single gemm-based algorithm for that). Caffe uses a slightly different variant for the bprop wrt. weights, and it seems this is faster.

The results indicate that we should really split GpuCorrMM into three ops (answering my question in #2033): The forward pass for valid correlation, GpuCorrMM_gradInput for the gradient wrt. inputs (a full convolution) and GpuCorrMM_gradWeights for the gradient wrt. weights (a valid... convolution, if I see correctly). If we give GpuCorrMM a grad() method then, it should perform similarly to Torch. Adapting the optimizer to choose the best replacement for any GpuConv ops it stumbles upon (to have it work for models using the standard conv2d() instead of directly using GpuCorrMM) might be tricky. My ultimate goal would be a meta-optimizer that tries the different variants we have, including the FFT-based ops, and then chooses the best-performing replacement for each individual GpuConv.

PS: Let's hope someone has both mercy with us and time to merge this PR soon so we can go on :)

@stencilman
Copy link
Contributor Author

I see, thanks for your explanation.

Yes, please merge this PR! @nouiz is perhaps away until fri and then things should move faster.

Hmm, so how do you think is the best and the fastest way to make it as fast? I am happy to write code for it.. I need the conv to be at least as fast as torch7 to be able to use theano.

@f0k
Copy link
Contributor

f0k commented Aug 13, 2014

Hmm, so how do you think is the best and the fastest way to make it as fast?

Well, when this PR is merged, I will rebase my other PR and then we can either work from there or start a second attempt. I would redefine the mode parameter in the CUDA code not to switch between valid and full convolution, but to switch between forward, bprop wrt. input and bprop wrt. weights, with the three matrix arguments always referring to the layer input, filters and layer output (instead of swapping input and output for full convolution, this made everything harder). There would be three corresponding Theano ops, the first of which would use the other two to define its gradient (similar to the cuda-convnet wrapper in pylearn2). The three ops should share as much of their C code as possible.
I'd rather like to do this tomorrow than next week... if @nouiz is away, maybe @abergeron can merge your PR?

I am happy to write code for it..

Thanks, I'll let you know how you can help!

I need the conv to be at least as fast as torch7 to be able to use theano.

What about the FFT-based convolution then? In convnet-benchmarks, it is faster than Torch7 for all configurations except L1.

@abergeron
Copy link
Member

My GPU is occupied with other tests right now. I'll give it a spin after that and merge if the tests pass (should be later tonight or tomorrow).

@stencilman
Copy link
Contributor Author

@f0k or @benanne: I can not get fft to work, somehow my scikits.cuda.cublas.cublasCgemmBatched seems to be missing in the scikits.cuda.cublas module (all batched versions seem to be missing, e.g. cublasCgemm exists). Do you have any idea why? Thanks.

@abergeron
Copy link
Member

You have to install the development version. The last release doesn't have the necessary bindings.

@stencilman
Copy link
Contributor Author

Any updates on merging this PR? Thanks!

abergeron added a commit that referenced this pull request Aug 14, 2014
Conv gemm non-square kernel support
@abergeron abergeron merged commit 0037c72 into Theano:master Aug 14, 2014
@abergeron
Copy link
Member

Just noticed that the tests have been successful.

@stencilman
Copy link
Contributor Author

Thanks a lot @abergeron!

@f0k
Copy link
Contributor

f0k commented Aug 14, 2014

Great, thank you!

@stencilman
Copy link
Contributor Author

Bellow I attach the convnet benchmark results for fft vs corrMM. However, when I try to run it for my project, it throws a scikits.cuda.cufft.cufftAllocFailed. So fft is not not practical to use.. And corrMM is still so slow as compared to torch7 :-(

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride = 1 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 720.042110896 GFLOP/s ( tm = 0.172513839722 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.190046844482 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 1.12835046387 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 1025.87552679 GFLOP/s ( tm = 0.121084114075 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.264007446289 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 1.65133056641 )

CONFIG: input = 64 x 64 x 64 * ker = 64 x 128 x 9 x 9 ( bs = 128 , stride = 1 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 6408.74976884 GFLOP/s ( tm = 0.0831223220825 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.104528160095 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 0.446905944824 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 1839.06468606 GFLOP/s ( tm = 0.289663635254 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.887582336426 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.677083068848 )

CONFIG: input = 128 x 32 x 32 * ker = 128 x 128 x 9 x 9 ( bs = 128 , stride = 1 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 6634.08404822 GFLOP/s ( tm = 0.0294975833893 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.0319159526825 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 0.151375915527 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 1041.26429915 GFLOP/s ( tm = 0.187934463501 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.327285858154 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.281016693115 )

CONFIG: input = 128 x 16 x 16 * ker = 128 x 128 x 7 x 7 ( bs = 128 , stride = 1 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 2118.82111964 GFLOP/s ( tm = 0.0096997756958 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.00985289573669 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 0.0386824493408 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 454.019960479 GFLOP/s ( tm = 0.0452669296265 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.0540057106018 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.076849357605 )

CONFIG: input = 384 x 13 x 13 * ker = 384 x 384 x 3 x 3 ( bs = 128 , stride = 1 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 586.160629223 GFLOP/s ( tm = 0.0701315841675 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.0719367828369 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 0.0903026046753 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 716.315122147 GFLOP/s ( tm = 0.057388671875 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.215068939209 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0558946075439 )

@nouiz
Copy link
Member

nouiz commented Aug 18, 2014

I don't catch-up to everything was done while I was away.

I have 1 questions. Do one of you know what is the difference in the
implementation of bprop weight and the valid implementation we have? Both
are valid convolution. From the profile again torch7, and what was written,
it seem the algo isn't the same.

As both implementation are valid convolution, we could do multiple op, but
we could select different code path in the Op depending of the input shape.
That would be lighter and more newbie proop, in the case someone want to
use the convolution with different input shapes.

On Thu, Aug 14, 2014 at 2:12 PM, Arjun Jain notifications@github.com
wrote:

Bellow I attach the convnet benchmark results for fft vs corrMM. However,
when I try to run it for my project, it throws a
scikits.cuda.cufft.cufftAllocFailed. So fft is not not practical to use..
And corrMM is still so slow as compared to torch7 :-(

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride = 1 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 720.042110896 GFLOP/s ( tm = 0.172513839722 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.190046844482 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 1.12835046387 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 1025.87552679 GFLOP/s ( tm = 0.121084114075 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.264007446289 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 1.65133056641 )

CONFIG: input = 64 x 64 x 64 * ker = 64 x 128 x 9 x 9 ( bs = 128 , stride = 1 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 6408.74976884 GFLOP/s ( tm = 0.0831223220825 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.104528160095 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 0.446905944824 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 1839.06468606 GFLOP/s ( tm = 0.289663635254 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.887582336426 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.677083068848 )

CONFIG: input = 128 x 32 x 32 * ker = 128 x 128 x 9 x 9 ( bs = 128 , stride = 1 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 6634.08404822 GFLOP/s ( tm = 0.0294975833893 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.0319159526825 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 0.151375915527 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 1041.26429915 GFLOP/s ( tm = 0.187934463501 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.327285858154 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.281016693115 )

CONFIG: input = 128 x 16 x 16 * ker = 128 x 128 x 7 x 7 ( bs = 128 , stride = 1 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 2118.82111964 GFLOP/s ( tm = 0.0096997756958 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.00985289573669 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 0.0386824493408 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 454.019960479 GFLOP/s ( tm = 0.0452669296265 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.0540057106018 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.076849357605 )

CONFIG: input = 384 x 13 x 13 * ker = 384 x 384 x 3 x 3 ( bs = 128 , stride = 1 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 586.160629223 GFLOP/s ( tm = 0.0701315841675 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.0719367828369 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 0.0903026046753 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 716.315122147 GFLOP/s ( tm = 0.057388671875 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.215068939209 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0558946075439 )


Reply to this email directly or view it on GitHub
#2023 (comment).

@stencilman
Copy link
Contributor Author

Hi @nouiz: Torch does a updateOutput (which is a valid conv) for doing the fprop, but does a accGradParameters when updating the weights (also a valid conv). If you see the code(https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu) you will see that they are different and perhaps thats why the grad wrt to the weights is faster for torch. Does it answer your question?

Yes, perhaps you can choose a diff code path depending on the sizes, but you guys(@f0k) will know this better.

@f0k
Copy link
Contributor

f0k commented Aug 18, 2014

@nouiz: The difference is that for the bprop wrt. weights, caffe iterates over the batch and computes a number of dot products that are accumulated into a weight gradient (by setting both alpha and beta to 1 in the gemm call). I have it almost working now, will push to #2033 soon.

My plan was to first have a GpuCorrMM with gradients, then update the optimizer to choose the best op depending on the input shapes. /Edit: This will only work for replacing GpuConv ops with fully-defined shape information, of course. But one can always use GpuCorrMM directly if it has a grad() function.

@stencilman
Copy link
Contributor Author

@f0k: Awesome!! Cant wait, thank you so much!! 👍

@ballasn ballasn mentioned this pull request Sep 4, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants