-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster algorithms and gradients for GpuCorrMM #2033
Conversation
It already says that it has conflicts. But I think you should wait until #2023 is merged before doing a rebase. Apart from that, I think you have the right approach in the optimization. The redundant dimshuffles should get merged or eliminated by the optimizer. |
Thanks a lot @f0k !! These are the timing spead-up that I get:
before your changes:
Maybe there are redundant ops in my graph or I am doing something very wrong.. Is there a way to fine tune my graph? For me, the fprop is still very good, but the bprop is now like 3x slower than torch7. @f0k: did you try the convnet-benchmark(https://github.com/soumith/convnet-benchmarks)? |
@stencilman: Hmm, so after my change it is about 5 times faster and only called half as often (upper table)? What's the problem then? |
I have nothing of value to contribute here, but I'd just like to say that this is awesome :) |
Seconded -- awesome work so far @f0k, @stencilman, @abergeron. |
I barely did anything. |
So @f0k, in your fork(https://github.com/f0k/convnet-benchmarks/tree/theano-benchmark-better) how does this compare to torch mm? My bprop is 3x slower than torch for some reason... |
So I've rebased the PR onto master (the latest and greatest 9b3ea9e). @stencilman: I did a force push to my corrmm-faster-fullconv branch, if you still have it checked out, you should probably checkout master, delete the branch and check it out anew. As discussed in #2023, I will refactor this implementation to have a |
@f0k: you are my hero!! Thank you so much, looking forward to hearing from you on this. Really grateful about this :-) |
@f0k: like i said earlier, do let me know if I can help you in any way to get this in quicker.. actually my project depends on this :-).. Thanks! |
@stencilman: Thanks for your offer, I planned to do it tomorrow (Monday). You can help by benchmarking against Torch again, and possibly with writing the tests... I'll let you know! |
@f0k: I would love to run it against Torch and write any needed tests. Thanks! 👍 |
So is this considered finished? Even if it's a bit slower than Torch, if it's faster than what we have and works correctly, it's good for merging. Unless you want to work more on it. |
@abergeron: Doesn't matter to me -- either you merge it and I'll file a new PR that refactors everything, or you don't merge it yet and I'll add a commit to this PR that refactors everything. (I have the refactored version almost finished, there's just something flipped again...) |
I'll wait for the refactor then. |
OK, the refactored version is up. It passes all the previous tests ( TODO:
I'll be gone for today, but I'll happily accept pull requests to my pull request for any of the TODO items. |
# call GpuCorrMM | ||
# TODO: call GpuCorrMM_gradWeights instead if appropriate | ||
return [GpuCorrMM('valid', subsample, pad)( | ||
gpu_contiguous(img), gpu_contiguous(kern))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions on when to use GpuCorrMM_gradWeights
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we needed to split that into 2 op. Here is some info to help find which one to use:
fprop img shape 32x32, kernel 5x5, output, 28x28
in that case, if we reuse the fprop code, the input will be img is 32x32, kernel 28x28 and output 5x5.
What about, if the kernel is bigger then the output, we use the gradWeights case? The only good way to know this is at run time. That is why I think we should keep both valid case in the same op. But with your code, it should be easy. But before doing this change, we should do some benchmark about my selection of the threshold. @stencilman can you do that? Time GpuCorrMM vs GpuCorr_gradWeights for different input size, including my examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can do this now. @nouiz: what exactly do you mean by GpuCorr_gradWeights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nouiz:
torch:
CONFIG: input = 64x32x32 * ker = 64x128x5x5 (bs = 32, stride = 1)
SpatialConvolutionMM:updateOutput(): (tm = 0.0090797543525696)
SpatialConvoltionMM:accGradParameters(): (tm = 0.011068224906921)
SpatialConvolutionMM:updateGradInput(): (tm = 0.0082367658615112)
CONFIG: input = 64x32x32 * ker = 64x128x28x28 (bs = 32, stride = 1)
SpatialConvolutionMM:updateOutput(): (tm = 0.032178223133087)
SpatialConvoltionMM:accGradParameters(): (tm = 0.032377779483795)
SpatialConvolutionMM:updateGradInput(): (tm = 0.02130252122879)
and theano:
CONFIG: input = 64 x 32 x 32 * ker = 64 x 128 x 5 x 5 ( bs = 32 , stride = 1 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 1323.00504803 GFLOP/s ( tm = 0.00776719999313 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.0100939035416 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.00898323154449 )
CONFIG: input = 64 x 32 x 32 * ker = 64 x 128 x 28 x 28 ( bs = 32 , stride = 1 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 308.976205726 GFLOP/s ( tm = 0.0332583694458 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.0303143196106 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0210659198761 )
is this what you were looking for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, a Theano GpuCorrMM vs vs GpuCorrMM_gradWeight. Both do the same computation, just with different algo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the batch size will also play a role, because in one case (fprop), the gemm is used to accumulate over the channels and the for loop to iterate over the minibatch, and in the other case (bprop weights), the for loop is used to accumulate over the channels (= minibatch of the forward pass) and gemm to iterate over the minibatch (= input channels of the forward pass).
I had a benchmark of the two implementations running over night, I will try to figure out a formula to choose between them. A nice aspect is that this may allow us to be faster than caffe/Torch for certain configurations, because in caffe/Torch it's hard-coded which implementation to use in the forward and backward pass.
The problem I see with joining both implementations into one Op is that they require very different memory layouts for the input and output. To compare, these three perform the same computation (copied from test_gemm_directly()
):
cpuval = py_conv(npy_img, npy_kern, 'valid', subsample)
op = theano.sandbox.cuda.blas.GpuCorrMM(border_mode='valid',
subsample=subsample)(i, k)
f = theano.function([i, k], op, mode=theano_mode)
gpuval1 = f(npy_img, npy_kern[:,:,::-1,::-1])
op = theano.sandbox.cuda.blas.GpuCorrMM_gradWeights(border_mode='valid',
subsample=subsample)(i, k)
f = theano.function([i, k], op, mode=theano_mode)
gpuval2 = numpy.array(f(npy_img.transpose(1, 0, 2, 3),
npy_kern.transpose(1, 0, 2, 3)[:,:,::-1,::-1])).transpose(1, 0, 2, 3)
# cpuval, gpuval1 and gpuval2 are now approximately equal
If we do the necessary dimshuffles, kernel flippings and gpu_contiguous calls inside the C code of a merged Op, we will probably do unneeded copies. If we do them in Python, we might still do unneeded copies unless we create an optimizer that can merge chains of dimshuffles with gpu_contiguous calls in between. Besides, keeping the code similar to caffe makes it easier to port upstream changes to Theano in case there are any.
So I'd prefer to keep the Ops separate for now, and leave merging them for another PR if that's okay with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, from my benchmark the magic formula seems to be:
if batchsize * kernelHeight * kernelWidth < inputChannels * outputHeight * outputWidth:
use GpuCorrMM
else:
use GpuCorrMM_gradWeights
Unfortunately, even if ConvOp
knows about the shapes, GpuConv
only knows about image and kernel width and height and the number of input channels, but not about the batchsize. So for now I'll have the optimizer decide based on kernel and output size, following your suggestion -- this gets most cases correct already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use c_code_helper in a few other op. So I would go with that name.
I'm good with the idea of merging ops later. If this PR is stable and give
more performance then the current master, it can be merged. But I think we
need to figure out how to fix @stencilman bug.
@stencilman, I understand your script is complicated. Can you dump your
Theano function and make a small python scrip that load it and init good
input value to the function that generate the error?
I suppose the problem happen in the first call to the theano function. Is
that the case?
@f0k, it is easy to add the batch size to the GpuConv op. You just need to
add a parameter to the GpuConv.init, store it as a member, modify
eq and hash to check it and modify the op in cuda/opt.py to pass it
to it. My sentence is long, but it is about 5-10 lines of codes. If you
want me to do it, just ask and I'll do it. That way, you can take into
account the fact when it is set. Just make sure the opt don't crash if only
some of the shapes are know. It is fine to do it only if we know all shapes.
thanks
On Tue, Aug 19, 2014 at 10:15 AM, Jan Schlüter notifications@github.com
wrote:
In theano/sandbox/cuda/opt.py:
kern = gpu_contiguous(kern)
return [GpuCorrMM(node.op.border_mode, node.op.subsample)(img, kern)]
border_mode = node.op.border_mode
subsample = node.op.subsample
pad = (0,0)
if (border_mode == 'full') and (subsample != (1,1)):
# need to simulate this via a padded valid convolution
pad = 'auto'
border_mode = 'valid'
if (border_mode == 'valid'):
# need to flip the kernel for valid convolution
kern = kern[:, :, ::-1, ::-1]
# call GpuCorrMM
# TODO: call GpuCorrMM_gradWeights instead if appropriate
return [GpuCorrMM('valid', subsample, pad)(
gpu_contiguous(img), gpu_contiguous(kern))]
Okay, from my benchmark the magic formula seems to be:
if batchsize * kernelHeight * kernelWidth < inputChannels * outputHeight * outputWidth:
use GpuCorrMMelse:
use GpuCorrMM_gradWeightsUnfortunately, even if ConvOp knows about the shapes, GpuConv only knows
about image and kernel width and height and the number of input channels,
but not about the batchsize. So for now I'll have the optimizer decide
based on kernel and output size, following your suggestion -- this gets
most cases correct already.—
Reply to this email directly or view it on GitHub
https://github.com/Theano/Theano/pull/2033/files#r16416429.
Added tests for |
I added a test for the gradients ( Additional TODOs:
|
|
very nice :) @f0k out of curiosity, have you tried using this for 1D convolutions? If so, how's the performance? Would it make sense to make a specialized 1D version of this, or is the 1D-using-2D approach good enough? |
@stencilman: Cool, thanks for testing! Seems we're on a good way here! |
Yeah, I meant the former. I have a suspicion that for the CorrMM-approach to computing convolutions, the 1D-using-2D approach is automatically the best thing you can do, which is why I'm interested to find out how it performs :) If I find some time I guess I'll try it for myself using the benchmark. |
During backprop, I get the error bellow when I use GpuCorrMM directly. however when I use
when I run with
Any idea what might be wrong? |
…t_* optimizers use it if available
7e1107b
to
a7f65a4
Compare
Added documentation about the CUBLAS bug and rebased onto master. The tests pass, Travis passes, should be finally good for merging. Thanks to everyone involved in reviewing and testing! |
prod2 *= node.op.imshp[0] | ||
# compare to decide | ||
if prod1 > prod2: | ||
return [gpu_contiguous(GpuCorrMM_gradWeights('valid', subsample, pad)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the gpu_contiguous on that line? I think it isn't needed and can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you're right. I thought it should have the same memory layout as before the replacement. Removed it.
I did a global review. There is just 2 thinks left before merging:
thanks for the good work! |
Thanks for the review, @nouiz! Fixed everything you noticed. |
Removing the
We could use Interestingly, this wasn't triggered in the tests, I only noticed when running the Theano benchmark script from convnet-benchmarks. I guess that's a general problem with the tests which should be addressed in a separate issue. |
I've got another optimization question. This is not actually part of this PR, but related to it, so I'll just ask here. When using
When using the standard
It's 100% slower than the former, seemingly because of the two dimshuffles that are actually redundant. The original conv2d graph for this case looks like this:
Any idea on how to fix that? Swap the order of /edit: Swapping the dimshuffle and flipping in the
There's something wrong with the dimshuffles. a) They should have been eliminated in optimization, and b) they should be nearly zero-cost operations, shouldn't they? I guess they are not eliminated in optimization because the first one has already been moved to GPU when the second one is added to the graph, and there is no optimization for GpuDimShuffles. Anyway, I'll open a separate issue when this PR has been merged. |
Faster algorithms and gradients for GpuCorrMM
Yay, thanks for merging, @nouiz! |
Can you time the gpucontiguous in the fast and slow case with the profile=True and profile_memory=True Theano flags? My guess is the -1 in the subtensor, cause the gpucontiguous to be slower. This cause different memory layout of the input and we didn't optimize for that case. |
Should we update the GpuCorrMM doc to tell that it is faster when called directly? What about doing a corr2d() method, that call the conv2d on the cpu with the flipping, that way they could cancel themself after substitution. Maybe we will need to make the merge subsentor optimization work for the GPU op. What do you think of this as a user interface? @lamblin @abergeron, you input that that interface would be welcome. |
That's the same in both the slow and fast case, because I explicitly did a Oh, maybe another thing that I observed: When I tried calling
No, because depending on the shapes, it can actually be slower when called directly. The goal should be to have the automatic graph optimization from
I think it's not needed for now. Let's first see if the kernel flipping really is a bottleneck and if so, we can still introduce a |
I would go for making merge subtensor work on GPU ops. I don't think a proliferation of mostly similar interfaces is a good idea. |
As a follow-up to #2023, this PR adds caffe's backward pass wrt. inputs to GpuCorrMM to implement
border_mode="full"
. It passes all the tests and it's about 2-4x faster than simulating a full convolution with a padded valid convolution. On the way, it cleans up the code and adds some more elaborate documentation.There are some caveats, though, all stemming from the fact that the implementation in caffe is meant as a backward pass for a valid correlation:
border_mode="valid"
,GpuCorrMM
doesn't flip the kernels, but withborder_mode="full"
, it does.border_mode="valid"
,subsampling
is for subsampling the output, but withborder_mode="full"
, it is for upsampling the input.border_mode="valid"
,pad
is for padding the input, but withborder_mode="full"
, it is for cropping the output.border_mode="full"
, it needs a different memory layout for the kernels.Currently,
GpuCorrMM
directly wraps the underlying algorithm, andlocal_conv_gemm()
copes with the different peculiarities, because I wasn't sure whether theGpuCorrMM
Op should be inserting dimshuffles, kernel flips andgpu_contiguous
es on its own.Looking at the end result, although it's quite fast,
local_conv_gemm()
now basically undoes everything thatConvOp.grad()
does. It might be possible to write an optimizer that replaces the gradient of a valid convolution with a properly parameterizedGpuCorrMM
Op, instead of just replacing the full convolution Op that is part of the gradient (introducing redundant dimshuffles etc. on the way). This way we would leverage the caffe implementation better.The alternative would be to modify the CUDA code to perform a subsampled, padded, full correlation, as would be expected from a Gpu Correlation Op. This would be cleaner, but it would also mean a lot more work and we wouldn't profit as much from caffe's implementation for the case of
subsampling != (1,1)
orpadding != (0,0)
(granted, this may be an uninteresting case in practice anyway)./edit: Another alternative would be splitting this into two Ops:
GpuCorrMM
for valid correlation with padding and subsampling (the caffe forward pass), andGpuConvMM
for full convolution with upsampling and cropping (the caffe backward pass). This way it would be obvious how the operations differ, but the memory layout required for the second Op would still be unintuitive.Yet another alternative would be splitting it into
GpuCorrMM
andGpuCorrMM_gradInput
. This way it would be obvious how to useGpuCorrMM_gradInput
for computing the gradient ofGpuCorrMM
. We could even addGpuCorrMM_gradKernel
for the gradient wrt. weights; it seems caffe does things slightly different there as well.In both cases,
GpuCorrMM
should get agrad()
method to define its gradient (so it can be used directly in a CNN to avoid kernel flipping and whatnot) andlocal_conv_gemm()
should still be able to replace anyGpuConv
instances with a gemm-based Op (so it can be used with existing CNN implementations).