New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster algorithms and gradients for GpuCorrMM #2033

Merged
merged 22 commits into from Sep 1, 2014

Conversation

Projects
None yet
7 participants
@f0k
Contributor

f0k commented Aug 12, 2014

As a follow-up to #2023, this PR adds caffe's backward pass wrt. inputs to GpuCorrMM to implement border_mode="full". It passes all the tests and it's about 2-4x faster than simulating a full convolution with a padded valid convolution. On the way, it cleans up the code and adds some more elaborate documentation.

There are some caveats, though, all stemming from the fact that the implementation in caffe is meant as a backward pass for a valid correlation:

  • With border_mode="valid", GpuCorrMM doesn't flip the kernels, but with border_mode="full", it does.
  • With border_mode="valid", subsampling is for subsampling the output, but with border_mode="full", it is for upsampling the input.
  • With border_mode="valid", pad is for padding the input, but with border_mode="full", it is for cropping the output.
  • With border_mode="full", it needs a different memory layout for the kernels.

Currently, GpuCorrMM directly wraps the underlying algorithm, and local_conv_gemm() copes with the different peculiarities, because I wasn't sure whether the GpuCorrMM Op should be inserting dimshuffles, kernel flips and gpu_contiguouses on its own.

Looking at the end result, although it's quite fast, local_conv_gemm() now basically undoes everything that ConvOp.grad() does. It might be possible to write an optimizer that replaces the gradient of a valid convolution with a properly parameterized GpuCorrMM Op, instead of just replacing the full convolution Op that is part of the gradient (introducing redundant dimshuffles etc. on the way). This way we would leverage the caffe implementation better.
The alternative would be to modify the CUDA code to perform a subsampled, padded, full correlation, as would be expected from a Gpu Correlation Op. This would be cleaner, but it would also mean a lot more work and we wouldn't profit as much from caffe's implementation for the case of subsampling != (1,1) or padding != (0,0) (granted, this may be an uninteresting case in practice anyway).
/edit: Another alternative would be splitting this into two Ops: GpuCorrMM for valid correlation with padding and subsampling (the caffe forward pass), and GpuConvMM for full convolution with upsampling and cropping (the caffe backward pass). This way it would be obvious how the operations differ, but the memory layout required for the second Op would still be unintuitive.
Yet another alternative would be splitting it into GpuCorrMM and GpuCorrMM_gradInput. This way it would be obvious how to use GpuCorrMM_gradInput for computing the gradient of GpuCorrMM. We could even add GpuCorrMM_gradKernel for the gradient wrt. weights; it seems caffe does things slightly different there as well.
In both cases, GpuCorrMM should get a grad() method to define its gradient (so it can be used directly in a CNN to avoid kernel flipping and whatnot) and local_conv_gemm() should still be able to replace any GpuConv instances with a gemm-based Op (so it can be used with existing CNN implementations).

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 12, 2014

Contributor

PS: This is on top of #2023, which wasn't merged yet. If you merge #2023, I can rebase this PR on top of master and solve the merge conflicts.

Contributor

f0k commented Aug 12, 2014

PS: This is on top of #2023, which wasn't merged yet. If you merge #2023, I can rebase this PR on top of master and solve the merge conflicts.

@f0k f0k referenced this pull request Aug 12, 2014

Closed

Continue gemm convolution #2015

5 of 8 tasks complete
@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Aug 12, 2014

Member

It already says that it has conflicts. But I think you should wait until #2023 is merged before doing a rebase.

Apart from that, I think you have the right approach in the optimization. The redundant dimshuffles should get merged or eliminated by the optimizer.

Member

abergeron commented Aug 12, 2014

It already says that it has conflicts. But I think you should wait until #2023 is merged before doing a rebase.

Apart from that, I think you have the right approach in the optimization. The redundant dimshuffles should get merged or eliminated by the optimizer.

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 12, 2014

Contributor

Thanks a lot @f0k !! These are the timing spead-up that I get:
after your changes:

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  89.9%    89.9%      80.250s       2.79e-01s     C      288       18   GpuCorrMM{valid, (1, 1), pad=0}
   5.6%    95.6%       5.022s       2.24e-02s     C      224       14   GpuCorrMM{full, (1, 1), pad=0}

before your changes:

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  57.1%    57.1%      83.547s       2.90e-01s     C      288       18   GpuCorrMM{valid, (1, 1), pad=(0, 0)}
  39.0%    96.1%      57.025s       1.02e-01s     C      560       28   GpuCorrMM{full, (1, 1), pad=(0, 0)}

Maybe there are redundant ops in my graph or I am doing something very wrong.. Is there a way to fine tune my graph? For me, the fprop is still very good, but the bprop is now like 3x slower than torch7. @f0k: did you try the convnet-benchmark(https://github.com/soumith/convnet-benchmarks)?

Contributor

stencilman commented Aug 12, 2014

Thanks a lot @f0k !! These are the timing spead-up that I get:
after your changes:

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  89.9%    89.9%      80.250s       2.79e-01s     C      288       18   GpuCorrMM{valid, (1, 1), pad=0}
   5.6%    95.6%       5.022s       2.24e-02s     C      224       14   GpuCorrMM{full, (1, 1), pad=0}

before your changes:

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  57.1%    57.1%      83.547s       2.90e-01s     C      288       18   GpuCorrMM{valid, (1, 1), pad=(0, 0)}
  39.0%    96.1%      57.025s       1.02e-01s     C      560       28   GpuCorrMM{full, (1, 1), pad=(0, 0)}

Maybe there are redundant ops in my graph or I am doing something very wrong.. Is there a way to fine tune my graph? For me, the fprop is still very good, but the bprop is now like 3x slower than torch7. @f0k: did you try the convnet-benchmark(https://github.com/soumith/convnet-benchmarks)?

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 12, 2014

Contributor

@stencilman: Hmm, so after my change it is about 5 times faster and only called half as often (upper table)? What's the problem then?
I already extended the convnet benchmark to include corrmm in a branch on my fork: https://github.com/f0k/convnet-benchmarks/tree/theano-benchmark-better (I'm postponing the pull request until GpuCorrMM is somewhat more mature). The bprop wrt. inputs was 2-4x faster with the new version. What do you get?
@abergeron: Yes, I'll wait for the merge. And I'll think about splitting the Op as detailed above.

Contributor

f0k commented Aug 12, 2014

@stencilman: Hmm, so after my change it is about 5 times faster and only called half as often (upper table)? What's the problem then?
I already extended the convnet benchmark to include corrmm in a branch on my fork: https://github.com/f0k/convnet-benchmarks/tree/theano-benchmark-better (I'm postponing the pull request until GpuCorrMM is somewhat more mature). The bprop wrt. inputs was 2-4x faster with the new version. What do you get?
@abergeron: Yes, I'll wait for the merge. And I'll think about splitting the Op as detailed above.

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Aug 12, 2014

Contributor

I have nothing of value to contribute here, but I'd just like to say that this is awesome :)

Contributor

benanne commented Aug 12, 2014

I have nothing of value to contribute here, but I'd just like to say that this is awesome :)

@madisonmay

This comment has been minimized.

Show comment
Hide comment
@madisonmay

madisonmay Aug 12, 2014

Contributor

Seconded -- awesome work so far @f0k, @stencilman, @abergeron.

Contributor

madisonmay commented Aug 12, 2014

Seconded -- awesome work so far @f0k, @stencilman, @abergeron.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Aug 12, 2014

Member

I barely did anything.

Member

abergeron commented Aug 12, 2014

I barely did anything.

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 13, 2014

Contributor

So @f0k, in your fork(https://github.com/f0k/convnet-benchmarks/tree/theano-benchmark-better) how does this compare to torch mm? My bprop is 3x slower than torch for some reason...

Contributor

stencilman commented Aug 13, 2014

So @f0k, in your fork(https://github.com/f0k/convnet-benchmarks/tree/theano-benchmark-better) how does this compare to torch mm? My bprop is 3x slower than torch for some reason...

@f0k f0k referenced this pull request Aug 13, 2014

Merged

Conv gemm non-square kernel support #2023

2 of 2 tasks complete
@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 15, 2014

Contributor

So I've rebased the PR onto master (the latest and greatest 9b3ea9e). @stencilman: I did a force push to my corrmm-faster-fullconv branch, if you still have it checked out, you should probably checkout master, delete the branch and check it out anew.

As discussed in #2023, I will refactor this implementation to have a GpuCorrMM op with two gradient ops, similar to the cuda-convnet wrapper in pylearn2. This will allow us to have a fast bprop wrt. weights in addition to the fast bprop wrt. inputs enabled by 121c1d4. So I'd suggest to not merge it yet.

Contributor

f0k commented Aug 15, 2014

So I've rebased the PR onto master (the latest and greatest 9b3ea9e). @stencilman: I did a force push to my corrmm-faster-fullconv branch, if you still have it checked out, you should probably checkout master, delete the branch and check it out anew.

As discussed in #2023, I will refactor this implementation to have a GpuCorrMM op with two gradient ops, similar to the cuda-convnet wrapper in pylearn2. This will allow us to have a fast bprop wrt. weights in addition to the fast bprop wrt. inputs enabled by 121c1d4. So I'd suggest to not merge it yet.

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 15, 2014

Contributor

@f0k: you are my hero!! Thank you so much, looking forward to hearing from you on this. Really grateful about this :-)

Contributor

stencilman commented Aug 15, 2014

@f0k: you are my hero!! Thank you so much, looking forward to hearing from you on this. Really grateful about this :-)

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 16, 2014

Contributor

@f0k: like i said earlier, do let me know if I can help you in any way to get this in quicker.. actually my project depends on this :-).. Thanks!

Contributor

stencilman commented Aug 16, 2014

@f0k: like i said earlier, do let me know if I can help you in any way to get this in quicker.. actually my project depends on this :-).. Thanks!

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 17, 2014

Contributor

@stencilman: Thanks for your offer, I planned to do it tomorrow (Monday). You can help by benchmarking against Torch again, and possibly with writing the tests... I'll let you know!

Contributor

f0k commented Aug 17, 2014

@stencilman: Thanks for your offer, I planned to do it tomorrow (Monday). You can help by benchmarking against Torch again, and possibly with writing the tests... I'll let you know!

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 18, 2014

Contributor

@f0k: I would love to run it against Torch and write any needed tests. Thanks! 👍

Contributor

stencilman commented Aug 18, 2014

@f0k: I would love to run it against Torch and write any needed tests. Thanks! 👍

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Aug 18, 2014

Member

So is this considered finished?

Even if it's a bit slower than Torch, if it's faster than what we have and works correctly, it's good for merging. Unless you want to work more on it.

Member

abergeron commented Aug 18, 2014

So is this considered finished?

Even if it's a bit slower than Torch, if it's faster than what we have and works correctly, it's good for merging. Unless you want to work more on it.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 18, 2014

Contributor

@abergeron: Doesn't matter to me -- either you merge it and I'll file a new PR that refactors everything, or you don't merge it yet and I'll add a commit to this PR that refactors everything.

(I have the refactored version almost finished, there's just something flipped again...)

Contributor

f0k commented Aug 18, 2014

@abergeron: Doesn't matter to me -- either you merge it and I'll file a new PR that refactors everything, or you don't merge it yet and I'll add a commit to this PR that refactors everything.

(I have the refactored version almost finished, there's just something flipped again...)

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Aug 18, 2014

Member

I'll wait for the refactor then.

Member

abergeron commented Aug 18, 2014

I'll wait for the refactor then.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 18, 2014

Contributor

OK, the refactored version is up. It passes all the previous tests (test_gemm_valid, test_gemm_full, test_gemm_subsample, test_gemm_directly). The optimizer doesn't use the new backprop wrt. weights yet, but the GpuCorrMM op now has gradients defined so it can be used instead of conv2d (and then the faster backprop wrt. weights will be used).

TODO:

  • Write tests for the gradients of GpuCorrMM. I didn't have time to test them at all yet.
  • Modify the conv_gemm optimizer to choose between GpuCorrMM and GpuCorrMM_gradWeights depending on the input shapes of any valid convolutions to be replaced. (Any suggestions?)

I'll be gone for today, but I'll happily accept pull requests to my pull request for any of the TODO items.

Contributor

f0k commented Aug 18, 2014

OK, the refactored version is up. It passes all the previous tests (test_gemm_valid, test_gemm_full, test_gemm_subsample, test_gemm_directly). The optimizer doesn't use the new backprop wrt. weights yet, but the GpuCorrMM op now has gradients defined so it can be used instead of conv2d (and then the faster backprop wrt. weights will be used).

TODO:

  • Write tests for the gradients of GpuCorrMM. I didn't have time to test them at all yet.
  • Modify the conv_gemm optimizer to choose between GpuCorrMM and GpuCorrMM_gradWeights depending on the input shapes of any valid convolutions to be replaced. (Any suggestions?)

I'll be gone for today, but I'll happily accept pull requests to my pull request for any of the TODO items.

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 18, 2014

Contributor

@f0k: Wowowow!! I will take a stab at testing the gradients of GpuCorrMM. I will leave TODO 2 for someome with better knowledge of Theano (perhaps @nouiz). I will also report the speadup.

Thanks again, I am really grateful to you!

Contributor

stencilman commented Aug 18, 2014

@f0k: Wowowow!! I will take a stab at testing the gradients of GpuCorrMM. I will leave TODO 2 for someome with better knowledge of Theano (perhaps @nouiz). I will also report the speadup.

Thanks again, I am really grateful to you!

Show outdated Hide outdated theano/sandbox/cuda/opt.py
# call GpuCorrMM
# TODO: call GpuCorrMM_gradWeights instead if appropriate
return [GpuCorrMM('valid', subsample, pad)(
gpu_contiguous(img), gpu_contiguous(kern))]

This comment has been minimized.

@f0k

f0k Aug 18, 2014

Contributor

Any suggestions on when to use GpuCorrMM_gradWeights instead?

@f0k

f0k Aug 18, 2014

Contributor

Any suggestions on when to use GpuCorrMM_gradWeights instead?

This comment has been minimized.

@nouiz

nouiz Aug 19, 2014

Member

I'm not convinced we needed to split that into 2 op. Here is some info to help find which one to use:
fprop img shape 32x32, kernel 5x5, output, 28x28
in that case, if we reuse the fprop code, the input will be img is 32x32, kernel 28x28 and output 5x5.

What about, if the kernel is bigger then the output, we use the gradWeights case? The only good way to know this is at run time. That is why I think we should keep both valid case in the same op. But with your code, it should be easy. But before doing this change, we should do some benchmark about my selection of the threshold. @stencilman can you do that? Time GpuCorrMM vs GpuCorr_gradWeights for different input size, including my examples?

@nouiz

nouiz Aug 19, 2014

Member

I'm not convinced we needed to split that into 2 op. Here is some info to help find which one to use:
fprop img shape 32x32, kernel 5x5, output, 28x28
in that case, if we reuse the fprop code, the input will be img is 32x32, kernel 28x28 and output 5x5.

What about, if the kernel is bigger then the output, we use the gradWeights case? The only good way to know this is at run time. That is why I think we should keep both valid case in the same op. But with your code, it should be easy. But before doing this change, we should do some benchmark about my selection of the threshold. @stencilman can you do that? Time GpuCorrMM vs GpuCorr_gradWeights for different input size, including my examples?

This comment has been minimized.

@stencilman

stencilman Aug 19, 2014

Contributor

Yes, I can do this now. @nouiz: what exactly do you mean by GpuCorr_gradWeights?

@stencilman

stencilman Aug 19, 2014

Contributor

Yes, I can do this now. @nouiz: what exactly do you mean by GpuCorr_gradWeights?

This comment has been minimized.

@stencilman

stencilman Aug 19, 2014

Contributor

@nouiz:
torch:

CONFIG: input = 64x32x32 * ker = 64x128x5x5 (bs = 32, stride = 1)
SpatialConvolutionMM:updateOutput(): (tm = 0.0090797543525696)
SpatialConvoltionMM:accGradParameters(): (tm = 0.011068224906921)
SpatialConvolutionMM:updateGradInput(): (tm = 0.0082367658615112)

CONFIG: input = 64x32x32 * ker = 64x128x28x28 (bs = 32, stride = 1)
SpatialConvolutionMM:updateOutput(): (tm = 0.032178223133087)
SpatialConvoltionMM:accGradParameters(): (tm = 0.032377779483795)
SpatialConvolutionMM:updateGradInput(): (tm = 0.02130252122879)

and theano:

CONFIG: input = 64 x 32 x 32 * ker = 64 x 128 x 5 x 5 ( bs = 32 , stride = 1 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 1323.00504803 GFLOP/s ( tm = 0.00776719999313 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.0100939035416 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.00898323154449 )

CONFIG: input = 64 x 32 x 32 * ker = 64 x 128 x 28 x 28 ( bs = 32 , stride = 1 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 308.976205726 GFLOP/s ( tm = 0.0332583694458 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.0303143196106 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0210659198761 )

is this what you were looking for?

@stencilman

stencilman Aug 19, 2014

Contributor

@nouiz:
torch:

CONFIG: input = 64x32x32 * ker = 64x128x5x5 (bs = 32, stride = 1)
SpatialConvolutionMM:updateOutput(): (tm = 0.0090797543525696)
SpatialConvoltionMM:accGradParameters(): (tm = 0.011068224906921)
SpatialConvolutionMM:updateGradInput(): (tm = 0.0082367658615112)

CONFIG: input = 64x32x32 * ker = 64x128x28x28 (bs = 32, stride = 1)
SpatialConvolutionMM:updateOutput(): (tm = 0.032178223133087)
SpatialConvoltionMM:accGradParameters(): (tm = 0.032377779483795)
SpatialConvolutionMM:updateGradInput(): (tm = 0.02130252122879)

and theano:

CONFIG: input = 64 x 32 x 32 * ker = 64 x 128 x 5 x 5 ( bs = 32 , stride = 1 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 1323.00504803 GFLOP/s ( tm = 0.00776719999313 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.0100939035416 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.00898323154449 )

CONFIG: input = 64 x 32 x 32 * ker = 64 x 128 x 28 x 28 ( bs = 32 , stride = 1 )
(experimental) theano.sandbox.cuda.blas.CorrMM fprop: 308.976205726 GFLOP/s ( tm = 0.0332583694458 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop weights: 0.0 GFLOP/s ( tm = 0.0303143196106 )
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0210659198761 )

is this what you were looking for?

This comment has been minimized.

@nouiz

nouiz Aug 19, 2014

Member

No, a Theano GpuCorrMM vs vs GpuCorrMM_gradWeight. Both do the same computation, just with different algo.

@nouiz

nouiz Aug 19, 2014

Member

No, a Theano GpuCorrMM vs vs GpuCorrMM_gradWeight. Both do the same computation, just with different algo.

This comment has been minimized.

@f0k

f0k Aug 19, 2014

Contributor

I think the batch size will also play a role, because in one case (fprop), the gemm is used to accumulate over the channels and the for loop to iterate over the minibatch, and in the other case (bprop weights), the for loop is used to accumulate over the channels (= minibatch of the forward pass) and gemm to iterate over the minibatch (= input channels of the forward pass).

I had a benchmark of the two implementations running over night, I will try to figure out a formula to choose between them. A nice aspect is that this may allow us to be faster than caffe/Torch for certain configurations, because in caffe/Torch it's hard-coded which implementation to use in the forward and backward pass.

The problem I see with joining both implementations into one Op is that they require very different memory layouts for the input and output. To compare, these three perform the same computation (copied from test_gemm_directly()):

cpuval = py_conv(npy_img, npy_kern, 'valid', subsample)

op = theano.sandbox.cuda.blas.GpuCorrMM(border_mode='valid',
        subsample=subsample)(i, k)
f = theano.function([i, k], op, mode=theano_mode)
gpuval1 = f(npy_img, npy_kern[:,:,::-1,::-1])

op = theano.sandbox.cuda.blas.GpuCorrMM_gradWeights(border_mode='valid',
        subsample=subsample)(i, k)
f = theano.function([i, k], op, mode=theano_mode)
gpuval2 = numpy.array(f(npy_img.transpose(1, 0, 2, 3),
        npy_kern.transpose(1, 0, 2, 3)[:,:,::-1,::-1])).transpose(1, 0, 2, 3)

# cpuval, gpuval1 and gpuval2 are now approximately equal

If we do the necessary dimshuffles, kernel flippings and gpu_contiguous calls inside the C code of a merged Op, we will probably do unneeded copies. If we do them in Python, we might still do unneeded copies unless we create an optimizer that can merge chains of dimshuffles with gpu_contiguous calls in between. Besides, keeping the code similar to caffe makes it easier to port upstream changes to Theano in case there are any.

So I'd prefer to keep the Ops separate for now, and leave merging them for another PR if that's okay with you.

@f0k

f0k Aug 19, 2014

Contributor

I think the batch size will also play a role, because in one case (fprop), the gemm is used to accumulate over the channels and the for loop to iterate over the minibatch, and in the other case (bprop weights), the for loop is used to accumulate over the channels (= minibatch of the forward pass) and gemm to iterate over the minibatch (= input channels of the forward pass).

I had a benchmark of the two implementations running over night, I will try to figure out a formula to choose between them. A nice aspect is that this may allow us to be faster than caffe/Torch for certain configurations, because in caffe/Torch it's hard-coded which implementation to use in the forward and backward pass.

The problem I see with joining both implementations into one Op is that they require very different memory layouts for the input and output. To compare, these three perform the same computation (copied from test_gemm_directly()):

cpuval = py_conv(npy_img, npy_kern, 'valid', subsample)

op = theano.sandbox.cuda.blas.GpuCorrMM(border_mode='valid',
        subsample=subsample)(i, k)
f = theano.function([i, k], op, mode=theano_mode)
gpuval1 = f(npy_img, npy_kern[:,:,::-1,::-1])

op = theano.sandbox.cuda.blas.GpuCorrMM_gradWeights(border_mode='valid',
        subsample=subsample)(i, k)
f = theano.function([i, k], op, mode=theano_mode)
gpuval2 = numpy.array(f(npy_img.transpose(1, 0, 2, 3),
        npy_kern.transpose(1, 0, 2, 3)[:,:,::-1,::-1])).transpose(1, 0, 2, 3)

# cpuval, gpuval1 and gpuval2 are now approximately equal

If we do the necessary dimshuffles, kernel flippings and gpu_contiguous calls inside the C code of a merged Op, we will probably do unneeded copies. If we do them in Python, we might still do unneeded copies unless we create an optimizer that can merge chains of dimshuffles with gpu_contiguous calls in between. Besides, keeping the code similar to caffe makes it easier to port upstream changes to Theano in case there are any.

So I'd prefer to keep the Ops separate for now, and leave merging them for another PR if that's okay with you.

This comment has been minimized.

@f0k

f0k Aug 19, 2014

Contributor

Okay, from my benchmark the magic formula seems to be:

if batchsize * kernelHeight * kernelWidth < inputChannels * outputHeight * outputWidth:
    use GpuCorrMM
else:
    use GpuCorrMM_gradWeights

Unfortunately, even if ConvOp knows about the shapes, GpuConv only knows about image and kernel width and height and the number of input channels, but not about the batchsize. So for now I'll have the optimizer decide based on kernel and output size, following your suggestion -- this gets most cases correct already.

@f0k

f0k Aug 19, 2014

Contributor

Okay, from my benchmark the magic formula seems to be:

if batchsize * kernelHeight * kernelWidth < inputChannels * outputHeight * outputWidth:
    use GpuCorrMM
else:
    use GpuCorrMM_gradWeights

Unfortunately, even if ConvOp knows about the shapes, GpuConv only knows about image and kernel width and height and the number of input channels, but not about the batchsize. So for now I'll have the optimizer decide based on kernel and output size, following your suggestion -- this gets most cases correct already.

This comment has been minimized.

@nouiz

nouiz Aug 19, 2014

Member

We use c_code_helper in a few other op. So I would go with that name.

I'm good with the idea of merging ops later. If this PR is stable and give
more performance then the current master, it can be merged. But I think we
need to figure out how to fix @stencilman bug.

@stencilman, I understand your script is complicated. Can you dump your
Theano function and make a small python scrip that load it and init good
input value to the function that generate the error?

http://www.deeplearning.net/software/theano/tutorial/debug_faq.html?highlight=dump%20function#dumping-a-function-to-help-debug

I suppose the problem happen in the first call to the theano function. Is
that the case?

@f0k, it is easy to add the batch size to the GpuConv op. You just need to
add a parameter to the GpuConv.init, store it as a member, modify
eq and hash to check it and modify the op in cuda/opt.py to pass it
to it. My sentence is long, but it is about 5-10 lines of codes. If you
want me to do it, just ask and I'll do it. That way, you can take into
account the fact when it is set. Just make sure the opt don't crash if only
some of the shapes are know. It is fine to do it only if we know all shapes.

thanks

On Tue, Aug 19, 2014 at 10:15 AM, Jan Schlüter notifications@github.com
wrote:

In theano/sandbox/cuda/opt.py:

  •    kern = gpu_contiguous(kern)
    
  •    return [GpuCorrMM(node.op.border_mode, node.op.subsample)(img, kern)]
    
  •    border_mode = node.op.border_mode
    
  •    subsample = node.op.subsample
    
  •    pad = (0,0)
    
  •    if (border_mode == 'full') and (subsample != (1,1)):
    
  •        # need to simulate this via a padded valid convolution
    
  •        pad = 'auto'
    
  •        border_mode = 'valid'
    
  •    if (border_mode == 'valid'):
    
  •        # need to flip the kernel for valid convolution
    
  •        kern = kern[:, :, ::-1, ::-1]
    
  •        # call GpuCorrMM
    
  •        # TODO: call GpuCorrMM_gradWeights instead if appropriate
    
  •        return [GpuCorrMM('valid', subsample, pad)(
    
  •                gpu_contiguous(img), gpu_contiguous(kern))]
    

Okay, from my benchmark the magic formula seems to be:

if batchsize * kernelHeight * kernelWidth < inputChannels * outputHeight * outputWidth:
use GpuCorrMMelse:
use GpuCorrMM_gradWeights

Unfortunately, even if ConvOp knows about the shapes, GpuConv only knows
about image and kernel width and height and the number of input channels,
but not about the batchsize. So for now I'll have the optimizer decide
based on kernel and output size, following your suggestion -- this gets
most cases correct already.


Reply to this email directly or view it on GitHub
https://github.com/Theano/Theano/pull/2033/files#r16416429.

@nouiz

nouiz Aug 19, 2014

Member

We use c_code_helper in a few other op. So I would go with that name.

I'm good with the idea of merging ops later. If this PR is stable and give
more performance then the current master, it can be merged. But I think we
need to figure out how to fix @stencilman bug.

@stencilman, I understand your script is complicated. Can you dump your
Theano function and make a small python scrip that load it and init good
input value to the function that generate the error?

http://www.deeplearning.net/software/theano/tutorial/debug_faq.html?highlight=dump%20function#dumping-a-function-to-help-debug

I suppose the problem happen in the first call to the theano function. Is
that the case?

@f0k, it is easy to add the batch size to the GpuConv op. You just need to
add a parameter to the GpuConv.init, store it as a member, modify
eq and hash to check it and modify the op in cuda/opt.py to pass it
to it. My sentence is long, but it is about 5-10 lines of codes. If you
want me to do it, just ask and I'll do it. That way, you can take into
account the fact when it is set. Just make sure the opt don't crash if only
some of the shapes are know. It is fine to do it only if we know all shapes.

thanks

On Tue, Aug 19, 2014 at 10:15 AM, Jan Schlüter notifications@github.com
wrote:

In theano/sandbox/cuda/opt.py:

  •    kern = gpu_contiguous(kern)
    
  •    return [GpuCorrMM(node.op.border_mode, node.op.subsample)(img, kern)]
    
  •    border_mode = node.op.border_mode
    
  •    subsample = node.op.subsample
    
  •    pad = (0,0)
    
  •    if (border_mode == 'full') and (subsample != (1,1)):
    
  •        # need to simulate this via a padded valid convolution
    
  •        pad = 'auto'
    
  •        border_mode = 'valid'
    
  •    if (border_mode == 'valid'):
    
  •        # need to flip the kernel for valid convolution
    
  •        kern = kern[:, :, ::-1, ::-1]
    
  •        # call GpuCorrMM
    
  •        # TODO: call GpuCorrMM_gradWeights instead if appropriate
    
  •        return [GpuCorrMM('valid', subsample, pad)(
    
  •                gpu_contiguous(img), gpu_contiguous(kern))]
    

Okay, from my benchmark the magic formula seems to be:

if batchsize * kernelHeight * kernelWidth < inputChannels * outputHeight * outputWidth:
use GpuCorrMMelse:
use GpuCorrMM_gradWeights

Unfortunately, even if ConvOp knows about the shapes, GpuConv only knows
about image and kernel width and height and the number of input channels,
but not about the batchsize. So for now I'll have the optimizer decide
based on kernel and output size, following your suggestion -- this gets
most cases correct already.


Reply to this email directly or view it on GitHub
https://github.com/Theano/Theano/pull/2033/files#r16416429.

@@ -186,7 +186,7 @@ def _params_allgood(ishape, kshape, mode, subsample=(1, 1), img_stride=(1, 1),
f = theano.function([i, k], op, mode=theano_mode)
if cls is not None:
assert any([isinstance(node.op, cls)
for node in f.maker.fgraph.toposort()]), f.maker.fgraph.toposort()
for node in f.maker.fgraph.toposort()]), "Cannot find class %r in %r" % (cls, f.maker.fgraph.toposort())

This comment has been minimized.

@f0k

f0k Aug 18, 2014

Contributor

More enlightening printout on why the test failed.

@f0k

f0k Aug 18, 2014

Contributor

More enlightening printout on why the test failed.

@@ -284,7 +284,7 @@ def exec_conv(version, shapes, verbose, random, mode,
cls=cls)
except Exception, e:
print ver, id, (ishape, kshape, subshape, istride, kstride)
print e
print "Exception", type(e), e

This comment has been minimized.

@f0k

f0k Aug 18, 2014

Contributor

Again, a bit more verbose information on why the test failed.

@f0k

f0k Aug 18, 2014

Contributor

Again, a bit more verbose information on why the test failed.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 18, 2014

Contributor

Added tests for GpuCorrMM_gradWeights and GpuCorrMM_gradInputs to test_gemm_directly, and fixed a mistake in GpuCorrMM_gradWeights. Still missing: A test_gemm_grads test which tests if the gradients of GpuCorrMM are actually the correct gradients. I'll see if I can do it before I really leave, otherwise I'll leave it open for tomorrow or for somebody else!

Contributor

f0k commented Aug 18, 2014

Added tests for GpuCorrMM_gradWeights and GpuCorrMM_gradInputs to test_gemm_directly, and fixed a mistake in GpuCorrMM_gradWeights. Still missing: A test_gemm_grads test which tests if the gradients of GpuCorrMM are actually the correct gradients. I'll see if I can do it before I really leave, otherwise I'll leave it open for tomorrow or for somebody else!

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 18, 2014

Contributor

I added a test for the gradients (test_gemm_grads). It passes for mode="valid", subsample=(1,1) and pad=(0,0), so if you don't use subsampling or padding, the GpuCorrMM()(img, gpu_contiguous(kern[:,:,::-1,::-1])) from this PR can already serve as a replacement for conv2d(img, kern). Note that this is currently required to get a speedup over the automatic optimization.

Additional TODOs:

  • GpuCorrMM needs to pass the input shape to GpuCorrMM_gradInputs and the weight shape to GpuCorrMM_gradWeights to make it work for all subsampling values. Currently, the shape is inferred incorrectly if the output image size is not divisible by the subsampling size.
  • At least one of the gradient ops fails for pad="auto".
  • GpuCorrMM_gradInputs and GpuCorrMM_gradWeights should probably get a grad() function as well.
Contributor

f0k commented Aug 18, 2014

I added a test for the gradients (test_gemm_grads). It passes for mode="valid", subsample=(1,1) and pad=(0,0), so if you don't use subsampling or padding, the GpuCorrMM()(img, gpu_contiguous(kern[:,:,::-1,::-1])) from this PR can already serve as a replacement for conv2d(img, kern). Note that this is currently required to get a speedup over the automatic optimization.

Additional TODOs:

  • GpuCorrMM needs to pass the input shape to GpuCorrMM_gradInputs and the weight shape to GpuCorrMM_gradWeights to make it work for all subsampling values. Currently, the shape is inferred incorrectly if the output image size is not divisible by the subsampling size.
  • At least one of the gradient ops fails for pad="auto".
  • GpuCorrMM_gradInputs and GpuCorrMM_gradWeights should probably get a grad() function as well.
@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 18, 2014

Contributor
X CONFIG: input = 64x64x64 * ker = 64x128x9x9 (bs = 128, stride = 1) CONFIG: input = 128x32x32 * ker = 128x128x9x9 (bs = 128, stride = 1) CONFIG: input = 3x128x128 * ker = 3x96x11x11 (bs = 128, stride = 1) CONFIG: input = 384x13x13 * ker = 384x384x3x3 (bs = 128, stride = 1) CONFIG: input = 128x16x16 * ker = 128x128x7x7 (bs = 128, stride = 1)
torch fprop 0.24305301904678 0.16899198293686 0.10639077425003 0.055703997612 0.041813731193
GpuCorrMM fprop 0.225996734619 0.159330703735 0.0936857147217 0.0532916145325 0.04246251297
torch grad wrt input 0.30804550647736 0.13152647018433 0.091511249542 0.02750152349472 0.02301424741745
GpuCorrMM bprop wrt inputs 0.288951873779 0.121228462219 0.086506401062 0.0258434085846 0.0209931201935
torch grad wrt weights 0.37771952152252 0.1442049741745 0.18967247009277 0.03774124383 0.023773729801
GpuCorrMM bprop wrt weights 0.352622833252 0.132992980957 0.169927322388 0.0339831695557 0.0214208164215

[]

Contributor

stencilman commented Aug 18, 2014

X CONFIG: input = 64x64x64 * ker = 64x128x9x9 (bs = 128, stride = 1) CONFIG: input = 128x32x32 * ker = 128x128x9x9 (bs = 128, stride = 1) CONFIG: input = 3x128x128 * ker = 3x96x11x11 (bs = 128, stride = 1) CONFIG: input = 384x13x13 * ker = 384x384x3x3 (bs = 128, stride = 1) CONFIG: input = 128x16x16 * ker = 128x128x7x7 (bs = 128, stride = 1)
torch fprop 0.24305301904678 0.16899198293686 0.10639077425003 0.055703997612 0.041813731193
GpuCorrMM fprop 0.225996734619 0.159330703735 0.0936857147217 0.0532916145325 0.04246251297
torch grad wrt input 0.30804550647736 0.13152647018433 0.091511249542 0.02750152349472 0.02301424741745
GpuCorrMM bprop wrt inputs 0.288951873779 0.121228462219 0.086506401062 0.0258434085846 0.0209931201935
torch grad wrt weights 0.37771952152252 0.1442049741745 0.18967247009277 0.03774124383 0.023773729801
GpuCorrMM bprop wrt weights 0.352622833252 0.132992980957 0.169927322388 0.0339831695557 0.0214208164215

[]

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Aug 18, 2014

Contributor

very nice :) @f0k out of curiosity, have you tried using this for 1D convolutions? If so, how's the performance? Would it make sense to make a specialized 1D version of this, or is the 1D-using-2D approach good enough?

Contributor

benanne commented Aug 18, 2014

very nice :) @f0k out of curiosity, have you tried using this for 1D convolutions? If so, how's the performance? Would it make sense to make a specialized 1D version of this, or is the 1D-using-2D approach good enough?

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 18, 2014

Contributor

@stencilman: Cool, thanks for testing! Seems we're on a good way here!
@benanne: I haven't tried, but the theano benchmark in convnet-benchmarks accepts a custom config on the command line, so you can time the 1D-using-2D approach (but the bprop wrt. weights will not be fast yet). If by 1D convolution you refer to your approach of shifting a 2D patch over a spectrogram in the time dimension only, then I don't see a way to make it faster. If you refer to convolving a vector with another vector, then... hmm... do you still have batches? Multiple input and output channels?

Contributor

f0k commented Aug 18, 2014

@stencilman: Cool, thanks for testing! Seems we're on a good way here!
@benanne: I haven't tried, but the theano benchmark in convnet-benchmarks accepts a custom config on the command line, so you can time the 1D-using-2D approach (but the bprop wrt. weights will not be fast yet). If by 1D convolution you refer to your approach of shifting a 2D patch over a spectrogram in the time dimension only, then I don't see a way to make it faster. If you refer to convolving a vector with another vector, then... hmm... do you still have batches? Multiple input and output channels?

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Aug 18, 2014

Contributor

Yeah, I meant the former. I have a suspicion that for the CorrMM-approach to computing convolutions, the 1D-using-2D approach is automatically the best thing you can do, which is why I'm interested to find out how it performs :) If I find some time I guess I'll try it for myself using the benchmark.

Contributor

benanne commented Aug 18, 2014

Yeah, I meant the former. I have a suspicion that for the CorrMM-approach to computing convolutions, the 1D-using-2D approach is automatically the best thing you can do, which is why I'm interested to find out how it performs :) If I find some time I guess I'll try it for myself using the benchmark.

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 18, 2014

Contributor

During backprop, I get the error bellow when I use GpuCorrMM directly. however when I use optimizer=fast_compile option, I do not get this error.

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1116d80000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total 
CudaNdarray_uninit: error freeing self->devdata. (self=0xd7614f0, self->devata=0x1116d80000)
Traceback (most recent call last):
  File "deep_nets.py", line 71, in <module>
    model.trainEpoch(data.tr, conf, epoch)
  File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
    err = self.train_model()
  File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ajain/Theano/theano/compile/function_module.py", line 579, in __call__
    outputs = self.fn()
RuntimeError: Cuda error: GpuElemwise node_600ed41afec07a555b0c58a139e2f767_0 Mul: an illegal memory access was encountered.
    n_blocks=30 threads_per_block=256
   Call: kernel_Mul_node_600ed41afec07a555b0c58a139e2f767_0_Ccontiguous<<<n_blocks, threads_per_block>>>(numEls, i0_data, i1_data, o0_data)

Apply node that caused the error: GpuElemwise{Mul}[(0, 1)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]},no_inplace}.0, GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)}.0)
Inputs types: [CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D)]
Inputs shapes: [(16, 256, 90, 90), (16, 256, 90, 90)]
Inputs strides: [(2073600, 8100, 90, 1), (2073600, 8100, 90, 1)]
Inputs scalar values: ['not scalar', 'not scalar']

HINT: Re-running with most Theano optimization disabled could give you a back-traces when this node was created. This can be done with by setting the Theano flags optimizer=fast_compile
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.

when I run with exception_verbosity=high I get the error bellow.

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1116d80000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total 
CudaNdarray_uninit: error freeing self->devdata. (self=0xcfd32f0, self->devata=0x1116d80000)
Traceback (most recent call last):
  File "deep_nets.py", line 71, in <module>
    model.trainEpoch(data.tr, conf, epoch)
  File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
    err = self.train_model()
  File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ajain/Theano/theano/gof/link.py", line 164, in raise_with_op
    print_type=True)
  File "/home/ajain/Theano/theano/printing.py", line 104, in debugprint
    stop_on_name=stop_on_name)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 613, in debugprint
    print >> file, '%s%s %s%s' % (prefix, r, id_str, type_str)
  File "/home/ajain/Theano/theano/sandbox/cuda/var.py", line 52, in __str__
    return "CudaNdarrayConstant{"+str(numpy.asarray(self.data))+"}"
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
    return array(a, dtype, copy=False, order=order)
RuntimeError: error copying data to host

Any idea what might be wrong?

Contributor

stencilman commented Aug 18, 2014

During backprop, I get the error bellow when I use GpuCorrMM directly. however when I use optimizer=fast_compile option, I do not get this error.

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1116d80000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total 
CudaNdarray_uninit: error freeing self->devdata. (self=0xd7614f0, self->devata=0x1116d80000)
Traceback (most recent call last):
  File "deep_nets.py", line 71, in <module>
    model.trainEpoch(data.tr, conf, epoch)
  File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
    err = self.train_model()
  File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ajain/Theano/theano/compile/function_module.py", line 579, in __call__
    outputs = self.fn()
RuntimeError: Cuda error: GpuElemwise node_600ed41afec07a555b0c58a139e2f767_0 Mul: an illegal memory access was encountered.
    n_blocks=30 threads_per_block=256
   Call: kernel_Mul_node_600ed41afec07a555b0c58a139e2f767_0_Ccontiguous<<<n_blocks, threads_per_block>>>(numEls, i0_data, i1_data, o0_data)

Apply node that caused the error: GpuElemwise{Mul}[(0, 1)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]},no_inplace}.0, GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)}.0)
Inputs types: [CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D)]
Inputs shapes: [(16, 256, 90, 90), (16, 256, 90, 90)]
Inputs strides: [(2073600, 8100, 90, 1), (2073600, 8100, 90, 1)]
Inputs scalar values: ['not scalar', 'not scalar']

HINT: Re-running with most Theano optimization disabled could give you a back-traces when this node was created. This can be done with by setting the Theano flags optimizer=fast_compile
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.

when I run with exception_verbosity=high I get the error bellow.

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1116d80000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total 
CudaNdarray_uninit: error freeing self->devdata. (self=0xcfd32f0, self->devata=0x1116d80000)
Traceback (most recent call last):
  File "deep_nets.py", line 71, in <module>
    model.trainEpoch(data.tr, conf, epoch)
  File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
    err = self.train_model()
  File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ajain/Theano/theano/gof/link.py", line 164, in raise_with_op
    print_type=True)
  File "/home/ajain/Theano/theano/printing.py", line 104, in debugprint
    stop_on_name=stop_on_name)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 613, in debugprint
    print >> file, '%s%s %s%s' % (prefix, r, id_str, type_str)
  File "/home/ajain/Theano/theano/sandbox/cuda/var.py", line 52, in __str__
    return "CudaNdarrayConstant{"+str(numpy.asarray(self.data))+"}"
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
    return array(a, dtype, copy=False, order=order)
RuntimeError: error copying data to host

Any idea what might be wrong?

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 19, 2014

Contributor

@nouiz @f0k : is it because i am missing any gpu_contiguous? I added it everywhere but I still seem to have this problem.. Any help will be great, thanks!

Update: It works when I use a smaller batch size.

Contributor

stencilman commented Aug 19, 2014

@nouiz @f0k : is it because i am missing any gpu_contiguous? I added it everywhere but I still seem to have this problem.. Any help will be great, thanks!

Update: It works when I use a smaller batch size.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Aug 19, 2014

Member

Can you provide code that reproduce the problem? I have a presentation this
Friday and just started to do it. So I won't have much time.

I would be surprised if the gpu_contiguous cause the problems. But I didn't
check the code of this PR yet. I'll try to check it tonight.

Can you report the error when you use the environment variable
CUDA_LAUNCH_BLOCKING=1. By default gpu kernel run async. So it is highly
possible that the error reported isn't the right place where it happen.

On Mon, Aug 18, 2014 at 8:09 PM, Arjun Jain notifications@github.com
wrote:

@nouiz https://github.com/nouiz @f0k https://github.com/f0k : is it
because i am missing any gpu_contiguous? I added it everywhere but I still
seem to have this problem.. Any help will be great, thanks!


Reply to this email directly or view it on GitHub
#2033 (comment).

Member

nouiz commented Aug 19, 2014

Can you provide code that reproduce the problem? I have a presentation this
Friday and just started to do it. So I won't have much time.

I would be surprised if the gpu_contiguous cause the problems. But I didn't
check the code of this PR yet. I'll try to check it tonight.

Can you report the error when you use the environment variable
CUDA_LAUNCH_BLOCKING=1. By default gpu kernel run async. So it is highly
possible that the error reported isn't the right place where it happen.

On Mon, Aug 18, 2014 at 8:09 PM, Arjun Jain notifications@github.com
wrote:

@nouiz https://github.com/nouiz @f0k https://github.com/f0k : is it
because i am missing any gpu_contiguous? I added it everywhere but I still
seem to have this problem.. Any help will be great, thanks!


Reply to this email directly or view it on GitHub
#2033 (comment).

Show outdated Hide outdated theano/sandbox/cuda/blas.py
pad = self.pad
if self.border_mode == "valid":
bmode = 1
def c_code(self, bottom, weights, top, direction, sub):

This comment has been minimized.

@nouiz

nouiz Aug 19, 2014

Member

This fct don't use the same signature as normal c_code. Can you rename it to make this clear?

@nouiz

nouiz Aug 19, 2014

Member

This fct don't use the same signature as normal c_code. Can you rename it to make this clear?

This comment has been minimized.

@nouiz

nouiz Aug 19, 2014

Member

I have difficulties to remember what bottom and top mean. Can you document that? bottom and top can have different meaning depending of how you see the graph.

@nouiz

nouiz Aug 19, 2014

Member

I have difficulties to remember what bottom and top mean. Can you document that? bottom and top can have different meaning depending of how you see the graph.

This comment has been minimized.

@f0k

f0k Aug 19, 2014

Contributor

Renaming to shared_c_code or c_code_helper would be fine?

About the names, I took them from caffe as they are less ambiguous than inputs and outputs. I'll document them.

@f0k

f0k Aug 19, 2014

Contributor

Renaming to shared_c_code or c_code_helper would be fine?

About the names, I took them from caffe as they are less ambiguous than inputs and outputs. I'll document them.

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 19, 2014

Contributor

If I use allow_gc=False, it works, but then it uses a lot of memory and I cant use a big model.

Please find bellow the error with CUDA_LAUNCH_BLOCKING=1.

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1162160000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total
device_free: cudaFree() returned an error, but there is already an Python error set. This happen during the clean up when there is a first error and the CUDA dr
iver is in a so bad state that it don't work anymore. We keep the previous error set to help debugging it.CudaNdarray_uninit: error freeing self->devdata. (self
=0xda12730, self->devata=0x1162160000)
Traceback (most recent call last):
  File "deep_nets.py", line 71, in <module>
    model.trainEpoch(data.tr, conf, epoch)
  File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
    err = self.train_model()
  File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ajain/Theano/theano/gof/link.py", line 164, in raise_with_op
    print_type=True)
  File "/home/ajain/Theano/theano/printing.py", line 104, in debugprint
    stop_on_name=stop_on_name)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 613, in debugprint
    print >> file, '%s%s %s%s' % (prefix, r, id_str, type_str)
  File "/home/ajain/Theano/theano/sandbox/cuda/var.py", line 52, in __str__
    return "CudaNdarrayConstant{"+str(numpy.asarray(self.data))+"}"
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
    return array(a, dtype, copy=False, order=order)
RuntimeError: error copying data to host
Contributor

stencilman commented Aug 19, 2014

If I use allow_gc=False, it works, but then it uses a lot of memory and I cant use a big model.

Please find bellow the error with CUDA_LAUNCH_BLOCKING=1.

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1162160000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total
device_free: cudaFree() returned an error, but there is already an Python error set. This happen during the clean up when there is a first error and the CUDA dr
iver is in a so bad state that it don't work anymore. We keep the previous error set to help debugging it.CudaNdarray_uninit: error freeing self->devdata. (self
=0xda12730, self->devata=0x1162160000)
Traceback (most recent call last):
  File "deep_nets.py", line 71, in <module>
    model.trainEpoch(data.tr, conf, epoch)
  File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
    err = self.train_model()
  File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ajain/Theano/theano/gof/link.py", line 164, in raise_with_op
    print_type=True)
  File "/home/ajain/Theano/theano/printing.py", line 104, in debugprint
    stop_on_name=stop_on_name)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
    prefix_child=new_prefix_child)
  File "/home/ajain/Theano/theano/compile/debugmode.py", line 613, in debugprint
    print >> file, '%s%s %s%s' % (prefix, r, id_str, type_str)
  File "/home/ajain/Theano/theano/sandbox/cuda/var.py", line 52, in __str__
    return "CudaNdarrayConstant{"+str(numpy.asarray(self.data))+"}"
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
    return array(a, dtype, copy=False, order=order)
RuntimeError: error copying data to host
@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Aug 19, 2014

Member

Remove exception_verbose something. It cause a second error that hide the
real error.

On Mon, Aug 18, 2014 at 9:32 PM, Arjun Jain notifications@github.com
wrote:

If I use allow_gc=False, it works, but then it uses a lot of memory and I
cant use a big model.

Please find bellow the error with CUDA_LAUNCH_BLOCKING=1.

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1162160000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total
device_free: cudaFree() returned an error, but there is already an Python error set. This happen during the clean up when there is a first error and the CUDA dr
iver is in a so bad state that it don't work anymore. We keep the previous error set to help debugging it.CudaNdarray_uninit: error freeing self->devdata. (self
=0xda12730, self->devata=0x1162160000)
Traceback (most recent call last):
File "deep_nets.py", line 71, in
model.trainEpoch(data.tr, conf, epoch)
File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
err = self.train_model()
File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in call
self.fn.thunks[self.fn.position_of_error])
File "/home/ajain/Theano/theano/gof/link.py", line 164, in raise_with_op
print_type=True)
File "/home/ajain/Theano/theano/printing.py", line 104, in debugprint
stop_on_name=stop_on_name)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 613, in debugprint
print >> file, '%s%s %s%s' % (prefix, r, id_str, type_str)
File "/home/ajain/Theano/theano/sandbox/cuda/var.py", line 52, in str
return "CudaNdarrayConstant{"+str(numpy.asarray(self.data))+"}"
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
return array(a, dtype, copy=False, order=order)
RuntimeError: error copying data to host


Reply to this email directly or view it on GitHub
#2033 (comment).

Member

nouiz commented Aug 19, 2014

Remove exception_verbose something. It cause a second error that hide the
real error.

On Mon, Aug 18, 2014 at 9:32 PM, Arjun Jain notifications@github.com
wrote:

If I use allow_gc=False, it works, but then it uses a lot of memory and I
cant use a big model.

Please find bellow the error with CUDA_LAUNCH_BLOCKING=1.

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1162160000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total
device_free: cudaFree() returned an error, but there is already an Python error set. This happen during the clean up when there is a first error and the CUDA dr
iver is in a so bad state that it don't work anymore. We keep the previous error set to help debugging it.CudaNdarray_uninit: error freeing self->devdata. (self
=0xda12730, self->devata=0x1162160000)
Traceback (most recent call last):
File "deep_nets.py", line 71, in
model.trainEpoch(data.tr, conf, epoch)
File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
err = self.train_model()
File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in call
self.fn.thunks[self.fn.position_of_error])
File "/home/ajain/Theano/theano/gof/link.py", line 164, in raise_with_op
print_type=True)
File "/home/ajain/Theano/theano/printing.py", line 104, in debugprint
stop_on_name=stop_on_name)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 609, in debugprint
prefix_child=new_prefix_child)
File "/home/ajain/Theano/theano/compile/debugmode.py", line 613, in debugprint
print >> file, '%s%s %s%s' % (prefix, r, id_str, type_str)
File "/home/ajain/Theano/theano/sandbox/cuda/var.py", line 52, in str
return "CudaNdarrayConstant{"+str(numpy.asarray(self.data))+"}"
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
return array(a, dtype, copy=False, order=order)
RuntimeError: error copying data to host


Reply to this email directly or view it on GitHub
#2033 (comment).

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Aug 19, 2014

Member

Can you give me the script that generate the error and how to use it?

Member

nouiz commented Aug 19, 2014

Can you give me the script that generate the error and how to use it?

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 19, 2014

Contributor

Ah, much better error! It is a big project, you will have to have my data etc to run my script. I will try to generate a minimal example if needed. Do you know what the error might mean?

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1162160000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total 
device_free: cudaFree() returned an error, but there is already an Python error set. This happen during the clean up when there is a first error and the CUDA driver is in a so bad state that it don't work anymore. We keep the previous error set to help debugging it.CudaNdarray_uninit: error freeing self->devdata. (self=0xd10f1f0, self->devata=0x1162160000)
Traceback (most recent call last):
  File "deep_nets.py", line 71, in <module>
    model.trainEpoch(data.tr, conf, epoch)
  File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
    err = self.train_model()
  File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ajain/Theano/theano/compile/function_module.py", line 579, in __call__
    outputs = self.fn()
RuntimeError: GpuCorrMM encountered a CUBLAS error: the function failed to launch on the GPU

Apply node that caused the error: GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)}(W_18, GpuContiguous.0)
Inputs types: [CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D)]
Inputs shapes: [(4, 256, 1, 1), (16, 4, 60, 90)]
Inputs strides: [(256, 1, 0, 0), (21600, 5400, 90, 1)]
Inputs scalar values: ['not scalar', 'not scalar']

Backtrace when the node is created:
  File "/home/ajain/Theano/theano/gradient.py", line 895, in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]
  File "/home/ajain/Theano/theano/gradient.py", line 1173, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/home/ajain/Theano/theano/gradient.py", line 1034, in access_term_cache
    input_grads = node.op.grad(inputs, new_output_grads)
  File "/home/ajain/Theano/theano/sandbox/cuda/blas.py", line 746, in grad
    weights, top)

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.
Contributor

stencilman commented Aug 19, 2014

Ah, much better error! It is a big project, you will have to have my data etc to run my script. I will try to generate a minimal example if needed. Do you know what the error might mean?

Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0x1162160000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total 
device_free: cudaFree() returned an error, but there is already an Python error set. This happen during the clean up when there is a first error and the CUDA driver is in a so bad state that it don't work anymore. We keep the previous error set to help debugging it.CudaNdarray_uninit: error freeing self->devdata. (self=0xd10f1f0, self->devata=0x1162160000)
Traceback (most recent call last):
  File "deep_nets.py", line 71, in <module>
    model.trainEpoch(data.tr, conf, epoch)
  File "/home/ajain/Projects/deep_nets/python/lib/machine.py", line 344, in trainEpoch
    err = self.train_model()
  File "/home/ajain/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ajain/Theano/theano/compile/function_module.py", line 579, in __call__
    outputs = self.fn()
RuntimeError: GpuCorrMM encountered a CUBLAS error: the function failed to launch on the GPU

Apply node that caused the error: GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)}(W_18, GpuContiguous.0)
Inputs types: [CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D)]
Inputs shapes: [(4, 256, 1, 1), (16, 4, 60, 90)]
Inputs strides: [(256, 1, 0, 0), (21600, 5400, 90, 1)]
Inputs scalar values: ['not scalar', 'not scalar']

Backtrace when the node is created:
  File "/home/ajain/Theano/theano/gradient.py", line 895, in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]
  File "/home/ajain/Theano/theano/gradient.py", line 1173, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/home/ajain/Theano/theano/gradient.py", line 1034, in access_term_cache
    input_grads = node.op.grad(inputs, new_output_grads)
  File "/home/ajain/Theano/theano/sandbox/cuda/blas.py", line 746, in grad
    weights, top)

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.
@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Aug 19, 2014

Member

@stencilman the new error message seem to indicate the graph() method that introduce the GpuCorrMM_gradInput op don't do it correctly, or that this case isn't well handled in the c code. @f0k can you check the code that call cublas in that case? I don't know well enough that code to do this relatively rapidly.

Member

nouiz commented Aug 19, 2014

@stencilman the new error message seem to indicate the graph() method that introduce the GpuCorrMM_gradInput op don't do it correctly, or that this case isn't well handled in the c code. @f0k can you check the code that call cublas in that case? I don't know well enough that code to do this relatively rapidly.

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Aug 19, 2014

Contributor

@f0k @nouiz: Do let me know if you know why I have the error. Thank you.

Contributor

stencilman commented Aug 19, 2014

@f0k @nouiz: Do let me know if you know why I have the error. Thank you.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 19, 2014

Contributor

@stencilman: From your error message, it seems you have a kernel shape of 4x256x1x1 and an output shape of 16x4x60x90. This would be an input shape of 16x256x60x90. In other words, a batchsize of 16, 256 input channels, image size 60x90, 4 output channels (filters), kernel size 1x1. Is this correct?
When I run pylearn2_benchmark.py with this, it works fine, and much faster than both the standard and fft convolution: python pylearn2_benchmark.py i256x60x90,k4x1x1,b16
Maybe it really was a memory problem not tracked correctly because of missing error checks. Please try again with the current version to see if the error message changes.

@nouiz: Thanks for reviewing, I extended the documentation and added error checks as suggested. Also the optimizer can insert all three ops now depending on border mode and shapes (if shapes were provided to conv2d by the user). The gradients of GpuCorrMM are still wrong if subsampling or padding is used; I'll fix that later.

Contributor

f0k commented Aug 19, 2014

@stencilman: From your error message, it seems you have a kernel shape of 4x256x1x1 and an output shape of 16x4x60x90. This would be an input shape of 16x256x60x90. In other words, a batchsize of 16, 256 input channels, image size 60x90, 4 output channels (filters), kernel size 1x1. Is this correct?
When I run pylearn2_benchmark.py with this, it works fine, and much faster than both the standard and fft convolution: python pylearn2_benchmark.py i256x60x90,k4x1x1,b16
Maybe it really was a memory problem not tracked correctly because of missing error checks. Please try again with the current version to see if the error message changes.

@nouiz: Thanks for reviewing, I extended the documentation and added error checks as suggested. Also the optimizer can insert all three ops now depending on border mode and shapes (if shapes were provided to conv2d by the user). The gradients of GpuCorrMM are still wrong if subsampling or padding is used; I'll fix that later.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 27, 2014

Contributor

Added documentation about the CUBLAS bug and rebased onto master. The tests pass, Travis passes, should be finally good for merging. Thanks to everyone involved in reviewing and testing!

Contributor

f0k commented Aug 27, 2014

Added documentation about the CUBLAS bug and rebased onto master. The tests pass, Travis passes, should be finally good for merging. Thanks to everyone involved in reviewing and testing!

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Aug 29, 2014

Can you tell that this is shape specific, so changing the shapes can work around that bug.

Can you tell that this is shape specific, so changing the shapes can work around that bug.

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 29, 2014

Owner

But it already says "for some input and filter shapes", how would you formulate it?

Owner

f0k replied Aug 29, 2014

But it already says "for some input and filter shapes", how would you formulate it?

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Aug 29, 2014

You are right, but I missed it, so some people will miss it. Why not just add this sentence.
Changings shapes could also work around this CUBLAS bug?

Anyway, I let you decide if you do it or not.

nouiz replied Aug 29, 2014

You are right, but I missed it, so some people will miss it. Why not just add this sentence.
Changings shapes could also work around this CUBLAS bug?

Anyway, I let you decide if you do it or not.

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 30, 2014

Owner

Okay, added a note telling that changing the batch size or channel count may also help.

Owner

f0k replied Aug 30, 2014

Okay, added a note telling that changing the batch size or channel count may also help.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Aug 29, 2014

You should return NULL after setting python error. Do this for all change in this PR, but there is also some other pyerr in the source that need this.

You should return NULL after setting python error. Do this for all change in this PR, but there is also some other pyerr in the source that need this.

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 30, 2014

Owner

Ooops. Yes, the "4D" checks were missing it, so I missed it as well. Fixed.

Owner

f0k replied Aug 30, 2014

Ooops. Yes, the "4D" checks were missing it, so I missed it as well. Fixed.

Show outdated Hide outdated theano/sandbox/cuda/opt.py
prod2 *= node.op.imshp[0]
# compare to decide
if prod1 > prod2:
return [gpu_contiguous(GpuCorrMM_gradWeights('valid', subsample, pad)(

This comment has been minimized.

@nouiz

nouiz Aug 29, 2014

Member

Why the gpu_contiguous on that line? I think it isn't needed and can be removed.

@nouiz

nouiz Aug 29, 2014

Member

Why the gpu_contiguous on that line? I think it isn't needed and can be removed.

This comment has been minimized.

@f0k

f0k Aug 30, 2014

Contributor

Hmm, you're right. I thought it should have the same memory layout as before the replacement. Removed it.

@f0k

f0k Aug 30, 2014

Contributor

Hmm, you're right. I thought it should have the same memory layout as before the replacement. Removed it.

@@ -1,47 +0,0 @@
/*

This comment has been minimized.

@nouiz

nouiz Aug 29, 2014

Member

This is good, one windows, a user reported problem that this file wasn't found.

@nouiz

nouiz Aug 29, 2014

Member

This is good, one windows, a user reported problem that this file wasn't found.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Aug 29, 2014

Member

I did a global review. There is just 2 thinks left before merging:

  • The return NULL
  • Optional but welcome, remove the not useful gpu_contiguous.

thanks for the good work!

Member

nouiz commented Aug 29, 2014

I did a global review. There is just 2 thinks left before merging:

  • The return NULL
  • Optional but welcome, remove the not useful gpu_contiguous.

thanks for the good work!

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 30, 2014

Contributor

Thanks for the review, @nouiz! Fixed everything you noticed.
/edit: please wait with the merge, removing the gpu_contiguous was wrong, just figuring it out...

Contributor

f0k commented Aug 30, 2014

Thanks for the review, @nouiz! Fixed everything you noticed.
/edit: please wait with the merge, removing the gpu_contiguous was wrong, just figuring it out...

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Sep 1, 2014

Contributor

Removing the gpu_contiguous as per @nouiz's suggestion introduced a problem: The dimshuffle() call on the output inserts a DimShuffle op, which returns a CPU output. This changes the type of the output, which is not allowed and results in:

TypeError: ('The type of the replacement must be the same as the type of the original Variable.', GpuConv{valid, (1, 1), None, (6, 6), True, (1, 287, 29), (6, 6)}.0, DimShuffle{1,0,2,3}.0, CudaNdarrayType(float32, 4D), TensorType(float32, 4D), 'local_conv_gemm')

We could use GpuDimShuffle instead, or wrap it in a as_cuda_ndarray_variable to promise that it's going to be moved to the GPU. I opted for the latter.

Interestingly, this wasn't triggered in the tests, I only noticed when running the Theano benchmark script from convnet-benchmarks. I guess that's a general problem with the tests which should be addressed in a separate issue.

Contributor

f0k commented Sep 1, 2014

Removing the gpu_contiguous as per @nouiz's suggestion introduced a problem: The dimshuffle() call on the output inserts a DimShuffle op, which returns a CPU output. This changes the type of the output, which is not allowed and results in:

TypeError: ('The type of the replacement must be the same as the type of the original Variable.', GpuConv{valid, (1, 1), None, (6, 6), True, (1, 287, 29), (6, 6)}.0, DimShuffle{1,0,2,3}.0, CudaNdarrayType(float32, 4D), TensorType(float32, 4D), 'local_conv_gemm')

We could use GpuDimShuffle instead, or wrap it in a as_cuda_ndarray_variable to promise that it's going to be moved to the GPU. I opted for the latter.

Interestingly, this wasn't triggered in the tests, I only noticed when running the Theano benchmark script from convnet-benchmarks. I guess that's a general problem with the tests which should be addressed in a separate issue.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Sep 1, 2014

Contributor

I've got another optimization question. This is not actually part of this PR, but related to it, so I'll just ask here.

When using GpuCorrMM directly (on a flipped kernel), I get the following graph and timing for the gradient wrt. inputs:

GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)} [@A] ''   3
 |GpuContiguous [@B] ''   2
 | |GpuSubtensor{::, ::, ::int64, ::int64} [@C] ''   1
 |   |<CudaNdarrayType(float32, 4D)> [@D]
 |   |Constant{-1} [@E]
 |   |Constant{-1} [@E]
 |GpuContiguous [@F] ''   0
   |<CudaNdarrayType(float32, 4D)> [@G]
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.012757648468 )

When using the standard conv2d and letting the optimizer replace it, I get:

GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)} [@A] ''   5
 |GpuContiguous [@B] ''   4
 | |GpuDimShuffle{1,0,2,3} [@C] ''   3
 |   |GpuSubtensor{::, ::, ::int64, ::int64} [@D] ''   2
 |     |GpuDimShuffle{1,0,2,3} [@E] ''   1
 |     | |<CudaNdarrayType(float32, 4D)> [@F]
 |     |Constant{-1} [@G]
 |     |Constant{-1} [@G]
 |GpuContiguous [@H] ''   0
   |<CudaNdarrayType(float32, 4D)> [@I]
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0244923362732 )

It's 100% slower than the former, seemingly because of the two dimshuffles that are actually redundant. The original conv2d graph for this case looks like this:

GpuConv{full, (1, 1), None, (6, 6), True, (32, 282, 24), (6, 6)} [@A] ''   2
 |<CudaNdarrayType(float32, 4D)> [@B]
 |GpuSubtensor{::, ::, ::int64, ::int64} [@C] ''   1
   |GpuDimShuffle{1,0,2,3} [@D] ''   0
   | |<CudaNdarrayType(float32, 4D)> [@E]
   |Constant{-1} [@F]
   |Constant{-1} [@F]
theano.tensor.nnet.conv.conv2d bprop inputs: 0.0 GFLOP/s ( tm = 0.0263266887665 )

Any idea on how to fix that? Swap the order of dimshuffle and [:,:,::-1,::-1] in the ConvOp gradient and hope they get merged? Introduce an optimizer that swaps dimshuffle and subtensor (shuffling the subtensor slices as needed)?

/edit: Swapping the dimshuffle and flipping in the ConvOp gradient results in:

GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)} [@A] ''   5
 |GpuContiguous [@B] ''   4
 | |GpuDimShuffle{1,0,2,3} [@C] ''   3
 |   |GpuDimShuffle{1,0,2,3} [@D] ''   2
 |     |GpuSubtensor{::, ::, ::int64, ::int64} [@E] ''   1
 |       |<CudaNdarrayType(float32, 4D)> [@F]
 |       |Constant{-1} [@G]
 |       |Constant{-1} [@G]
 |GpuContiguous [@H] ''   0
   |<CudaNdarrayType(float32, 4D)> [@I]
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0244253120422 )

There's something wrong with the dimshuffles. a) They should have been eliminated in optimization, and b) they should be nearly zero-cost operations, shouldn't they? I guess they are not eliminated in optimization because the first one has already been moved to GPU when the second one is added to the graph, and there is no optimization for GpuDimShuffles.

Anyway, I'll open a separate issue when this PR has been merged.

Contributor

f0k commented Sep 1, 2014

I've got another optimization question. This is not actually part of this PR, but related to it, so I'll just ask here.

When using GpuCorrMM directly (on a flipped kernel), I get the following graph and timing for the gradient wrt. inputs:

GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)} [@A] ''   3
 |GpuContiguous [@B] ''   2
 | |GpuSubtensor{::, ::, ::int64, ::int64} [@C] ''   1
 |   |<CudaNdarrayType(float32, 4D)> [@D]
 |   |Constant{-1} [@E]
 |   |Constant{-1} [@E]
 |GpuContiguous [@F] ''   0
   |<CudaNdarrayType(float32, 4D)> [@G]
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.012757648468 )

When using the standard conv2d and letting the optimizer replace it, I get:

GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)} [@A] ''   5
 |GpuContiguous [@B] ''   4
 | |GpuDimShuffle{1,0,2,3} [@C] ''   3
 |   |GpuSubtensor{::, ::, ::int64, ::int64} [@D] ''   2
 |     |GpuDimShuffle{1,0,2,3} [@E] ''   1
 |     | |<CudaNdarrayType(float32, 4D)> [@F]
 |     |Constant{-1} [@G]
 |     |Constant{-1} [@G]
 |GpuContiguous [@H] ''   0
   |<CudaNdarrayType(float32, 4D)> [@I]
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0244923362732 )

It's 100% slower than the former, seemingly because of the two dimshuffles that are actually redundant. The original conv2d graph for this case looks like this:

GpuConv{full, (1, 1), None, (6, 6), True, (32, 282, 24), (6, 6)} [@A] ''   2
 |<CudaNdarrayType(float32, 4D)> [@B]
 |GpuSubtensor{::, ::, ::int64, ::int64} [@C] ''   1
   |GpuDimShuffle{1,0,2,3} [@D] ''   0
   | |<CudaNdarrayType(float32, 4D)> [@E]
   |Constant{-1} [@F]
   |Constant{-1} [@F]
theano.tensor.nnet.conv.conv2d bprop inputs: 0.0 GFLOP/s ( tm = 0.0263266887665 )

Any idea on how to fix that? Swap the order of dimshuffle and [:,:,::-1,::-1] in the ConvOp gradient and hope they get merged? Introduce an optimizer that swaps dimshuffle and subtensor (shuffling the subtensor slices as needed)?

/edit: Swapping the dimshuffle and flipping in the ConvOp gradient results in:

GpuCorrMM_gradInputs{valid, (1, 1), pad=(0, 0)} [@A] ''   5
 |GpuContiguous [@B] ''   4
 | |GpuDimShuffle{1,0,2,3} [@C] ''   3
 |   |GpuDimShuffle{1,0,2,3} [@D] ''   2
 |     |GpuSubtensor{::, ::, ::int64, ::int64} [@E] ''   1
 |       |<CudaNdarrayType(float32, 4D)> [@F]
 |       |Constant{-1} [@G]
 |       |Constant{-1} [@G]
 |GpuContiguous [@H] ''   0
   |<CudaNdarrayType(float32, 4D)> [@I]
(experimental) theano.sandbox.cuda.blas.CorrMM bprop inputs: 0.0 GFLOP/s ( tm = 0.0244253120422 )

There's something wrong with the dimshuffles. a) They should have been eliminated in optimization, and b) they should be nearly zero-cost operations, shouldn't they? I guess they are not eliminated in optimization because the first one has already been moved to GPU when the second one is added to the graph, and there is no optimization for GpuDimShuffles.

Anyway, I'll open a separate issue when this PR has been merged.

nouiz added a commit that referenced this pull request Sep 1, 2014

Merge pull request #2033 from f0k/corrmm-faster-fullconv
Faster algorithms and gradients for GpuCorrMM

@nouiz nouiz merged commit cfc493d into Theano:master Sep 1, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details

@f0k f0k deleted the f0k:corrmm-faster-fullconv branch Sep 1, 2014

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Sep 1, 2014

Contributor

Yay, thanks for merging, @nouiz!

Contributor

f0k commented Sep 1, 2014

Yay, thanks for merging, @nouiz!

@stencilman

This comment has been minimized.

Show comment
Hide comment
@stencilman

stencilman Sep 1, 2014

Contributor

Thanks a lot @nouiz! I can tell people to use the master and not to merge with @f0k's branch any more! 👍

Contributor

stencilman commented Sep 1, 2014

Thanks a lot @nouiz! I can tell people to use the master and not to merge with @f0k's branch any more! 👍

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Sep 2, 2014

Member

Can you time the gpucontiguous in the fast and slow case with the profile=True and profile_memory=True Theano flags? My guess is the -1 in the subtensor, cause the gpucontiguous to be slower. This cause different memory layout of the input and we didn't optimize for that case.

Member

nouiz commented Sep 2, 2014

Can you time the gpucontiguous in the fast and slow case with the profile=True and profile_memory=True Theano flags? My guess is the -1 in the subtensor, cause the gpucontiguous to be slower. This cause different memory layout of the input and we didn't optimize for that case.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Sep 2, 2014

Member

Should we update the GpuCorrMM doc to tell that it is faster when called directly?

What about doing a corr2d() method, that call the conv2d on the cpu with the flipping, that way they could cancel themself after substitution. Maybe we will need to make the merge subsentor optimization work for the GPU op. What do you think of this as a user interface? @lamblin @abergeron, you input that that interface would be welcome.

Member

nouiz commented Sep 2, 2014

Should we update the GpuCorrMM doc to tell that it is faster when called directly?

What about doing a corr2d() method, that call the conv2d on the cpu with the flipping, that way they could cancel themself after substitution. Maybe we will need to make the merge subsentor optimization work for the GPU op. What do you think of this as a user interface? @lamblin @abergeron, you input that that interface would be welcome.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Sep 2, 2014

Contributor

My guess is the -1 in the subtensor, cause the gpucontiguous to be slower.

That's the same in both the slow and fast case, because I explicitly did a [:,:,::-1,::-1] on the weights when using GpuCorrMM directly. So the only difference is the dimshuffle. But I tried with some other shapes and then both cases perform the same... maybe we shouldn't worry too much.

Oh, maybe another thing that I observed: When I tried calling GpuCorrMM directly with some SharedVariable instances, they were not C-contiguous. Is this to be expected?

Should we update the GpuCorrMM doc to tell that it is faster when called directly?

No, because depending on the shapes, it can actually be slower when called directly. The goal should be to have the automatic graph optimization from conv2d be on par or better than a manual GpuCorrMM in all cases.

What about doing a corr2d() method

I think it's not needed for now. Let's first see if the kernel flipping really is a bottleneck and if so, we can still introduce a corr2d or update the conv2d documentation to say that flipping the kernels may improve performance in some cases.

Contributor

f0k commented Sep 2, 2014

My guess is the -1 in the subtensor, cause the gpucontiguous to be slower.

That's the same in both the slow and fast case, because I explicitly did a [:,:,::-1,::-1] on the weights when using GpuCorrMM directly. So the only difference is the dimshuffle. But I tried with some other shapes and then both cases perform the same... maybe we shouldn't worry too much.

Oh, maybe another thing that I observed: When I tried calling GpuCorrMM directly with some SharedVariable instances, they were not C-contiguous. Is this to be expected?

Should we update the GpuCorrMM doc to tell that it is faster when called directly?

No, because depending on the shapes, it can actually be slower when called directly. The goal should be to have the automatic graph optimization from conv2d be on par or better than a manual GpuCorrMM in all cases.

What about doing a corr2d() method

I think it's not needed for now. Let's first see if the kernel flipping really is a bottleneck and if so, we can still introduce a corr2d or update the conv2d documentation to say that flipping the kernels may improve performance in some cases.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron Sep 2, 2014

Member

I would go for making merge subtensor work on GPU ops. I don't think a proliferation of mostly similar interfaces is a good idea.

Member

abergeron commented Sep 2, 2014

I would go for making merge subtensor work on GPU ops. I don't think a proliferation of mostly similar interfaces is a good idea.

@ballasn ballasn referenced this pull request Sep 4, 2014

Closed

GPUCorr3Dmm #2077

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment