Cuda fftconv #1870

Merged
merged 20 commits into from May 28, 2014

Projects

None yet

6 participants

@abergeron
Member

Optional fft convolution on the gpu controlled by the config flag enable_conv2d_fft (which defaults to False).

It can also be used directly if you want to use it only in parts of the graph, but there is no support for the automatic gradient in that case.

NEWS.txt: this contain code from Sander, and this PR contain commit from Arnaud and gyom

@nouiz nouiz commented on the diff May 22, 2014
theano/sandbox/cuda/fftconv.py
@@ -0,0 +1,495 @@
+import string
+
+import numpy as np
@nouiz
nouiz May 22, 2014 Member

Add this file to doc/library/tensor/nnet/conv.txt, in the autofunction in the list of function. make sure you tell that it only work on GPU for now.

@abergeron
abergeron May 22, 2014 Member

Since it's part of the cuda subpackage, it would feel weird adding it to nnet/conv.txt.

@nouiz
nouiz May 22, 2014 Member

Not totally, as there isn't much gpu specific doc from memory. But if you find it add it there, or you can create one. But be sure to link to it from the tensor/nnet/conv.txt, so people that read the conv doc can find it.

@abergeron
abergeron May 22, 2014 Member

This is painful to do since sphinx can't import theano.sandbox.cuda.fftconv (and it doesn't matter where the autofunction directive is).

I can wrap a giant try..except around all of the imports to let the file be importable, but that seems wrong (and there were problems of circular import with the file before that could creep back, unnoticed if I do that).

@nouiz nouiz and 1 other commented on an outdated diff May 22, 2014
theano/sandbox/cuda/opt.py
@@ -1120,6 +1121,16 @@ def values_eq_approx(a, b):
# differently then the gpu ConvOp
return [out]
+
+@register_opt()
+@local_optimizer([GpuConv])
+def local_conv_fft(node):
+ if (theano.config.enable_conv2d_fft and
+ isinstance(node.op, GpuConv) and
+ node.op.border_mode == 'valid' and
+ node.op.subsample == (1, 1)):
+ return [conv2d_fft(node.inputs[0], node.inputs[1])]
@nouiz
nouiz May 22, 2014 Member

try to pass the op.kshp and op.imshp.

@abergeron
abergeron May 22, 2014 Member

I can't since they are not in the right format. len(kshp) = 2 and len(imshp) = 3 conv2d_fft expects len() = 4 for both.

@nouiz
nouiz May 22, 2014 Member

You can add the missing as the shape[i] that correspond to the missing information. Do that if part of kshp and imshp have None value.

@abergeron
abergeron May 22, 2014 Member

There is also other trickery since imshp and kshp are massaged according to padding and other factors that just confuse the issue. I am open to revisiting this in a later PR, but it might take a while (as in maybe another day, just for that).

@nouiz nouiz and 2 others commented on an outdated diff May 22, 2014
theano/sandbox/cuda/tests/test_fftconv.py
+
+ inputs_val = numpy.random.random(inputs_shape).astype('float32')
+ filters_val = numpy.random.random(filters_shape).astype('float32')
+
+ inputs = shared(inputs_val)
+ filters = shared(filters_val)
+
+ conv = theano.tensor.nnet.conv.conv2d(inputs, filters)
+
+ f_ref = theano.function([], conv)
+ f_fft = theano.function([], conv, mode=mode_with_gpu)
+
+ res_ref = f_ref()
+ res_fft = f_fft()
+
+ numpy.testing.assert_allclose(res_ref, res_fft)
@nouiz
nouiz May 22, 2014 Member

Test a case with padding of inputs when the inputs is odd and when it is even.

@abergeron
abergeron May 22, 2014 Member

I can test the padding (as in pad_last_dim=True) when the last dim is odd, but I can't test it when it is already even, that would just crash at runtime with an obscure error.

@nouiz
nouiz May 22, 2014 Member

ok for this PR, but instead of hardcoding +1, we could compute the shape that is even.

@abergeron
abergeron May 22, 2014 Member

No we can't. At least not without introducing an ifelse or a switch which you told us was too slow.

@dwf
dwf May 22, 2014 Member
  • shape % 2?

On Thu, May 22, 2014 at 3:27 PM, abergeron notifications@github.com wrote:

In theano/sandbox/cuda/tests/test_fftconv.py:

  •    inputs_val = numpy.random.random(inputs_shape).astype('float32')
    
  •    filters_val = numpy.random.random(filters_shape).astype('float32')
    
  •    inputs = shared(inputs_val)
    
  •    filters = shared(filters_val)
    
  •    conv = theano.tensor.nnet.conv.conv2d(inputs, filters)
    
  •    f_ref = theano.function([], conv)
    
  •    f_fft = theano.function([], conv, mode=mode_with_gpu)
    
  •    res_ref = f_ref()
    
  •    res_fft = f_fft()
    
  •    numpy.testing.assert_allclose(res_ref, res_fft)
    

No we can't. At least not without introducing an ifelse or a switch which
you told us was too slow.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870/files#r12969297
.

@dwf
dwf May 22, 2014 Member

Or rather + (1 - shape % 2)

On Thu, May 22, 2014 at 3:29 PM, David Warde-Farley <
d.warde.farley@gmail.com> wrote:

  • shape % 2?

On Thu, May 22, 2014 at 3:27 PM, abergeron notifications@github.comwrote:

In theano/sandbox/cuda/tests/test_fftconv.py:

  •    inputs_val = numpy.random.random(inputs_shape).astype('float32')
    
  •    filters_val = numpy.random.random(filters_shape).astype('float32')
    
  •    inputs = shared(inputs_val)
    
  •    filters = shared(filters_val)
    
  •    conv = theano.tensor.nnet.conv.conv2d(inputs, filters)
    
  •    f_ref = theano.function([], conv)
    
  •    f_fft = theano.function([], conv, mode=mode_with_gpu)
    
  •    res_ref = f_ref()
    
  •    res_fft = f_fft()
    
  •    numpy.testing.assert_allclose(res_ref, res_fft)
    

No we can't. At least not without introducing an ifelse or a switch which
you told us was too slow.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870/files#r12969297
.

@abergeron
abergeron May 22, 2014 Member

Maybe that could work, to get the right shape, but that would mean that we introduce a copy of the input data even when no padding is necessary. I don't think this is such a good idea from a performance point of view.

@nouiz nouiz commented on the diff May 22, 2014
theano/tensor/nnet/tests/test_conv.py
@@ -67,11 +69,11 @@ def sym_conv2d(input, filters):
output = sym_conv2d(input, filters)
output.name = 'conv2d(%s,%s)' % (input.name, filters.name)
- theano_conv = theano.function([input, filters], output)
+ theano_conv = theano.function([input, filters], output, mode=self.mode)
@nouiz
nouiz May 22, 2014 Member

self.mode and self.dtype don't exist. So this will fail.

@abergeron
abergeron May 22, 2014 Member

Yes they do, see the changes on line 15 and 16.

@dwf
Member
dwf commented May 22, 2014

It can also be used directly if you want to use it only in parts of the graph, but there is no support for the automatic gradient in that case.

What does this mean?

@nouiz
Member
nouiz commented May 22, 2014

It mean that you can manually introduce that version of the convolution,
but in that case, the gradient won't be implemented.

Fred

On Thu, May 22, 2014 at 2:12 PM, David Warde-Farley <
notifications@github.com> wrote:

It can also be used directly if you want to use it only in parts of the
graph, but there is no support for the automatic gradient in that case.

What does this mean?


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870#issuecomment-43923990
.

@nouiz
Member
nouiz commented May 22, 2014

Guillaume told me that the full case should work. Can you add test for it? You can reuse the tests case from the TestConv2d that you started to generalyze to reuse here.

@abergeron
Member

I can't reuse TestConv2d since it tests for odd-last-dim shapes which don't work.

Also, @dwf I mean that there is no grad() on a bunch of ops used.

@dwf
Member
dwf commented May 22, 2014

I see. Is the use intended to be that it is substituted in after graph
creation, hence grad is unneeded? Isn't the grad fairly simple if you have
both full and valid mode convolution implemented?

@abergeron
Member

It's use is intended to replace GpuConv as an optimisation.

But the problem with the grad is that it's not a single op, but rather a graph that has an FFT transform, a complex dot product and an inverse FFT transform. I have no idea what the gradient of an FFT looks like and I don't want to spend a lot of time figuring it out.

@nouiz nouiz and 1 other commented on an outdated diff May 22, 2014
theano/sandbox/cuda/opt.py
@@ -1120,6 +1121,16 @@ def values_eq_approx(a, b):
# differently then the gpu ConvOp
return [out]
+
+@register_opt()
@nouiz
nouiz May 22, 2014 Member

This will register it by default. Remove that decorator and add this line after the opt:

gpu_optimizer.register("local_conv_fft", local_conv_fft)

As there will be 2 opt, I think we should remove the theano flags and ask the user to use optimizer_including=local_conv_fft:local_conv_full_fft to enable both of them.

@abergeron
abergeron May 22, 2014 Member

So I should just remove the register_opt() decorator or, do I need to add your call after for the optimizer_including trick to work?

@nouiz
nouiz May 22, 2014 Member

You need to do both, remove the decorator and register manually with the right param.

@nouiz
Member
nouiz commented May 22, 2014

One user just show me a case where he used GpuConv2d to do 1d convolution and in the full mode case, it was very slow (like speeding 100% time there with the rounding).

As the full case work and is now tested, can you add an opt for this?

@abergeron
Member

Yes I can, but I was under the impression that this was somewhat urgent. With all the things that keep piling on this PR, it'll take another week to complete (and I can't even get the doc to work).

@nouiz
Member
nouiz commented May 22, 2014

We discussed many things here, and not all of them need to be done for this PR, but I think we should not forget them. Here is the list of what I think should be done in this PR:

  • unittest_toosl.assert_allclose() (very fast)
  • fix how we register the opt
  • add user doc, that I direct people to it if I suggest them to use this. This can be manual doc in tensor/nnet/conv.txt. I don't understand your import problem. I made sure we can import some of the cuda files for the docs. This mean we should be able to import the file on a computer without a GPU and nvcc. But this can be later.

List of todo in this PR or just after

  • make opt and associated test for the full case and update the doc for this.

todo for later

  • Should we make the pad parameter work with the image size is even?
  • make the file importable without a GPU and without nvcc
  • Should we pass the user specified shape to the fft conv?

Did I forgot something? Do you agree with the prio?

@abergeron
Member

The import problem is that import theano.sandbox.cuda.fftconv fails because fftconv tries to import pycuda and scikits.cuda. Therefore autofunction and/or automodule can't work on it.

abergeron and others added some commits May 12, 2014
@abergeron abergeron Import fftconv.py with some pep8 modifications.
This comes from Sander Dieleman <sanderdieleman@gmail.com> (@benanne
on github).
9aa6518
@abergeron abergeron Use the fft version of convolution for the valid mode. 48a87d2
@abergeron abergeron Fix import loop involving fftconv. 2d1e254
@abergeron abergeron Last fix for the import problem. a50a798
@abergeron abergeron Attempt at adding support for the 'full' border mode. 7ec0c01
@abergeron abergeron Add some tests (that all fail due to shape errors at the moment). dbe9c96
@gyom @abergeron gyom Fixed some typos and a dimension that was probably wrong. This doesn'…
…t solve the issue with the boder_mode='full' because we still have other errors.
b93cc86
@gyom @abergeron gyom Added support for images whose last dimension is odd. This is more ex…
…pensive computationally, and it costs more in memory, but it's optional.
750f6c0
@abergeron abergeron Add a config flags to enable/disable the conv2d_fft opt. 558ad02
@abergeron abergeron Fix the test to actually work. b0e830f
@gyom @abergeron gyom Renamed a variable because I gave it a ridiculously long name. 5183197
@gyom @abergeron gyom border_mode 'full' now working and tested. 215962f
@abergeron abergeron Also test the full mode. 77471f6
@abergeron abergeron Add forgotten comma. bb53388
@abergeron abergeron use unittest_tools.assert_allclose() instead of numpy.testing.assert_…
…allclose()
c6cafd0
@abergeron abergeron Test mistake. 5c61c43
@abergeron abergeron Change the way to enable the optimization from a theano flags to a re…
…gistered, but disabled opt.
0c41c43
@abergeron abergeron Add optimization for the full mode and update the tests to work with …
…the new way of enabling them.
7a10c46
@abergeron abergeron Fix tests so that they pass and test what we want. d1f66a2
@abergeron abergeron Add a mention of conv2d_fft in the docs and write a better docstring …
…for it.
4cf06d2
@abergeron
Member

I did a mention in the docs. I also had to rebase because of it, since that area of the docs had changed recently and the PR would not merge.

@nouiz
Member
nouiz commented May 24, 2014

on oolong I have this error:

Traceback (most recent call last):
  File "convolutional_mlp.py", line 32, in <module>
    import theano
  File "/u/bastienf/repos/theano/__init__.py", line 85, in <module>
    import theano.sandbox.cuda
  File "/u/bastienf/repos/theano/sandbox/cuda/__init__.py", line 248, in <module>
    import theano.sandbox.cuda.fftconv
  File "/u/bastienf/repos/theano/sandbox/cuda/fftconv.py", line 18, in <module>
    misc.init()
  File "/opt/lisa/os/lib/python2.7/site-packages/scikits/cuda/misc.py", line 154, in init
    cula.culaInitialize()
  File "/opt/lisa/os/lib/python2.7/site-packages/scikits/cuda/cula.py", line 388, in culaInitialize
    culaCheckStatus(status)
  File "/opt/lisa/os/lib/python2.7/site-packages/scikits/cuda/cula.py", line 264, in culaCheckStatus
    raise culaExceptions[status](error)
scikits.cuda.cula.culaRuntimeError: 3
@abergeron
Member

According to research it means that the cula initialization code encounters this cuda error:

cudaErrorInitializationError = 3
The API call failed because the CUDA driver and runtime could not be initialized.

I do get that same error on that machine, but I notice that you're also running the scan testsuite on that GPU (which may or may not be the cause).

@bhack
bhack commented May 25, 2014

Could this be used for the opencl (CPU/GPU) counterpart? https://github.com/geggo/gpyfft

@benanne
Contributor
benanne commented May 25, 2014

It looks like gpyfft does batched FFTs, so it should be possible to implement gpyfft-based versions of at least the CuFFTOp and the CuIFFTOp. A batched complex dot product implementation is also needed though. In CUDA this is achieved with cublasCgemmBatched, I don't know if there is a good OpenCL alternative for this (and a corresponding Python wrapper).

@bhack
bhack commented May 25, 2014

I think that could covered by CLBLAS but the python wrapper it is just starter and cover only sgemm as example https://github.com/clMathLibraries/clBLAS/blob/develop/src/wrappers/python/README.txt

@benanne
Contributor
benanne commented May 25, 2014

If clBLAS also comes with batched gemm operations that could work. I can't find any reference to clblasCgemmBatched (only clblasCgemm), but maybe it is named differently because it is not a standard level 3 BLAS operation afaik.

It is of course possible to just do separate clblasCgemm calls for each dot product, but this seems to negate much of the performance advantage of using an FFT-based convolution in the first place, at least in the CUDA / cuBLAS setting.

@abergeron
Member

XgemmBatched is indeed not a standard BLAS operation and I didn't see it in other libraries than cuBLAS.

We already have a python wrapper of sorts for clBLAS in libgpuarray.

It is in the plans to add support for https://github.com/clMathLibraries/clFFT and cuFFT to provide fourrier transform capabilities, but not right now.

@bhack
bhack commented May 25, 2014

Yes gpyfft is a clFFT wrapper. So isn't more useful to put effort in libgpuarray instead of push new op in Theano?

@abergeron
Member

Yes we plan to add that support to libgpuarray.

@nouiz
Member
nouiz commented May 26, 2014

@abergeron,

I have a crash when I try this PR with the convolution in the DLT. Can you
check that is goind on?

On Sun, May 25, 2014 at 10:17 AM, abergeron notifications@github.comwrote:

Yes we plan to add that support to libgpuarray.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870#issuecomment-44134199
.

@abergeron
Member

Are you sure it isn't using odd inputs somewhere? What does the crash looks like? If it's something about a reshape, then it is the odd dim problem.

@benanne
Contributor
benanne commented May 27, 2014

Out of curiosity, is there a reason why padding for odd input sizes can't be done automatically? I feel like I'm missing something.

I understand this introduces an extra copy operation, and the associated slowdown, but as long as this is documented I don't think this is a problem. Maybe it could be controlled with a keyword argument that has "pad if necessary" semantics, instead of "always pad".

If I understand correctly, the optimization that inserts the FFT-based convolution in the graph will always insert an op with pad_last_dim=False (which is the default), so using convolutions with odd input sizes will always fail, unless the user manually inserts the conv2d_fft op into the graph and remembers to set pad_last_dim=True.

If pad_last_dim added the padding only if necessary, the FFT-based op inserted by the optimization would always work (just a bit slower if the input size is odd). I think this would greatly increase usability. The user can then decide themselves whether they find the slowdown acceptable, and the code doesn't need to change when input and filter sizes are changed.

@abergeron
Member

Because the sizes are symbolic, we can't know if they are odd or even at graph creation time.

So for "pad if necessary" we either have to always copy the input data into a "padding buffer" that may be completely useless or introduce an ifelse that slows down the graph. Not only would that slow down the graph, but it would require even more memory on the GPU rendering large graphs impossible.

The other option might be to add a ConvPad op that would have access to the sizes in play and pad the buffer to be even or just return a view. But that wouldn't solve the memory problem.

Finally, since the optimization has to be manually enabled, such restrictions are acceptable.

@benanne
Contributor
benanne commented May 27, 2014

That makes sense, I forgot that the sizes aren't necessarily available at compile time. I have made a habit of always specifying the image_shape and filter_shape. Thanks for the clarification.

@nouiz
Member
nouiz commented May 27, 2014

It crash with a cula errors, not with reshape errors. Can you try it here
and see if you also have this problem?

On Mon, May 26, 2014 at 7:54 PM, abergeron notifications@github.com wrote:

Are you sure it isn't using odd inputs somewhere? What does the crash
looks like? If it's something about a reshape, then it is the odd dim
problem.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870#issuecomment-44224575
.

@abergeron
Member

I ran the convolution_mlp.py example from the DLT on my machine with both optimizations enabled and I get no errors (and even a slight speedup).

@nouiz
Member
nouiz commented May 28, 2014

It is a problem with the installation here. So I merge the PR.

@nouiz nouiz merged commit 7555005 into Theano:master May 28, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment