New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda fftconv #1870

Merged
merged 20 commits into from May 28, 2014

Conversation

Projects
None yet
6 participants
@abergeron
Member

abergeron commented May 22, 2014

Optional fft convolution on the gpu controlled by the config flag enable_conv2d_fft (which defaults to False).

It can also be used directly if you want to use it only in parts of the graph, but there is no support for the automatic gradient in that case.

NEWS.txt: this contain code from Sander, and this PR contain commit from Arnaud and gyom

@@ -0,0 +1,495 @@
import string
import numpy as np

This comment has been minimized.

@nouiz

nouiz May 22, 2014

Member

Add this file to doc/library/tensor/nnet/conv.txt, in the autofunction in the list of function. make sure you tell that it only work on GPU for now.

@nouiz

nouiz May 22, 2014

Member

Add this file to doc/library/tensor/nnet/conv.txt, in the autofunction in the list of function. make sure you tell that it only work on GPU for now.

This comment has been minimized.

@abergeron

abergeron May 22, 2014

Member

Since it's part of the cuda subpackage, it would feel weird adding it to nnet/conv.txt.

@abergeron

abergeron May 22, 2014

Member

Since it's part of the cuda subpackage, it would feel weird adding it to nnet/conv.txt.

This comment has been minimized.

@nouiz

nouiz May 22, 2014

Member

Not totally, as there isn't much gpu specific doc from memory. But if you find it add it there, or you can create one. But be sure to link to it from the tensor/nnet/conv.txt, so people that read the conv doc can find it.

@nouiz

nouiz May 22, 2014

Member

Not totally, as there isn't much gpu specific doc from memory. But if you find it add it there, or you can create one. But be sure to link to it from the tensor/nnet/conv.txt, so people that read the conv doc can find it.

This comment has been minimized.

@abergeron

abergeron May 22, 2014

Member

This is painful to do since sphinx can't import theano.sandbox.cuda.fftconv (and it doesn't matter where the autofunction directive is).

I can wrap a giant try..except around all of the imports to let the file be importable, but that seems wrong (and there were problems of circular import with the file before that could creep back, unnoticed if I do that).

@abergeron

abergeron May 22, 2014

Member

This is painful to do since sphinx can't import theano.sandbox.cuda.fftconv (and it doesn't matter where the autofunction directive is).

I can wrap a giant try..except around all of the imports to let the file be importable, but that seems wrong (and there were problems of circular import with the file before that could creep back, unnoticed if I do that).

Show outdated Hide outdated theano/sandbox/cuda/opt.py
isinstance(node.op, GpuConv) and
node.op.border_mode == 'valid' and
node.op.subsample == (1, 1)):
return [conv2d_fft(node.inputs[0], node.inputs[1])]

This comment has been minimized.

@nouiz

nouiz May 22, 2014

Member

try to pass the op.kshp and op.imshp.

@nouiz

nouiz May 22, 2014

Member

try to pass the op.kshp and op.imshp.

This comment has been minimized.

@abergeron

abergeron May 22, 2014

Member

I can't since they are not in the right format. len(kshp) = 2 and len(imshp) = 3 conv2d_fft expects len() = 4 for both.

@abergeron

abergeron May 22, 2014

Member

I can't since they are not in the right format. len(kshp) = 2 and len(imshp) = 3 conv2d_fft expects len() = 4 for both.

This comment has been minimized.

@nouiz

nouiz May 22, 2014

Member

You can add the missing as the shape[i] that correspond to the missing information. Do that if part of kshp and imshp have None value.

@nouiz

nouiz May 22, 2014

Member

You can add the missing as the shape[i] that correspond to the missing information. Do that if part of kshp and imshp have None value.

This comment has been minimized.

@abergeron

abergeron May 22, 2014

Member

There is also other trickery since imshp and kshp are massaged according to padding and other factors that just confuse the issue. I am open to revisiting this in a later PR, but it might take a while (as in maybe another day, just for that).

@abergeron

abergeron May 22, 2014

Member

There is also other trickery since imshp and kshp are massaged according to padding and other factors that just confuse the issue. I am open to revisiting this in a later PR, but it might take a while (as in maybe another day, just for that).

Show outdated Hide outdated theano/sandbox/cuda/tests/test_fftconv.py
res_ref = f_ref()
res_fft = f_fft()
numpy.testing.assert_allclose(res_ref, res_fft)

This comment has been minimized.

@nouiz

nouiz May 22, 2014

Member

Test a case with padding of inputs when the inputs is odd and when it is even.

@nouiz

nouiz May 22, 2014

Member

Test a case with padding of inputs when the inputs is odd and when it is even.

This comment has been minimized.

@abergeron

abergeron May 22, 2014

Member

I can test the padding (as in pad_last_dim=True) when the last dim is odd, but I can't test it when it is already even, that would just crash at runtime with an obscure error.

@abergeron

abergeron May 22, 2014

Member

I can test the padding (as in pad_last_dim=True) when the last dim is odd, but I can't test it when it is already even, that would just crash at runtime with an obscure error.

This comment has been minimized.

@nouiz

nouiz May 22, 2014

Member

ok for this PR, but instead of hardcoding +1, we could compute the shape that is even.

@nouiz

nouiz May 22, 2014

Member

ok for this PR, but instead of hardcoding +1, we could compute the shape that is even.

This comment has been minimized.

@abergeron

abergeron May 22, 2014

Member

No we can't. At least not without introducing an ifelse or a switch which you told us was too slow.

@abergeron

abergeron May 22, 2014

Member

No we can't. At least not without introducing an ifelse or a switch which you told us was too slow.

This comment has been minimized.

@dwf

dwf May 22, 2014

Member
  • shape % 2?

On Thu, May 22, 2014 at 3:27 PM, abergeron notifications@github.com wrote:

In theano/sandbox/cuda/tests/test_fftconv.py:

  •    inputs_val = numpy.random.random(inputs_shape).astype('float32')
    
  •    filters_val = numpy.random.random(filters_shape).astype('float32')
    
  •    inputs = shared(inputs_val)
    
  •    filters = shared(filters_val)
    
  •    conv = theano.tensor.nnet.conv.conv2d(inputs, filters)
    
  •    f_ref = theano.function([], conv)
    
  •    f_fft = theano.function([], conv, mode=mode_with_gpu)
    
  •    res_ref = f_ref()
    
  •    res_fft = f_fft()
    
  •    numpy.testing.assert_allclose(res_ref, res_fft)
    

No we can't. At least not without introducing an ifelse or a switch which
you told us was too slow.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870/files#r12969297
.

@dwf

dwf May 22, 2014

Member
  • shape % 2?

On Thu, May 22, 2014 at 3:27 PM, abergeron notifications@github.com wrote:

In theano/sandbox/cuda/tests/test_fftconv.py:

  •    inputs_val = numpy.random.random(inputs_shape).astype('float32')
    
  •    filters_val = numpy.random.random(filters_shape).astype('float32')
    
  •    inputs = shared(inputs_val)
    
  •    filters = shared(filters_val)
    
  •    conv = theano.tensor.nnet.conv.conv2d(inputs, filters)
    
  •    f_ref = theano.function([], conv)
    
  •    f_fft = theano.function([], conv, mode=mode_with_gpu)
    
  •    res_ref = f_ref()
    
  •    res_fft = f_fft()
    
  •    numpy.testing.assert_allclose(res_ref, res_fft)
    

No we can't. At least not without introducing an ifelse or a switch which
you told us was too slow.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870/files#r12969297
.

This comment has been minimized.

@dwf

dwf May 22, 2014

Member

Or rather + (1 - shape % 2)

On Thu, May 22, 2014 at 3:29 PM, David Warde-Farley <
d.warde.farley@gmail.com> wrote:

  • shape % 2?

On Thu, May 22, 2014 at 3:27 PM, abergeron notifications@github.comwrote:

In theano/sandbox/cuda/tests/test_fftconv.py:

  •    inputs_val = numpy.random.random(inputs_shape).astype('float32')
    
  •    filters_val = numpy.random.random(filters_shape).astype('float32')
    
  •    inputs = shared(inputs_val)
    
  •    filters = shared(filters_val)
    
  •    conv = theano.tensor.nnet.conv.conv2d(inputs, filters)
    
  •    f_ref = theano.function([], conv)
    
  •    f_fft = theano.function([], conv, mode=mode_with_gpu)
    
  •    res_ref = f_ref()
    
  •    res_fft = f_fft()
    
  •    numpy.testing.assert_allclose(res_ref, res_fft)
    

No we can't. At least not without introducing an ifelse or a switch which
you told us was too slow.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870/files#r12969297
.

@dwf

dwf May 22, 2014

Member

Or rather + (1 - shape % 2)

On Thu, May 22, 2014 at 3:29 PM, David Warde-Farley <
d.warde.farley@gmail.com> wrote:

  • shape % 2?

On Thu, May 22, 2014 at 3:27 PM, abergeron notifications@github.comwrote:

In theano/sandbox/cuda/tests/test_fftconv.py:

  •    inputs_val = numpy.random.random(inputs_shape).astype('float32')
    
  •    filters_val = numpy.random.random(filters_shape).astype('float32')
    
  •    inputs = shared(inputs_val)
    
  •    filters = shared(filters_val)
    
  •    conv = theano.tensor.nnet.conv.conv2d(inputs, filters)
    
  •    f_ref = theano.function([], conv)
    
  •    f_fft = theano.function([], conv, mode=mode_with_gpu)
    
  •    res_ref = f_ref()
    
  •    res_fft = f_fft()
    
  •    numpy.testing.assert_allclose(res_ref, res_fft)
    

No we can't. At least not without introducing an ifelse or a switch which
you told us was too slow.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870/files#r12969297
.

This comment has been minimized.

@abergeron

abergeron May 22, 2014

Member

Maybe that could work, to get the right shape, but that would mean that we introduce a copy of the input data even when no padding is necessary. I don't think this is such a good idea from a performance point of view.

@abergeron

abergeron May 22, 2014

Member

Maybe that could work, to get the right shape, but that would mean that we introduce a copy of the input data even when no padding is necessary. I don't think this is such a good idea from a performance point of view.

@@ -67,11 +69,11 @@ def sym_conv2d(input, filters):
output = sym_conv2d(input, filters)
output.name = 'conv2d(%s,%s)' % (input.name, filters.name)
theano_conv = theano.function([input, filters], output)
theano_conv = theano.function([input, filters], output, mode=self.mode)

This comment has been minimized.

@nouiz

nouiz May 22, 2014

Member

self.mode and self.dtype don't exist. So this will fail.

@nouiz

nouiz May 22, 2014

Member

self.mode and self.dtype don't exist. So this will fail.

This comment has been minimized.

@abergeron

abergeron May 22, 2014

Member

Yes they do, see the changes on line 15 and 16.

@abergeron

abergeron May 22, 2014

Member

Yes they do, see the changes on line 15 and 16.

This comment has been minimized.

@nouiz

nouiz May 22, 2014

Member

ok.

@nouiz
@dwf

This comment has been minimized.

Show comment
Hide comment
@dwf

dwf May 22, 2014

Member

It can also be used directly if you want to use it only in parts of the graph, but there is no support for the automatic gradient in that case.

What does this mean?

Member

dwf commented May 22, 2014

It can also be used directly if you want to use it only in parts of the graph, but there is no support for the automatic gradient in that case.

What does this mean?

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz May 22, 2014

Member

It mean that you can manually introduce that version of the convolution,
but in that case, the gradient won't be implemented.

Fred

On Thu, May 22, 2014 at 2:12 PM, David Warde-Farley <
notifications@github.com> wrote:

It can also be used directly if you want to use it only in parts of the
graph, but there is no support for the automatic gradient in that case.

What does this mean?


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870#issuecomment-43923990
.

Member

nouiz commented May 22, 2014

It mean that you can manually introduce that version of the convolution,
but in that case, the gradient won't be implemented.

Fred

On Thu, May 22, 2014 at 2:12 PM, David Warde-Farley <
notifications@github.com> wrote:

It can also be used directly if you want to use it only in parts of the
graph, but there is no support for the automatic gradient in that case.

What does this mean?


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870#issuecomment-43923990
.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz May 22, 2014

Member

Guillaume told me that the full case should work. Can you add test for it? You can reuse the tests case from the TestConv2d that you started to generalyze to reuse here.

Member

nouiz commented May 22, 2014

Guillaume told me that the full case should work. Can you add test for it? You can reuse the tests case from the TestConv2d that you started to generalyze to reuse here.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron May 22, 2014

Member

I can't reuse TestConv2d since it tests for odd-last-dim shapes which don't work.

Also, @dwf I mean that there is no grad() on a bunch of ops used.

Member

abergeron commented May 22, 2014

I can't reuse TestConv2d since it tests for odd-last-dim shapes which don't work.

Also, @dwf I mean that there is no grad() on a bunch of ops used.

@dwf

This comment has been minimized.

Show comment
Hide comment
@dwf

dwf May 22, 2014

Member

I see. Is the use intended to be that it is substituted in after graph
creation, hence grad is unneeded? Isn't the grad fairly simple if you have
both full and valid mode convolution implemented?

Member

dwf commented May 22, 2014

I see. Is the use intended to be that it is substituted in after graph
creation, hence grad is unneeded? Isn't the grad fairly simple if you have
both full and valid mode convolution implemented?

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron May 22, 2014

Member

It's use is intended to replace GpuConv as an optimisation.

But the problem with the grad is that it's not a single op, but rather a graph that has an FFT transform, a complex dot product and an inverse FFT transform. I have no idea what the gradient of an FFT looks like and I don't want to spend a lot of time figuring it out.

Member

abergeron commented May 22, 2014

It's use is intended to replace GpuConv as an optimisation.

But the problem with the grad is that it's not a single op, but rather a graph that has an FFT transform, a complex dot product and an inverse FFT transform. I have no idea what the gradient of an FFT looks like and I don't want to spend a lot of time figuring it out.

Show outdated Hide outdated theano/sandbox/cuda/opt.py
@@ -1120,6 +1121,16 @@ def values_eq_approx(a, b):
# differently then the gpu ConvOp
return [out]
@register_opt()

This comment has been minimized.

@nouiz

nouiz May 22, 2014

Member

This will register it by default. Remove that decorator and add this line after the opt:

gpu_optimizer.register("local_conv_fft", local_conv_fft)

As there will be 2 opt, I think we should remove the theano flags and ask the user to use optimizer_including=local_conv_fft:local_conv_full_fft to enable both of them.

@nouiz

nouiz May 22, 2014

Member

This will register it by default. Remove that decorator and add this line after the opt:

gpu_optimizer.register("local_conv_fft", local_conv_fft)

As there will be 2 opt, I think we should remove the theano flags and ask the user to use optimizer_including=local_conv_fft:local_conv_full_fft to enable both of them.

This comment has been minimized.

@abergeron

abergeron May 22, 2014

Member

So I should just remove the register_opt() decorator or, do I need to add your call after for the optimizer_including trick to work?

@abergeron

abergeron May 22, 2014

Member

So I should just remove the register_opt() decorator or, do I need to add your call after for the optimizer_including trick to work?

This comment has been minimized.

@nouiz

nouiz May 22, 2014

Member

You need to do both, remove the decorator and register manually with the right param.

@nouiz

nouiz May 22, 2014

Member

You need to do both, remove the decorator and register manually with the right param.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz May 22, 2014

Member

One user just show me a case where he used GpuConv2d to do 1d convolution and in the full mode case, it was very slow (like speeding 100% time there with the rounding).

As the full case work and is now tested, can you add an opt for this?

Member

nouiz commented May 22, 2014

One user just show me a case where he used GpuConv2d to do 1d convolution and in the full mode case, it was very slow (like speeding 100% time there with the rounding).

As the full case work and is now tested, can you add an opt for this?

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron May 22, 2014

Member

Yes I can, but I was under the impression that this was somewhat urgent. With all the things that keep piling on this PR, it'll take another week to complete (and I can't even get the doc to work).

Member

abergeron commented May 22, 2014

Yes I can, but I was under the impression that this was somewhat urgent. With all the things that keep piling on this PR, it'll take another week to complete (and I can't even get the doc to work).

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz May 22, 2014

Member

We discussed many things here, and not all of them need to be done for this PR, but I think we should not forget them. Here is the list of what I think should be done in this PR:

  • unittest_toosl.assert_allclose() (very fast)
  • fix how we register the opt
  • add user doc, that I direct people to it if I suggest them to use this. This can be manual doc in tensor/nnet/conv.txt. I don't understand your import problem. I made sure we can import some of the cuda files for the docs. This mean we should be able to import the file on a computer without a GPU and nvcc. But this can be later.

List of todo in this PR or just after

  • make opt and associated test for the full case and update the doc for this.

todo for later

  • Should we make the pad parameter work with the image size is even?
  • make the file importable without a GPU and without nvcc
  • Should we pass the user specified shape to the fft conv?

Did I forgot something? Do you agree with the prio?

Member

nouiz commented May 22, 2014

We discussed many things here, and not all of them need to be done for this PR, but I think we should not forget them. Here is the list of what I think should be done in this PR:

  • unittest_toosl.assert_allclose() (very fast)
  • fix how we register the opt
  • add user doc, that I direct people to it if I suggest them to use this. This can be manual doc in tensor/nnet/conv.txt. I don't understand your import problem. I made sure we can import some of the cuda files for the docs. This mean we should be able to import the file on a computer without a GPU and nvcc. But this can be later.

List of todo in this PR or just after

  • make opt and associated test for the full case and update the doc for this.

todo for later

  • Should we make the pad parameter work with the image size is even?
  • make the file importable without a GPU and without nvcc
  • Should we pass the user specified shape to the fft conv?

Did I forgot something? Do you agree with the prio?

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron May 23, 2014

Member

The import problem is that import theano.sandbox.cuda.fftconv fails because fftconv tries to import pycuda and scikits.cuda. Therefore autofunction and/or automodule can't work on it.

Member

abergeron commented May 23, 2014

The import problem is that import theano.sandbox.cuda.fftconv fails because fftconv tries to import pycuda and scikits.cuda. Therefore autofunction and/or automodule can't work on it.

abergeron and others added some commits May 12, 2014

Import fftconv.py with some pep8 modifications.
This comes from Sander Dieleman <sanderdieleman@gmail.com> (@benanne
on github).
Fixed some typos and a dimension that was probably wrong. This doesn'…
…t solve the issue with the boder_mode='full' because we still have other errors.
Added support for images whose last dimension is odd. This is more ex…
…pensive computationally, and it costs more in memory, but it's optional.
@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron May 23, 2014

Member

I did a mention in the docs. I also had to rebase because of it, since that area of the docs had changed recently and the PR would not merge.

Member

abergeron commented May 23, 2014

I did a mention in the docs. I also had to rebase because of it, since that area of the docs had changed recently and the PR would not merge.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz May 24, 2014

Member

on oolong I have this error:

Traceback (most recent call last):
  File "convolutional_mlp.py", line 32, in <module>
    import theano
  File "/u/bastienf/repos/theano/__init__.py", line 85, in <module>
    import theano.sandbox.cuda
  File "/u/bastienf/repos/theano/sandbox/cuda/__init__.py", line 248, in <module>
    import theano.sandbox.cuda.fftconv
  File "/u/bastienf/repos/theano/sandbox/cuda/fftconv.py", line 18, in <module>
    misc.init()
  File "/opt/lisa/os/lib/python2.7/site-packages/scikits/cuda/misc.py", line 154, in init
    cula.culaInitialize()
  File "/opt/lisa/os/lib/python2.7/site-packages/scikits/cuda/cula.py", line 388, in culaInitialize
    culaCheckStatus(status)
  File "/opt/lisa/os/lib/python2.7/site-packages/scikits/cuda/cula.py", line 264, in culaCheckStatus
    raise culaExceptions[status](error)
scikits.cuda.cula.culaRuntimeError: 3
Member

nouiz commented May 24, 2014

on oolong I have this error:

Traceback (most recent call last):
  File "convolutional_mlp.py", line 32, in <module>
    import theano
  File "/u/bastienf/repos/theano/__init__.py", line 85, in <module>
    import theano.sandbox.cuda
  File "/u/bastienf/repos/theano/sandbox/cuda/__init__.py", line 248, in <module>
    import theano.sandbox.cuda.fftconv
  File "/u/bastienf/repos/theano/sandbox/cuda/fftconv.py", line 18, in <module>
    misc.init()
  File "/opt/lisa/os/lib/python2.7/site-packages/scikits/cuda/misc.py", line 154, in init
    cula.culaInitialize()
  File "/opt/lisa/os/lib/python2.7/site-packages/scikits/cuda/cula.py", line 388, in culaInitialize
    culaCheckStatus(status)
  File "/opt/lisa/os/lib/python2.7/site-packages/scikits/cuda/cula.py", line 264, in culaCheckStatus
    raise culaExceptions[status](error)
scikits.cuda.cula.culaRuntimeError: 3
@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron May 24, 2014

Member

According to research it means that the cula initialization code encounters this cuda error:

cudaErrorInitializationError = 3
The API call failed because the CUDA driver and runtime could not be initialized.

I do get that same error on that machine, but I notice that you're also running the scan testsuite on that GPU (which may or may not be the cause).

Member

abergeron commented May 24, 2014

According to research it means that the cula initialization code encounters this cuda error:

cudaErrorInitializationError = 3
The API call failed because the CUDA driver and runtime could not be initialized.

I do get that same error on that machine, but I notice that you're also running the scan testsuite on that GPU (which may or may not be the cause).

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack May 25, 2014

Could this be used for the opencl (CPU/GPU) counterpart? https://github.com/geggo/gpyfft

bhack commented May 25, 2014

Could this be used for the opencl (CPU/GPU) counterpart? https://github.com/geggo/gpyfft

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne May 25, 2014

Contributor

It looks like gpyfft does batched FFTs, so it should be possible to implement gpyfft-based versions of at least the CuFFTOp and the CuIFFTOp. A batched complex dot product implementation is also needed though. In CUDA this is achieved with cublasCgemmBatched, I don't know if there is a good OpenCL alternative for this (and a corresponding Python wrapper).

Contributor

benanne commented May 25, 2014

It looks like gpyfft does batched FFTs, so it should be possible to implement gpyfft-based versions of at least the CuFFTOp and the CuIFFTOp. A batched complex dot product implementation is also needed though. In CUDA this is achieved with cublasCgemmBatched, I don't know if there is a good OpenCL alternative for this (and a corresponding Python wrapper).

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack May 25, 2014

I think that could covered by CLBLAS but the python wrapper it is just starter and cover only sgemm as example https://github.com/clMathLibraries/clBLAS/blob/develop/src/wrappers/python/README.txt

bhack commented May 25, 2014

I think that could covered by CLBLAS but the python wrapper it is just starter and cover only sgemm as example https://github.com/clMathLibraries/clBLAS/blob/develop/src/wrappers/python/README.txt

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne May 25, 2014

Contributor

If clBLAS also comes with batched gemm operations that could work. I can't find any reference to clblasCgemmBatched (only clblasCgemm), but maybe it is named differently because it is not a standard level 3 BLAS operation afaik.

It is of course possible to just do separate clblasCgemm calls for each dot product, but this seems to negate much of the performance advantage of using an FFT-based convolution in the first place, at least in the CUDA / cuBLAS setting.

Contributor

benanne commented May 25, 2014

If clBLAS also comes with batched gemm operations that could work. I can't find any reference to clblasCgemmBatched (only clblasCgemm), but maybe it is named differently because it is not a standard level 3 BLAS operation afaik.

It is of course possible to just do separate clblasCgemm calls for each dot product, but this seems to negate much of the performance advantage of using an FFT-based convolution in the first place, at least in the CUDA / cuBLAS setting.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron May 25, 2014

Member

XgemmBatched is indeed not a standard BLAS operation and I didn't see it in other libraries than cuBLAS.

We already have a python wrapper of sorts for clBLAS in libgpuarray.

It is in the plans to add support for https://github.com/clMathLibraries/clFFT and cuFFT to provide fourrier transform capabilities, but not right now.

Member

abergeron commented May 25, 2014

XgemmBatched is indeed not a standard BLAS operation and I didn't see it in other libraries than cuBLAS.

We already have a python wrapper of sorts for clBLAS in libgpuarray.

It is in the plans to add support for https://github.com/clMathLibraries/clFFT and cuFFT to provide fourrier transform capabilities, but not right now.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack May 25, 2014

Yes gpyfft is a clFFT wrapper. So isn't more useful to put effort in libgpuarray instead of push new op in Theano?

bhack commented May 25, 2014

Yes gpyfft is a clFFT wrapper. So isn't more useful to put effort in libgpuarray instead of push new op in Theano?

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron May 25, 2014

Member

Yes we plan to add that support to libgpuarray.

Member

abergeron commented May 25, 2014

Yes we plan to add that support to libgpuarray.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz May 26, 2014

Member

@abergeron,

I have a crash when I try this PR with the convolution in the DLT. Can you
check that is goind on?

On Sun, May 25, 2014 at 10:17 AM, abergeron notifications@github.comwrote:

Yes we plan to add that support to libgpuarray.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870#issuecomment-44134199
.

Member

nouiz commented May 26, 2014

@abergeron,

I have a crash when I try this PR with the convolution in the DLT. Can you
check that is goind on?

On Sun, May 25, 2014 at 10:17 AM, abergeron notifications@github.comwrote:

Yes we plan to add that support to libgpuarray.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870#issuecomment-44134199
.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron May 26, 2014

Member

Are you sure it isn't using odd inputs somewhere? What does the crash looks like? If it's something about a reshape, then it is the odd dim problem.

Member

abergeron commented May 26, 2014

Are you sure it isn't using odd inputs somewhere? What does the crash looks like? If it's something about a reshape, then it is the odd dim problem.

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne May 27, 2014

Contributor

Out of curiosity, is there a reason why padding for odd input sizes can't be done automatically? I feel like I'm missing something.

I understand this introduces an extra copy operation, and the associated slowdown, but as long as this is documented I don't think this is a problem. Maybe it could be controlled with a keyword argument that has "pad if necessary" semantics, instead of "always pad".

If I understand correctly, the optimization that inserts the FFT-based convolution in the graph will always insert an op with pad_last_dim=False (which is the default), so using convolutions with odd input sizes will always fail, unless the user manually inserts the conv2d_fft op into the graph and remembers to set pad_last_dim=True.

If pad_last_dim added the padding only if necessary, the FFT-based op inserted by the optimization would always work (just a bit slower if the input size is odd). I think this would greatly increase usability. The user can then decide themselves whether they find the slowdown acceptable, and the code doesn't need to change when input and filter sizes are changed.

Contributor

benanne commented May 27, 2014

Out of curiosity, is there a reason why padding for odd input sizes can't be done automatically? I feel like I'm missing something.

I understand this introduces an extra copy operation, and the associated slowdown, but as long as this is documented I don't think this is a problem. Maybe it could be controlled with a keyword argument that has "pad if necessary" semantics, instead of "always pad".

If I understand correctly, the optimization that inserts the FFT-based convolution in the graph will always insert an op with pad_last_dim=False (which is the default), so using convolutions with odd input sizes will always fail, unless the user manually inserts the conv2d_fft op into the graph and remembers to set pad_last_dim=True.

If pad_last_dim added the padding only if necessary, the FFT-based op inserted by the optimization would always work (just a bit slower if the input size is odd). I think this would greatly increase usability. The user can then decide themselves whether they find the slowdown acceptable, and the code doesn't need to change when input and filter sizes are changed.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron May 27, 2014

Member

Because the sizes are symbolic, we can't know if they are odd or even at graph creation time.

So for "pad if necessary" we either have to always copy the input data into a "padding buffer" that may be completely useless or introduce an ifelse that slows down the graph. Not only would that slow down the graph, but it would require even more memory on the GPU rendering large graphs impossible.

The other option might be to add a ConvPad op that would have access to the sizes in play and pad the buffer to be even or just return a view. But that wouldn't solve the memory problem.

Finally, since the optimization has to be manually enabled, such restrictions are acceptable.

Member

abergeron commented May 27, 2014

Because the sizes are symbolic, we can't know if they are odd or even at graph creation time.

So for "pad if necessary" we either have to always copy the input data into a "padding buffer" that may be completely useless or introduce an ifelse that slows down the graph. Not only would that slow down the graph, but it would require even more memory on the GPU rendering large graphs impossible.

The other option might be to add a ConvPad op that would have access to the sizes in play and pad the buffer to be even or just return a view. But that wouldn't solve the memory problem.

Finally, since the optimization has to be manually enabled, such restrictions are acceptable.

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne May 27, 2014

Contributor

That makes sense, I forgot that the sizes aren't necessarily available at compile time. I have made a habit of always specifying the image_shape and filter_shape. Thanks for the clarification.

Contributor

benanne commented May 27, 2014

That makes sense, I forgot that the sizes aren't necessarily available at compile time. I have made a habit of always specifying the image_shape and filter_shape. Thanks for the clarification.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz May 27, 2014

Member

It crash with a cula errors, not with reshape errors. Can you try it here
and see if you also have this problem?

On Mon, May 26, 2014 at 7:54 PM, abergeron notifications@github.com wrote:

Are you sure it isn't using odd inputs somewhere? What does the crash
looks like? If it's something about a reshape, then it is the odd dim
problem.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870#issuecomment-44224575
.

Member

nouiz commented May 27, 2014

It crash with a cula errors, not with reshape errors. Can you try it here
and see if you also have this problem?

On Mon, May 26, 2014 at 7:54 PM, abergeron notifications@github.com wrote:

Are you sure it isn't using odd inputs somewhere? What does the crash
looks like? If it's something about a reshape, then it is the odd dim
problem.


Reply to this email directly or view it on GitHubhttps://github.com/Theano/Theano/pull/1870#issuecomment-44224575
.

@abergeron

This comment has been minimized.

Show comment
Hide comment
@abergeron

abergeron May 27, 2014

Member

I ran the convolution_mlp.py example from the DLT on my machine with both optimizations enabled and I get no errors (and even a slight speedup).

Member

abergeron commented May 27, 2014

I ran the convolution_mlp.py example from the DLT on my machine with both optimizations enabled and I get no errors (and even a slight speedup).

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz May 28, 2014

Member

It is a problem with the installation here. So I merge the PR.

Member

nouiz commented May 28, 2014

It is a problem with the installation here. So I merge the PR.

nouiz added a commit that referenced this pull request May 28, 2014

@nouiz nouiz merged commit 7555005 into Theano:master May 28, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment