New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta Optimizers #2072

Closed
f0k opened this Issue Sep 2, 2014 · 25 comments

Comments

Projects
None yet
8 participants
@f0k
Contributor

f0k commented Sep 2, 2014

Theano now provides access to several different convolution implementations that perform very differently depending on image and kernel shapes, and depending on the direction when used in training a CNN (forward pass, gradient wrt. weights, gradient wrt. inputs), but also have different restrictions (filter shapes, memory usage).
A lot could be gained by picking the best-performing implementation for each individual convolution in a CNN: https://github.com/soumith/convnet-benchmarks
Because the relative performance of different implementations heavily depends on the GPU used, it is impossible to select the optimal implementation at compile time via some clever heuristic.

So I would like to start a discussion on adding "meta optimizers" to the Theano graph optimizers. A meta optimizer would collect a set of possible replacements for an op, compile each of them, time their performance and then select the best-performing one.

This might be implemented in the form of a regular graph optimizer, but it should have the following features:

  • It should operate on a list of op replacement functions that each return a replacement for a given op or a falseish value to indicate that they are not applicable -- similar to what the current graph optimizers do anyway.
  • It should be possible to register additional op replacement functions with a given meta optimizer. For example, user code may want to register a wrapper for cuda-convnet, cuda-convnet2 or some other Theano-external implementation with Theano's conv2d meta optimizer.
  • It should be able to cache its decisions, so that it does not need to re-run any benchmarks if it is repeatedly asked to find a replacement for a given op and device with a given set of registered replacement functions. (This cache should be as permanent as the compilation cache.)

I'm currently interested in GPU-based conv2d only. In this case the meta optimizer could spring into action for any GpuConv op with fully specified shape on compile time, by compiling and running a set of proposed replacements on random dummy data. Other use cases for meta-optimization may require more realistic data -- they could hijack the test value computation, for example.

Sooo...

  1. What do you think of the idea in general?
  2. What would it need to implement the features listed above for a GpuConv meta optimizer?
  3. Should the implementation plan for possible future meta optimizers or not care about generality/reusability for now?

/Update Nov 24: A first implementation was done in #2266. TODOs left:

  • The meta-optimizer currently tries all available convolutions, it should only try the ones selected via optimizer_including and optimizer_excluding
  • For gemm, it should try both GpuCorrMM and GpuCorrMM_gradWeights if both are applicable, not rely on the heuristic in local_conv_gemm
  • Timings should be cached on disk to speed up repeated invocations.
  • The configuration flag metaopt.verbose should become an int: 0 for silent mode, 1 to only warn if we cannot meta-optimize some op (missing shape information), 2 for full output of separate timings and selected implementation
  • Update it to try the 3 cudnn forward convolution algo.
  • For 3d convolution, try conv3d() vs conv3d2d() (sometimes the last one is faster: gh-5713)
@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Sep 2, 2014

Contributor

I really like the idea of automating this procedure. Perhaps it would be useful to mimic the FFTW approach, where different optimization modes can be set depending on your patience (in this case for computing FFTs). (See http://www.fftw.org/doc/Planner-Flags.html )

The exhaustive mode will try every (applicable) approach and pick the fastest one. The 'heuristic' mode will use a heuristic to determine the best algorithm based on shape information. And then there are a few modes in between that enable the user to tune the optimization time / performance trade-off.

Contributor

benanne commented Sep 2, 2014

I really like the idea of automating this procedure. Perhaps it would be useful to mimic the FFTW approach, where different optimization modes can be set depending on your patience (in this case for computing FFTs). (See http://www.fftw.org/doc/Planner-Flags.html )

The exhaustive mode will try every (applicable) approach and pick the fastest one. The 'heuristic' mode will use a heuristic to determine the best algorithm based on shape information. And then there are a few modes in between that enable the user to tune the optimization time / performance trade-off.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Sep 3, 2014

Contributor

Perhaps it would be useful to mimic the FFTW approach, where different optimization modes can be set depending on your patience

This would be nice, but the premise was that it's nearly impossible to find good heuristics, so a 'heuristic' mode will be hard... For a start, I would suggest to provide a single version of the convolution meta-optimizer only, "enableable" ;) via optimizer_including=conv_metaopt, and with no notion of different optimization strategies because that would make it difficult to generalize. At a later stage there could be two or three differently expensive conv_metaopt under different names to give the user a choice without complicating matters too much.

Contributor

f0k commented Sep 3, 2014

Perhaps it would be useful to mimic the FFTW approach, where different optimization modes can be set depending on your patience

This would be nice, but the premise was that it's nearly impossible to find good heuristics, so a 'heuristic' mode will be hard... For a start, I would suggest to provide a single version of the convolution meta-optimizer only, "enableable" ;) via optimizer_including=conv_metaopt, and with no notion of different optimization strategies because that would make it difficult to generalize. At a later stage there could be two or three differently expensive conv_metaopt under different names to give the user a choice without complicating matters too much.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Sep 4, 2014

Member

I idea is good. It isn't a new idea, but we need someone to work on it:)

I would keep it simple at first. Just compare the GpuConv, vs GpuCorrMM vs
fft conv.

If it work, we can add the call back to add extra op to compare again,
caching, generalize it...

I don't see in the short term a need for other meta optimizer, so keeping
in your head the goal of genarality is good, but don't spend too much time
on this. It can be refactored later.

On Wed, Sep 3, 2014 at 9:09 AM, Jan Schlüter notifications@github.com
wrote:

Perhaps it would be useful to mimic the FFTW approach, where different
optimization modes can be set depending on your patience

This would be nice, but the premise was that it's nearly impossible to
find good heuristics, so a 'heuristic' mode will be hard... For a start, I
would suggest to provide a single version of the convolution meta-optimizer
only, "enableable" ;) via optimizer_including=conv_metaopt, and with no
notion of different optimization strategies because that would make it
difficult to generalize. At a later stage there could be two or three
differently expensive conv_metaopt under different names to give the user
a choice without complicating matters too much.


Reply to this email directly or view it on GitHub
#2072 (comment).

Member

nouiz commented Sep 4, 2014

I idea is good. It isn't a new idea, but we need someone to work on it:)

I would keep it simple at first. Just compare the GpuConv, vs GpuCorrMM vs
fft conv.

If it work, we can add the call back to add extra op to compare again,
caching, generalize it...

I don't see in the short term a need for other meta optimizer, so keeping
in your head the goal of genarality is good, but don't spend too much time
on this. It can be refactored later.

On Wed, Sep 3, 2014 at 9:09 AM, Jan Schlüter notifications@github.com
wrote:

Perhaps it would be useful to mimic the FFTW approach, where different
optimization modes can be set depending on your patience

This would be nice, but the premise was that it's nearly impossible to
find good heuristics, so a 'heuristic' mode will be hard... For a start, I
would suggest to provide a single version of the convolution meta-optimizer
only, "enableable" ;) via optimizer_including=conv_metaopt, and with no
notion of different optimization strategies because that would make it
difficult to generalize. At a later stage there could be two or three
differently expensive conv_metaopt under different names to give the user
a choice without complicating matters too much.


Reply to this email directly or view it on GitHub
#2072 (comment).

@f0k f0k referenced this issue Sep 11, 2014

Merged

Convolution layer using cudnn. #2096

2 of 4 tasks complete
@cancan101

This comment has been minimized.

Show comment
Hide comment
@cancan101

cancan101 Nov 20, 2014

One of the complexities of a meta optimizer would be dealing with the memory vs performance tradeoff. At some point as the memory usage grows, you will be have to decrease the batch size in order to fit the memory used in the function valuation within GPU RAM. Reducing the batch size will presumably lead to slower function evaluation. For example FFT conv might run faster than alternatives, but it uses so much memory that on certain devices with certain argument sizes, you are forced to reduce the batch size leading to slower performance.

Are you thinking of having the meta optimizer choose optimal values for batch size as well?

cancan101 commented Nov 20, 2014

One of the complexities of a meta optimizer would be dealing with the memory vs performance tradeoff. At some point as the memory usage grows, you will be have to decrease the batch size in order to fit the memory used in the function valuation within GPU RAM. Reducing the batch size will presumably lead to slower function evaluation. For example FFT conv might run faster than alternatives, but it uses so much memory that on certain devices with certain argument sizes, you are forced to reduce the batch size leading to slower performance.

Are you thinking of having the meta optimizer choose optimal values for batch size as well?

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Nov 20, 2014

Contributor

Are you thinking of having the meta optimizer choose optimal values for batch size as well?

No, it will just use the batch size you gave and try which implementation is the fastest. It does not monitor memory usage. It will not use an implementation that throws a compile time or run time exception when trying it, so it will not use the FFT convolution if it exceeds GPU memory. It will not try to split the convolution into two batches of half the size or something like that, but you can easily register additional optimizers with the meta-optimizer at runtime if you think this would be worth it.
By the way, there's a potential problem in that during the meta-optimization, each convolution will be compiled, run and benchmarked separately. If you use allow_gc=0, it might happen that while the separate convolution instances run fine, a function having all of the convolutions will run out of GPU memory. As long as you use allow_gc=1 and also allocate all your shared variables before compiling the function, it should be fine, though.

Contributor

f0k commented Nov 20, 2014

Are you thinking of having the meta optimizer choose optimal values for batch size as well?

No, it will just use the batch size you gave and try which implementation is the fastest. It does not monitor memory usage. It will not use an implementation that throws a compile time or run time exception when trying it, so it will not use the FFT convolution if it exceeds GPU memory. It will not try to split the convolution into two batches of half the size or something like that, but you can easily register additional optimizers with the meta-optimizer at runtime if you think this would be worth it.
By the way, there's a potential problem in that during the meta-optimization, each convolution will be compiled, run and benchmarked separately. If you use allow_gc=0, it might happen that while the separate convolution instances run fine, a function having all of the convolutions will run out of GPU memory. As long as you use allow_gc=1 and also allocate all your shared variables before compiling the function, it should be fine, though.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Nov 21, 2014

Member

As each convolution will be timmed separatly, if it introduce fft that fit
itself on the GPU, it can still crash at run time, due to other
intermediate memory usage I think.

During the meta-optimizer, you can set allow_gc=True yourself to do not
have this problem.

I didn't looked up to now to the code...

On Thu, Nov 20, 2014 at 6:15 PM, Jan Schlüter notifications@github.com
wrote:

Are you thinking of having the meta optimizer choose optimal values for
batch size as well?

No, it will just use the batch size you gave and try which implementation
is the fastest. It does not monitor memory usage. It will not use an
implementation that throws a compile time or run time exception when trying
it, so it will not use the FFT convolution if it exceeds GPU memory. It
will not try to split the convolution into two batches of half the size or
something like that, but you can easily register additional optimizers with
the meta-optimizer at runtime if you think this would be worth it.
By the way, there's a potential problem in that during the
meta-optimization, each convolution will be compiled, run and benchmarked
separately. If you use allow_gc=0, it might happen that while the
separate convolution instances run fine, a function having all of the
convolutions will run out of GPU memory. As long as you use allow_gc=1
and also allocate all your shared variables before compiling the function,
it should be fine, though.


Reply to this email directly or view it on GitHub
#2072 (comment).

Member

nouiz commented Nov 21, 2014

As each convolution will be timmed separatly, if it introduce fft that fit
itself on the GPU, it can still crash at run time, due to other
intermediate memory usage I think.

During the meta-optimizer, you can set allow_gc=True yourself to do not
have this problem.

I didn't looked up to now to the code...

On Thu, Nov 20, 2014 at 6:15 PM, Jan Schlüter notifications@github.com
wrote:

Are you thinking of having the meta optimizer choose optimal values for
batch size as well?

No, it will just use the batch size you gave and try which implementation
is the fastest. It does not monitor memory usage. It will not use an
implementation that throws a compile time or run time exception when trying
it, so it will not use the FFT convolution if it exceeds GPU memory. It
will not try to split the convolution into two batches of half the size or
something like that, but you can easily register additional optimizers with
the meta-optimizer at runtime if you think this would be worth it.
By the way, there's a potential problem in that during the
meta-optimization, each convolution will be compiled, run and benchmarked
separately. If you use allow_gc=0, it might happen that while the
separate convolution instances run fine, a function having all of the
convolutions will run out of GPU memory. As long as you use allow_gc=1
and also allocate all your shared variables before compiling the function,
it should be fine, though.


Reply to this email directly or view it on GitHub
#2072 (comment).

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Nov 21, 2014

Contributor

As each convolution will be timmed separatly, if it introduce fft that fit
itself on the GPU, it can still crash at run time, due to other
intermediate memory usage I think.

Probably yes. But I think the meta-optimizer should not try to account for this. It should be enough if in future we allow fine control about which optimizers are used by the meta-optimizer, so users can disable the fft optimizer if needed. Furthermore, users can flag some of their convolutions with version='no_fft', this still works as before.

I didn't looked up to now to the code...

OK, don't worry, I can use it already :) But please try to merge it before it needs a rebase.

Contributor

f0k commented Nov 21, 2014

As each convolution will be timmed separatly, if it introduce fft that fit
itself on the GPU, it can still crash at run time, due to other
intermediate memory usage I think.

Probably yes. But I think the meta-optimizer should not try to account for this. It should be enough if in future we allow fine control about which optimizers are used by the meta-optimizer, so users can disable the fft optimizer if needed. Furthermore, users can flag some of their convolutions with version='no_fft', this still works as before.

I didn't looked up to now to the code...

OK, don't worry, I can use it already :) But please try to merge it before it needs a rebase.

@rubenvereecken

This comment has been minimized.

Show comment
Hide comment
@rubenvereecken

rubenvereecken Jan 4, 2016

Quite an old discussion it seems. I got here through the documentation page and this article in the hope of squeezing some extra performance out of my program.

I got none of the fancy output @f0k displayed in the post. Before I dig all the way into the code to see if this is still somehow supported, does anyone know if it is? This issue being open for so long scared me.

rubenvereecken commented Jan 4, 2016

Quite an old discussion it seems. I got here through the documentation page and this article in the hope of squeezing some extra performance out of my program.

I got none of the fancy output @f0k displayed in the post. Before I dig all the way into the code to see if this is still somehow supported, does anyone know if it is? This issue being open for so long scared me.

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Jan 4, 2016

Contributor

Note that cuDNN itself now has multiple convolution implementations (including FFT-based) and also has something like a metaoptimizer built in these days. From Theano, you can control which implementations are used with the Theano flags dnn.conv.algo_fwd, dnn.conv.algo_bwd_filter and dnn.conv.algo_bwd_data. For the different options, have a look at the 'notes' section in the docs: http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html (note that this is slightly outdated, the option dnn.conv.algo_bwd no longer exists.)

By setting all of them to 'time_once', you get similar behaviour to the metaoptimizer: it will time all implementations and use the fastest ones. If you have a recent GPU (Maxwell-based), I thoroughly recommend playing with these settings as I've seen speedups of more than 50%.

Contributor

benanne commented Jan 4, 2016

Note that cuDNN itself now has multiple convolution implementations (including FFT-based) and also has something like a metaoptimizer built in these days. From Theano, you can control which implementations are used with the Theano flags dnn.conv.algo_fwd, dnn.conv.algo_bwd_filter and dnn.conv.algo_bwd_data. For the different options, have a look at the 'notes' section in the docs: http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html (note that this is slightly outdated, the option dnn.conv.algo_bwd no longer exists.)

By setting all of them to 'time_once', you get similar behaviour to the metaoptimizer: it will time all implementations and use the fastest ones. If you have a recent GPU (Maxwell-based), I thoroughly recommend playing with these settings as I've seen speedups of more than 50%.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Jan 4, 2016

Contributor

Before I dig all the way into the code to see if this is still somehow supported, does anyone know if it is? This issue being open for so long scared me.

It's possible that it needs some work to resurrect since the conv2d interface was recently changed. The issue being open for so long is just because the original post on the top still has several suggestions/TODOs for improving meta-optimization over what I initially implemented.

This being said, if you have cuDNN, then the meta-optimizer probably won't buy you anything over what @benanne suggested -- it was more interesting in the early days when cuDNN was neither widely supported nor mature. And it would become interesting again if we added support for alternative implementations such as facebook's or Nervana's.

Contributor

f0k commented Jan 4, 2016

Before I dig all the way into the code to see if this is still somehow supported, does anyone know if it is? This issue being open for so long scared me.

It's possible that it needs some work to resurrect since the conv2d interface was recently changed. The issue being open for so long is just because the original post on the top still has several suggestions/TODOs for improving meta-optimization over what I initially implemented.

This being said, if you have cuDNN, then the meta-optimizer probably won't buy you anything over what @benanne suggested -- it was more interesting in the early days when cuDNN was neither widely supported nor mature. And it would become interesting again if we added support for alternative implementations such as facebook's or Nervana's.

@rubenvereecken

This comment has been minimized.

Show comment
Hide comment
@rubenvereecken

rubenvereecken Jan 4, 2016

Thanks @benanne and @f0k, I managed to get my hands on cuDNN but was trying to see if there were any other gains to be had depending on the problem maybe. So far, with manual exploration, I've managed to get the best runtimes using cuDNN (through Lasagne's interface), so I'll stick to that.

rubenvereecken commented Jan 4, 2016

Thanks @benanne and @f0k, I managed to get my hands on cuDNN but was trying to see if there were any other gains to be had depending on the problem maybe. So far, with manual exploration, I've managed to get the best runtimes using cuDNN (through Lasagne's interface), so I'll stick to that.

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Jan 25, 2017

Member

was also discussed in gh-3557. Still needed as cudnn timer don't cover our own code and don't cover mixing the forward/back-ward when possible.

Member

nouiz commented Jan 25, 2017

was also discussed in gh-3557. Still needed as cudnn timer don't cover our own code and don't cover mixing the forward/back-ward when possible.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Jan 25, 2017

Contributor

Still needed as cudnn timer don't cover our own code and don't cover mixing the forward/back-ward when possible.

Still a nagging problem is that it requires the shapes to be known at compile time. This makes it unusable for fully-convolutional models, for example. We could do everything at runtime (like "time_once" and "time_on_shape_change") if we had a way to directly call and time cuDNN functions and the caffe-based implementation. This in turn would require them to reside in precompiled modules, not compiled on-demand. This in turn could greatly reduce compile time for some models. That's more of a long-term idea, though :)

Contributor

f0k commented Jan 25, 2017

Still needed as cudnn timer don't cover our own code and don't cover mixing the forward/back-ward when possible.

Still a nagging problem is that it requires the shapes to be known at compile time. This makes it unusable for fully-convolutional models, for example. We could do everything at runtime (like "time_once" and "time_on_shape_change") if we had a way to directly call and time cuDNN functions and the caffe-based implementation. This in turn would require them to reside in precompiled modules, not compiled on-demand. This in turn could greatly reduce compile time for some models. That's more of a long-term idea, though :)

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Jan 25, 2017

Member
Member

nouiz commented Jan 25, 2017

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Jan 27, 2017

Contributor

If we implement shape inference when we build the graph, then we could do that at compile time.

No, I was saying there are models for which the shape is only known at runtime. E.g., for a fully-convolutional model, the spatial dimensions may vary, and for many models, the batchsize may vary. So it makes sense to do the timing at runtime, as done for cuDNN. (Of course, if every call has a different shape, it doesn't make sense to do a time_on_shape_change, but the time_once could still help. And if we implement memoization, after a while there'll be a database of which algorithm to use when.)

Contributor

f0k commented Jan 27, 2017

If we implement shape inference when we build the graph, then we could do that at compile time.

No, I was saying there are models for which the shape is only known at runtime. E.g., for a fully-convolutional model, the spatial dimensions may vary, and for many models, the batchsize may vary. So it makes sense to do the timing at runtime, as done for cuDNN. (Of course, if every call has a different shape, it doesn't make sense to do a time_on_shape_change, but the time_once could still help. And if we implement memoization, after a while there'll be a database of which algorithm to use when.)

@affanv14

This comment has been minimized.

Show comment
Hide comment
@affanv14

affanv14 Mar 4, 2017

Contributor

For gemm, it should try both GpuCorrMM and GpuCorrMM_gradWeights if both are applicable, not rely on the heuristic in local_conv_gemm

Can you please elaborate on this.

Contributor

affanv14 commented Mar 4, 2017

For gemm, it should try both GpuCorrMM and GpuCorrMM_gradWeights if both are applicable, not rely on the heuristic in local_conv_gemm

Can you please elaborate on this.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Mar 5, 2017

Contributor

Can you please elaborate on this.

Instead of only conv_gemm, we should have conv_gemm and conv_gemm_alternative similar to how it's done for cuDNN. I.e., always try two versions as described here: #2266 (comment). The current conv_gemm includes a heuristic that selects between different code paths depending on the input and filter tensor shapes, this should be circumvented in the meta-optimizer to only rely on timing instead.

Note that the meta-optimizer is not integrated in the new gpu backend at all, so this would need to be done first. The new backend also has convolution placeholder Ops that explicitly encode the direction (forward, backward wrt. input, backward wrt. weights), so the set of optimizers participating in the meta-optimization needs to be changed a bit. (The old backend also has these placeholder Ops, but they are replaced by a GpuConv Op first and then optimized with meta-optimization; the new backend directly works with the placeholders.)

Contributor

f0k commented Mar 5, 2017

Can you please elaborate on this.

Instead of only conv_gemm, we should have conv_gemm and conv_gemm_alternative similar to how it's done for cuDNN. I.e., always try two versions as described here: #2266 (comment). The current conv_gemm includes a heuristic that selects between different code paths depending on the input and filter tensor shapes, this should be circumvented in the meta-optimizer to only rely on timing instead.

Note that the meta-optimizer is not integrated in the new gpu backend at all, so this would need to be done first. The new backend also has convolution placeholder Ops that explicitly encode the direction (forward, backward wrt. input, backward wrt. weights), so the set of optimizers participating in the meta-optimization needs to be changed a bit. (The old backend also has these placeholder Ops, but they are replaced by a GpuConv Op first and then optimized with meta-optimization; the new backend directly works with the placeholders.)

@vikramnitin9

This comment has been minimized.

Show comment
Hide comment
@vikramnitin9

vikramnitin9 Mar 29, 2017

Contributor

The meta-optimizer currently tries all available convolutions, it should only try the ones selected via optimizer_including and optimizer_excluding

Regarding this, how do you access (from the meta-optimizer) the tag passed to 'optimizer_including=...' to check whether or not to include a particular optimizer?

Contributor

vikramnitin9 commented Mar 29, 2017

The meta-optimizer currently tries all available convolutions, it should only try the ones selected via optimizer_including and optimizer_excluding

Regarding this, how do you access (from the meta-optimizer) the tag passed to 'optimizer_including=...' to check whether or not to include a particular optimizer?

@nouiz nouiz added the Convolution label Apr 10, 2017

@affanv14

This comment has been minimized.

Show comment
Hide comment
@affanv14

affanv14 Jul 11, 2017

Contributor

Does the Gpuarray backend require device synchronisation ? I'm a little confused about this

Contributor

affanv14 commented Jul 11, 2017

Does the Gpuarray backend require device synchronisation ? I'm a little confused about this

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Jul 12, 2017

Contributor

Does the Gpuarray backend require device synchronisation ? I'm a little confused about this

Is this question about the meta-optimizer? Are you working on this? In the meta-optimizer I wrote for the old backend, I specifically compile the function such that the input and output stay on GPU. In this case, we need device synchronization to measure the computation time -- only if the result is copied to CPU there will be an implicit synchronization. I would assume it's the same for the new backend, but you can try: compile a function that computes on GPU without any gpu-to-host or host-to-gpu transfers, run it ten times and time the result, then run it ten times + synchronize in the end and time the result.

Contributor

f0k commented Jul 12, 2017

Does the Gpuarray backend require device synchronisation ? I'm a little confused about this

Is this question about the meta-optimizer? Are you working on this? In the meta-optimizer I wrote for the old backend, I specifically compile the function such that the input and output stay on GPU. In this case, we need device synchronization to measure the computation time -- only if the result is copied to CPU there will be an implicit synchronization. I would assume it's the same for the new backend, but you can try: compile a function that computes on GPU without any gpu-to-host or host-to-gpu transfers, run it ten times and time the result, then run it ten times + synchronize in the end and time the result.

@affanv14 affanv14 referenced this issue Jul 12, 2017

Merged

Implement Meta-optimizer on new backend #6143

9 of 9 tasks complete
@affanv14

This comment has been minimized.

Show comment
Hide comment
@affanv14

affanv14 Jul 17, 2017

Contributor

For gemm, it should try both GpuCorrMM and GpuCorrMM_gradWeights if both are applicable, not rely on the heuristic in local_conv_gemm

From what I have understood, you mean to say that GPUCorrmm_gradweight (and looking at the GPUCorrMM optimization function in opt.py, GPUCorrMM_gradInputs as well? ) compute the forward pass faster under a few situations and conditions. So new optimization functions need to be created which will be passed to the meta-optimizer. Which in-turn will be used to find the fastest implementation. Am I right?

Update it to try the 3 cudnn forward convolution algo

referring #2072 (comment) . Does this still need to be done?

Contributor

affanv14 commented Jul 17, 2017

For gemm, it should try both GpuCorrMM and GpuCorrMM_gradWeights if both are applicable, not rely on the heuristic in local_conv_gemm

From what I have understood, you mean to say that GPUCorrmm_gradweight (and looking at the GPUCorrMM optimization function in opt.py, GPUCorrMM_gradInputs as well? ) compute the forward pass faster under a few situations and conditions. So new optimization functions need to be created which will be passed to the meta-optimizer. Which in-turn will be used to find the fastest implementation. Am I right?

Update it to try the 3 cudnn forward convolution algo

referring #2072 (comment) . Does this still need to be done?

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Jul 17, 2017

Contributor

So new optimization functions need to be created which will be passed to the meta-optimizer. Which in-turn will be used to find the fastest implementation. Am I right?

Correct. In the old backend, there were optimization functions like local_dnn_conv_alternative (or some permutation of these words, I don't recall exactly) which tried to swap in a different implementation. See my writeup at http://benanne.github.io/2014/12/09/theano-metaopt.html for a more slow-paced explanation.

Update it to try the 3 cudnn forward convolution algo

Does this still need to be done?

Yes. For the old backend, I only tried two algorithms for each convolution (the "natural" one, and an alternative), but depending on the convolution, all three might be applicable.

Contributor

f0k commented Jul 17, 2017

So new optimization functions need to be created which will be passed to the meta-optimizer. Which in-turn will be used to find the fastest implementation. Am I right?

Correct. In the old backend, there were optimization functions like local_dnn_conv_alternative (or some permutation of these words, I don't recall exactly) which tried to swap in a different implementation. See my writeup at http://benanne.github.io/2014/12/09/theano-metaopt.html for a more slow-paced explanation.

Update it to try the 3 cudnn forward convolution algo

Does this still need to be done?

Yes. For the old backend, I only tried two algorithms for each convolution (the "natural" one, and an alternative), but depending on the convolution, all three might be applicable.

@affanv14

This comment has been minimized.

Show comment
Hide comment
@affanv14

affanv14 Jul 17, 2017

Contributor

Yes. For the old backend, I only tried two algorithms for each convolution (the "natural" one, and an alternative), but depending on the convolution, all three might be applicable.

Oh, ok. I confused algo for the algo argument in dnn_conv, I see that you meant fwd algo, grad wrt weights and inputs. I understood now, Thanks!

Contributor

affanv14 commented Jul 17, 2017

Yes. For the old backend, I only tried two algorithms for each convolution (the "natural" one, and an alternative), but depending on the convolution, all three might be applicable.

Oh, ok. I confused algo for the algo argument in dnn_conv, I see that you meant fwd algo, grad wrt weights and inputs. I understood now, Thanks!

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Jul 17, 2017

Contributor

I confused algo for the algo argument in dnn_conv

Ah... I see. The "Update it to try the 3 cudnn forward convolution algo" was added to the TODO list by @nouiz, I actually don't know whether this referred to the different cuDNN implementations or the directions. Both is possible. Maybe this was before cuDNN provided a way to time the implementations. In any case, I would rely on cuDNN for the implementations. The meta-optimization base class times the function three times and takes the minimum, so it's okay if the first call uses cuDNN's internal meta-optimization.

Contributor

f0k commented Jul 17, 2017

I confused algo for the algo argument in dnn_conv

Ah... I see. The "Update it to try the 3 cudnn forward convolution algo" was added to the TODO list by @nouiz, I actually don't know whether this referred to the different cuDNN implementations or the directions. Both is possible. Maybe this was before cuDNN provided a way to time the implementations. In any case, I would rely on cuDNN for the implementations. The meta-optimization base class times the function three times and takes the minimum, so it's okay if the first call uses cuDNN's internal meta-optimization.

@lamblin

This comment has been minimized.

Show comment
Hide comment
@lamblin

lamblin Sep 22, 2017

Member

I think this has been done by @affanv14 during the summer. Should we close this?

Member

lamblin commented Sep 22, 2017

I think this has been done by @affanv14 during the summer. Should we close this?

@nouiz nouiz closed this Oct 2, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment