Meta Optimizers #2072

Open
f0k opened this Issue Sep 2, 2014 · 15 comments

Projects

None yet

5 participants

@f0k
Contributor
f0k commented Sep 2, 2014

Theano now provides access to several different convolution implementations that perform very differently depending on image and kernel shapes, and depending on the direction when used in training a CNN (forward pass, gradient wrt. weights, gradient wrt. inputs), but also have different restrictions (filter shapes, memory usage).
A lot could be gained by picking the best-performing implementation for each individual convolution in a CNN: https://github.com/soumith/convnet-benchmarks
Because the relative performance of different implementations heavily depends on the GPU used, it is impossible to select the optimal implementation at compile time via some clever heuristic.

So I would like to start a discussion on adding "meta optimizers" to the Theano graph optimizers. A meta optimizer would collect a set of possible replacements for an op, compile each of them, time their performance and then select the best-performing one.

This might be implemented in the form of a regular graph optimizer, but it should have the following features:

  • It should operate on a list of op replacement functions that each return a replacement for a given op or a falseish value to indicate that they are not applicable -- similar to what the current graph optimizers do anyway.
  • It should be possible to register additional op replacement functions with a given meta optimizer. For example, user code may want to register a wrapper for cuda-convnet, cuda-convnet2 or some other Theano-external implementation with Theano's conv2d meta optimizer.
  • It should be able to cache its decisions, so that it does not need to re-run any benchmarks if it is repeatedly asked to find a replacement for a given op and device with a given set of registered replacement functions. (This cache should be as permanent as the compilation cache.)

I'm currently interested in GPU-based conv2d only. In this case the meta optimizer could spring into action for any GpuConv op with fully specified shape on compile time, by compiling and running a set of proposed replacements on random dummy data. Other use cases for meta-optimization may require more realistic data -- they could hijack the test value computation, for example.

Sooo...

  1. What do you think of the idea in general?
  2. What would it need to implement the features listed above for a GpuConv meta optimizer?
  3. Should the implementation plan for possible future meta optimizers or not care about generality/reusability for now?

/Update Nov 24: A first implementation was done in #2266. TODOs left:

  • The meta-optimizer currently tries all available convolutions, it should only try the ones selected via optimizer_including and optimizer_excluding
  • For gemm, it should try both GpuCorrMM and GpuCorrMM_gradWeights if both are applicable, not rely on the heuristic in local_conv_gemm
  • Timings should be cached on disk to speed up repeated invocations.
  • The configuration flag metaopt.verbose should become an int: 0 for silent mode, 1 to only warn if we cannot meta-optimize some op (missing shape information), 2 for full output of separate timings and selected implementation
  • Update it to try the 3 cudnn forward convolution algo.
@benanne
Contributor
benanne commented Sep 2, 2014

I really like the idea of automating this procedure. Perhaps it would be useful to mimic the FFTW approach, where different optimization modes can be set depending on your patience (in this case for computing FFTs). (See http://www.fftw.org/doc/Planner-Flags.html )

The exhaustive mode will try every (applicable) approach and pick the fastest one. The 'heuristic' mode will use a heuristic to determine the best algorithm based on shape information. And then there are a few modes in between that enable the user to tune the optimization time / performance trade-off.

@f0k
Contributor
f0k commented Sep 3, 2014

Perhaps it would be useful to mimic the FFTW approach, where different optimization modes can be set depending on your patience

This would be nice, but the premise was that it's nearly impossible to find good heuristics, so a 'heuristic' mode will be hard... For a start, I would suggest to provide a single version of the convolution meta-optimizer only, "enableable" ;) via optimizer_including=conv_metaopt, and with no notion of different optimization strategies because that would make it difficult to generalize. At a later stage there could be two or three differently expensive conv_metaopt under different names to give the user a choice without complicating matters too much.

@nouiz
Member
nouiz commented Sep 4, 2014

I idea is good. It isn't a new idea, but we need someone to work on it:)

I would keep it simple at first. Just compare the GpuConv, vs GpuCorrMM vs
fft conv.

If it work, we can add the call back to add extra op to compare again,
caching, generalize it...

I don't see in the short term a need for other meta optimizer, so keeping
in your head the goal of genarality is good, but don't spend too much time
on this. It can be refactored later.

On Wed, Sep 3, 2014 at 9:09 AM, Jan Schlüter notifications@github.com
wrote:

Perhaps it would be useful to mimic the FFTW approach, where different
optimization modes can be set depending on your patience

This would be nice, but the premise was that it's nearly impossible to
find good heuristics, so a 'heuristic' mode will be hard... For a start, I
would suggest to provide a single version of the convolution meta-optimizer
only, "enableable" ;) via optimizer_including=conv_metaopt, and with no
notion of different optimization strategies because that would make it
difficult to generalize. At a later stage there could be two or three
differently expensive conv_metaopt under different names to give the user
a choice without complicating matters too much.


Reply to this email directly or view it on GitHub
#2072 (comment).

@f0k f0k referenced this issue Sep 11, 2014
Merged

Convolution layer using cudnn. #2096

2 of 4 tasks complete
@cancan101

One of the complexities of a meta optimizer would be dealing with the memory vs performance tradeoff. At some point as the memory usage grows, you will be have to decrease the batch size in order to fit the memory used in the function valuation within GPU RAM. Reducing the batch size will presumably lead to slower function evaluation. For example FFT conv might run faster than alternatives, but it uses so much memory that on certain devices with certain argument sizes, you are forced to reduce the batch size leading to slower performance.

Are you thinking of having the meta optimizer choose optimal values for batch size as well?

@f0k
Contributor
f0k commented Nov 20, 2014

Are you thinking of having the meta optimizer choose optimal values for batch size as well?

No, it will just use the batch size you gave and try which implementation is the fastest. It does not monitor memory usage. It will not use an implementation that throws a compile time or run time exception when trying it, so it will not use the FFT convolution if it exceeds GPU memory. It will not try to split the convolution into two batches of half the size or something like that, but you can easily register additional optimizers with the meta-optimizer at runtime if you think this would be worth it.
By the way, there's a potential problem in that during the meta-optimization, each convolution will be compiled, run and benchmarked separately. If you use allow_gc=0, it might happen that while the separate convolution instances run fine, a function having all of the convolutions will run out of GPU memory. As long as you use allow_gc=1 and also allocate all your shared variables before compiling the function, it should be fine, though.

@nouiz
Member
nouiz commented Nov 21, 2014

As each convolution will be timmed separatly, if it introduce fft that fit
itself on the GPU, it can still crash at run time, due to other
intermediate memory usage I think.

During the meta-optimizer, you can set allow_gc=True yourself to do not
have this problem.

I didn't looked up to now to the code...

On Thu, Nov 20, 2014 at 6:15 PM, Jan Schlüter notifications@github.com
wrote:

Are you thinking of having the meta optimizer choose optimal values for
batch size as well?

No, it will just use the batch size you gave and try which implementation
is the fastest. It does not monitor memory usage. It will not use an
implementation that throws a compile time or run time exception when trying
it, so it will not use the FFT convolution if it exceeds GPU memory. It
will not try to split the convolution into two batches of half the size or
something like that, but you can easily register additional optimizers with
the meta-optimizer at runtime if you think this would be worth it.
By the way, there's a potential problem in that during the
meta-optimization, each convolution will be compiled, run and benchmarked
separately. If you use allow_gc=0, it might happen that while the
separate convolution instances run fine, a function having all of the
convolutions will run out of GPU memory. As long as you use allow_gc=1
and also allocate all your shared variables before compiling the function,
it should be fine, though.


Reply to this email directly or view it on GitHub
#2072 (comment).

@f0k
Contributor
f0k commented Nov 21, 2014

As each convolution will be timmed separatly, if it introduce fft that fit
itself on the GPU, it can still crash at run time, due to other
intermediate memory usage I think.

Probably yes. But I think the meta-optimizer should not try to account for this. It should be enough if in future we allow fine control about which optimizers are used by the meta-optimizer, so users can disable the fft optimizer if needed. Furthermore, users can flag some of their convolutions with version='no_fft', this still works as before.

I didn't looked up to now to the code...

OK, don't worry, I can use it already :) But please try to merge it before it needs a rebase.

@rubenvereecken

Quite an old discussion it seems. I got here through the documentation page and this article in the hope of squeezing some extra performance out of my program.

I got none of the fancy output @f0k displayed in the post. Before I dig all the way into the code to see if this is still somehow supported, does anyone know if it is? This issue being open for so long scared me.

@benanne
Contributor
benanne commented Jan 4, 2016

Note that cuDNN itself now has multiple convolution implementations (including FFT-based) and also has something like a metaoptimizer built in these days. From Theano, you can control which implementations are used with the Theano flags dnn.conv.algo_fwd, dnn.conv.algo_bwd_filter and dnn.conv.algo_bwd_data. For the different options, have a look at the 'notes' section in the docs: http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html (note that this is slightly outdated, the option dnn.conv.algo_bwd no longer exists.)

By setting all of them to 'time_once', you get similar behaviour to the metaoptimizer: it will time all implementations and use the fastest ones. If you have a recent GPU (Maxwell-based), I thoroughly recommend playing with these settings as I've seen speedups of more than 50%.

@f0k
Contributor
f0k commented Jan 4, 2016

Before I dig all the way into the code to see if this is still somehow supported, does anyone know if it is? This issue being open for so long scared me.

It's possible that it needs some work to resurrect since the conv2d interface was recently changed. The issue being open for so long is just because the original post on the top still has several suggestions/TODOs for improving meta-optimization over what I initially implemented.

This being said, if you have cuDNN, then the meta-optimizer probably won't buy you anything over what @benanne suggested -- it was more interesting in the early days when cuDNN was neither widely supported nor mature. And it would become interesting again if we added support for alternative implementations such as facebook's or Nervana's.

@rubenvereecken

Thanks @benanne and @f0k, I managed to get my hands on cuDNN but was trying to see if there were any other gains to be had depending on the problem maybe. So far, with manual exploration, I've managed to get the best runtimes using cuDNN (through Lasagne's interface), so I'll stick to that.

@nouiz
Member
nouiz commented Jan 25, 2017

was also discussed in gh-3557. Still needed as cudnn timer don't cover our own code and don't cover mixing the forward/back-ward when possible.

@f0k
Contributor
f0k commented Jan 25, 2017

Still needed as cudnn timer don't cover our own code and don't cover mixing the forward/back-ward when possible.

Still a nagging problem is that it requires the shapes to be known at compile time. This makes it unusable for fully-convolutional models, for example. We could do everything at runtime (like "time_once" and "time_on_shape_change") if we had a way to directly call and time cuDNN functions and the caffe-based implementation. This in turn would require them to reside in precompiled modules, not compiled on-demand. This in turn could greatly reduce compile time for some models. That's more of a long-term idea, though :)

@nouiz
Member
nouiz commented Jan 25, 2017
@f0k
Contributor
f0k commented Jan 27, 2017

If we implement shape inference when we build the graph, then we could do that at compile time.

No, I was saying there are models for which the shape is only known at runtime. E.g., for a fully-convolutional model, the spatial dimensions may vary, and for many models, the batchsize may vary. So it makes sense to do the timing at runtime, as done for cuDNN. (Of course, if every call has a different shape, it doesn't make sense to do a time_on_shape_change, but the time_once could still help. And if we implement memoization, after a while there'll be a database of which algorithm to use when.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment