Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offer a way for users to access the lower-order values that end up being calculated by FAD methods #37

Closed
jrevels opened this issue Aug 14, 2015 · 8 comments · Fixed by #40

Comments

@jrevels
Copy link
Member

jrevels commented Aug 14, 2015

Previously discussed here.

For example, if somebody calls hessian(f,x) they should be able to tell the function to return ∇f(x) and/or f(x) as well, since those values are already calculated by way of the Hessian calculation.

Here some ideas for what this might look like:

  1. Have a keyword argument that specifies that the function should return the calculated ForwardDiffNumber, then provide methods in the API to extract the data you want from it. Something like this:
julia> data = hessian(f, x, alldata=true) # returns ForwardDiffNumber
julia> gradient(data) # extract gradient from data
julia> hessian(data) # extract hessian from data

I don't really like alldata as a name for the keyword argument, but you get my drift.

  1. Pass symbols to the method saying what you want up front (once again, via a keyword argument):
# returns in the same order as it's given, with Hessian first 
julia> hess, val, grad = hessian(f, x, also=(:value, :gradient)) 

Thoughts?

@KristofferC
Copy link
Collaborator

Just a thought, but one way is to instead of returning the values directly, we give the user a possibility to create a tailor-made function that does what you want.

Some examples of how it would work:

g = makefunction(f, returns=(:value, :gradient))
gives a function that does
value, gradient = g(x)

g = makefunction(f, returns=(:value), mutates=(:gradient)
gives a function that does
value = g(x, gradient)
and mutates the gradient argument

g = makefunction(f, returns=(:value), mutates=(:gradient, :hessian)
gives a function that does
value = g(x, gradient, hessian)
and mutates the gradient and hessian argument.

g = makefunction(f, returns=(:gradient))
gives a function that does
graident = g(x)
which is the same as gradient(f)does now.

This basically provides an API to let the user define exactly what the function should return and what should be mutated. That would also remove the type instabilities that suggestion 2 seem to imply.

One problem, what do you do if the user put :value in the mutable field and the function returns a scalar. Force the user to wrap the scalar in an array?

@mlubin
Copy link
Contributor

mlubin commented Aug 17, 2015

Since this is an expert API designed to save one function evaluation out of many, I think we should provide the fast mutating versions as a starting point. Everything else can be built on top of that.

@jrevels
Copy link
Member Author

jrevels commented Aug 17, 2015

Here are my thoughts on the two main strategies proposed so far:


  1. Return the ForwardDiffNumber and provide API methods to work with it

This option would be less work to implement, and it would fit in better with the existing implementation. We wouldn't have to handle as many configuration options just to let users get to the data they need, and it's up to the user to grab what they need from the results themselves.

@mlubin The way mutation would be handled would look like this:

julia> data = hessian(f, x, return_all=true); # still looking for the best keyword...

julia> gradient!(gradout, data); # load gradient from data into gradout

julia> hessian!(hessout, data); # load hessian from data into hessout

julia> gradient(data) == gradout 
 true

julia> hessian(data) == hessout
 true

The API would then be composable in pretty predictable ways:

julia> gradient!(output1, f, x);

julia> gradient!(output2, gradient(f, x, return_all=true));

julia> output1 == output2
true

The biggest downside to this approach is that I don't immediately see a way to support a call like

hessian(f, x, return_all=true, chunk_size=10)

It seems to me that this approach could only be allowed when a chunk_size isn't specified.

Pros

  • composability > configuration
  • much easier to implement
  • might actually make the existing fad_api cleaner/more "separable"

Cons

  • I don't know how chunk_size could be used in tandem with this

  1. Configure output to be exactly what you want

Riffing off of @KristofferC's idea/type stability concerns, I think the best way to do this would be to pass in a Tuple type instead of an instance:

julia> fad! = forwarddiff(f, returns=Tuple{:value, :gradient, :hessian, :tensor}, mutates=Tuple{:gradient, :hessian});

Then just pass a tuple of the objects you want to mutate to fad! as the first argument:

julia> val, gradout, hessout, tens = fad!((gradout, hessout), x);

One problem, what do you do if the user put :value in the mutable field and the function returns a scalar.

I'd just throw an error.

Pros

  • It's not impossible to support simultaneous chunk_size configuration (though it might be super tough to get right)

Cons

  • Implementation of something this general could be really messy, and would be a pretty big change from what's currently on master
  • Things might get rough for codegen/JIT, as the space of allowable function signatures is huge

@KristofferC
Copy link
Collaborator

@mlubin Why do you say that it saves one function evaluation out of many? Isn't it true that currently you need two function calls to get the value and the gradient while this could be reduced to one. That will be the case in every iteration so you save half of the evaluations? I am not very familiar with automatic differentiation so I might have misunderstood something,

Suggestion 1) is pretty good. It exposes the DualNumbers directly to the user but since this is an advance feature it is likely no problem and when you have the dual numbers you can do what you want.

Regarding the function generators. I don't think that code gen is a problem because even though there is many different possible signatures, any given user is likely to only use at most a few different signatures. The code wouldn't be generated until the user request the function.

I also don't think the implementation have to be so messy,

@mlubin
Copy link
Contributor

mlubin commented Aug 17, 2015

@KristofferC, it's a bit tricky to measure exactly, but the complexity of evaluating the gradient increases proportionally with the dimension of the input. So as the dimension of the input increases, the time saved by avoiding the extra function evaluation becomes relatively smaller. The factor of 2 improvement is only really the case for functions of one or two variables, definitely not for functions of 100s of variables.

@KristofferC
Copy link
Collaborator

@mlubin My use case is structural optimization where you don't need that many variables simultaneously (order of 10), but the function cost is large because you need to solve the assembled sparse system. I believe for these type of applications, avoiding extra function calls are quite significant.

@jrevels
Copy link
Member Author

jrevels commented Aug 18, 2015

So as the dimension of the input increases, the time saved by avoiding the extra function evaluation becomes relatively smaller. The factor of 2 improvement is only really the case for functions of one or two variables, definitely not for functions of 100s of variables.

My use case is structural optimization where you don't need that many variables simultaneously (order of 10), but the function cost is large because you need to solve the assembled sparse system. I believe for these type of applications, avoiding extra function calls are quite significant.

With both of these in mind, it seems to me that the problems for which this feature would be really useful are, in general, going to be the problems in which chunk-mode is not useful. If you're specifying a chunk_size, then you are specifying that you want to trade a greater number of function calls for less memory usage. Otherwise, the current implementation assumes memory can be freely spent to minimize the number of function calls.

If that's true, then that makes the single con of option 1 (inability to simultaneously specify chunk_size) a bit less scary. However, I'm sure there are many cases in which you'd want to find a middle ground (e.g. when your chunk_size is only off by a small factor of your input dimension).

I think next week I'm going to get ForwardDiff.jl back on readthedocs and have a comprehensive section on what chunk_size actually does, so that users can realize upfront the tradeoff @mlubin is talking about.

I don't think that code gen is a problem because even though there is many different possible signatures, any given user is likely to only use at most a few different signatures. The code wouldn't be generated until the user request the function.

JIT overhead and poor codegen are two related, but distinct issues. JIT overhead is (as you say), the compilation cost incurred from the initial call of a method with a specific signature. "Poor codegen", however, ultimately means "the LLVM bitcode compiled from this Julia expression isn't as performant as it could be", which is tied to the compiler's ability to prove assumptions about the behavior of the code (the most prominent example of this is type inferencing). My intuition is that the more we offload "configuration-like" things onto the type system (which ForwardDiff already does a lot of through @generated functions), the more difficult our code could be for Julia's codegen. That intuition is at least partly fueled by ignorance, though, since I've not dug deep into the codegen stuff - a benchmark against an actual implementation would always be best...

@jrevels
Copy link
Member Author

jrevels commented Aug 21, 2015

I just realized that chunk-mode can fairly easily be supported in conjunction with strategy 1 by loading the chunks into a result ForwardDiffNumber instead of loading them into result arrays. With that problem conceptually out of the way, I'm going to go for strategy 1 (at least for an initial implementation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants