Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perturbation Confusion (Nested Differentiation Bug) #83

Closed
jrevels opened this issue Dec 18, 2015 · 13 comments · Fixed by #213
Closed

Perturbation Confusion (Nested Differentiation Bug) #83

jrevels opened this issue Dec 18, 2015 · 13 comments · Fixed by #213

Comments

@jrevels
Copy link
Member

jrevels commented Dec 18, 2015

Nested differentiation should not be trusted until this issue is resolved. Not all nested calculations will be affected, but I think it's better to play it safe. Non-nested differentiation should be unaffected.

EDIT:

This bug is restricted to cases where the inner function closes over the argument at which an outer function is being differentiated.

Here are some examples:

const D = ForwardDiff.derivative

# inner function is a closure over x -> WILL GIVE WRONG RESULT
D(x -> x * D(y -> x + y, 1), 1) 

# inner function is a closure over v -> WILL GIVE WRONG RESULT
ForwardDiff.gradient(v -> sum(v) * D(y -> y * norm(v), 1), [1])

# inner function doesn't use x at all -> WILL GIVE CORRECT RESULT
D(x -> x * D(y -> y^2, 1), 1)

# inner function is a closure, but over z, not x -> WILL GIVE CORRECT RESULT
z = 3
D(x -> x / D(y -> y^z, 1), 1)

I discovered today that ForwardDiff suffers from perturbation confusion.

I noticed this when I was looking at the test suite earlier today and realized we had a test missing. The missing test should've been verifying that the following bug didn't occur:

julia> const D = ForwardDiff.derivative
derivative (generic function with 7 methods)

julia> D(x -> x * D(y -> x + y, 1), 1)
2 # this should obviously be 1

I thought I had committed this test a while ago (at the same time that I wrote the other nested differentiation tests), but apparently I screwed up somewhere, because I can't find it in the commit history.

I won't have time to really dig into this for a week or two, and a fix might require significant changes to the package (likely implementing a tagging system). I won't be sure of the best path forward until I do some thorough testing/code review. I'll go ahead and update the docs/README to make it clear that this issue exists.

@mlubin I bet this is also an issue for a lot of nested differentiation that people do using DualNumbers.jl, unless they were wise enough to code around it. I don't think DualNumbers.jl has a responsibility to provide a tagging system or anything, but it might be worth mentioning in that package's README.

@mlubin
Copy link
Contributor

mlubin commented Dec 18, 2015

I wonder what a tagging system would look like, an extra integer field for every forward diff number?

@jrevels
Copy link
Member Author

jrevels commented Dec 18, 2015

Possibly. I'm curious about the performance burden a tagging system might impose on non-nested differentiation. Correctness is the first priority, of course, but it would be worth it to figure out if we could resolve this issue without incurring a performance penalty on already correct cases.

Maybe there would be a way to lift some of the work into the compilation phase? I'm not really sure yet.

@jrevels
Copy link
Member Author

jrevels commented Jan 7, 2016

So, after some discussion with @mlubin, I think we might've figured out a semi-reasonable strategy for solving this issue. TL;DR: The v0.5 closure rewrite (JuliaLang/julia#13412) might allow us to reach in and strip out any problematic closed-over perturbations before the closure gets called.

Essentially, the problem is that perturbations attached to arguments of the target function could leak into inner functions that close over these arguments. When a differentiation operation on the closure introduces a separate set of perturbations, these new perturbations are vulnerable to erroneous interaction with the closed-over perturbations.

Most forward-mode implementations solve this problem by assigning matching tags to differentiation operations and their associated perturbations, so that the system can prevent interactions between perturbations that belong to separate differentiation operations. Then, when gathering results, each differentiation operation extracts only the perturbations that share its tags.

I've been playing around with implementing a tagging system, but every idea I've come up with would either cause big performance regressions (mostly due to type certainty/stability or excessive code generation), or force us to sacrifice some of ForwardDiff's unique features (like the ability to use tunable, stack-allocated perturbation containers).

However, we may not need to utilize a tagging system if we can simply remove the closed-over perturbations before the closure ever gets called by the differentiation operation. JuliaLang/julia#13412 could make intercepting and stripping external perturbations possible, since the closed-over data is explicitly stored as a field in the function type. This would allow us to recursively remove perturbations from the child elements of the closed-over data. With this strategy, for example, the following should work:

function f(x::Number)
    foo = [Foo(x), Foo(x + 1)]
    g(y) = dot(foo, y)
    # ForwardDiff.gradient would create and then operate on a
    # new instance of g where foo's elements are stripped of 
    # problematic perturbations
    v =  ForwardDiff.gradient(g, [2.0, 3.0])
    return sum(v)
end

ForwardDiff.derivative(f, 1)

We'd have to be careful not to perform unnecessary copies, but there's no reason this couldn't be accomplished in theory. The main restriction we'd have to expose to the user is that closures could not rely on mutating closed-over data. For example, this code wouldn't work properly:

function f(x::Number)
    foo = [x]
    g(y) = (foo[1] += y; return foo[1]) # the behavior of f relies on g mutating foo
    r = zero(x)
    for i in 1:10
        r += ForwardDiff.derivative(g, i)
    end 
    return r
end

ForwardDiff.derivative(f, 1)

If anybody has any thoughts on this kind of strategy (and potential limitations), I'd be interested to hear them.

@KristofferC
Copy link
Collaborator

closures could not rely on mutating closed-over data

That would be quite bad for me, at least for how I am doing things now.

In my case, I solve a set of nonlinear equations where the gradient at a point depend on results in computing the residual. In other words the gradient and residual equation need to "talk" to each other. The way I do this is by making a closure over a parameter Y for the gradient and have the residual equation update Y. In that way the gradient has access to Y.

Maybe there is some other way for me to solve it but it feels like a quite significant limitation..

@jrevels
Copy link
Member Author

jrevels commented Jan 8, 2016

Technically, the restriction is more specific than my earlier statement - the proposed mutation restriction would only be on closed-over objects that sneak external perturbations into a closure that is being differentiated.

In other words, that the proposed restriction is actually less restrictive than the restriction already imposed by this bug. If the code you're talking about currently relies on these objects entering the closure scope at all, then it's quite likely that the code already suffers from perturbation confusion. Can you link to the code you're talking about?

@KristofferC
Copy link
Collaborator

Your last post clarified things. There are no perturbations in the data being shared, they can be seen pretty much as constants when the gradient is being evsluated. So it should be fine then .

@aaronsheldon
Copy link

I was wondering, would you be able to use type parameterization of the dual numbers to implement a light weight tagging system?

Specifically a third type parameter could be added to dual numbers that takes any tag. The forward differentiation operator then maps to zero any dual numbers where the tag type parameter does not match. By default the tag parameter parameter could be set to zero, so that differentiation treats all variables inside a function as the same variable (which it currently does). The tag parameter could then be used to specify that the two dual numbers with different types represent two different types of duals to calculate the derivative by.

@jrevels
Copy link
Member Author

jrevels commented Aug 15, 2016

@aaronsheldon If you're talking about removing the erroneous perturbations before executing the target function, you don't need a tagging system. See my comment here.

If you want to remove the erroneous perturbations during execution of the target function, there are some tough decisions you'd have to make. I've played around with a couple different tagging implementations but none of them were fully satisfactory. I would love it if somebody came up with a good one.

For example, you have to have some mechanism for choosing a tag within elementary function definitions. Take this simple function:

Base.:+{tag1,tag2}(a::Dual{tag1}, b::Dual{tag2}) = # ?

These tags could be any valid type parameter - function types, integers, symbols, etc. But which tag do you give preference to in this operation? In other words, which of these Duals contains the "good" perturbations, and which contains the "bad" perturbations?

My solution for this was a tagging system where each tag is an integer corresponding to the "closure level" of the dual number. Dual numbers in non-closure scope have tag 0, dual numbers in a closure tag 1, dual numbers in a closure within a closure have tag 2, etc. Then, the higher-tagged dual number is always the one with "good" perturbations, and you can ignore the lower-tagged dual number's perturbations.

The big problem with this approach is that we'd have to make basically every non-unary function on dual numbers a @generated function in order to do the selection logic at compile time.

@jrevels
Copy link
Member Author

jrevels commented Jan 13, 2017

Okay, I've settled on what I think is a feasible "solution" to this issue, and is way less hacky than closure interception (which seems like it would be very tricky/fragile if possible at all).

My proposal is for ForwardDiff to error-out explicitly rather than fail silently. To accomplish this, we can add an extra tag type parameter to the Dual type (like all the other proposals in this thread). However, instead of attempting to choose which tag to propagate in method definition, we'll just restrict method signatures to force all Dual arguments to have the same tag i.e. perturbation confusion would result in a MethodError, which we could catch and rethrow with a better message.

The key question is how to generate the tags. We want a unique compile-time tag for every differentiation operation, but we don't want to recompile for every call. We can't just use the target function's type as the tag, since we'd like recursively differentiated functions to work. Thus, I believe the best option is to generate a unique tag at parse-time. In other words, we'll have to change the current API functions to macros (which is something I've played around with in the past anyway).

@KristofferC
Copy link
Collaborator

While it may be necessary to fix this issue I typically find that macros feel a lot more obscure to use than normal functions. Could there also be an option to perhaps keep the functional form and have the tag be an optional argument or something (and leave it up to the user to ensure uniqueness).

@jrevels
Copy link
Member Author

jrevels commented Jan 13, 2017

Could there also be an option to perhaps keep the functional form and have the tag be an optional argument or something (and leave it up to the user to ensure uniqueness).

Definitely. That was basically my plan for the functions underlying the "safe API". For example, I was thinking something like

immutable gradient{tag} end

macro gradient(args...)
    return :(gradient{$(Expr(:quote, gensym()))}($(args...)))
end

function (::Type{gradient{tag}}){tag}(args...)
    # basically has the same definition that `gradient` has now, but takes the tag into account
end

The tricky part will be ensuring the *Config types are still reusable for different calls, but I think it might be possible to just reinterpret the type with the relevant tag once it is passed in to the gradient{tag} "constructor".

EDIT: Argh, this doesn't seem to work (Foo here is analogous to Config, and Bar is analogous to the duals within the Config instance):

julia> immutable Bar{tag} end

julia> immutable Foo{tag}
           bars::Vector{Bar{tag}}
       end

julia> f = Foo([Bar{1}(), Bar{1}()])
Foo{1}(Bar{1}[Bar{1}(),Bar{1}()])

julia> reinterpret(Foo{2}, f)
ERROR: reinterpret: target type not a leaf bitstype
 in reinterpret(::Type{Foo{2}}, ::Foo{1}) at ./essentials.jl:86

Guess this will be more work than I thought...

EDIT2: Note that this will all still work out, I'll just have to define explicit tag conversion/reinterpretation methods for *Config types.

@jrevels
Copy link
Member Author

jrevels commented Jan 16, 2017

Hmm...just realized one could pretty trivially break the parse-time-tag approach by defining D(f, x) = @derivative(f, x) and then use D everywhere instead of @derivative. We can't simply convert the API to macros, then. We can still use the underlying idea of derivative{tag}(args...), and just pick a default for tag and tell users to provide a unique tag if they need it.

This solution is definitely incomplete, though - users generally won't have enough contextual information to properly generate unique tags. However, at this point, I'm not sure it's possible to correctly generate disjoint compile-time tags without either gensyming every call or relying on some hacky reflection/introspection process.

@rdeits
Copy link

rdeits commented May 12, 2017

Should the readme be updated to reflect the fact that this bug is closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants