Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Behaviour in Broadcasting #1036

Open
willtebbutt opened this issue Jul 20, 2021 · 20 comments
Open

Inconsistent Behaviour in Broadcasting #1036

willtebbutt opened this issue Jul 20, 2021 · 20 comments

Comments

@willtebbutt
Copy link
Member

willtebbutt commented Jul 20, 2021

As of version 0.6.15, the following:

julia> Zygote.gradient(^, 0.0, 0.9)
(0.0, 0.0)

julia> Zygote.gradient((x, y) -> sum(x .^ y), zeros(3), fill(0.9, 3))
([0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

julia> f(x, y) = x ^ y;

julia> Zygote.gradient((x, y) -> sum(f.(x, y)), zeros(3), fill(0.9, 3))
([Inf, Inf, Inf], [NaN, NaN, NaN])

The same code in version 0.6.14:

julia> Zygote.gradient(^, 0.0, 0.9)
(0.0, 0.0)

julia> Zygote.gradient((x, y) -> sum(x .^ y), zeros(3), fill(0.9, 3))
([0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

julia> f(x, y) = x ^ y;

julia> Zygote.gradient((x, y) -> sum(f.(x, y)), zeros(3), fill(0.9, 3))
([0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

the crucial difference being in the last case for each.

If you do this with ForwardDiff, you get

julia> ForwardDiff.gradient(x -> x[1]^x[2], [0.0, 0.9])
2-element Vector{Float64}:
  Inf
 NaN

so this smells to me like ForwardDiff is being invoked for f, whereas individual chain rules are used for sum(x .^ y) because Zygote un-fuses broadcasting where it can (it can't un-fuse f).

@oxinabox @mcabbott @DhairyaLGandhi I'm assuming that this is related to our recent efficiency upgrades?

This is breaking the build in KernelFunctions.jl. We can of course allow the tests to fail for the time being, but it essentially means that we can't use Zygote with one of our kernels.

@willtebbutt
Copy link
Member Author

willtebbutt commented Jul 20, 2021

I guess the culprit is the discrepancy between the version of the rule DiffRules and ChainRules

In the latter we adopt a different convention from the former.

@DhairyaLGandhi
Copy link
Member

Hmm, using DiffRules seems to be more "correct". Thoughts?

@willtebbutt
Copy link
Member Author

Unclear to me -- the "subgradient" convention adopted in ChainRules works perfectly in the KernelFunctions use-case. Also, it's what we do for lots of other things (ReLUs, abs, etc). My guess would be it's a useful thing to do?

@willtebbutt
Copy link
Member Author

Hmm, using DiffRules seems to be more "correct". Thoughts?

Hmm, I interpreted this as meaning that you believe the convention adopted in DiffRules is more appropriate than the one in ChainRules. Was I correct in this interpretation?

@oxinabox
Copy link
Member

oxinabox commented Jul 20, 2021

Returning NaN, or Inf for this is just causing pain with little gain, for no real reason.
0.0 is fine

Considering x^y for x=0, and y=0.9
The deriviative wrt x from below is not defined since the function errors if we make a small negative change, so we don't care about that case.
The deriviative wrt x from above is zero.
Thus we might as well say the derivative is zero.

The deriviative wrt y from below is Inf
The deriviative wrt y from above is zero.
I would argue that around if a<=b etc we are largely free to chose which branch we want for a==b (there needs to be some provisos on that statement)
And here chosing zero is the choice that leads to more useful behavour.
Noone want to work with Inf.

This also has the attractive property that if you enumerate all points that the derivative is zero,
i.e. all possible nonboundry local minimal, this gets included.
so if you then check those for the smallest you will find the global minima.
And if you hit this point during gradient desent, you do infact want to stop, so derivative of 0 is useful here also.

Which is where we get the subgradient convention (which I will write up one day, it comes from a discussion I had at a conference.)
We say "If the derivative is not defined, but the subgradient contains zero, then just say the derivative is 0".
Or more directly if taking the derivative from all directions of approach forms a interval that encloses 0, stay the derivative is 0.

@mcabbott
Copy link
Member

I think this thread urgently needs a graph:

power09

As x -> 0 from above, the gradient does go to positive real infinity. Whether to return Inf is another question, you could argue that a large finite value from say x = eps() might be nicer.

A few more cases to check, while I was at it. We should be sure to get all of these right.

power6

@antoine-levitt
Copy link
Contributor

How did you obtain these plots? x^y for y non-integer and x negative is impossible to define in a consistent way

@antoine-levitt
Copy link
Contributor

(eg see your 3.14 plot, which does not look like x^3)

@mcabbott
Copy link
Member

They allow complex numbers, the dashed lines are the imaginary part when this is nonzero. To get one answer you have to pick a branch, and Julia does. x^(1//3) takes one branch, cbrt(x) takes a different one which is real at negative x.

@antoine-levitt
Copy link
Contributor

It does not pick a branch? (nor should it)

julia> (-2.0)^(1//3)
ERROR: DomainError with -2.0:
Exponentiation yielding a complex result requires a complex argument.
Replace x^y with (x+0im)^y, Complex(x)^y, or similar.

@antoine-levitt
Copy link
Contributor

Ah, you gave it complex numbers, I see. This is different then. In that case the choice is inferred from the fact that zeros are signed:

julia> (-2.0+0.0im)^(1//3)
0.6299605249474366 + 1.0911236359717214im

julia> (-2.0-0.0im)^(1//3)
0.6299605249474366 - 1.0911236359717214im

@mcabbott
Copy link
Member

Yes, there's a cut on the negative real axis in Julia's conventions. The 3rd choice is:

julia> cbrt(-2.0)
-1.2599210498948732

julia> cbrt(-2.0+0.0im)
ERROR: MethodError: no method matching cbrt(::ComplexF64)

julia> (-2.0+0.0im)^(1//3) |> abs
1.2599210498948732

julia> (-2.0-0.0im)^(1//3) |> abs
1.2599210498948732

@antoine-levitt
Copy link
Contributor

antoine-levitt commented Jul 30, 2021

Here's my opinion of what should happen here

  • pow(x::Real,y::Int): do the normal thing
  • pow(x::Real, y::Real): do the normal thing for x > 0, error for x < 0. For x=0, return the one-sided derivative: 0 for y >= 1, +Inf for y < 1
  • pow(x::Complex, y::Any): do the normal thing for all x not on the negative real axis. For x on the negative real axis, the derivative of the imaginary part of the result wrt the imaginary part of x is undefined (so NaN, probably). Technically, the other derivatives (real of result wrt both real and imag of x, as well as imag of result wrt real of x) are well-defined, so the right thing to do is to return a 2x2 matrix with one entry as NaN.

The crucial difference between case 2 and case 3 is that it is permissible to use one-sided derivatives when at the boundary of a definition where the function is undefined at the other side, but it's not when the function is discontinuous. Does that make sense?

@mcabbott
Copy link
Member

For x=0, return the one-sided derivative: 0 for y >= 1, +Inf for y < 1

Except at y=1, where it's 1 -- the 2nd panel of my graph. And at y=0 it's zero. And negative y is allowed too.

But what I'm not sure about is whether returning Inf is a good idea. Maybe it would be better to regularise everything, for x=0 (in this one-sided case) always give the gradients for say x = eps(). This will change smoothly between the different cases.

@antoine-levitt
Copy link
Contributor

julia> (-1.0)^(1.0)
-1.0

Oooh that's evil. Then yeah I agree with you: when y is integer (positive or negative), the rule should be that pow(x::Real, ::Real) follows pow(x::Real, y::Int). The derivative wrt y when y is integer and x <= 0 should be NaN

I don't see the point of regularizing: the one-sided derivative is infinite.

@mcabbott
Copy link
Member

mcabbott commented Jul 30, 2021

crucial difference between case 2 and case 3 is that it is permissible to use one-sided derivatives when at the boundary

Yes, I think I agree. When at a boundary, we implicitly assume that this is the boundary of the domain of your parameter too, so of course you care about the one-sided derivative.

When at a singularity in the interior there's no obvious answer mathematically. Although sometimes, like 1/x and 1/x^2 at x=0, we should blithely declare the two one-sided infinities equal and return that. The cut on the -ve real axis I'd need to think a little more, might be NaN.

the point of regularizing: the one-sided derivative is infinite.

The point would be to try to save people from NaNs downstream. What you are ultimately doing (in my head at least) is gradient descent, and if your parameter is bounded to [0,1] sometimes you need an arrow telling you to move away from the wall. A strictly infinite arrow is much more likely to break things than a large finite arrow -- and since you're doing floating point, you do accept some fuzziness, x === +0.0 is (in this view) just a representation of a very small number, but never exactly at the boundary.

That said I'm far from 100% sure that this wouldn't have weird consequences elsewhere. It just seems worth some thought, before jumping from an infinite limit in the pure mathematics to the concrete Inf::Float64.

@antoine-levitt
Copy link
Contributor

A strictly infinite arrow is much more likely to break things than a large finite arrow -- and since you're doing floating point, you do accept some fuzziness, x === +0.0 is (in this view) just a representation of a very small number, but never exactly at the boundary.

With what you suggest the arrow would be 1/eps(): if you're doing things like that, it will break: I would imagine you want to make sure it breaks fast, it's easier to debug.

@mcabbott
Copy link
Member

The regulated arrow would depend on the function. So it goes smoothly from "infinite" to 0 for powers near 1:

julia> ForwardDiff.derivative(x -> x^0.9, 0 + eps())
33.08251262391458

julia> ForwardDiff.derivative(x -> x^0.999, 0 + eps())
1.0356643999187012

julia> ForwardDiff.derivative(x -> x^1.001, 0 + eps())
0.9655627827687262

julia> ForwardDiff.derivative(x -> x^1.1, 0 + eps())
0.029925175613304173

That was my thinking in writing x + eps() as the regulator. But perhaps this is far too big, and eps(0) too small... There may well be much better choices than perturbing x.

it will break

How obvious is this? If I have some parameter x in [0,1], and choose to re-parameterise to work with x2 = x^2 say, also in [0,1], then I will probably have many f(sqrt(x2)). It would be nice if these didn't cause problems. I agree it's entirely possible attempting to regulate this sqrt's gradient may cause far more problems than it solves.

@mcabbott
Copy link
Member

mcabbott commented Sep 5, 2021

Behaviour with ChainRules v1.11.4 now matches ForwardDiff, and the graph, for the the gradient with respect to x:

julia> Zygote.gradient(^, 0.0, 0.9)
(Inf, 0.0)

julia> Zygote.gradient((x, y) -> sum(x .^ y), zeros(3), fill(0.9, 3))
([Inf, Inf, Inf], [0.0, 0.0, 0.0])

julia> f(x, y) = @show(x) ^ y;

julia> Zygote.gradient((x, y) -> sum(f.(x, y)), zeros(3), fill(0.9, 3))
x = Dual{Nothing}(0.0,1.0,0.0)
x = Dual{Nothing}(0.0,1.0,0.0)
x = Dual{Nothing}(0.0,1.0,0.0)
([Inf, Inf, Inf], [NaN, NaN, NaN])

ForwardDiff still has NaN for the p gradient, where I think it should have zero.

@gaspardbb
Copy link

Hi,
I guess this is a related issue, but let me know if I should open a new issue.

I discovered inconsistent behavior when using abs and broadcasting. Here is a MRE:

using Zygote

x = [1.0, -1.0]
w = [1.0, 0.0]

g1 = gradient(x) do x
    sum(abs.(x .- w))
end  # ([1.0, -1.0],)

g2 = gradient(x) do x
    abs(x[1] - w[1]) + abs(x[2] - w[2])
end  # ([0.0, -1.0],)

Basically, both are correct in the sense that both provide a valid subgradient for abs in 0: g1 takes x -> 1 and g2 takes the more obvious x -> 0. But it took me a while to debug, I feel the fact that the results are inconsistent is really bothering.

Based on the first post in this thread, the issue might come (but I don't know enough of Zygote internals) from different definitions for the subgradient: DiffRules vs ChainRules.

Let me know if I can provide more info !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants