Inconsistent Behaviour in Broadcasting #1036

willtebbutt · 2021-07-20T17:24:19Z

As of version 0.6.15, the following:

julia> Zygote.gradient(^, 0.0, 0.9)
(0.0, 0.0)

julia> Zygote.gradient((x, y) -> sum(x .^ y), zeros(3), fill(0.9, 3))
([0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

julia> f(x, y) = x ^ y;

julia> Zygote.gradient((x, y) -> sum(f.(x, y)), zeros(3), fill(0.9, 3))
([Inf, Inf, Inf], [NaN, NaN, NaN])

The same code in version 0.6.14:

julia> Zygote.gradient(^, 0.0, 0.9)
(0.0, 0.0)

julia> Zygote.gradient((x, y) -> sum(x .^ y), zeros(3), fill(0.9, 3))
([0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

julia> f(x, y) = x ^ y;

julia> Zygote.gradient((x, y) -> sum(f.(x, y)), zeros(3), fill(0.9, 3))
([0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

the crucial difference being in the last case for each.

If you do this with ForwardDiff, you get

julia> ForwardDiff.gradient(x -> x[1]^x[2], [0.0, 0.9])
2-element Vector{Float64}:
  Inf
 NaN

so this smells to me like ForwardDiff is being invoked for f, whereas individual chain rules are used for sum(x .^ y) because Zygote un-fuses broadcasting where it can (it can't un-fuse f).

@oxinabox @mcabbott @DhairyaLGandhi I'm assuming that this is related to our recent efficiency upgrades?

This is breaking the build in KernelFunctions.jl. We can of course allow the tests to fail for the time being, but it essentially means that we can't use Zygote with one of our kernels.

The text was updated successfully, but these errors were encountered:

willtebbutt · 2021-07-20T17:30:59Z

I guess the culprit is the discrepancy between the version of the rule DiffRules and ChainRules

In the latter we adopt a different convention from the former.

DhairyaLGandhi · 2021-07-20T17:50:31Z

Hmm, using DiffRules seems to be more "correct". Thoughts?

willtebbutt · 2021-07-20T17:54:11Z

Unclear to me -- the "subgradient" convention adopted in ChainRules works perfectly in the KernelFunctions use-case. Also, it's what we do for lots of other things (ReLUs, abs, etc). My guess would be it's a useful thing to do?

willtebbutt · 2021-07-20T17:55:29Z

Hmm, using DiffRules seems to be more "correct". Thoughts?

Hmm, I interpreted this as meaning that you believe the convention adopted in DiffRules is more appropriate than the one in ChainRules. Was I correct in this interpretation?

oxinabox · 2021-07-20T19:06:11Z

Returning NaN, or Inf for this is just causing pain with little gain, for no real reason.
0.0 is fine

Considering x^y for x=0, and y=0.9
The deriviative wrt x from below is not defined since the function errors if we make a small negative change, so we don't care about that case.
The deriviative wrt x from above is zero.
Thus we might as well say the derivative is zero.

The deriviative wrt y from below is Inf
The deriviative wrt y from above is zero.
I would argue that around if a<=b etc we are largely free to chose which branch we want for a==b (there needs to be some provisos on that statement)
And here chosing zero is the choice that leads to more useful behavour.
Noone want to work with Inf.

This also has the attractive property that if you enumerate all points that the derivative is zero,
i.e. all possible nonboundry local minimal, this gets included.
so if you then check those for the smallest you will find the global minima.
And if you hit this point during gradient desent, you do infact want to stop, so derivative of 0 is useful here also.

Which is where we get the subgradient convention (which I will write up one day, it comes from a discussion I had at a conference.)
We say "If the derivative is not defined, but the subgradient contains zero, then just say the derivative is 0".
Or more directly if taking the derivative from all directions of approach forms a interval that encloses 0, stay the derivative is 0.

mcabbott · 2021-07-27T00:36:27Z

I think this thread urgently needs a graph:

As x -> 0 from above, the gradient does go to positive real infinity. Whether to return Inf is another question, you could argue that a large finite value from say x = eps() might be nicer.

A few more cases to check, while I was at it. We should be sure to get all of these right.

antoine-levitt · 2021-07-30T11:11:20Z

How did you obtain these plots? x^y for y non-integer and x negative is impossible to define in a consistent way

antoine-levitt · 2021-07-30T11:12:58Z

(eg see your 3.14 plot, which does not look like x^3)

mcabbott · 2021-07-30T11:16:54Z

They allow complex numbers, the dashed lines are the imaginary part when this is nonzero. To get one answer you have to pick a branch, and Julia does. x^(1//3) takes one branch, cbrt(x) takes a different one which is real at negative x.

antoine-levitt · 2021-07-30T11:18:17Z

It does not pick a branch? (nor should it)

julia> (-2.0)^(1//3)
ERROR: DomainError with -2.0:
Exponentiation yielding a complex result requires a complex argument.
Replace x^y with (x+0im)^y, Complex(x)^y, or similar.

antoine-levitt · 2021-07-30T11:19:21Z

Ah, you gave it complex numbers, I see. This is different then. In that case the choice is inferred from the fact that zeros are signed:

julia> (-2.0+0.0im)^(1//3)
0.6299605249474366 + 1.0911236359717214im

julia> (-2.0-0.0im)^(1//3)
0.6299605249474366 - 1.0911236359717214im

mcabbott · 2021-07-30T11:22:31Z

Yes, there's a cut on the negative real axis in Julia's conventions. The 3rd choice is:

julia> cbrt(-2.0)
-1.2599210498948732

julia> cbrt(-2.0+0.0im)
ERROR: MethodError: no method matching cbrt(::ComplexF64)

julia> (-2.0+0.0im)^(1//3) |> abs
1.2599210498948732

julia> (-2.0-0.0im)^(1//3) |> abs
1.2599210498948732

antoine-levitt · 2021-07-30T11:28:08Z

Here's my opinion of what should happen here

pow(x::Real,y::Int): do the normal thing
pow(x::Real, y::Real): do the normal thing for x > 0, error for x < 0. For x=0, return the one-sided derivative: 0 for y >= 1, +Inf for y < 1
pow(x::Complex, y::Any): do the normal thing for all x not on the negative real axis. For x on the negative real axis, the derivative of the imaginary part of the result wrt the imaginary part of x is undefined (so NaN, probably). Technically, the other derivatives (real of result wrt both real and imag of x, as well as imag of result wrt real of x) are well-defined, so the right thing to do is to return a 2x2 matrix with one entry as NaN.

The crucial difference between case 2 and case 3 is that it is permissible to use one-sided derivatives when at the boundary of a definition where the function is undefined at the other side, but it's not when the function is discontinuous. Does that make sense?

mcabbott · 2021-07-30T11:34:17Z

For x=0, return the one-sided derivative: 0 for y >= 1, +Inf for y < 1

Except at y=1, where it's 1 -- the 2nd panel of my graph. And at y=0 it's zero. And negative y is allowed too.

But what I'm not sure about is whether returning Inf is a good idea. Maybe it would be better to regularise everything, for x=0 (in this one-sided case) always give the gradients for say x = eps(). This will change smoothly between the different cases.

antoine-levitt · 2021-07-30T11:43:02Z

julia> (-1.0)^(1.0)
-1.0

Oooh that's evil. Then yeah I agree with you: when y is integer (positive or negative), the rule should be that pow(x::Real, ::Real) follows pow(x::Real, y::Int). The derivative wrt y when y is integer and x <= 0 should be NaN

I don't see the point of regularizing: the one-sided derivative is infinite.

mcabbott · 2021-07-30T11:56:43Z

crucial difference between case 2 and case 3 is that it is permissible to use one-sided derivatives when at the boundary

Yes, I think I agree. When at a boundary, we implicitly assume that this is the boundary of the domain of your parameter too, so of course you care about the one-sided derivative.

When at a singularity in the interior there's no obvious answer mathematically. Although sometimes, like 1/x and 1/x^2 at x=0, we should blithely declare the two one-sided infinities equal and return that. The cut on the -ve real axis I'd need to think a little more, might be NaN.

the point of regularizing: the one-sided derivative is infinite.

The point would be to try to save people from NaNs downstream. What you are ultimately doing (in my head at least) is gradient descent, and if your parameter is bounded to [0,1] sometimes you need an arrow telling you to move away from the wall. A strictly infinite arrow is much more likely to break things than a large finite arrow -- and since you're doing floating point, you do accept some fuzziness, x === +0.0 is (in this view) just a representation of a very small number, but never exactly at the boundary.

That said I'm far from 100% sure that this wouldn't have weird consequences elsewhere. It just seems worth some thought, before jumping from an infinite limit in the pure mathematics to the concrete Inf::Float64.

antoine-levitt · 2021-07-30T12:03:02Z

A strictly infinite arrow is much more likely to break things than a large finite arrow -- and since you're doing floating point, you do accept some fuzziness, x === +0.0 is (in this view) just a representation of a very small number, but never exactly at the boundary.

With what you suggest the arrow would be 1/eps(): if you're doing things like that, it will break: I would imagine you want to make sure it breaks fast, it's easier to debug.

mcabbott · 2021-07-30T12:55:21Z

The regulated arrow would depend on the function. So it goes smoothly from "infinite" to 0 for powers near 1:

julia> ForwardDiff.derivative(x -> x^0.9, 0 + eps())
33.08251262391458

julia> ForwardDiff.derivative(x -> x^0.999, 0 + eps())
1.0356643999187012

julia> ForwardDiff.derivative(x -> x^1.001, 0 + eps())
0.9655627827687262

julia> ForwardDiff.derivative(x -> x^1.1, 0 + eps())
0.029925175613304173

That was my thinking in writing x + eps() as the regulator. But perhaps this is far too big, and eps(0) too small... There may well be much better choices than perturbing x.

it will break

How obvious is this? If I have some parameter x in [0,1], and choose to re-parameterise to work with x2 = x^2 say, also in [0,1], then I will probably have many f(sqrt(x2)). It would be nice if these didn't cause problems. I agree it's entirely possible attempting to regulate this sqrt's gradient may cause far more problems than it solves.

mcabbott · 2021-09-05T00:56:05Z

Behaviour with ChainRules v1.11.4 now matches ForwardDiff, and the graph, for the the gradient with respect to x:

julia> Zygote.gradient(^, 0.0, 0.9)
(Inf, 0.0)

julia> Zygote.gradient((x, y) -> sum(x .^ y), zeros(3), fill(0.9, 3))
([Inf, Inf, Inf], [0.0, 0.0, 0.0])

julia> f(x, y) = @show(x) ^ y;

julia> Zygote.gradient((x, y) -> sum(f.(x, y)), zeros(3), fill(0.9, 3))
x = Dual{Nothing}(0.0,1.0,0.0)
x = Dual{Nothing}(0.0,1.0,0.0)
x = Dual{Nothing}(0.0,1.0,0.0)
([Inf, Inf, Inf], [NaN, NaN, NaN])

ForwardDiff still has NaN for the p gradient, where I think it should have zero.

gaspardbb · 2022-12-14T14:12:04Z

Hi,
I guess this is a related issue, but let me know if I should open a new issue.

I discovered inconsistent behavior when using abs and broadcasting. Here is a MRE:

using Zygote

x = [1.0, -1.0]
w = [1.0, 0.0]

g1 = gradient(x) do x
    sum(abs.(x .- w))
end  # ([1.0, -1.0],)

g2 = gradient(x) do x
    abs(x[1] - w[1]) + abs(x[2] - w[2])
end  # ([0.0, -1.0],)

Basically, both are correct in the sense that both provide a valid subgradient for abs in 0: g1 takes x -> 1 and g2 takes the more obvious x -> 0. But it took me a while to debug, I feel the fact that the results are inconsistent is really bothering.

Based on the first post in this thread, the issue might come (but I don't know enough of Zygote internals) from different definitions for the subgradient: DiffRules vs ChainRules.

Let me know if I can provide more info !

willtebbutt mentioned this issue Jul 20, 2021

Don't test FBM on Zygote JuliaGaussianProcesses/KernelFunctions.jl#342

Merged

oxinabox mentioned this issue Jul 20, 2021

Document subgradient convention JuliaDiff/ChainRulesCore.jl#404

Closed

This was referenced Jul 30, 2021

Strange behaviour of the derivative JuliaDiff/Diffractor.jl#26

Closed

fix some issues with rules for ^ JuliaDiff/ChainRules.jl#485

Closed

mcabbott mentioned this issue Jan 16, 2022

NaN gradients for sqrt #1101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Behaviour in Broadcasting #1036

Inconsistent Behaviour in Broadcasting #1036

willtebbutt commented Jul 20, 2021 •

edited by mcabbott

Loading

willtebbutt commented Jul 20, 2021 •

edited

Loading

DhairyaLGandhi commented Jul 20, 2021

willtebbutt commented Jul 20, 2021

willtebbutt commented Jul 20, 2021

oxinabox commented Jul 20, 2021 •

edited

Loading

mcabbott commented Jul 27, 2021

antoine-levitt commented Jul 30, 2021

antoine-levitt commented Jul 30, 2021

mcabbott commented Jul 30, 2021

antoine-levitt commented Jul 30, 2021

antoine-levitt commented Jul 30, 2021

mcabbott commented Jul 30, 2021

antoine-levitt commented Jul 30, 2021 •

edited

Loading

mcabbott commented Jul 30, 2021

antoine-levitt commented Jul 30, 2021

mcabbott commented Jul 30, 2021 •

edited

Loading

antoine-levitt commented Jul 30, 2021

mcabbott commented Jul 30, 2021

mcabbott commented Sep 5, 2021

gaspardbb commented Dec 14, 2022

Inconsistent Behaviour in Broadcasting #1036

Inconsistent Behaviour in Broadcasting #1036

Comments

willtebbutt commented Jul 20, 2021 • edited by mcabbott Loading

willtebbutt commented Jul 20, 2021 • edited Loading

DhairyaLGandhi commented Jul 20, 2021

willtebbutt commented Jul 20, 2021

willtebbutt commented Jul 20, 2021

oxinabox commented Jul 20, 2021 • edited Loading

mcabbott commented Jul 27, 2021

antoine-levitt commented Jul 30, 2021

antoine-levitt commented Jul 30, 2021

mcabbott commented Jul 30, 2021

antoine-levitt commented Jul 30, 2021

antoine-levitt commented Jul 30, 2021

mcabbott commented Jul 30, 2021

antoine-levitt commented Jul 30, 2021 • edited Loading

mcabbott commented Jul 30, 2021

antoine-levitt commented Jul 30, 2021

mcabbott commented Jul 30, 2021 • edited Loading

antoine-levitt commented Jul 30, 2021

mcabbott commented Jul 30, 2021

mcabbott commented Sep 5, 2021

gaspardbb commented Dec 14, 2022

willtebbutt commented Jul 20, 2021 •

edited by mcabbott

Loading

willtebbutt commented Jul 20, 2021 •

edited

Loading

oxinabox commented Jul 20, 2021 •

edited

Loading

antoine-levitt commented Jul 30, 2021 •

edited

Loading

mcabbott commented Jul 30, 2021 •

edited

Loading