`NaN` gradients for `sqrt` #1101

JordiBolibar · 2021-10-13T07:16:28Z

After a long time hunting a bug with @facusapienza21, we have realized that Zygote fails to provide a gradient for the basic sqrt function. This has been discussed at length in this Discourse thread.

Here's a MWE to reproduce the issue:

using Zygote 
using Flux

A₀ = [[1,0] [0,3]]
A₁ = [[0,0] [0,0]]

function loss(θ)
    A = A₀.^θ
    A = sqrt.(A)
    return sqrt(Flux.Losses.mse(A, A₀; agg=sum))
end

θ = 4.0
loss_θ, back_θ = Zygote.pullback(loss, θ)

For this last case, the value of back_θ(1.0) is NaN. However, if we avoid the use of sqrt() by defining the loss function as

function loss(θ)
    A = A₀.^(θ/2)
    return sqrt(Flux.Losses.mse(A, A₀; agg=sum))
end

then Zygote provides the right gradient.

According to @mcabbott, "the reason we get NaN is that the slope of sqrt at zero is infinite. That infinity multiplies the slope of 0^x at 4, which is zero. Whereas with the 0^(x/2) version, the slope is simply zero".

Being such a basic function, this bug can potentially impact a large number of users.

The text was updated successfully, but these errors were encountered:

mcabbott · 2022-01-11T16:09:53Z

I don't think this is really a bug in sqrt that can be solved. But avoiding such issues is one reason to define gradient rules for larger functions -- this is something like norm, and we can choose to give that smoother behaviour.

It's a little like sin(x)/x, which has an obvious definition at zero to make it continuous, but the computer does not know this and gives you NaN. Which is a reason to wrap it in a function sinc(x) = iszero(x) ? zero(x) : sin(x)/x to help out. Although again the derivative goes wrong, abs(ForwardDiff.derivative(sinc, 1e-40)) > 1e20, which we could smooth out with further rules.

mcabbott · 2022-01-16T21:24:54Z

Xref also discussion here: #1036 . It might be possible to regularise all Inf gradients; this is likely to sometimes lead to wrong finite gradients, but perhaps they are acceptable, and perhaps such smoothed gradients would be more useful?

JordiBolibar · 2022-01-17T08:16:15Z

That's a good point. To be honest, I'm not sure wrong gradients are better than an error. I would say that for this, if no perfect solution is available, the best solution might be an informative error. Something pointing out that the issue comes from this, and proposing a solution (e.g. just use ^1/2), like the one I used.

mcabbott · 2022-01-17T18:01:51Z

I agree that an error is often better than a NaN. Looks like this has been discussed a bit:

https://discourse.julialang.org/t/treating-nan-as-error-helping-debugging/36933

JuliaLang/julia#27705

Less ambitiously, something like this could potentially be added only to AD. For instance inserting a function which is by default check_nan() = false into @scalar_derivative would I think let you recompile all rules to have a check in them, for debugging.

Alexander-Barth · 2022-01-19T14:59:00Z

Zygote and PyTorch seem to behave similarily in these cases:

gradient(x ->  x * sqrt(x),0)
# (NaN,)
gradient(x ->  x^(1.5),0)
# (0.0,)

PyTorch:

x = torch.tensor([0.], requires_grad=True); f = x*torch.sqrt(x); f.backward(); x.grad
# tensor([nan])
x = torch.tensor([0.], requires_grad=True); f = x**(1.5); f.backward(); x.grad
# tensor([0.])

To me (to my math professors, as far as I remember :-)) √x and x / √x are just two different functions. √x = 0 for x = 0 but x / √x is undefined for x = 0 (they are equal almost everywhere but still different).

Given that the derivative of sqrt is undefined (in the mathematical sense) for x = 0, having NaN as a results seems quite logical too me. I would not expect any symbolic transformation from Zygote (or PyTorch) to lift this pathological case.

mcabbott · 2022-01-19T15:10:21Z

There might be more clever ways, ForwardDiff's nan-safe mode works around some cases where the simple conclusion would be NaN. Today's discussion here: JuliaDiff/ChainRules.jl#576

MariusDrulea · 2022-12-19T13:30:04Z

The derivative of sqrt(x) is 1/(2*sqrt(x)) so it has to be Inf around 0, as 1/0 return Inf in Julia. Zygote, ForwardDiff, ReverseDiff are right here. It would be a terrible mistake if these AD tools will return something else.

Possible solutions to avoid Inf gradients for sqrt:

sqrte(x) = sqrt(x+e), where e is a small positive number.
sqrt_(x) = x > e ? sqrt(x) : sqrt(x+e), this is for the case you want to keep the exact sqrt behaviour for most x values

sethaxen linked a pull request Mar 12, 2022 that will close this issue

Avoid NaN (co)tangents for sqrt(0) JuliaDiff/ChainRules.jl#599

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`NaN` gradients for `sqrt` #1101

`NaN` gradients for `sqrt` #1101

JordiBolibar commented Oct 13, 2021

mcabbott commented Jan 11, 2022

mcabbott commented Jan 16, 2022

JordiBolibar commented Jan 17, 2022

mcabbott commented Jan 17, 2022

Alexander-Barth commented Jan 19, 2022

mcabbott commented Jan 19, 2022

MariusDrulea commented Dec 19, 2022

NaN gradients for sqrt #1101

NaN gradients for sqrt #1101

Comments

JordiBolibar commented Oct 13, 2021

mcabbott commented Jan 11, 2022

mcabbott commented Jan 16, 2022

JordiBolibar commented Jan 17, 2022

mcabbott commented Jan 17, 2022

Alexander-Barth commented Jan 19, 2022

mcabbott commented Jan 19, 2022

MariusDrulea commented Dec 19, 2022

`NaN` gradients for `sqrt` #1101

`NaN` gradients for `sqrt` #1101