Incorrect gradient when using Flux.params #1251

raphaelchinchilla · 2022-06-24T23:54:51Z

I encounter a weird behavior when taking gradients using Flux.params. Here is a minimal reproducible example:

using Flux
using Statistics: mean


function Fp(x,p) 
    ω₀,W=p
    [x[1:1,:];-ω₀*x[2:2,:]]+W*x
end

function get_derivatives(x,p̂ᵢₙᵢₜ)
    p̂=deepcopy(p̂ᵢₙᵢₜ) 
    nu=ones(size(x))
    ps = Flux.params(p̂)   

    g_ps = Flux.gradient(ps) do
        mean(abs2,p̂[2]*x)+10*mean(abs2,Fp(x,p̂))+mean(nu.*Fp(x,p̂))
    end

    g_p̂ = Flux.gradient(p̂) do p̂
        mean(abs2,p̂[2]*x)+10*mean(abs2,Fp(x,p̂))+mean(nu.*Fp(x,p̂))
    end
    return g_ps,g_p̂
    
    return p̂,nu
end
p̂ᵢₙᵢₜ=[[1.0],[2. 3.;4. 5.]]
x=randn(2,10)
g_ps,g_p̂=get_derivatives(x,p̂ᵢₙᵢₜ)

If I query each of the derivatives, I obtain

julia> g_ps.grads
IdDict{Any, Any} with 4 entries:
  [2.0 3.0; 4.0 5.0]                             => [40.038 26.2856; 53.326 35.0164]
  Box([1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0])    => RefValue{Any}((contents = [0.117313 0.00144669 … -0.166186 0.442534; 0.156417 0.00192892 … -0.221581 0.590045],))
  [1.0]                                          => [-35.0164]
  Box(Array{Float64}[[1.0], [2.0 3.0; 4.0 5.0]]) => RefValue{Any}((contents = Union{Nothing, Matrix{Float64}}[nothing, [2.54043 2.73642; 5.19804 4.48258]],))

and

 julia> g_p̂
(Array{Float64}[[-35.01637148376182], [42.578438152955385 29.02198799901979; 58.52408623552277 39.49894829943962]],)

I verified with Symbolics.jl, and the correct derivative is the one given by g_p̂. So, it is obvious with this simple example that Zygote is ignoring the term mean(abs2,p̂[2]*x), I presume because we only call p̂[2]. What is not obvious for me, is why the rest of the derivative appears in

  Box(Array{Float64}[[1.0], [2.0 3.0; 4.0 5.0]]) => RefValue{Any}((contents = Union{Nothing, Matrix{Float64}}[nothing, [2.54043 2.73642; 5.19804 4.48258]],))

(by rest, I mean that if one adds them both, we get the correct result) and why the derivative with respect to nu is being taken. If I remove return p̂,nu the two related entries in g_ps.grads disappear.

My issue:

While it is obvious with this MRE that I get the wrong gradient because of p̂[2], it took me two days to figure it out. Maybe it is not a bug, and for a developer it is obvious that this should not work. If it was possible to get a warning or (better yet) to fix it, it would be great
If someone could explain to me why Zygote is taking derivatives with respect to nu even though it is not a parameter, it would be great. And if there is a way to avoid it happening, would be even better. In this example nu is not used for anything, but in my more code I use it for something else and need to return the value.

The text was updated successfully, but these errors were encountered:

ToucheSir · 2022-06-25T00:17:34Z

Dupe of #1232.

raphaelchinchilla · 2022-06-27T16:42:20Z

@ToucheSir I had not seen #1232, which addresses my first question, I apologize for that.

However, my second question refers to what seems to me another bug, as Zygote is taking the derivative of more things than it is asked for, which (I imagine) impairs its performance.

ToucheSir · 2022-06-27T17:35:44Z

When you use implicit params, Zygote will try to be smart about caching gradients of captured (in the callback of gradient) variables as well. This is yet another reason we recommend using "explicit" params wherever possible and avoiding implicit ones like the plague.

raphaelchinchilla · 2022-06-27T18:08:50Z

Thanks!

Quick question: if I have a neural network (created using Flux's APIs like Dense and Chain), how can I take derivatives using explicit parameters? In my (real and more complicated) problem I have to take derivatives from both regular functions and neural networks.

I was going to take the gradients twice, one with explicit parameters and another using implicit, for the neural network, but it would be great if there was another way to do this.

ToucheSir · 2022-06-27T23:16:13Z

You can differentiate through flux models the same way you would any other callable struct. The main difference is that Flux's current, built-in optimizers won't be able to handle the gradients Zygote spits out. Instead, use https://github.com/FluxML/Optimisers.jl, and see specifically this section of the docs for an example of Flux integration. Using Optimisers.jl will also future-proof your code, as it will be replacing the current Flux ones as default in the next major Flux version.

ToucheSir transferred this issue from FluxML/Flux.jl Jun 25, 2022

ToucheSir closed this as not planned Won't fix, can't repro, duplicate, stale Jun 25, 2022

mcabbott added the implicit using Params, Grads label Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect gradient when using Flux.params #1251

Incorrect gradient when using Flux.params #1251

raphaelchinchilla commented Jun 24, 2022

ToucheSir commented Jun 25, 2022

raphaelchinchilla commented Jun 27, 2022

ToucheSir commented Jun 27, 2022 •

edited

raphaelchinchilla commented Jun 27, 2022

ToucheSir commented Jun 27, 2022

Incorrect gradient when using Flux.params #1251

Incorrect gradient when using Flux.params #1251

Comments

raphaelchinchilla commented Jun 24, 2022

ToucheSir commented Jun 25, 2022

raphaelchinchilla commented Jun 27, 2022

ToucheSir commented Jun 27, 2022 • edited

raphaelchinchilla commented Jun 27, 2022

ToucheSir commented Jun 27, 2022

ToucheSir commented Jun 27, 2022 •

edited