Second order for vector to vector functions #206

gdalle · 2024-04-24T09:47:02Z

At the moment I don't know if there is wide user interest, but feel free to contribute here.

See initial discussion with @timholy in JuliaDiff/AbstractDifferentiation.jl#134

timholy · 2024-04-24T11:51:30Z

xref JuliaDiff/ForwardDiff.jl#61 (and quite possibly others, but I haven't searched)

timholy · 2024-05-10T12:11:54Z

As far as naming goes, perhaps this could still be called Hessian. It's a little inconsistent that we give the gradient of a vector-valued function a special name, "Jacobian," but that's basically a historical artifact. The right way to think about it is that the jacobian is not a matrix: it's a 2-tensor that is both contravariant (for the "vector-valued" portion) and covariant (for the gradient portion). There's really no confusion between these, except for the fact that in Julia we handle this by adding an extra dimension and we don't use any encoding to specify which "dimension" is covariant and which is contravariant.

The second order derivative of a vector-valued function is, of course, a rank-3 tensor with one contravariant and two covariant "dimensions." Again there doesn't have to be any actual ambiguity about these things, if we encoded the meaning of the dimensions. Heck, one could even define Hessian-(contravariant)vector products just fine with no ambiguity, because there are only two candidates for that (the two covariant "dimensions") and they are symmetric.

adrhill · 2024-05-10T12:51:25Z

we give the gradient of a vector-valued function a special name, "Jacobian," but that's basically a historical artifact. The right way to think about it is that the jacobian is not a matrix: it's a 2-tensor that is both contravariant (for the "vector-valued" portion) and covariant (for the gradient portion).

To add to what you are saying, Jacobians are the matrix representation of two linear maps: pushforwards and (transposed) pullbacks, which are at the heart of forward- and reverse-mode AD (and DI).
For $f: \mathcal{M} \rightarrow \mathcal{N}$, pushforwards map tangent spaces on $\mathcal{M}$ onto the corresponding tangent space on $\mathcal{N}$.
Pullbacks map cotangent spaces on $\mathcal{N}$ onto corresponding cotangent spaces on $\mathcal{M}$.

I'm guessing the popularity of Jacobians comes from frameworks like JAX calling their pushforwards Jacobian-vector products and pullbacks vector-Jacobian products, even though they are implemented as matrix-free linear maps. To some degree, this is just a question of semantics.

timholy · 2024-05-25T09:09:32Z

@gdalle, are you now satisfied that this isn't "weird"? I could resubmit JuliaDiff/AbstractDifferentiation.jl#134 here.

gdalle · 2024-05-25T09:30:58Z

I understand the concept, but I've been thinking more about your specific application to optimization with a Lagrangian:

$$L: (x, \lambda) \in \mathbb{R}^n \times \mathbb{R}^m \longmapsto f(x) + \lambda^\top c(x)$$

I'm still unsure which option is more wasteful:

computing the joint scalar Hessian $\nabla_{(x,\lambda)}^2 L \in \mathbb{R}^{(n+m)^2}$ and keeping only the upper left corner $\nabla_x^2 L \in \mathbb{R}^{n^2}$
computing the vector Hessian of $\phi: x \longmapsto (f(x), c_1(x), \dots, c_m(x))$, let's call it $\overset{\text{vec}}{\nabla^2} \phi \in \mathbb{R}^{(1+m) \times n \times n}$, and then summing along the first dimension

Perhaps @amontoison would have a clue, we've been talking about this too.

gdalle · 2024-05-25T09:39:15Z

Either way, if you want the vector Hessian, you can get it very easily as follows:

using DifferentiationInterface
import ForwardDiff

function value_jacobian_and_vector_hessian(f, backend, x)
    y = f(x)
    J(x) = jacobian(f, backend, x)
    J, H = value_and_jacobian(J, backend, x)
    return y, J, reshape(H, eachindex(y), eachindex(x), eachindex(x)) 
end

I'm reluctant to add it to DifferentiationInterface because:

It's very easy to code
It's very expensive to test and document. Our test suite is the true added value of the package, and adding one more operator would require at least 500 LOCs off the top of my head.
No backend natively supports this, so the implementation above is essentially optimal...
... up to preparation. That is where we could possibly improve performance. Unfortunately, preparation for second-order operators is tricky, and I haven't figured it out fully yet.

Since you've already been bitten by the last item in #252, you can look at the other issues discussing it:

Preparation for second order #86

timholy · 2024-05-25T09:46:05Z

computing the joint scalar Hessian ... and keeping only the upper left corner

This would be viable, but you probably wouldn't want to throw it away: Newton's method for constrained optimization consists of using a first order expansion of $\nabla_x L$ with respect to $x + \Delta x$, $\lambda + \Delta \lambda$. So the $x, \lambda$ terms are useful. The $\lambda, \lambda$ ones are not because they are zero (the Lagrangian is linear in $\lambda$).

That said, there are cases where you might still want the individual Hessians (your second proposal). For unconstrained optimization, trust-region methods are very popular, and they essentially ask "was the quadratic expansion fairly accurate?" You could imagine asking that about each individual constraint as well, in which case you might want to have them disentangled from $L$. But that's a bit more speculative.

Anyway, it's unfortunate that 2nd order is so hard. Presumably Julia is not the only ecosystem grappling with this?

gdalle mentioned this issue Apr 24, 2024

WIP: value_jacobian_and_hessian JuliaDiff/AbstractDifferentiation.jl#134

Closed

gdalle added the core Related to the core utilities of the package label Apr 24, 2024

timholy mentioned this issue May 10, 2024

Closures capture storage in extras #252

Closed

gdalle mentioned this issue May 18, 2024

Useful packages for (sparse) differentiation? JuliaSmoothOptimizers/ADNLPModels.jl#229

Open

gdalle mentioned this issue May 30, 2024

Preparing functions of 2 variables with respect to just one. #286

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second order for vector to vector functions #206

Second order for vector to vector functions #206

gdalle commented Apr 24, 2024 •

edited

timholy commented Apr 24, 2024

timholy commented May 10, 2024

adrhill commented May 10, 2024

timholy commented May 25, 2024

gdalle commented May 25, 2024 •

edited

gdalle commented May 25, 2024 •

edited

timholy commented May 25, 2024 •

edited

Second order for vector to vector functions #206

Second order for vector to vector functions #206

Comments

gdalle commented Apr 24, 2024 • edited

timholy commented Apr 24, 2024

timholy commented May 10, 2024

adrhill commented May 10, 2024

timholy commented May 25, 2024

gdalle commented May 25, 2024 • edited

gdalle commented May 25, 2024 • edited

timholy commented May 25, 2024 • edited

gdalle commented Apr 24, 2024 •

edited

gdalle commented May 25, 2024 •

edited

gdalle commented May 25, 2024 •

edited

timholy commented May 25, 2024 •

edited