Dealing with non-array/scalar structured values #8

ararslan · 2019-04-22T19:45:54Z

Consider the case of an eigenvalue decomposition from the eigen function, which produces both eigenvalues and eigenvectors in an Eigen object. Giles provides forward- and reverse-mode sensitivities for the decomposition, which depend on both the eigenvalues and vectors. That begs the question of how this should be expressed in ChainRules.

In a conversation with @jrevels, he said that his vision for this was to use named tuples in cases such as this, which involve structures other than arrays and scalars. However, we weren't able to come to a concrete conclusion on whether it makes more sense for a Rule for e.g. eigen to produce a named tuple upon application of the rule, or if frule/rrule should themselves should produce a named tuple of Rules.

It seems that the author of a rule can reuse computations for the eigenvalues and vectors if applying the rule yields a single named tuple. This would look something along the lines of (untested):

function frule(::typeof(eigen), X::AbstractMatrix)
    E = eigen(X)
    λ, U = E
    n = size(X, 1)
    R = Rule() do ΔX
        Y = U' * ΔX * U
        values = copy(diag(Y))
        @inbounds for j = 1:n, i = 1:n
            if i == j
                Y[i,j] = 0
            else
                Y[i,j] /= λ[j] - λ[i]
            end
        end
        vectors = U * Y
        (values=values, vectors=vectors)
    end
    return E, R
end

However, if frule/rrule returns a named tuple of rules, one can specify accessor functions much more conveniently, something like:

function frule(::typeof(eigvals), X::AbstractMatrix)
    E, R = frule(eigen, X)
    return E.values, R.values
end

To quote Jarrett directly, at what stage do we want the caller to know that they should deal with a named tuple?

The text was updated successfully, but these errors were encountered:

willtebbutt · 2019-04-22T20:51:24Z

I'm also not entirely sure what the correct answer is here.

The other thing we should sort out while making these choices is whether, when dealing with non-Array AbstractArray types (e.g. Diagonal, Symmetric etc), the rules should also be AbstractArray types, named tuples (in whatever way we decide), or whether it's acceptable to have both conventions coexist.

ararslan · 2019-05-02T00:01:51Z

Something else to consider here is objects with "virtual fields," for which in some cases the actual fields may be considered "private." For example, getproperty on a Cholesky permits accessing U or L, which are not actually fields, but are computed as simple functions of the fields factors and uplo. Would the named tuple returned in this case, be it from frule/rrule or from application of a rule, reflect the fields or the properties?

oxinabox · 2019-05-10T18:37:18Z

So my attempt to summarize the discussion right now:
We should return a Rule
that when evaluated return s a NamedTuple,
where that NamedTuple contains for each element a Differentiable (or a Value)

The normal case: (Without NamedTuples)
We have say a function and 3 inputs to a rrule
its returns the value for the function and a tuple containing 3 Rules.
Each for those rules can then be given a a sensitivity for the souronding operation.
And at which point they return a Differentiable (or a value).
(Which could be a Thunk, but Thunks have no inputs, though they do close over the input to the Rule)

Now the new case is if one of the inputs was a Struct.
Now the rrule still returns the same: ie,. the function value and a tuple containing 3 Rules.
But now the struct's rule when given a sentivity return not a single Differentiable,
(or value), but rather a NamedTuple of Differentiables (or values) matching the property (fields) of the struct

This makes sense as at the end of the day a Differntiable is basically a representation of how the input can change.
So it is of semantically the same type as the input.

Coming off this, to make things nice we might want to add add and mul.
for NamedTuples
because it won't be type piracy. (since those are not + and *) so it is probably fine.
And at least in the case when they are the same NamedTuple clear.
At which point they are basically themselves Differentiables.

This also maxises our expressive power,
as logic things can be done in:

the top level rrule function (which can contol all lower levels via closing)
or in the the Rule that returns a named tuple of differnetiables
the individual Differnetiables of the NamedTuple can be Thunks themsevles.

Also we want to do the same for Tuples as for NamedTuple.
NamedTuples stand in for all structs,
Tuple stand in for tuples.

ararslan · 2019-05-10T19:00:44Z

Also note that we might not actually need add and mul for NamedTuples if our goal is accumulation; we can extend accumulate for that.

oxinabox · 2019-08-27T15:55:49Z

@willtebbutt do you remember why we decided on a Rule returning a NamedTuple (of Differentials) rather than a NamedTuple of Rules?

That latter feels better to me now.

willtebbutt · 2019-08-27T16:06:14Z

I think it was probably something to do with a named tuple of differentials being something that can reasonably participate immediately in whatever computations are required downstream, whereas a named tuple of rules would require further unwrapping etc. For example, if you have a heavily nested namedtuple (e.g. corresponding to the adjoint w.r.t. a struct inside a struct inside a struct etc in Zygote) you'll wind up having to repeat a two-step procedure to recursively evaluate everything as required as you work through the nested named tuple, rather than just having every already evaluated.

I think, in short, it's just simpler to have a named tuple of differentials than a named tuple of rules.

oxinabox · 2019-08-27T17:32:02Z

The annoying case is in particular the fact that most function derviative W.R.T self
will be
Rule(i->NamedTuple()) rather than NamedTuple()

Which is byfar the most common case.
Even though it is an edge case.
And for things that do have fields (functors/closures)
then one would need

(a=Rule(i->x), b=Rule(i->x)

Also means all deferring of work now hass to be done via Thunk rather than Rule
but that is much less of a problem and is probably better.

I guess I will define const nil_structure_rule=Rule(i->NamedTuple())

oxinabox · 2019-09-11T21:46:18Z

So in the post #30 world we do not have Rules anymore so everything we do in this space is more elegant. At least in terms of having resolves fully the question of Rule returning named tuple, vs named tuple of Rules.
It is now clearly pullback/pushforward returning a namedtuple.

Right now for svd, our only example of returning a NamedTuple,
it is quiet limitted:

does not support use with any operation other than accumlate! (not even accumulate)
Can't contain AbstractDifferentials (see Audit use of zeros vs Zero ChainRules.jl#102 (comment))

Also note that we might not actually need add and mul for NamedTuples if our goal is accumulation; we can extend accumulate for that.

Ever since #16 we don't have add and mul, just + and *
so overloading those is not an option without terrible type-piracy.
So we would have to create out own-types,
Say DNamedTuple and DTuple
but they would only need to support the differential operations,
which seems reasonable really.
They would just apply them recursively.
(recursive extern was already bought up in #47 (comment) about Arrays)

To go with that, we should I think add getindex and getproperty overloads to all abstract differentials. getindex(::One, ::Any) = One() etc
so that if something was expecting a DNamedTuple but that got multiplied away by Zero()*DNamedTuple(...)=Zero() it can still have its fields looked up in accumulate.

Also we should have some macro or function to convert NamedTuples and Tuples to the Differiential type them.
Maybe we can reuse use the differential function for that.

willtebbutt · 2019-09-12T08:33:23Z

Say DNamedTuple and DTuple but they would only need to support the differential operations, which seems reasonable really.

This seems reasonable to me. It is kind of sad that we don't have add and mul for precisely these kinds of situations though. If we start encountering lots of types that don't have + and * defined on them in precisely the manner that we need for ChainRules, we're going to have to start creating a lot of wrapper types. Or am I over-estimating the problem?

oxinabox · 2019-09-12T08:55:53Z

I don't think we will encounter more (touch wood),
because addition and multiplication have a mathematical definition that we want to follow.

There is an interesting case around things like DateTime. If you say have a function of DateTime, like say the value of the pound stirling, the gradient (fromn the pullback) will be in terms of a Period like milliseconds. You can't add/subtract DateTimes, but you can add/subtract a Period from a DateTime.

oxinabox · 2019-09-21T21:22:13Z

The DNamedTuple might want to take in some extra information about what type it was supposed to be (like what struct it is based on),
so that when Externing it, it can maybe get back into that struct?

oxinabox · 2019-10-17T20:42:23Z

@willtebbutt and I have concluded That we only need 1 composite type.
and it is Composite and it does both NamedTuples and Tuples.
(for tuples it has numeric keys).

Also @willtebbutt is keen on the idea of it being parameterized byt the original object type,
e.g. SVD

oxinabox · 2019-10-18T09:38:56Z

I am thinking
We have 3 kinds of objects*: (these do not correspond to julia types)
primal values, and their paired differential,
and scale-factors.
The core rule is that for
s a scale-factor
v a primal value of kind V
d a differential (of kind V')
Then v + sd must be a primal value in V
In the simple case:
scalar factors are scalars, subtyping Real,
primals and differentials are matrixes of the same size.
Zero() is both a scale factor and a differential for All Kinds of primals.
One() is a scalar factor and a differential for scalar primals only.
The main reason to care about One() when acting s a differential is for doing computation of partial derivatives via forward, or for the initial seed in reverse when you have gone down to a single value like a neural network loss.
So One is not like LinearAlgebra.I because when you multiply a matrix by One(), that matrix is a differential, d in core formula above,
and the core formula only has scalefactor multiplication with a differential.
So it is scalar in that case.
We don’t need to multiply two differentials by each other.
Core formula could be wrong though
Thunk is a differential that is equivelent to the differential being thunked.
(if you thunk something else then it is invalid)
Think of this like the deletation pattern, Wrapped{T<:AbstractT} <:AbstractT
*(plus and maybe differentials that escape their pairings like Wirtinger, but that hopefully will come home)

The interesting one in this framework is the Composite
which is the named-tuple like differential that matches to all stucts.
Because it is nontrival to make the addition of it to its primal (which could be any struct) work
because constructing primals its hard.
But we can do the thing that mostly works and fails when someone makes a primal that can’t be constructed,
and we can make it easy to overload that

oxinabox · 2019-10-18T18:52:30Z

Turns out one of the cases we need this for is sum(::Tuple)
because the pullback will return a Tuple.
but you can't add Tuples,
So it should be returnings Composite

oxinabox · 2019-10-18T18:55:22Z

I think we do want refine_differential(xs) to be a thing,
for when you are not sure on the type you are producing.
E.g. the naive implementation of sum already almost produces a Tuple or an Array
depending on the type of the input.
But actually if it is an Array then we can leave it as is,
but if it is a Tuple we should convert it the a Composite.

Master behavior ```julia julia> @scalar_rule(one(x), Zero()) julia> frule(one, 1, Zero(), [1, 2]) (1, Zero()) julia> frule(one, 1, Zero(), One()) (1, Zero()) ``` Desirable behavior ```julia julia> @scalar_rule(one(x), Zero()) julia> frule(one, 1, Zero(), [1, 2]) (1, [0, 0]) julia> frule(one, 1, Zero(), One()) (1, Thunk(var"#8#10"()) ) ```

* Eagerly evaluate scalers rules Master behavior ```julia julia> @scalar_rule(one(x), Zero()) julia> frule(one, 1, Zero(), [1, 2]) (1, Zero()) julia> frule(one, 1, Zero(), One()) (1, Zero()) ``` Desirable behavior ```julia julia> @scalar_rule(one(x), Zero()) julia> frule(one, 1, Zero(), [1, 2]) (1, [0, 0]) julia> frule(one, 1, Zero(), One()) (1, Thunk(var"#8#10"()) ) ``` * New release * Add tests * Revert "Eagerly evaluate scalers rules" This reverts commit dbe7765. * Redefine * between ::Zero and ::Any * Make it nicer * Add tests and move zero(::AbstractDifferential) to the right folder

oxinabox · 2020-12-15T17:23:18Z

closed by #59

oxinabox transferred this issue from JuliaDiff/ChainRules.jl Aug 2, 2019

tkf mentioned this issue Sep 3, 2019

Slow backward pass when the forward pass touches a large array FluxML/Zygote.jl#323

Open

tkf mentioned this issue Sep 11, 2019

Overhaul Rules #30

Merged

This was referenced Sep 17, 2019

Some thoughts on Wirtinger #40

Open

The Extensibility Problem, for propagator closures #53

Closed

oxinabox mentioned this issue Oct 18, 2019

Get rid of extern? / Add unthunk, use extern only for testing? #56

Closed

oxinabox mentioned this issue Oct 18, 2019

Don't allow sum of anything in rrule JuliaDiff/ChainRules.jl#124

Merged

This was referenced Oct 29, 2019

Introduce Composite for structured deriviatives #59

Merged

Rules for functions / structs etc #4

Closed

Differentiating with respect to a function #22

Closed

nickrobinson251 added the Structural Tangent Related to the `Tangent` type for structured (composite) values label Jan 25, 2020

oxinabox closed this as completed Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with non-array/scalar structured values #8

Dealing with non-array/scalar structured values #8

ararslan commented Apr 22, 2019

willtebbutt commented Apr 22, 2019

ararslan commented May 2, 2019 •

edited

Loading

oxinabox commented May 10, 2019 •

edited

Loading

ararslan commented May 10, 2019

oxinabox commented Aug 27, 2019 •

edited

Loading

willtebbutt commented Aug 27, 2019 •

edited

Loading

oxinabox commented Aug 27, 2019

oxinabox commented Sep 11, 2019

willtebbutt commented Sep 12, 2019

oxinabox commented Sep 12, 2019

oxinabox commented Sep 21, 2019

oxinabox commented Oct 17, 2019

oxinabox commented Oct 18, 2019

oxinabox commented Oct 18, 2019

oxinabox commented Oct 18, 2019

oxinabox commented Dec 15, 2020

Dealing with non-array/scalar structured values #8

Dealing with non-array/scalar structured values #8

Comments

ararslan commented Apr 22, 2019

willtebbutt commented Apr 22, 2019

ararslan commented May 2, 2019 • edited Loading

oxinabox commented May 10, 2019 • edited Loading

ararslan commented May 10, 2019

oxinabox commented Aug 27, 2019 • edited Loading

willtebbutt commented Aug 27, 2019 • edited Loading

oxinabox commented Aug 27, 2019

oxinabox commented Sep 11, 2019

willtebbutt commented Sep 12, 2019

oxinabox commented Sep 12, 2019

oxinabox commented Sep 21, 2019

oxinabox commented Oct 17, 2019

oxinabox commented Oct 18, 2019

oxinabox commented Oct 18, 2019

oxinabox commented Oct 18, 2019

oxinabox commented Dec 15, 2020

ararslan commented May 2, 2019 •

edited

Loading

oxinabox commented May 10, 2019 •

edited

Loading

oxinabox commented Aug 27, 2019 •

edited

Loading

willtebbutt commented Aug 27, 2019 •

edited

Loading