Gradients for prod, cumsum, cumprod #524

mcabbott · 2018-12-20T15:42:55Z

This is a new version of #334, aiming to solve these problems:

prod(xs) has a gradient defined prod(xs) ./ xs .* Δ which fails when some x are zero.
prod(x,dims=1), cumprod(x) and cumsum(x) are all Array{<:TrackedReal}.
prod(x,1), in Julia 0.6 notation, has a custom gradient with various problems.

This custom gradient defined using circshift is I believe correct for the case prod(x). In #334 I wrote a variant for the case prod(x,dims=1) which depends on dims. However both are very slow, and crash julia if called on arrays with 1000s of elements.

Compared to previous attempt:

The function ∇prod treats correctly cases where the product is zero due to rounding, but no x[i] was zero. It is much faster on the case where exactly one x[i] is zero, with thanks to @Drvi for an idea.
This is now applied using the built-in mapslices, which is a bit slow but will eventually improve.
∇cumprod needs to be applied with variadic mapslices, here implemented with some ntuple-ing. I fixed stupid a bug in this, so now all test pass, and made it more compact.
For more generic arrays, the fall-back is to call ForwardDiff.gradient instead of circshift funcitons.

ForwardDiff does work on CuArrays, but slowly via CPU I think, for now. However this is only called by ∇prod when exactly one entry is zero, so other cases should be fast. (Ideally ∇prod_one and ∇cumprod could be written with CUDAnative, perahps, but I'm not sure where they would live.) The gradient for cumsum also goes via the CPU I think, and because reverse(cu(rand(2,2)), dims=1) is an error; I wrote a mapslices thing for that case.

Variadic mapslices JuliaLang/julia#23306 would replace my ∇cumprod_d function. You could also write this using Julia 1.1's eachslice() JuliaLang/julia#29749 as cumprod only works for dims::Int. But prod accepts dim::Tuple thus no help.

There are a few benchmarks etc in this gist.
Flux's tests fail locally for me, but in curnn.jl unrelated to this.
I just saw that cumsum, UpperTriangular, LowerTriangular operations #388 also defines a gradient for cumsum, however I believe it is missing a reverse.

MikeInnes · 2019-02-06T15:03:47Z

Thanks a lot for the patch here. Some of these seem a bit strange to me though, e.g. taking a jacobian through forward diff seems like it'd be slower than just tracing operations with reverse mode (and a lot of this definitely won't be GPU compatible, which isn't a deal breaker but worth considering).

It might be worth splitting some of this out into smaller pieces and doing one thing at a time; this would make it easier to review the pieces.

mcabbott · 2019-02-06T23:34:12Z

Thanks for taking a look. Let's just think about prod to start -- I'm sorry about the mess, it would indeed be nice to simplify.

When nothing is zero, then Δ .* prod(xs, dims=1) ./ xs would be ideal, fast and generic. If there are many zeros (along the product direction) then there is little point in attempting gradient descent. But, in between, I have cases where each x ∈ [0,1] and some columns will contain a zero. I wrote a function for the case of exactly one zero, which mapslices applies to each column, if necessary.

My computer with a GPU isn't answering the phone right now, but I think my idea was to handle the case of no zeros at least, and if CuArrays.allowscalar(true) have a fall-back in case you randomly hit one.

Would "tracing operations with reverse mode" mean using TrackedReal? Perhaps this could be used as a fall-back option (i.e. when you hit a zero), if it will work on the GPU. I'm not sure I tried that.

Right now, using TrackedReal for everything seems very slow:

julia> using Flux, ForwardDiff
julia> r = rand(10,100);

julia> prod(param(r), dims=1)
1×100 Array{Flux.Tracker.TrackedReal{Float64},2}:
 6.49254e-7  3.12325e-5  4.41833e-9  0.000188122  …  6.07351e-6  1.42578e-5  0.00236577

julia> @btime Flux.gradient(x -> sum(sin, prod(x, dims=1)), $r)
  1.413 ms (15013 allocations: 8.16 MiB)

and with PR:

julia> @btime Flux.gradient(x -> sum(sin, prod(x, dims=1)), $r)
  8.160 μs (51 allocations: 29.41 KiB)

The other possible generic fallback is the circshift-based thing. There is one in the current source which I believe is correct for prod(x) no dims. I had problems with it crashing above some size, about length = 400. And it's slow -- for prod(rand(50)), circshift takes 2ms, TrackedReal 12 μs, ForwardDiff 1.5 μs, and simple division 80ns.

Finally, I haven't thought about second derivatives, perhaps I should.

MikeInnes · 2019-02-08T10:47:10Z

No problem, looking forward to getting some of this stuff in.

I suggest just doing the simplest gradient for now even if it's not so good with zeros. Then we can post test cases for anything that still isn't ideal, and discuss what would make a good solution for that case.

mcabbott · 2019-02-08T11:42:01Z

OK, if you like I'll make a few-lines PR which just uses Δ .* prod ./ xs for everything, as a start.

I haven't made any progress, but have a question: How do I explicitly tell Flux to use TrackedReal for something? (To try using this within the function which gets mapslices-ed.)

MikeInnes · 2019-02-08T11:51:47Z

You shouldn't generally need to, if there's no gradient it'll just fall back to the scalar version. But you can also use map(identity, xs) to get an array of tracked reals.

mcabbott · 2019-03-10T14:34:47Z

I took another look at this story, and collected the steps (and benchmarks) here:

https://gist.github.com/mcabbott/ecb9a7756c0530e8fae0ef444761ffcd

I would quite like prod(xs; dims=2) to handle xs containing zeros correctly, but if we want this, then I still don't see a simpler approach than this mapslices(∇prod,...) thing. This case is necessarily slower, but by checking for zeros first, the case of all-nonzero xs can be almost as fast as the naiive Δ .* prod ./ xs gradient.

I don't however see an elegant way to do that for CuArrays. But what occurs to me today is that perhaps it would be OK to give up on handling zeros correctly there, and just dispatch to Δ .* prod ./ xs. This won't silently give you wrong answers, you should get NaN to warn you. Might that be acceptable?

Will not treat zeros correctly, see FluxML/Flux.jl#524

mcabbott · 2019-03-16T11:32:02Z

I thought of a more generic way to compute the prod gradient, allowing zeros. Instead of calling circshift, you can create something similar by reshaping x .* ones' to have length(x)-1 rows... the simplest version looks like this:

function ∇prod_one(x, Δ)
  n = length(x) - 1
  m = reshape(vec(x) .* trues(n)' .* Δ, (n,:))
  v = reverse(vec(prod(m, dims=1)))
  reshape(v, size(x))
end

This is only 10x slower than directly indexing, instead of 200x. I've added this to the end of the above-linked gist. It's not done yet, partly because reverse doesn't seem to exist for CuArrays. But that's the news.

jburroni · 2019-03-19T03:17:12Z

@mcabbott Is it true that p::Real in ∇prod? If that is the case, you could write this:

function ∇prod(x, p::Real=prod(x), Δ=1)
  if !iszero(p)
    ∇ = p ./ x .* Δ
  elseif count(iszero, x) > 1
    ∇ = zero(x)
  else
    ∇ = ∇prod_one(x, Δ)
  end
end

mcabbott · 2019-03-19T08:44:30Z

I ran into a subtle bug with that (with !iszero(p) as the first test): if x contains several very small numbers, than the product can be zero without any individual zeros, due to floating-point rounding. And then findfirst in the PR's ∇prod_one returns nothing and it fails.

The current version returns zero(x) in this case. Maybe ideally one could treat it better. The new idea for ∇prod_one with reshape(..., (length(x)-1,:)) may be better, in fact.

jburroni · 2019-03-19T14:10:36Z

good catch! (the floating point rounding)
I do still believe that trying to short-circuit the --presumible-- most common case of a product different than zero is important.

function ∇prod(x, p::Real=prod(x), Δ=1)
  !iszero(p) && return ∇ = p ./ x .* Δ
  numzero = count(iszero, x) 
  if numzero == 0 
    ∇ = p ./ x .* Δ
  elseif numzero > 1
    ∇ = zero(x)
  else
    ∇ = ∇prod_one(x, Δ)
  end
end

mcabbott · 2019-03-19T15:58:49Z

I don't have all the numbers in my head, but when I timed things I think count(iszero, x) turned out not to matter much, a few percent of the quickest gradient. But I could be wrong. Note that you could combine the second case and third cases here to numzero != 1, as p==0 implies that p ./ x .* Δ == zero(x).

This whole PR seems to be about trade-offs between complication and speed -- the fastest variant involved writing my own mapslices (worth a factor of 2) but that started to sound like too much complication.

Another concern worth some thought is whether this is can be made correct for second derivatives. Right now the PR has a nobacksies which explicitly prevents this, and the logic of the 3-option ∇prod is I think only for first derivatives. But p ./ x .* Δ should be correct (when there are no zeros) and this new idea reshape(..., (length(x)-1,:)) is perhaps also correct?

Also, somehow I must never have checked this, but mapslices also turns out not to exist for CuArrays:

julia> CuArrays.allowscalar(false)
julia> mapslices(sum, rand(2,3) |> cu, dims=1)
ERROR: scalar setindex! is disallowed

So I don't really see a way to do the right thing or CuArrays. eachslice works but not for dims=(2,3). And since CuArrays might not be loaded, you can't even directly dispatch on its type. Perhaps just accept that the case of a CuArray with containing zeros is going to use scalar indexing, and will be very slow, until CuArrays learns to understand mapslices and reverse.

MikeInnes · 2019-04-04T13:23:31Z

Can this be closed in favour of FluxML/Tracker.jl#1? Seems like that doesn't have everything here.

mcabbott · 2019-04-05T11:31:31Z

I guess this is a discussion thread now, not aiming to be merged.

The tracker PR was the simplest prod case as suggested, I can pull out cumsum equally simply.

For prod the tl;dr version is that I still think we ought to treat zero entries correctly. I don't see a great way to do this that includes CuArrays; on CPU there are several options (trading speed vs complication) which I can tidy up if that would help.

@adjoint

112: Simplest prod(x; dims) gradient r=dhairyagandhi96 a=mcabbott The current gradient for `prod(x; dims)` gives incorrect results, this PR fixes it (parallel to FluxML/Tracker.jl#1 ): ``` julia> using Zygote, ForwardDiff julia> r = rand(2,3,2); julia> ForwardDiff.gradient(w->sum(prod(w, dims=(2,3))), r) 2×3×2 Array{Float64,3}: [:, :, 1] = 0.00131643 0.000954347 0.0051387 0.0177437 0.0354628 0.00934587 [:, :, 2] = 0.00434307 0.0140455 0.00152818 0.0151417 0.00464615 0.00451601 julia> Zygote.gradient(w->sum(prod(w, dims=(2,3))), r)[1] # wrong answer! 2×3×2 Array{Float64,3}: [:, :, 1] = 5.93867e-6 4.30525e-6 2.31817e-5 1.60301e-5 3.2038e-5 8.44331e-6 [:, :, 2] = 1.95925e-5 6.33622e-5 6.89391e-6 1.36795e-5 4.19746e-6 4.07989e-6 julia> Zygote.@adjoint function prod(xs; dims = :) # as in this PR p = prod(xs; dims = dims) p, Δ -> (p ./ xs .* Δ,) end julia> Zygote.refresh() julia> Zygote.gradient(w->sum(prod(w, dims=(2,3))), r)[1] # now matches ForwardDiff 2×3×2 Array{Float64,3}: [:, :, 1] = 0.00131643 0.000954347 0.0051387 0.0177437 0.0354628 0.00934587 [:, :, 2] = 0.00434307 0.0140455 0.00152818 0.0151417 0.00464615 0.00451601 ``` This does not handle zeros in the array correctly -- see FluxML/Flux.jl#524 for attempts to do that. The `circshift(...` operation deleted here was a correct (but slow) gradient for `prod(x)`, but is clearly independent of `dims`. The example above is almost the same as the one in the tests, which strangely passes, without this PR. Perhaps something is wrong with `gradtest`? ``` julia> @test gradtest(x -> prod(x, dims = (2, 3)), (3,4,5)) Test Passed julia> @test gradtest(x -> prod(x), (3,4,5)) Test Passed ``` Co-authored-by: Michael Abbott <me@pseudomac>

Will not treat zeros correctly, see FluxML/Flux.jl#524

CarloLucibello · 2020-12-26T09:24:55Z

If there are any missing gradients, they should be added to ChainRules

gradients for prod, cumsum, cumprod

09540ac

mcabbott mentioned this pull request Dec 20, 2018

Gradients for prod(x; dims) and cumprod(x), which take keywords & allow for zero entries #334

Closed

mcabbott mentioned this pull request Feb 6, 2019

User defined model does not work #597

Closed

mcabbott added a commit to mcabbott/Tracker.jl that referenced this pull request Mar 10, 2019

simplest prod(xs; dims) gradient

74f32ef

Will not treat zeros correctly, see FluxML/Flux.jl#524

This was referenced Mar 10, 2019

Simplest prod(xs; dims) gradient FluxML/Tracker.jl#1

Merged

Simplest prod(x; dims) gradient FluxML/Zygote.jl#112

Merged

pshashk mentioned this pull request Mar 26, 2019

cumsum, UpperTriangular, LowerTriangular operations #388

Closed

MikeInnes added the gradients label Mar 26, 2019

mcabbott mentioned this pull request Jul 30, 2019

Gradients for cumsum and cumprod FluxML/Zygote.jl#282

Closed

mcabbott added a commit to mcabbott/Tracker.jl that referenced this pull request Aug 10, 2020

simplest prod(xs; dims) gradient

7f2b4f3

Will not treat zeros correctly, see FluxML/Flux.jl#524

CarloLucibello closed this Dec 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Gradients for prod, cumsum, cumprod #524

Gradients for prod, cumsum, cumprod #524

Uh oh!

mcabbott commented Dec 20, 2018

Uh oh!

MikeInnes commented Feb 6, 2019

Uh oh!

mcabbott commented Feb 6, 2019

Uh oh!

MikeInnes commented Feb 8, 2019

Uh oh!

mcabbott commented Feb 8, 2019

Uh oh!

MikeInnes commented Feb 8, 2019

Uh oh!

mcabbott commented Mar 10, 2019

Uh oh!

mcabbott commented Mar 16, 2019

Uh oh!

jburroni commented Mar 19, 2019 •

edited

Loading

Uh oh!

mcabbott commented Mar 19, 2019

Uh oh!

jburroni commented Mar 19, 2019

Uh oh!

mcabbott commented Mar 19, 2019 •

edited

Loading

Uh oh!

MikeInnes commented Apr 4, 2019

Uh oh!

mcabbott commented Apr 5, 2019

Uh oh!

CarloLucibello commented Dec 26, 2020

Uh oh!

Uh oh!

Uh oh!

Gradients for prod, cumsum, cumprod #524

Gradients for prod, cumsum, cumprod #524

Uh oh!

Conversation

mcabbott commented Dec 20, 2018

Uh oh!

MikeInnes commented Feb 6, 2019

Uh oh!

mcabbott commented Feb 6, 2019

Uh oh!

MikeInnes commented Feb 8, 2019

Uh oh!

mcabbott commented Feb 8, 2019

Uh oh!

MikeInnes commented Feb 8, 2019

Uh oh!

mcabbott commented Mar 10, 2019

Uh oh!

mcabbott commented Mar 16, 2019

Uh oh!

jburroni commented Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcabbott commented Mar 19, 2019

Uh oh!

jburroni commented Mar 19, 2019

Uh oh!

mcabbott commented Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MikeInnes commented Apr 4, 2019

Uh oh!

mcabbott commented Apr 5, 2019

Uh oh!

CarloLucibello commented Dec 26, 2020

Uh oh!

Uh oh!

jburroni commented Mar 19, 2019 •

edited

Loading

mcabbott commented Mar 19, 2019 •

edited

Loading