Replace unrolled `foldl` used to evaluate `Chain` with a better one #1809

mcabbott · 2021-12-13T21:02:09Z

This uses Base's foldl to feed data through a Chain, instead of a hand-written version. The reason to do so is that this greatly reduced time-to-first-gradient in FluxML/Zygote.jl#1126 . It's possible that Flux ought to bound ChainRules with this. Or at least I should check whether Zygote's bound is high enough to guarantee it'll work. But first see if CI likes this.

Latest version does not call either foldl or Base.afoldl, but instead replaces one hand-written version with another, as this seems to perform better.

src/layers/basic.jl

ToucheSir · 2021-12-14T00:23:09Z

Would there be any value in defining a foldl wrapper or foldl-like function that has init as a positional arg? That would allow for creating an equivalent (possibly simplified) version of https://github.com/JuliaDiff/ChainRules.jl/blob/a75193768775975fac5578c89d1e5f50d7f358c2/src/rulesets/Base/mapreduce.jl#L342-L377.

(the above need not hold up this PR, just thinking about next steps and how to bring something like VChain into Flux)

mcabbott · 2021-12-14T01:16:27Z

The obvious candidate there is Base.mapfoldr_impl(f, op, init, itr), which is 2 hops down from foldl. I've thought about it but haven't quite got there yet.

Agree the unstable path would be nice to have, too. This is just the minimum-effort PR to get started.

CarloLucibello · 2021-12-14T06:08:18Z

Should we have some second derivative test since you expressed concerns about it in FluxML/Zygote.jl#1126 ?

ToucheSir · 2021-12-14T06:26:42Z

That would be great, thanks for bringing it up.

…

On Mon, 13 Dec 2021 at 22:08, Carlo Lucibello ***@***.***> wrote: Should we have some second derivative test since you expressed concerns about it in FluxML/Zygote.jl#1126 ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

ToucheSir · 2022-01-03T23:53:03Z

src/layers/basic.jl

@@ -43,11 +43,18 @@ end

 functor(::Type{<:Chain}, c) = c.layers, ls -> Chain(ls...)

+function (c::Chain)(x)
+  if order() < 2
+    foldl((y,f) -> f(y), (x, c.layers...))


Would it be too cheeky to do

Suggested change

foldl((y,f) -> f(y), (x, c.layers...))

Base.afoldl(|>, x, c.layers...)

ref. https://github.com/JuliaLang/julia/blob/dba8f0344ab6f1a63ea4f7928610feb35f014d4f/base/operators.jl#L592.

This was slow when I tried it :( although I did not think of |>

Tidied up, here are times of some options on some very small networks:

julia> using Flux julia> m1 = Chain(x->vcat(x,x), Dense(2,2,relu)); julia> m2 = Chain(Dense(20,30,relu), Dense(30,10), softmax); # Tagged version julia> @btime gradient(sum∘$m1, 3.14) 16.584 μs (253 allocations: 9.52 KiB) (0.33772213757038116,) julia> @btime gradient(sum∘$m2, rand(20,20)); 42.792 μs (353 allocations: 68.00 KiB) # foldl julia> @eval Flux (c::Chain)(x) = foldl(|>, (x, c.layers...)); julia> Flux.Zygote.refresh() julia> @btime gradient(sum∘$m1, 3.14) 18.625 μs (277 allocations: 11.62 KiB) (0.33772213757038116,) julia> @btime gradient(sum∘$m2, rand(20,20)); 45.500 μs (385 allocations: 72.03 KiB) # afoldl julia> @eval Flux (c::Chain)(x) = Base.afoldl(|>, x, c.layers...); julia> Flux.Zygote.refresh() julia> @btime gradient(sum∘$m1, 3.14) 84.167 μs (496 allocations: 19.03 KiB) (0.33772213757038116,) julia> @btime gradient(sum∘$m2, rand(20,20)); 117.917 μs (614 allocations: 77.62 KiB) # Unstable for loop julia> @eval Flux begin function (c::Chain)(x) y = identity(x) for f in c.layers y = f(y) end y end end julia> Flux.Zygote.refresh() julia> @btime gradient(sum∘$m1, 3.14) 25.250 μs (357 allocations: 13.33 KiB) (0.33772213757038116,) julia> @btime gradient(sum∘$m2, rand(20,20)); 58.792 μs (512 allocations: 72.89 KiB)

Thanks. Looks like afoldl would need an rrule like foldl to be remotely competitive.

I guess it's a sad reflection on Zygote that something so simple should benefit from a complicated hand-written rule. Diffractor:

julia> fs = (sin, cos, x -> vcat(x,x), sum, exp); julia> @btime Zygote.gradient(x -> foldl(|>, (x, $fs...)), 3) 6.900 μs (130 allocations: 3.88 KiB) (2.017262629514019,) julia> @btime Zygote.gradient(x -> Base.afoldl(|>, x, $fs...), 3) 72.083 μs (306 allocations: 11.81 KiB) (2.017262629514019,) julia> @btime Diffractor.gradient(x -> foldl(|>, (x, $fs...)), 3) 9.375 μs (205 allocations: 6.52 KiB) (2.017262629514019,) julia> @btime Diffractor.gradient(x -> Base.afoldl(|>, x, $fs...), 3) 3.812 μs (90 allocations: 5.56 KiB) (2.017262629514019,)

And [now with ReverseDiff too]

julia> @btime ForwardDiff.derivative(x -> foldl(|>, (x, $fs...)), 3) 689.597 ns (11 allocations: 416 bytes) 2.0172626295140184 julia> @btime ForwardDiff.derivative(x -> Base.afoldl(|>, x, $fs...), 3) 181.812 ns (6 allocations: 256 bytes) 2.0172626295140184 julia> fs2 = (sin, cos, x -> [x,x], sum, exp); julia> @btime Tracker.gradient(x -> foldl(|>, (x, $fs2...)), 3) 850.134 ns (22 allocations: 768 bytes) (2.017262629514019 (tracked),) julia> @btime Tracker.gradient(x -> Base.afoldl(|>, x, $fs2...), 3) 294.457 ns (22 allocations: 768 bytes) (2.017262629514019 (tracked),) julia> @btime ReverseDiff.gradient(x -> foldl(|>, (x[1], $fs2...)), [3.0]) 700.290 ns (24 allocations: 1.02 KiB) 1-element Vector{Float64}: 2.017262629514019 julia> @btime ReverseDiff.gradient(x -> Base.afoldl(|>, x[1], $fs2...), [3.0]) 448.652 ns (24 allocations: 1.02 KiB) 1-element Vector{Float64}: 2.017262629514019

Oh right, indeed. This does understand structs, but fails on my foldl tests above, not sure why

julia> Yota.grad(x -> sum(x.a), (a=[1,2], b=[3,4])) (3, (ChainRulesCore.ZeroTangent(), Tangent{NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Int64}}}}(a = [1, 1],))) julia> Yota.grad(x -> foldl(|>, (x, fs2...)), 3.0) ERROR: No deriative rule found for op %7 = foldl(|>, %5)::Float64, try defining it using ChainRulesCore.rrule(::typeof(foldl), ::typeof(|>), ::Tuple{Float64, typeof(sin), typeof(cos), var"#13#14", typeof(sum), typeof(exp)}) = ... julia> Yota.grad(x -> Base.afoldl(|>, x, fs2...), 3.0) ERROR: MethodError: no method matching length(::typeof(sin)) julia> m2 = Chain(Dense(2,3,sigmoid)); julia> Yota.grad(m -> sum(m([0.1, 0.2])), m2) ERROR: No deriative rule found for op %15 = broadcasted(%12, %14)::Broadcasted{}, try defining it using ...

Cc @dfdx in case he'd like to join this hallway discussion.

The performance is really only due to zygote unrolling chains, which isn't per se necessary. It also seems foldl wouldn't work generically, esp when composing layers chains of chains etc.

@mcabbott The rule for foldl exists in ChainRules, so it must be a bug. I'm looking into it. Thanks for pinging me!

Latest version of this PR replaces foldl with @generated applychain. Applied to the tiny examples above #1809 (comment) this is faster. And it has shorter compile times (shown elsewhere).

julia> using Flux julia> m1 = Chain(x->vcat(x,x), Dense(2,2,relu)); julia> m2 = Chain(Dense(20,30,relu), Dense(30,10), softmax); julia> @btime gradient(sum∘$m1, 3.14) min 14.917 μs, mean 16.347 μs (246 allocations, 9.42 KiB) (0.310660183429718,) julia> @btime gradient(sum∘$m2, rand(20,20)); min 41.417 μs, mean 50.995 μs (353 allocations, 68.27 KiB) julia> m1 = Chain([x->vcat(x,x), Dense(2,2,relu)]); julia> m2 = Chain([Dense(20,30,relu), Dense(30,10), softmax]); julia> @btime gradient(sum∘$m1, 3.14) min 34.333 μs, mean 35.980 μs (459 allocations, 15.06 KiB) (0.0,) julia> @btime gradient(sum∘$m2, rand(20,20)); min 73.500 μs, mean 85.917 μs (681 allocations, 76.66 KiB)

The last times here show a chain with layers stored as a vector. This is not type-stable, but helps with compile times. On very small networks like this, it is significantly slower, so it probably cannot be the default.

FYI: latest master of Yota now correctly handles foldl. I haven't tested it with the latest version of this PR though since I'm currently busy speeding up the underlying tracer faster, so I'm going to wait till the PR is done and catch up later.

DhairyaLGandhi

We should simplify this, because there are different rules now how composed chains would work.

DhairyaLGandhi · 2022-01-08T09:20:27Z

src/layers/basic.jl

@@ -43,11 +43,18 @@ end

 functor(::Type{<:Chain}, c) = c.layers, ls -> Chain(ls...)

+function (c::Chain)(x)
+  if order() < 2
+    foldl((y,f) -> f(y), (x, c.layers...))


The performance is really only due to zygote unrolling chains, which isn't per se necessary. It also seems foldl wouldn't work generically, esp when composing layers chains of chains etc.

mcabbott · 2022-01-11T01:24:25Z

Latest version of this replaces foldl with an @generated function which explicitly unrolls. This is not only faster to compile, it is also faster to run. With the example from FluxML/Zygote.jl#1126 (comment), the final gradient is 30% faster, and the first one is 2x faster:

julia> @time loss(lr_images);  # first Flux.Chain call, cold start, no Zygote
  4.490344 seconds (30.96 M allocations: 1.634 GiB, 7.28% gc time, 99.81% compilation time)  # Flux v0.12.8
  3.608894 seconds (34.26 M allocations: 1.801 GiB, 8.63% gc time, 99.84% compilation time)  # @generated
  3.699612 seconds (34.25 M allocations: 1.800 GiB, 8.80% gc time, 99.82% compilation time)

julia> @time loss(lr_images);
  0.005024 seconds (2.46 k allocations: 242.812 KiB)

julia> @btime loss($lr_images);
  min 7.202 ms, mean 7.648 ms (2443 allocations, 256.98 KiB)  # Flux v0.12.8
  min 5.533 ms, mean 5.696 ms (2461 allocations, 242.78 KiB)  # @generated
  min 5.639 ms, mean 5.881 ms (2368 allocations, 233.55 KiB)  # Vector chain

julia> loss_grad(x, ps) = gradient(() -> loss(x), ps);

julia> ps = Flux.params(model);

julia> @time loss_grad(lr_images, ps);  # first Zygote.gradient, cold start
 62.895168 seconds (67.77 M allocations: 3.523 GiB, 0.99% gc time, 99.92% compilation time)  # Flux v0.12.8
 29.646482 seconds (64.41 M allocations: 3.373 GiB, 2.02% gc time, 99.88% compilation time)  # @generated
 13.677340 seconds (63.15 M allocations: 3.308 GiB, 3.99% gc time, 99.74% compilation time)  # Vector chain

julia> @time loss_grad(lr_images, ps);  # second
  0.028015 seconds (12.95 k allocations: 2.275 MiB)

julia> @btime $loss_grad($lr_images, $ps);
  min 37.933 ms, mean 38.574 ms (13052 allocations, 2.93 MiB)  # Flux v0.12.8
  min 29.283 ms, mean 29.730 ms (12941 allocations, 2.28 MiB)  # @generated
  min 30.516 ms, mean 30.957 ms (13436 allocations, 1.62 MiB)  # Vector chain

julia> length(model.layers[1].layers.layers)  # count of resblock(channels) layers
15

Also included in this comparison is a Chain backed by a Vector. This is type-unstable, and not as fast for very small networks. But for larger ones, it is no slower, and improves compile time by another factor of 2. It's included in this PR for now, but obviously needs tests if it is to stay.

mcabbott · 2022-01-11T04:46:26Z

src/layers/basic.jl

 # This is a temporary and naive implementation
 # it might be replaced in the future for better performance


BTW, in addition to a hand-written foldl, the function Flux.activations is just accumulate(|>, m1.layers; init=x1). Since we don't support Julia < 1.5, we could just replace it.

I thought accumulate would face the same issue as foldl, namely that the rrule doesn't consider init? This PR need not be concerned with activations either way, we can kick that can down the road until rrules are tweaked or someone complains about performance.

Indeed, you could likewise do accumulate(|>, (x, m.layers...)). But happy to leave it alone for now.

DhairyaLGandhi · 2022-01-11T04:57:59Z

For sciml we might have regressions in higher order chains or regressions in smaller chains.

ToucheSir · 2022-02-04T03:48:02Z

@ChrisRackauckas would you be able to offer a thumbs up/down on whether this is something SciML could tolerate? Also @pxl-th I don't remember if you've tested this, but if not it may be interesting for your benchmark in FluxML/Zygote.jl#1119.

ChrisRackauckas · 2022-02-04T07:25:07Z

This is fine. We use FastChain for small chains anyways, and resort to Flux when things get big, so we should be fine. The small end gap will grow after SciML/DiffEqFlux.jl#671, and I'm not sure what to do after that divergence but it'll be an interesting discussion to have.

codecov-commenter · 2022-02-05T03:28:17Z

Codecov Report

Merging #1809 (f60da1a) into master (9b21e2c) will increase coverage by 0.07%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master    #1809      +/-   ##
==========================================
+ Coverage   74.50%   74.57%   +0.07%     
==========================================
  Files          28       28              
  Lines        1706     1715       +9     
==========================================
+ Hits         1271     1279       +8     
- Misses        435      436       +1

Impacted Files	Coverage Δ
src/layers/show.jl	`72.72% <71.42%> (+0.35%)`	⬆️
src/layers/basic.jl	`77.53% <90.90%> (+0.61%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b21e2c...f60da1a. Read the comment docs.

src/layers/show.jl

This reverts commit a74f86d.

mcabbott · 2022-02-05T20:56:02Z

Good to go?

ToucheSir reviewed Dec 13, 2021

View reviewed changes

src/layers/basic.jl Show resolved Hide resolved

CarloLucibello added this to the v0.13 milestone Dec 14, 2021

ToucheSir reviewed Jan 3, 2022

View reviewed changes

mcabbott force-pushed the foldl branch from 480833c to b6a3401 Compare January 4, 2022 00:14

mcabbott marked this pull request as draft January 4, 2022 00:51

DhairyaLGandhi reviewed Jan 8, 2022

View reviewed changes

ToucheSir mentioned this pull request Jan 8, 2022

Cutting a release JuliaDiff/Diffractor.jl#66

Open

mcabbott force-pushed the foldl branch from 300c4ad to 4bc55de Compare January 11, 2022 01:53

mcabbott mentioned this pull request Jan 11, 2022

Significant compile time latency in Flux with default optimization FluxML/Zygote.jl#1126

Closed

mcabbott force-pushed the foldl branch from 4bc55de to 8dbf112 Compare January 11, 2022 02:26

mcabbott changed the title ~~Use foldl to evaluate Chain~~ Replace unrolled foldl used to evaluate Chain with a better one Jan 11, 2022

mcabbott commented Jan 11, 2022

View reviewed changes

mcabbott mentioned this pull request Feb 4, 2022

Chain forgets names under fmap #1857

Closed

mcabbott marked this pull request as ready for review February 5, 2022 02:40

mcabbott force-pushed the foldl branch from 8dbf112 to 9284bd7 Compare February 5, 2022 02:59

ToucheSir reviewed Feb 5, 2022

View reviewed changes

src/layers/show.jl Outdated Show resolved Hide resolved

mcabbott added 5 commits February 5, 2022 14:31

use foldl for Chain

6918b0e

use foldl less often

00ba124

second derivative tests

a9bbb0c

Revert "use foldl less often"

585043d

This reverts commit a74f86d.

replace foldl with generated expression

fcb09da

mcabbott added 3 commits February 5, 2022 14:33

allow unstable Chain{Vector} too

d99d7ac

trailing comma

4dfd551

fixup

f60da1a

mcabbott force-pushed the foldl branch from 9142607 to f60da1a Compare February 5, 2022 19:36

ToucheSir approved these changes Feb 5, 2022

View reviewed changes

mcabbott merged commit 7b56813 into FluxML:master Feb 5, 2022

mcabbott deleted the foldl branch February 5, 2022 21:02

ToucheSir mentioned this pull request Feb 7, 2022

Add destructure, take II FluxML/Optimisers.jl#54

Merged

mcabbott mentioned this pull request Feb 10, 2022

Implementation of base ViT model FluxML/Metalhead.jl#105

Merged

ToucheSir mentioned this pull request Apr 5, 2022

Elide stack generation outside of looping control flow FluxML/Zygote.jl#1195

Open

CarloLucibello mentioned this pull request Jul 22, 2022

Update GNNChain CarloLucibello/GraphNeuralNetworks.jl#202

Merged

2 tasks

cstjean mentioned this pull request Nov 7, 2022

Chain(Parallel(...), ...) #2100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace unrolled `foldl` used to evaluate `Chain` with a better one #1809

Replace unrolled `foldl` used to evaluate `Chain` with a better one #1809

mcabbott commented Dec 13, 2021 •

edited

Loading

ToucheSir commented Dec 14, 2021 •

edited

Loading

mcabbott commented Dec 14, 2021

CarloLucibello commented Dec 14, 2021

ToucheSir commented Dec 14, 2021 via email

ToucheSir Jan 3, 2022

mcabbott Jan 3, 2022

mcabbott Jan 4, 2022

ToucheSir Jan 4, 2022

mcabbott Jan 4, 2022 •

edited

Loading

mcabbott Jan 7, 2022 •

edited

Loading

DhairyaLGandhi Jan 8, 2022

dfdx Jan 8, 2022 •

edited

Loading

mcabbott Jan 11, 2022

dfdx Jan 11, 2022

DhairyaLGandhi left a comment

DhairyaLGandhi Jan 8, 2022

mcabbott commented Jan 11, 2022

mcabbott Jan 11, 2022

ToucheSir Jan 11, 2022

mcabbott Jan 11, 2022

DhairyaLGandhi commented Jan 11, 2022 •

edited

Loading

ToucheSir commented Feb 4, 2022

ChrisRackauckas commented Feb 4, 2022

codecov-commenter commented Feb 5, 2022 •

edited

Loading

mcabbott commented Feb 5, 2022

	foldl((y,f) -> f(y), (x, c.layers...))
	Base.afoldl(\|>, x, c.layers...)

		# This is a temporary and naive implementation
		# it might be replaced in the future for better performance

Replace unrolled foldl used to evaluate Chain with a better one #1809

Replace unrolled foldl used to evaluate Chain with a better one #1809

Conversation

mcabbott commented Dec 13, 2021 • edited Loading

ToucheSir commented Dec 14, 2021 • edited Loading

mcabbott commented Dec 14, 2021

CarloLucibello commented Dec 14, 2021

ToucheSir commented Dec 14, 2021 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcabbott Jan 4, 2022 • edited Loading

Choose a reason for hiding this comment

mcabbott Jan 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfdx Jan 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DhairyaLGandhi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcabbott commented Jan 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DhairyaLGandhi commented Jan 11, 2022 • edited Loading

ToucheSir commented Feb 4, 2022

ChrisRackauckas commented Feb 4, 2022

codecov-commenter commented Feb 5, 2022 • edited Loading

Codecov Report

mcabbott commented Feb 5, 2022

Replace unrolled `foldl` used to evaluate `Chain` with a better one #1809

Replace unrolled `foldl` used to evaluate `Chain` with a better one #1809

mcabbott commented Dec 13, 2021 •

edited

Loading

ToucheSir commented Dec 14, 2021 •

edited

Loading

mcabbott Jan 4, 2022 •

edited

Loading

mcabbott Jan 7, 2022 •

edited

Loading

dfdx Jan 8, 2022 •

edited

Loading

DhairyaLGandhi commented Jan 11, 2022 •

edited

Loading

codecov-commenter commented Feb 5, 2022 •

edited

Loading