Improve type stability of LayerNorm and Dropout #2005

ToucheSir · 2022-06-23T05:17:43Z

These two layers made use of explicit or implicit control flow (e.g. default keyword argument values) which Zygote does not like. This PR is essentially a set of small hacks to work around that.

Any ideas on how to avoid return_type in _dropout would be much appreciated, but for now it seems to work.

TODO benchmarks.

PR Checklist

Entry in NEWS.md

src/layers/normalise.jl

ToucheSir · 2022-06-27T23:56:26Z

TTFG timings using the following snippet:

Test code

using Metalhead, Flux, Zygote
using Metalhead: ChannelLayerNorm

model = ConvNeXt(:tiny; inchannels=1, nclasses=1).layers
# ChannelLayerNorm isn't type stable yet (for the same reason as LayerNorm wasn't),
# So remove it for this demo
model = fmap(Returns(identity), model; exclude=Base.Fix2(isa, ChannelLayerNorm))

# display(model); println()

loss(m, x) = sum(m(x))

inputs = randn(Float32, 32, 32, 1, 1)
# @time loss(model, inputs)
# @time loss(model, inputs)

loss_grad(m, x) = gradient((m, x) -> loss(m, x), m, x)

@time loss_grad(model, inputs)
# @time loss_grad(model, inputs)

julia> @time loss_grad(model, inputs)
 34.835647 seconds (87.12 M allocations: 4.701 GiB, 3.14% gc time, 99.38% compilation time) # 0.13.3
 30.679322 seconds (78.88 M allocations: 4.300 GiB, 3.46% gc time, 98.96% compilation time) # this PR

Replacing the Chain{Vector} with a Chain{Tuple} creates a larger gap:

julia> @time loss_grad(model, inputs)
 79.846248 seconds (98.87 M allocations: 5.243 GiB, 1.68% gc time, 99.67% compilation time) # 0.13.3
 63.024710 seconds (79.23 M allocations: 4.245 GiB, 1.92% gc time, 99.45% compilation time) # this PR
 52.838056 seconds (70.81 M allocations: 3.745 GiB, 1.98% gc time, 99.60% compilation time) # this PR + Zygote#1248

ToucheSir · 2022-08-01T00:08:46Z

For kicks, here is Diffractor with JuliaDiff/ChainRules.jl#644:

julia> @time loss_grad(model, inputs)
 30.442982 seconds (92.61 M allocations: 4.148 GiB, 3.18% gc time, 89.07% compilation time) # tuple chain
 23.051121 seconds (88.06 M allocations: 3.920 GiB, 3.81% gc time, 85.11% compilation time) # vector chain, requires https://github.com/JuliaDiff/Diffractor.jl/pull/82

Re-enabling ChannelLayerNorm adds but ~1s to the total. Note that even the tuple Chain here is faster than any tested Zygote configuration.

Edit: added times for vector chains using a patched Diffractor.

theabhirath · 2022-08-01T01:35:37Z

Does Diffractor already work with most Flux models (or at least those with built-in layers)? I was under the impression that it wasn't there yet 😅

ToucheSir · 2022-08-01T01:52:28Z

Not OOTB, which is why that ChainRules PR is required.

chengchingwen · 2022-08-01T04:40:38Z

@ToucheSir Could you try running the layer norm gradient with gpu? I have try that manual broadcast fusion before but CUDA.time said it actually allocated more gpu memory

ToucheSir · 2022-08-01T05:03:17Z

You're right, it allocates one more time for over 2x the memory overhead. I also found this out the hard way recently while trying to fuse the RNN cell kernels for #2023, but forgot about the change here.

codecov-commenter · 2022-08-01T05:35:24Z

Codecov Report

Merging #2005 (29ef2ff) into master (d66d2c4) will increase coverage by 0.27%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2005      +/-   ##
==========================================
+ Coverage   87.10%   87.37%   +0.27%     
==========================================
  Files          20       20              
  Lines        1528     1553      +25     
==========================================
+ Hits         1331     1357      +26     
+ Misses        197      196       -1

Impacted Files	Coverage Δ
src/Flux.jl	`0.00% <ø> (ø)`
src/layers/normalise.jl	`90.28% <100.00%> (+1.46%)`	⬆️
src/layers/stateless.jl	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d66d2c4...29ef2ff. Read the comment docs.

darsnack · 2022-08-10T15:00:54Z

Any updates on this (like benchmarks after unfusing)?

ToucheSir mentioned this pull request Jun 23, 2022

Enable optimisations with Chain #2004

Open

CarloLucibello reviewed Jun 23, 2022

View reviewed changes

src/layers/normalise.jl Outdated Show resolved Hide resolved

ToucheSir force-pushed the bc/norm-type-stability branch from 25f0a1b to 9259e4a Compare June 24, 2022 19:54

ToucheSir mentioned this pull request Jul 2, 2022

Overhaul of ResNet API FluxML/Metalhead.jl#174

Merged

18 tasks

ToucheSir added the performance label Jul 10, 2022

ToucheSir mentioned this pull request Jul 30, 2022

Deprecate Flux.Optimisers and implicit parameters in favour of Optimisers.jl and explicit parameters #1986

Open

8 tasks

ToucheSir added 3 commits July 31, 2022 22:03

Improve type stability of LayerNorm and Dropout

72cda4a

fix and add tests

689d16b

revert to unfused broadcast

29ef2ff

ToucheSir force-pushed the bc/norm-type-stability branch from 9259e4a to 29ef2ff Compare August 1, 2022 05:06

ToucheSir mentioned this pull request Aug 20, 2022

gradient of keyword argument? FluxML/Zygote.jl#446

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve type stability of LayerNorm and Dropout #2005

Improve type stability of LayerNorm and Dropout #2005

ToucheSir commented Jun 23, 2022 •

edited

ToucheSir commented Jun 27, 2022

ToucheSir commented Aug 1, 2022 •

edited

theabhirath commented Aug 1, 2022

ToucheSir commented Aug 1, 2022

chengchingwen commented Aug 1, 2022

ToucheSir commented Aug 1, 2022

codecov-commenter commented Aug 1, 2022

darsnack commented Aug 10, 2022

Improve type stability of LayerNorm and Dropout #2005

Are you sure you want to change the base?

Improve type stability of LayerNorm and Dropout #2005

Conversation

ToucheSir commented Jun 23, 2022 • edited

PR Checklist

ToucheSir commented Jun 27, 2022

ToucheSir commented Aug 1, 2022 • edited

theabhirath commented Aug 1, 2022

ToucheSir commented Aug 1, 2022

chengchingwen commented Aug 1, 2022

ToucheSir commented Aug 1, 2022

codecov-commenter commented Aug 1, 2022

Codecov Report

darsnack commented Aug 10, 2022

ToucheSir commented Jun 23, 2022 •

edited

ToucheSir commented Aug 1, 2022 •

edited