Use `NNlib.bias_act!` #2327

mcabbott · 2023-09-04T22:58:09Z

Uses FluxML/NNlib.jl#457 to speed up & save memory, up to half the memory for a forward pass. Largest savings in the gradient will be for large batch size, and activation functions like identity, relu, tanh whose input need not be stored.

julia> lenet = Chain(  # from the model zoo
           Conv((5, 5), 1=>6, relu),
           MaxPool((2, 2)),
           Conv((5, 5), 6=>16, relu),
           MaxPool((2, 2)),
           Flux.flatten,
           Dense(256 => 120, relu),
           Dense(120 => 84, relu), 
           Dense(84 => 10),
       );

julia> img = rand32(28, 28, 1, 128);

julia> @btime $lenet($img);
  min 867.875 μs, mean 1.434 ms (160 allocations, 5.60 MiB)  # before
  min 831.500 μs, mean 1.100 ms (149 allocations, 3.31 MiB)  # after

julia> @btime gradient(m -> sum(abs2, m($img)), $lenet);
  min 7.128 ms, mean 10.280 ms (567 allocations, 14.19 MiB)
  min 6.296 ms, mean 6.930 ms (546 allocations, 9.61 MiB)

Closes #2151 which I forgot about.

ToucheSir · 2023-09-05T03:39:51Z

src/layers/conv.jl

  cdims = conv_dims(c, x)
  xT = _match_eltype(c, x)
-  σ.(conv(xT, c.weight, cdims) .+ conv_reshape_bias(c))
+  NNlib.bias_act!(c.σ, conv(xT, c.weight, cdims), conv_reshape_bias(c))


GPUCompiler doesn't like this when c.σ === sigmoid and a bias is set, https://buildkite.com/julialang/flux-dot-jl/builds/4240#018a62b9-4aa7-4a4a-80fe-661494ca9939/351-799. It's not clear to me why Dense would be fine given it uses the same machinery.

Thanks for digging. Error is on

broadcast!(::ComposedFunction{typeof(sigmoid_fast), typeof(+)}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})

where ComposedFunction comes from here:

https://github.com/FluxML/NNlib.jl/blob/1b30040fabadd41efa0d9dde5841b90f9f85cf2d/src/bias_act.jl#L32-L33

Agree it's odd that Dense doesn't hit the same.

I can replicate this issue with just CUDA.jl and NNlib, so we should consider adding some GPU tests for bias_act! on the NNlib side. Interestingly enough normal sigmoid works just fine, so something is strange with sigmoid_fast in particular.

Have a theory now based on more testing. sigmoid_fast also works if one removes the @inline. I think what's happening is that with the @inline, it's being inlined into the body of ComposedFunction too early and preventing ComposedFunction itself from being inlined because its body is now too complex.

Edit: confirmed with Cthulhu. Not sure what the best course of action here would be. Do we rely heavily on the @inline for CPU perf?

Could always override fast_act for GPU arrays. Uglier but preserves CPU performance if there is some gain there.

This might be a good PR to test the new benchmarking tool too.

Could always override fast_act for GPU arrays

Good point. Allowing this is precisely why fast_act takes a second argument.

Unfortunately, it looks like this error still persists :(

Rebased to see how it worked with Enzyme etc, but still didn't get around to fixing this error.

Can save a lot of memory but haven't seen much of a speedup out of it.

CarloLucibello · 2023-09-05T06:39:08Z

src/layers/basic.jl

  xT = _match_eltype(a, x)  # fixes Float64 input, etc.
-  return σ.(a.weight * xT .+ a.bias)
+  NNlib.bias_act!(a.σ, a.weight * xT, a.bias)  # does σ.(W*x .+ b), with fast paths


Suggested change

NNlib.bias_act!(a.σ, a.weight * xT, a.bias) # does σ.(W*x .+ b), with fast paths

return NNlib.bias_act!(a.σ, a.weight * xT, a.bias) # does σ.(W*x .+ b), with fast paths

mcabbott · 2023-09-05T13:37:11Z

src/layers/normalise.jl

  scale = γ ./ sqrt.(σ² .+ eps)
-  bias = -scale .* μ .+ β
+  bias = .-scale .* μ .+ β
  l.λ.(scale .* x .+ bias)
 end


Unrelated change, but surely a typo?

I considered using bias_act! here but maybe that's more confusing than helpful, so much other allocation.

If anything I would've expected it on the line below (248).

Yes that's what I meant, sorry. But while there, I spotted the missing dot.

rm comments

ToucheSir added performance run downstream test labels Sep 4, 2023

ToucheSir reviewed Sep 5, 2023

View reviewed changes

CarloLucibello reviewed Sep 5, 2023

View reviewed changes

mcabbott commented Sep 5, 2023

View reviewed changes

mcabbott force-pushed the bias_act branch from 48d5e45 to 1a3e33e Compare March 19, 2024 19:14

use NNlib.bias_act

4ab8343

rm comments

mcabbott force-pushed the bias_act branch from 1a3e33e to 4ab8343 Compare March 30, 2024 19:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `NNlib.bias_act!` #2327

Use `NNlib.bias_act!` #2327

mcabbott commented Sep 4, 2023 •

edited

ToucheSir Sep 5, 2023

mcabbott Sep 5, 2023

ToucheSir Sep 6, 2023

ToucheSir Sep 6, 2023 •

edited

darsnack Sep 6, 2023

darsnack Sep 6, 2023

mcabbott Sep 6, 2023

ToucheSir Apr 2, 2024

mcabbott Apr 2, 2024

CarloLucibello Sep 5, 2023

mcabbott Sep 5, 2023

ToucheSir Sep 6, 2023

mcabbott Sep 6, 2023

	NNlib.bias_act!(a.σ, a.weight * xT, a.bias) # does σ.(W*x .+ b), with fast paths
	return NNlib.bias_act!(a.σ, a.weight * xT, a.bias) # does σ.(W*x .+ b), with fast paths

Use NNlib.bias_act! #2327

Are you sure you want to change the base?

Use NNlib.bias_act! #2327

Conversation

mcabbott commented Sep 4, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ToucheSir Sep 6, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Use `NNlib.bias_act!` #2327

Use `NNlib.bias_act!` #2327

mcabbott commented Sep 4, 2023 •

edited

ToucheSir Sep 6, 2023 •

edited