Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NNlib.bias_act! #2327

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Use NNlib.bias_act! #2327

wants to merge 1 commit into from

Conversation

mcabbott
Copy link
Member

@mcabbott mcabbott commented Sep 4, 2023

Uses FluxML/NNlib.jl#457 to speed up & save memory, up to half the memory for a forward pass. Largest savings in the gradient will be for large batch size, and activation functions like identity, relu, tanh whose input need not be stored.

julia> lenet = Chain(  # from the model zoo
           Conv((5, 5), 1=>6, relu),
           MaxPool((2, 2)),
           Conv((5, 5), 6=>16, relu),
           MaxPool((2, 2)),
           Flux.flatten,
           Dense(256 => 120, relu),
           Dense(120 => 84, relu), 
           Dense(84 => 10),
       );

julia> img = rand32(28, 28, 1, 128);

julia> @btime $lenet($img);
  min 867.875 μs, mean 1.434 ms (160 allocations, 5.60 MiB)  # before
  min 831.500 μs, mean 1.100 ms (149 allocations, 3.31 MiB)  # after

julia> @btime gradient(m -> sum(abs2, m($img)), $lenet);
  min 7.128 ms, mean 10.280 ms (567 allocations, 14.19 MiB)
  min 6.296 ms, mean 6.930 ms (546 allocations, 9.61 MiB)

Closes #2151 which I forgot about.

cdims = conv_dims(c, x)
xT = _match_eltype(c, x)
σ.(conv(xT, c.weight, cdims) .+ conv_reshape_bias(c))
NNlib.bias_act!(c.σ, conv(xT, c.weight, cdims), conv_reshape_bias(c))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUCompiler doesn't like this when c.σ === sigmoid and a bias is set, https://buildkite.com/julialang/flux-dot-jl/builds/4240#018a62b9-4aa7-4a4a-80fe-661494ca9939/351-799. It's not clear to me why Dense would be fine given it uses the same machinery.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging. Error is on

broadcast!(::ComposedFunction{typeof(sigmoid_fast), typeof(+)}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})

where ComposedFunction comes from here:

https://github.com/FluxML/NNlib.jl/blob/1b30040fabadd41efa0d9dde5841b90f9f85cf2d/src/bias_act.jl#L32-L33

Agree it's odd that Dense doesn't hit the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can replicate this issue with just CUDA.jl and NNlib, so we should consider adding some GPU tests for bias_act! on the NNlib side. Interestingly enough normal sigmoid works just fine, so something is strange with sigmoid_fast in particular.

Copy link
Member

@ToucheSir ToucheSir Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a theory now based on more testing. sigmoid_fast also works if one removes the @inline. I think what's happening is that with the @inline, it's being inlined into the body of ComposedFunction too early and preventing ComposedFunction itself from being inlined because its body is now too complex.

Edit: confirmed with Cthulhu. Not sure what the best course of action here would be. Do we rely heavily on the @inline for CPU perf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could always override fast_act for GPU arrays. Uglier but preserves CPU performance if there is some gain there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good PR to test the new benchmarking tool too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could always override fast_act for GPU arrays

Good point. Allowing this is precisely why fast_act takes a second argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it looks like this error still persists :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased to see how it worked with Enzyme etc, but still didn't get around to fixing this error.

Can save a lot of memory but haven't seen much of a speedup out of it.

xT = _match_eltype(a, x) # fixes Float64 input, etc.
return σ.(a.weight * xT .+ a.bias)
NNlib.bias_act!(a.σ, a.weight * xT, a.bias) # does σ.(W*x .+ b), with fast paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NNlib.bias_act!(a.σ, a.weight * xT, a.bias) # does σ.(W*x .+ b), with fast paths
return NNlib.bias_act!(a.σ, a.weight * xT, a.bias) # does σ.(W*x .+ b), with fast paths

Comment on lines 246 to 251
scale = γ ./ sqrt.(σ² .+ eps)
bias = -scale .* μ .+ β
bias = .-scale .* μ .+ β
l.λ.(scale .* x .+ bias)
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change, but surely a typo?

I considered using bias_act! here but maybe that's more confusing than helpful, so much other allocation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything I would've expected it on the line below (248).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I meant, sorry. But while there, I spotted the missing dot.

rm comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants