Restructure makes a copy #146

linusheck · 2023-04-24T12:59:45Z

In some situations, you have to restructure a lot if you use Flux, for instance if you want to run your batches as seperate solves in DiffEqFlux using an EnsembleProblem. You have to use something like a ComponentArray to pass the parameters through the solver and to let the adjoint methods do their work in differentiating the solve. But restructuring using a ComponentArray is unreasonably(?) slow in Flux. Switching to Lux eliminates those problems, but it seems like something that could be implemented better in Flux or ComponentArrays.

Example code:

using Flux, Profile, PProf, Random, ComponentArrays
layer_size = 256
layer = Flux.Dense(layer_size => layer_size)
params_, re = Flux.destructure(layer)
params = ComponentArray(params_)
function eval(steps, input)
	vec = input
	for i in 1:steps
		vec = re(params)(vec)
	end
end
Profile.clear()
@profile eval(100000, rand(Float32, layer_size))
pprof(; web=true)

Here, we spend 10% of the time in sgemv matrix multiplication, another 10% in the rest of the Dense call and about 75% in the Restructure. This gets worse if the networks are smaller. As far as I can read the flame graph, the restructure seems to spend a lot of time in the GC:

Could there be a way to mitigate this specific problem? In particular if you use the same parameters. I think this would make some example code a lot faster too.

I also opened a discourse about this because I'm not sure if it's an issue with Flux specifically: https://discourse.julialang.org/t/flux-restructure-for-componentarrays-jl-unreasonably-slow/97849/3

The text was updated successfully, but these errors were encountered:

mcabbott · 2023-04-24T13:16:08Z

I think the story here is that Restructure makes a new copy of the model, which allocates, and this costs about 30μs:

julia> @btime $re($params);  # This is the reconstruction cost
  min 11.125 μs, mean 29.891 μs (24 allocations, 257.95 KiB)

julia> @btime copy($params);  # ... and it's mostly allocation, same mean:
  min 6.007 μs, mean 29.209 μs (2 allocations, 257.05 KiB)

Normally the allocation of the forward pass dwarfs this, but if you don't make batches and do this for every sample, then the model allocation dominates.

julia> input = rand(Float32, layer_size);  # As above, batch size of 1

julia> @btime $layer($input);  # ... hence alloc much smaller here, unusual.
  min 6.800 μs, mean 8.970 μs (2 allocations, 2.12 KiB)

julia> @btime $re($params)($input);  # Complete, as above
  min 20.458 μs, mean 51.587 μs (26 allocations, 260.08 KiB)

This isn't a regime anyone got around to optimising. It would probably be easy to make Restructure either (1) copy into the existing model, or (2) create a new model out of views not copies.

Note that ComponentArrays is doing nothing here, params isa Vector. I think you should think of ComponentArrays as another idea in the same direction as destructure, "instead" not "and". They differ in lots of ways, of course.

Note also that allocations in the gradient will typically be larger than in the forward pass. When last I checked, destructure does this in a more efficient way than ComponentArrays, which made a copy of the full parameter vector repeatedly. This will matter (for larger models) in exactly the regime you have selected.

linusheck · 2023-04-24T13:45:34Z

Thanks for your thorough reply! Good to know that ComponentArrays isn't doing anything. The offending line seems to be this one:

Optimisers.jl/src/destructure.jl

Line 103 in 8a37946

    
           ProjectTo(y)(reshape(flat[o .+ (1:length(y))], axes(y)))  # ProjectTo is just correcting eltypes

I wonder if just using ProjectTo a bit smarter might fix the allocations and use views

mcabbott · 2023-04-24T13:59:36Z

Changing that line to use a view would probably work fine. ProjectTo won't mind, but I'm not sure whether all operations downstream will be happy with a reshaped view (e.g. will CuArray dispatch work?)

Maybe the bigger question is what the interface should be, e.g. whether this should be a keyword option.

Or whether it should be a separate destructure! which copies into the model. That may solve both questions.

CarloLucibello · 2024-04-12T14:28:42Z

I think we should have a copyless alternative to restructure, possible name ps, re = trainable_components(x), that returns a ComponentArray ps and a reconstruction function re(ps) == x

linusheck changed the title ~~Restructure for ComponentArrays.jl unreasonably slow~~ Restructure unreasonably slow Apr 24, 2023

mcabbott changed the title ~~Restructure unreasonably slow~~ Restructure makes a copy Apr 24, 2023

mcabbott transferred this issue from FluxML/Flux.jl May 1, 2023

mcabbott added the enhancement New feature or request label Oct 9, 2023

mcabbott linked a pull request Nov 2, 2023 that will close this issue

Add in-place destructure! #165

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure makes a copy #146

Restructure makes a copy #146

linusheck commented Apr 24, 2023 •

edited

mcabbott commented Apr 24, 2023

linusheck commented Apr 24, 2023 •

edited

mcabbott commented Apr 24, 2023

CarloLucibello commented Apr 12, 2024

Restructure makes a copy #146

Restructure makes a copy #146

Comments

linusheck commented Apr 24, 2023 • edited

mcabbott commented Apr 24, 2023

linusheck commented Apr 24, 2023 • edited

mcabbott commented Apr 24, 2023

CarloLucibello commented Apr 12, 2024

linusheck commented Apr 24, 2023 •

edited

linusheck commented Apr 24, 2023 •

edited