More complex models: multiple inputs, multiple outputs, and parallel paths inside a model #1289

sebastian-engel · 2020-07-21T18:33:45Z

In coordination with @DhairyaLGandhi I added new parts to the docs and added new layers for a feature that I have been missing for some time now:

Multiple Input Layer ("Join")
Multiple Output Layer ("Split")
Multiple Internal Path Layer ("Parallel")

The documentation has been extended with examples for custom implementations of these layers (see "advanced model building"), and the new layers Join, Split and Parallel have been added to the src.

The new layers run on GPU (yet not tested with more complex layers, e.g. conv). Any number of inputs, outputs or parallel paths should be possible. Simple testing and documentation is also implemented. Maybe a real Julia pro can take a look at the layers to improve the performance or replace vcat with something better.

Additionally a "No-OPeration" Layer ("Nop") is added, so that paths without an action are also possible, instead of writing "x->x".

I tried to stay simple with the design. I would be happy about good ideas and comments, and if these changes are accepted somehow, so that we can use this or a similar implementation soon.

PR Checklist

Tests are added
Entry in NEWS.md
Documentation, if applicable
Final review from @dhairyagandhi96 (for API changes).

… custom layer description, and in the src as layers 'Join', 'Split', and 'Parallel'

DhairyaLGandhi · 2020-07-21T18:38:14Z

I'd forgo the layers for now and focus on the design aspect of it for this PR, and can revisit the layers in a separate PR

HenriDeh · 2020-07-28T12:38:00Z

Is it not a performance problem that the Customxxx structs are not parametric ? How about defining the struct like

struct CustomLayer{T <: Tuple}
    fs::T
end

so that the compiler can generate optimized code ?

My bad, I just saw that the code in your local branch has parametric types.

HenriDeh · 2020-07-30T16:04:20Z

I would also suggest to change vcat([w.fs[i](t[i]) for i in 1:length(w.fs)]...) to a Tuple comprehension vcat((w.fs[i](t[i]) for i in 1:length(w.fs))...).

This makes the function type-stable and therefore probably much faster :)

DrChainsaw · 2020-08-01T19:59:35Z

Can all types of connections be expressed with these types of layers or will there still be 'entangled' cases which can't? Whats the advantage v.s. just going full static computation graph like keras/torchscript/onnx?

sebastian-engel · 2020-08-13T10:58:13Z

@HenriDeh
Great improvement ideas! Thank you! ;)

@DrChainsaw

Can all types of connections be expressed with these types of layers or will there still be 'entangled' cases which can't? Whats the advantage v.s. just going full static computation graph like keras/torchscript/onnx?

I have only tested this with dense layers and cuda. Because of vcat, I am unsure if conv or other layer may work. Concerning your second point: I am not sure why flux.jl shouldn´t have these features?

HenriDeh · 2020-08-13T14:47:32Z

One more then: I just read in Flux documentation performance tips that using

reduce(vcat, (w.fs[i](t[i]) for i in 1:length(w.fs)))

is better optimized than splatting

vcat((w.fs[i](t[i]) for i in 1:length(w.fs))...)

DhairyaLGandhi · 2020-09-19T20:31:16Z

Bump

DrChainsaw · 2020-09-26T09:52:47Z

I’d be happy to help out here, but I don’t want to get in the way or cause divergence.

What are the design aspects which needs to be discussed?

I have a framework agnostic implementation of a standard static computation graph here. Its current home package comes a lot of stuff for progamatically modifying the graph, but It could easily be moved to an own package with very few (if any) dependencies if there is interest. Flux could then either depend on it (or perhaps better) just direct people to that package.

There is nothing stopping anyone from trying it out today though. The API is a bit clunky as it was not designed to be used by humans, but one can easily make their own leaner version like I have done below:

using NaiveNASflux

# Works for all Flux layers, but lets make a simple dense example here
julia> dense(insize::Integer, outsize, act) = dense(inputvertex("in", insize), outsize, act)^C

julia> dense(input::AbstractVertex, outsize, act) = mutable(Dense(nout(input), outsize, act), input)
dense (generic function with 2 methods)

julia> v1 = dense(3, 4, relu);

julia> v2 = dense(v1, 3, elu);

julia> v3 = concat(v1, v2);

julia> v4 = dense(v3, 4, relu);

julia> v5 = v1 + v4;

julia> graph = CompGraph(inputs(v1), v5);

julia> graph(ones(3, 2))
4×2 Array{Float32,2}:
 0.289443  0.289443
 1.15558   1.15558
 0.852968  0.852968
 0.0       0.0

# Vertices can ofc also be called as functions

julia> v1(ones(3,2))
4×2 Array{Float32,2}:
 0.0       0.0
 1.15558   1.15558
 0.612048  0.612048
 0.0       0.0

DhairyaLGandhi · 2020-09-28T15:23:06Z

Allowing for making such layers composable, parallelization over tasks, and reduction via user defined functions, and performance considerations would be some of the things to get a grip on.

DrChainsaw · 2020-09-28T18:51:57Z

Those are indeed important issues. I see it as one advantage of the traditional static computation graph structure (e.g. TensorFlow, ONNX, Dagger etc.) that its properties are well known so assessing those issues ought to be easier. Are there any particular drawbacks which makes it something to stay clear of?

On a higher level, what are the missing things in Flux which one seeks to address here? For example, one drawback of just using raw Julia functions (which is a very appealing property of frameworks like Flux and PyTorch imo) is that things like params and gpu does not "just work". Would the issues go away if they did work?

I'd be happy to do some analysis of the mentioned issues for the standard static graph case, but I'm thinking it will be more than a couple of lines so maybe it is not the right place to post it in an issue tracker.

Oh, and if these questions have already been discussed and decided upon in some other forum then just let me know and I'll stop pestering you guys about this matter. No feelings hurt, I promise :)

darsnack

I added some comments about performance changes.

I think I might be missing something, but I don't see the purpose of the docs addition. I agree there should be a section about using these advanced layers, but as far as I can tell, the current docs addition just walks through reimplementing the layers already in structure.jl. It would be more useful to walk through how these layers can be used for things like inception modules.

On that last point, can we update these layers to support arbitrary reduce operators? If so, then Parallel could replace SkipConnection. The existence of both seems redundant to me.

darsnack · 2020-10-29T22:23:21Z

src/layers/structure.jl

+
+function (w::Split)(x::AbstractArray)
+  # may have performance issues ... some julia/flux pro should maybe rework this
+  tuple([w.fs[i](x) for i in 1:length(w.fs)]...)


This can be improved by using map:

julia> fs = [sum for _ in 1:10] julia> @benchmark tuple([fs[i]($(rand(1000))) for i in 1:length(fs)]...) BenchmarkTools.Trial: memory estimate: 720 bytes allocs estimate: 27 -------------- minimum time: 2.177 μs (0.00% GC) median time: 2.292 μs (0.00% GC) mean time: 2.437 μs (0.77% GC) maximum time: 192.556 μs (97.19% GC) -------------- samples: 10000 evals/sample: 9 julia> @benchmark tuple(map(f -> f($(rand(1000))), fs)...) BenchmarkTools.Trial: memory estimate: 464 bytes allocs estimate: 14 -------------- minimum time: 1.658 μs (0.00% GC) median time: 1.717 μs (0.00% GC) mean time: 1.832 μs (0.92% GC) maximum time: 172.004 μs (97.52% GC) -------------- samples: 10000 evals/sample: 10

darsnack · 2020-10-29T22:24:50Z

src/layers/structure.jl

+"""
+    Nop() - No-OPeration
+
+Create a new 'Nop' layer that does nothing and just passes the value on. 
+This can be useful if you want to pass the output of a split layer directly 
+to the output.
+
+The input 'x' must be a common input format. The out 'y' has the same format 
+as the input format.
+
+The internal action is `x->x`
+
+# Example
+```
+julia> using flux
+
+julia> using CUDA
+
+julia> model = Chain(
+  Dense(1, 2),
+  Split(
+    Dense(2, 1),
+    Nop
+  )
+)
+
+julia> julia> model(cu(rand(1)))
+(Float32[0.12], (Float32[0.52], Float32[0.23]))
+```
+"""
+
+struct Nop
+  empt # workaround?, 'empt' ~ empty
+end
+
+function Nop()
+  Nop(nothing)
+end
+
+function (w::Nop)(x::AbstractArray)
+  return x
+end
+
+function Base.show(io::IO, j::Nop)
+  print(io, "Nop")
+end


Can't identity serve as Nop? Then there is no need to create an additional type.

darsnack · 2020-10-29T22:26:22Z

src/layers/structure.jl

+
+function (w::Parallel)(x::AbstractArray)
+  # may have performance issues ... some julia/flux pro should maybe rework this
+  vcat([w.fs[i](x) for i in 1:length(w.fs)]...)


Same thing here. It would be faster as:

mapreduce(f -> f(x), vcat, w.fs)

I agree on this point. Of course, we need the reduction operator to be user defined, but that's separate.

darsnack · 2020-10-29T22:30:46Z

src/layers/structure.jl

+function (w::Join)(t::Tuple)
+  # may have performance issues ... some julia/flux pro should maybe rework this
+  @assert length(w.fs) == length(t)
+  vcat([w.fs[i](t[i]) for i in 1:length(w.fs)]...)


Same again, but map over both the functions and the input:

mapreduce((f, x) -> f(x), vcat, fs, t)

sebastian-engel · 2020-10-30T10:55:25Z

@darsnack Yes, the docs double themselves with the implementation. When I created this, the idea was that the docs might be accepted more quickly, so then users don't have to search for everything. Nice optimizations btw.

DrChainsaw · 2020-11-15T16:16:26Z

I guess the silence around my questions around a more traditional DAG format means "no" then :)

For the proposal here, I think you need to let the user supply the reduction method. For example conv outputs need to use something like (outs...) -> cat(outs...; dim=3) as the reduction function to concatenate the channel dimension. Also, if someone instead wants to combine using + or *, should they then need to create new identical layers except for the combine method?

darsnack · 2020-11-15T16:50:33Z

Yeah I think once this PR makes the changes to allow user-supplied reduce operators and the optimizations, then it should be ready to merge (probably want to refactor the docs additions first).

I guess the silence around my questions around a more traditional DAG format means "no" then :)

😄 I think I posted some longer thoughts here already, but I feel that DAG-style execution should be in packages built on top of Flux. Similar to your comments on the linked issue, I think that the simplicity of Flux's layer API is a good thing. But the brilliance of Flux/Zygote is that CompGraph can interoperate with any other "Flux model" since all Julia functions are first-class models. So keeping the DAG execution stuff in a separate package:

keeps Flux.jl lean and simple
allows the DAG package developers to more frequently iterate (there is a lot of interesting research on this topic, and imo it would be a shame if the ability to do cool things is apprehended by the need for stability in Flux.jl)

DrChainsaw · 2020-11-15T19:18:00Z

So keeping the DAG execution stuff in a separate package: keeps Flux.jl lean and simple

Definitely agree, but why should the layers proposed here be held to a different standard as they are clearly a way to express the DAG and contributes to keeping Flux less lean and simple?

One risk I see with this approach is the death by a thousand cuts since just as SkipConnection was used as the motivation for these layers, these layers may be used as motivation to push further convenience stuff in. I don't really have a horse in this race though as I'm not a maintainer.

I basically just blurted out a suggestion that doing something like Keras Sequential vs Model ( assuming that is still a thing there, I haven't used it in a long time) might end up being more maintainable in the long run since the latter is supported by a few generic building blocks. I don't think this has anything to do with execution of the DAG which can be seen as a separate issue imo.

CarloLucibello · 2020-11-16T16:40:11Z

Maybe we should just document some generic pattern to create arbitrarily complex models. For instance, this would be very familiar to pytorch users:

using Flux 

struct NN
    layers
end

Flux.@functor NN

function NN()
    layers = [Dense(5,5), Dense(5,3)]
    return NN(layers)
end

function (m::NN)(x)
    x = relu.(x + m.layers[1](x))
    return m.layers[2](x)
end

model = NN()

@assert length(Flux.params(model)) == 4

ŷ = model(randn(5, 3))
@assert size(ŷ) == (3, 3)

Additionally, we could implement a Model abstract class or a @model macro (in place of @functor) to provide some additional basic functionalities

darsnack · 2020-11-16T17:20:54Z

Definitely agree, but why should the layers proposed here be held to a different standard as they are clearly a way to express the DAG and contributes to keeping Flux less lean and simple?

This is a great point, and I agree that it is generally better not to introduce new layers (or types in general) unless the benefit crosses some criteria. For me, that criteria is does a new layer collect parameters in a useful way (e.g. Dense) or does dispatching on the type introduce a benefit? The second part of the criteria is why I think this PR should be more than just documentation on complex models. I apologize in advance for the long example, but I will give a TL;DR: I think Parallel should deprecate SkipConnection as it generalizes it, but Join/Split can be removed from the PR.

The ability to dispatch is crucial for writing functions in code of the form f(m) -> m' (where f transforms a model m to m'). A very practical example of this is pruning though there are many others. There are several vision models that have Parallel-like structures (e.g. ResNet and GoogLeNet). Like you mentioned, many mainstream frameworks don't provide a specific layer for these structures (or any standard at all), and it is effectively user implemented per model type. This ends up being a PITA for writing code in the form I mentioned, but there is little point in fixing it for Python-based frameworks because there is no multiple dispatch. I think we have the opportunity to do better with Flux. Let me walk you through what I mean. In PyTorch (my chosen victim), I might have the following function:

def f(m):
    if isinstance(m, BasicBlock):
        # code
    elif isinstance(m, Bottleneck):
        # code
    elif isinstance(m, InceptionModule):
        # code
    elif isinstance(m, Conv2d):
        # code
    elif isinstance(m, Linear):
        # code...you get the idea

    return m

The # code sections are long, and the number of cases is not small, leading to a mega-function that is annoying. One thing you can do is separate each # code block into its own function:

def f_basicblock(m):
    # code
def f_bottleneck(m):
    # code
def f_inception(m):
    # code
def f_conv(m):
   # code
def f_linear(m):
   # code

def f(m):
    if isinstance(m, BasicBlock):
        return f_basicblock(m)
    elif isinstance(m, Bottleneck):
        return f_bottleneck(m)
    elif isinstance(m, InceptionModule):
        return f_inception(m)
    elif isinstance(m, Conv2d):
        return f_conv(m)
    elif isinstance(m, Linear):
        return f_linear(m)

At this point you've basically implemented f as manual multiple dispatch. So, we could do the same in Julia as:

f(m::BasicBlock) = # code
f(m::Bottleneck) = # code
f(m::InceptionModule) = # code
f(m::Conv2d) = # code
f(m::Linear) = # code

But the real problem is that BasicBlock, Bottleneck, and InceptionModule are all the same general type of layer with the specifics changed. But since this is PyTorch and not Flux, you can't use Python functions like map, reduce, or splatting to generalize without a performance cost (or maybe there is no cost; either way PyTorch doesn't have a general layer). Thus, even though # code is effectively the same for all those layers, you have different methods for each one. So you might consider a solution with a "de-sugared" function like:

def f_parallel(branches, reduce_op):
    # shared code
def f_basicblock(m):
   # de-sugars the fields of m to call the generic f_parallel
# ... omitted to be concise

f(branches::AbstractVector, reduce_op) = # shared code
f(m::BasicBlock) = # de-sugar
# ...

Still, if all three layers were the same type, we could write just one generic function in Julia that worked for all layers instead of a generic function + 3 routing functions:

f(m::Parallel) = # shared code

Okay, so hopefully you can see that having a Parallel type is useful, because it lets you dispatch on a common structure in vision models to write generic code once. I feel that Join/Split are natural partners to Parallel, but I wouldn't be upset if they were excluded. I don't think I've ever encountered them personally in a model. At the start of this year, I would have agreed that Parallel should not be a standard layer, and that we should only guide users on how to implement such layers. But these layer structures are inevitably going to appear in models, and if we have a standard type that all those models use, then the burden on maintainers of higher level ecosystem packages becomes simpler. It also makes end-user code easier to write.

At the end of the day, a neural network is a graph by definition. So I don't think that just because a certain layer can be seen as representing a subgraph that it should be excluded from Flux. Taking that to the extreme, we would have no layers and just closures over calls to NNlib. In the other extreme, we would have every model represented by a graph data structure instead of functions and structs. Clearly, a happy medium is probably going to be most pleasurable to use, and hopefully I convinced everyone that Parallel is useful enough to be part of that happy medium 😄. Also, importantly, there is no reason that the two extremes can't exist in conjunction with whatever we settle on (and have full interop with Flux).

@DhairyaLGandhi

1462: Add Parallel layer r=DhairyaLGandhi a=darsnack Since #1289 stalled, I have added an implementation of `Parallel` with some of the changes we discussed during ML calls. This version excludes most of the structural layers in #1289 like `Join`, `Split`, and `Nop`. I also added the ability for the user to specify the reduction operator. If it is acceptable, I would like to remap `SkipConnection` to `Parallel` (not a deprecation exactly). The reason for submitting this PR now is because I am creating pre-trained weights for the networks in FluxML/Metalhead.jl#70, and there is a lot of code that can be replaced with a `Parallel`. So, I'd like to have `Parallel` in Flux before continuing with training to make the process easier. ### PR Checklist - [x] Tests are added - [x] Entry in NEWS.md - [x] Documentation, if applicable - [x] Final review from @DhairyaLGandhi (for API changes). cc @CarloLucibello Co-authored-by: Kyle Daruwalla <daruwalla@wisc.edu> Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

bgroenks96 · 2021-12-29T15:48:57Z

I'd like to bump this. Looks like there's still some outstanding fixes that need to be made to this PR, right?

I'm having to write my own Split layer for what I am currently working on... it's simple enough, but it would be nice if this were just provided by Flux.

darsnack · 2021-12-29T16:26:18Z

A Parallel layer has been added since this PR. You can use Parallel with tuple as the reduction operator to create a Split.

Split(branch, branches...) = Parallel(tuple, branch, branches...)

Calling this with a single input will pass the same input to each branch.

HenriDeh · 2021-12-29T16:31:48Z

I think his point is that it should be pre-built in Flux, not because it's hard to do it yourself (it is not), but because it is sufficiently frequently used in ML to justify it. And I agree, many models predict Gaussian means and std. These have two heads.

Pre-building this makes Flux more accessible to newcomers and ML practitioners that are not that proficient in Julia.

ToucheSir · 2021-12-29T17:27:41Z

I agree with @darsnack that Flux should not (except for very specific circumstances) be in the business of adding aliases purely for convenience. IMO Split does not pass muster because

It's not a named layer in the literature like e.g. Maxout or SkipConnection.
The name doesn't describe the actual behaviour that well (coming back to this issue after some time, I expected to see an implementation that partitioned the input array into chunks). I'd argue Parallel is a better name, because it highlights the independence of each function application. FanOut would be the most semantically descriptive, but that might confuse users without a CS background.
the accessibility argument doesn't make sense to me as no other framework has an equivalent layer. For the mean and stddev example, x -> (mean(x), std(x)) is arguably clearer and requires no Flux-specific functionality.

That said, Parallel(tuple, branch, branches...) does not work either because the reduction there is done pairwise ¹. We've discussed making the reduction (optionally) non-pairwise, so if that's figured out then the need for a separate Split layer disappears entirely.

that works just fine for Join, however. ↩

DhairyaLGandhi · 2021-12-29T17:32:07Z

Reusing code from the right place is important, and I agree Split is too ambiguous about it's behavior to be reliably used in a wide enough list of cases. We face the same issue in Parallel, but since it's something that's very wide spread, we basically need to choose how we want it to behave and go with it.

darsnack · 2022-01-27T22:03:53Z

I'm closing this as most of the work has been merged into Flux for quite some time, and the original author does appear to be responsive. As for Split and Join, these are documented in Flux, but the maintainers seem to agree that they don't meet the threshold for inclusion in the source code.

I would suggest opening an issue (or filing a PR) if there's some piece here that anyone still wants to see completed.

sebastian-engel added 2 commits July 21, 2020 16:54

added multiple input, output and parallel-path layers, in the docs as…

6241312

… custom layer description, and in the src as layers 'Join', 'Split', and 'Parallel'

multi io layers, tweaks and bugfixes

a16ac5d

darsnack reviewed Oct 29, 2020

View reviewed changes

darsnack mentioned this pull request Nov 15, 2020

ONNX import/export FluxML/FluxML-Community-Call-Minutes#10

Closed

DrChainsaw mentioned this pull request Dec 19, 2020

PyTorch feature parity #1431

Open

92 tasks

darsnack mentioned this pull request Jan 12, 2021

Add Parallel layer #1462

Merged

4 tasks

ToucheSir mentioned this pull request Aug 24, 2021

Make Parallel Vararg #1698

Closed

darsnack mentioned this pull request Dec 29, 2021

Add Split as a convenience constructor #1820

Closed

4 tasks

darsnack closed this Jan 27, 2022

ToucheSir mentioned this pull request Nov 11, 2022

Compatibility with ProtoStruct.jl, and LayerFactory ideas for custom layers #2107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More complex models: multiple inputs, multiple outputs, and parallel paths inside a model #1289

More complex models: multiple inputs, multiple outputs, and parallel paths inside a model #1289

sebastian-engel commented Jul 21, 2020

DhairyaLGandhi commented Jul 21, 2020 •

edited

Loading

HenriDeh commented Jul 28, 2020 •

edited

Loading

HenriDeh commented Jul 30, 2020

DrChainsaw commented Aug 1, 2020

sebastian-engel commented Aug 13, 2020

HenriDeh commented Aug 13, 2020

DhairyaLGandhi commented Sep 19, 2020

DrChainsaw commented Sep 26, 2020

DhairyaLGandhi commented Sep 28, 2020

DrChainsaw commented Sep 28, 2020

darsnack left a comment

darsnack Oct 29, 2020

darsnack Oct 29, 2020

darsnack Oct 29, 2020

DhairyaLGandhi Dec 3, 2020

darsnack Oct 29, 2020

sebastian-engel commented Oct 30, 2020

DrChainsaw commented Nov 15, 2020

darsnack commented Nov 15, 2020

DrChainsaw commented Nov 15, 2020

CarloLucibello commented Nov 16, 2020

darsnack commented Nov 16, 2020 •

edited

Loading

bgroenks96 commented Dec 29, 2021

darsnack commented Dec 29, 2021

HenriDeh commented Dec 29, 2021

ToucheSir commented Dec 29, 2021

DhairyaLGandhi commented Dec 29, 2021

darsnack commented Jan 27, 2022

More complex models: multiple inputs, multiple outputs, and parallel paths inside a model #1289

More complex models: multiple inputs, multiple outputs, and parallel paths inside a model #1289

Conversation

sebastian-engel commented Jul 21, 2020

PR Checklist

DhairyaLGandhi commented Jul 21, 2020 • edited Loading

HenriDeh commented Jul 28, 2020 • edited Loading

HenriDeh commented Jul 30, 2020

DrChainsaw commented Aug 1, 2020

sebastian-engel commented Aug 13, 2020

HenriDeh commented Aug 13, 2020

DhairyaLGandhi commented Sep 19, 2020

DrChainsaw commented Sep 26, 2020

DhairyaLGandhi commented Sep 28, 2020

DrChainsaw commented Sep 28, 2020

darsnack left a comment

Choose a reason for hiding this comment

darsnack Oct 29, 2020

Choose a reason for hiding this comment

darsnack Oct 29, 2020

Choose a reason for hiding this comment

darsnack Oct 29, 2020

Choose a reason for hiding this comment

DhairyaLGandhi Dec 3, 2020

Choose a reason for hiding this comment

darsnack Oct 29, 2020

Choose a reason for hiding this comment

sebastian-engel commented Oct 30, 2020

DrChainsaw commented Nov 15, 2020

darsnack commented Nov 15, 2020

DrChainsaw commented Nov 15, 2020

CarloLucibello commented Nov 16, 2020

darsnack commented Nov 16, 2020 • edited Loading

bgroenks96 commented Dec 29, 2021

darsnack commented Dec 29, 2021

HenriDeh commented Dec 29, 2021

ToucheSir commented Dec 29, 2021

Footnotes

DhairyaLGandhi commented Dec 29, 2021

darsnack commented Jan 27, 2022

DhairyaLGandhi commented Jul 21, 2020 •

edited

Loading

HenriDeh commented Jul 28, 2020 •

edited

Loading

darsnack commented Nov 16, 2020 •

edited

Loading