Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More complex models: multiple inputs, multiple outputs, and parallel paths inside a model #1289

Closed
wants to merge 2 commits into from

Conversation

sebastian-engel
Copy link

In coordination with @DhairyaLGandhi I added new parts to the docs and added new layers for a feature that I have been missing for some time now:

  • Multiple Input Layer ("Join")
  • Multiple Output Layer ("Split")
  • Multiple Internal Path Layer ("Parallel")

The documentation has been extended with examples for custom implementations of these layers (see "advanced model building"), and the new layers Join, Split and Parallel have been added to the src.

The new layers run on GPU (yet not tested with more complex layers, e.g. conv). Any number of inputs, outputs or parallel paths should be possible. Simple testing and documentation is also implemented. Maybe a real Julia pro can take a look at the layers to improve the performance or replace vcat with something better.

Additionally a "No-OPeration" Layer ("Nop") is added, so that paths without an action are also possible, instead of writing "x->x".

I tried to stay simple with the design. I would be happy about good ideas and comments, and if these changes are accepted somehow, so that we can use this or a similar implementation soon.

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable
  • Final review from @dhairyagandhi96 (for API changes).

… custom layer description, and in the src as layers 'Join', 'Split', and 'Parallel'
@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Jul 21, 2020

I'd forgo the layers for now and focus on the design aspect of it for this PR, and can revisit the layers in a separate PR

@HenriDeh
Copy link
Contributor

HenriDeh commented Jul 28, 2020

Is it not a performance problem that the Customxxx structs are not parametric ? How about defining the struct like

struct CustomLayer{T <: Tuple}
    fs::T
end

so that the compiler can generate optimized code ?

My bad, I just saw that the code in your local branch has parametric types.

@HenriDeh
Copy link
Contributor

I would also suggest to change vcat([w.fs[i](t[i]) for i in 1:length(w.fs)]...) to a Tuple comprehension vcat((w.fs[i](t[i]) for i in 1:length(w.fs))...).

This makes the function type-stable and therefore probably much faster :)

@DrChainsaw
Copy link
Contributor

Can all types of connections be expressed with these types of layers or will there still be 'entangled' cases which can't? Whats the advantage v.s. just going full static computation graph like keras/torchscript/onnx?

@sebastian-engel
Copy link
Author

@HenriDeh
Great improvement ideas! Thank you! ;)

@DrChainsaw

Can all types of connections be expressed with these types of layers or will there still be 'entangled' cases which can't? Whats the advantage v.s. just going full static computation graph like keras/torchscript/onnx?

I have only tested this with dense layers and cuda. Because of vcat, I am unsure if conv or other layer may work. Concerning your second point: I am not sure why flux.jl shouldn´t have these features?

@HenriDeh
Copy link
Contributor

One more then: I just read in Flux documentation performance tips that using

reduce(vcat, (w.fs[i](t[i]) for i in 1:length(w.fs)))

is better optimized than splatting

vcat((w.fs[i](t[i]) for i in 1:length(w.fs))...)

@DhairyaLGandhi
Copy link
Member

Bump

@DrChainsaw
Copy link
Contributor

I’d be happy to help out here, but I don’t want to get in the way or cause divergence.

What are the design aspects which needs to be discussed?

I have a framework agnostic implementation of a standard static computation graph here. Its current home package comes a lot of stuff for progamatically modifying the graph, but It could easily be moved to an own package with very few (if any) dependencies if there is interest. Flux could then either depend on it (or perhaps better) just direct people to that package.

There is nothing stopping anyone from trying it out today though. The API is a bit clunky as it was not designed to be used by humans, but one can easily make their own leaner version like I have done below:

using NaiveNASflux

# Works for all Flux layers, but lets make a simple dense example here
julia> dense(insize::Integer, outsize, act) = dense(inputvertex("in", insize), outsize, act)^C

julia> dense(input::AbstractVertex, outsize, act) = mutable(Dense(nout(input), outsize, act), input)
dense (generic function with 2 methods)

julia> v1 = dense(3, 4, relu);

julia> v2 = dense(v1, 3, elu);

julia> v3 = concat(v1, v2);

julia> v4 = dense(v3, 4, relu);

julia> v5 = v1 + v4;

julia> graph = CompGraph(inputs(v1), v5);

julia> graph(ones(3, 2))
4×2 Array{Float32,2}:
 0.289443  0.289443
 1.15558   1.15558
 0.852968  0.852968
 0.0       0.0

# Vertices can ofc also be called as functions

julia> v1(ones(3,2))
4×2 Array{Float32,2}:
 0.0       0.0
 1.15558   1.15558
 0.612048  0.612048
 0.0       0.0

@DhairyaLGandhi
Copy link
Member

Allowing for making such layers composable, parallelization over tasks, and reduction via user defined functions, and performance considerations would be some of the things to get a grip on.

@DrChainsaw
Copy link
Contributor

Those are indeed important issues. I see it as one advantage of the traditional static computation graph structure (e.g. TensorFlow, ONNX, Dagger etc.) that its properties are well known so assessing those issues ought to be easier. Are there any particular drawbacks which makes it something to stay clear of?

On a higher level, what are the missing things in Flux which one seeks to address here? For example, one drawback of just using raw Julia functions (which is a very appealing property of frameworks like Flux and PyTorch imo) is that things like params and gpu does not "just work". Would the issues go away if they did work?

I'd be happy to do some analysis of the mentioned issues for the standard static graph case, but I'm thinking it will be more than a couple of lines so maybe it is not the right place to post it in an issue tracker.

Oh, and if these questions have already been discussed and decided upon in some other forum then just let me know and I'll stop pestering you guys about this matter. No feelings hurt, I promise :)

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments about performance changes.

I think I might be missing something, but I don't see the purpose of the docs addition. I agree there should be a section about using these advanced layers, but as far as I can tell, the current docs addition just walks through reimplementing the layers already in structure.jl. It would be more useful to walk through how these layers can be used for things like inception modules.

On that last point, can we update these layers to support arbitrary reduce operators? If so, then Parallel could replace SkipConnection. The existence of both seems redundant to me.


function (w::Split)(x::AbstractArray)
# may have performance issues ... some julia/flux pro should maybe rework this
tuple([w.fs[i](x) for i in 1:length(w.fs)]...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be improved by using map:

julia> fs = [sum for _ in 1:10]

julia> @benchmark tuple([fs[i]($(rand(1000))) for i in 1:length(fs)]...)
BenchmarkTools.Trial:
  memory estimate:  720 bytes
  allocs estimate:  27
  --------------
  minimum time:     2.177 μs (0.00% GC)
  median time:      2.292 μs (0.00% GC)
  mean time:        2.437 μs (0.77% GC)
  maximum time:     192.556 μs (97.19% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark tuple(map(f -> f($(rand(1000))), fs)...)
BenchmarkTools.Trial:
  memory estimate:  464 bytes
  allocs estimate:  14
  --------------
  minimum time:     1.658 μs (0.00% GC)
  median time:      1.717 μs (0.00% GC)
  mean time:        1.832 μs (0.92% GC)
  maximum time:     172.004 μs (97.52% GC)
  --------------
  samples:          10000
  evals/sample:     10

Comment on lines +180 to +225
"""
Nop() - No-OPeration

Create a new 'Nop' layer that does nothing and just passes the value on.
This can be useful if you want to pass the output of a split layer directly
to the output.

The input 'x' must be a common input format. The out 'y' has the same format
as the input format.

The internal action is `x->x`

# Example
```
julia> using flux

julia> using CUDA

julia> model = Chain(
Dense(1, 2),
Split(
Dense(2, 1),
Nop
)
)

julia> julia> model(cu(rand(1)))
(Float32[0.12], (Float32[0.52], Float32[0.23]))
```
"""

struct Nop
empt # workaround?, 'empt' ~ empty
end

function Nop()
Nop(nothing)
end

function (w::Nop)(x::AbstractArray)
return x
end

function Base.show(io::IO, j::Nop)
print(io, "Nop")
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't identity serve as Nop? Then there is no need to create an additional type.


function (w::Parallel)(x::AbstractArray)
# may have performance issues ... some julia/flux pro should maybe rework this
vcat([w.fs[i](x) for i in 1:length(w.fs)]...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here. It would be faster as:

mapreduce(f -> f(x), vcat, w.fs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on this point. Of course, we need the reduction operator to be user defined, but that's separate.

function (w::Join)(t::Tuple)
# may have performance issues ... some julia/flux pro should maybe rework this
@assert length(w.fs) == length(t)
vcat([w.fs[i](t[i]) for i in 1:length(w.fs)]...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same again, but map over both the functions and the input:

mapreduce((f, x) -> f(x), vcat, fs, t)

@sebastian-engel
Copy link
Author

@darsnack Yes, the docs double themselves with the implementation. When I created this, the idea was that the docs might be accepted more quickly, so then users don't have to search for everything. Nice optimizations btw.

@DrChainsaw
Copy link
Contributor

I guess the silence around my questions around a more traditional DAG format means "no" then :)

For the proposal here, I think you need to let the user supply the reduction method. For example conv outputs need to use something like (outs...) -> cat(outs...; dim=3) as the reduction function to concatenate the channel dimension. Also, if someone instead wants to combine using + or *, should they then need to create new identical layers except for the combine method?

@darsnack
Copy link
Member

Yeah I think once this PR makes the changes to allow user-supplied reduce operators and the optimizations, then it should be ready to merge (probably want to refactor the docs additions first).

I guess the silence around my questions around a more traditional DAG format means "no" then :)

😄 I think I posted some longer thoughts here already, but I feel that DAG-style execution should be in packages built on top of Flux. Similar to your comments on the linked issue, I think that the simplicity of Flux's layer API is a good thing. But the brilliance of Flux/Zygote is that CompGraph can interoperate with any other "Flux model" since all Julia functions are first-class models. So keeping the DAG execution stuff in a separate package:

  • keeps Flux.jl lean and simple
  • allows the DAG package developers to more frequently iterate (there is a lot of interesting research on this topic, and imo it would be a shame if the ability to do cool things is apprehended by the need for stability in Flux.jl)

@DrChainsaw
Copy link
Contributor

So keeping the DAG execution stuff in a separate package: keeps Flux.jl lean and simple

Definitely agree, but why should the layers proposed here be held to a different standard as they are clearly a way to express the DAG and contributes to keeping Flux less lean and simple?

One risk I see with this approach is the death by a thousand cuts since just as SkipConnection was used as the motivation for these layers, these layers may be used as motivation to push further convenience stuff in. I don't really have a horse in this race though as I'm not a maintainer.

I basically just blurted out a suggestion that doing something like Keras Sequential vs Model ( assuming that is still a thing there, I haven't used it in a long time) might end up being more maintainable in the long run since the latter is supported by a few generic building blocks. I don't think this has anything to do with execution of the DAG which can be seen as a separate issue imo.

@CarloLucibello
Copy link
Member

Maybe we should just document some generic pattern to create arbitrarily complex models. For instance, this would be very familiar to pytorch users:

using Flux 

struct NN
    layers
end

Flux.@functor NN

function NN()
    layers = [Dense(5,5), Dense(5,3)]
    return NN(layers)
end

function (m::NN)(x)
    x = relu.(x + m.layers[1](x))
    return m.layers[2](x)
end

model = NN()

@assert length(Flux.params(model)) == 4= model(randn(5, 3))
@assert size(ŷ) == (3, 3)

Additionally, we could implement a Model abstract class or a @model macro (in place of @functor) to provide some additional basic functionalities

@darsnack
Copy link
Member

darsnack commented Nov 16, 2020

Definitely agree, but why should the layers proposed here be held to a different standard as they are clearly a way to express the DAG and contributes to keeping Flux less lean and simple?

This is a great point, and I agree that it is generally better not to introduce new layers (or types in general) unless the benefit crosses some criteria. For me, that criteria is does a new layer collect parameters in a useful way (e.g. Dense) or does dispatching on the type introduce a benefit? The second part of the criteria is why I think this PR should be more than just documentation on complex models. I apologize in advance for the long example, but I will give a TL;DR: I think Parallel should deprecate SkipConnection as it generalizes it, but Join/Split can be removed from the PR.

The ability to dispatch is crucial for writing functions in code of the form f(m) -> m' (where f transforms a model m to m'). A very practical example of this is pruning though there are many others. There are several vision models that have Parallel-like structures (e.g. ResNet and GoogLeNet). Like you mentioned, many mainstream frameworks don't provide a specific layer for these structures (or any standard at all), and it is effectively user implemented per model type. This ends up being a PITA for writing code in the form I mentioned, but there is little point in fixing it for Python-based frameworks because there is no multiple dispatch. I think we have the opportunity to do better with Flux. Let me walk you through what I mean. In PyTorch (my chosen victim), I might have the following function:

def f(m):
    if isinstance(m, BasicBlock):
        # code
    elif isinstance(m, Bottleneck):
        # code
    elif isinstance(m, InceptionModule):
        # code
    elif isinstance(m, Conv2d):
        # code
    elif isinstance(m, Linear):
        # code...you get the idea

    return m

The # code sections are long, and the number of cases is not small, leading to a mega-function that is annoying. One thing you can do is separate each # code block into its own function:

def f_basicblock(m):
    # code
def f_bottleneck(m):
    # code
def f_inception(m):
    # code
def f_conv(m):
   # code
def f_linear(m):
   # code

def f(m):
    if isinstance(m, BasicBlock):
        return f_basicblock(m)
    elif isinstance(m, Bottleneck):
        return f_bottleneck(m)
    elif isinstance(m, InceptionModule):
        return f_inception(m)
    elif isinstance(m, Conv2d):
        return f_conv(m)
    elif isinstance(m, Linear):
        return f_linear(m)

At this point you've basically implemented f as manual multiple dispatch. So, we could do the same in Julia as:

f(m::BasicBlock) = # code
f(m::Bottleneck) = # code
f(m::InceptionModule) = # code
f(m::Conv2d) = # code
f(m::Linear) = # code

But the real problem is that BasicBlock, Bottleneck, and InceptionModule are all the same general type of layer with the specifics changed. But since this is PyTorch and not Flux, you can't use Python functions like map, reduce, or splatting to generalize without a performance cost (or maybe there is no cost; either way PyTorch doesn't have a general layer). Thus, even though # code is effectively the same for all those layers, you have different methods for each one. So you might consider a solution with a "de-sugared" function like:

def f_parallel(branches, reduce_op):
    # shared code
def f_basicblock(m):
   # de-sugars the fields of m to call the generic f_parallel
# ... omitted to be concise
f(branches::AbstractVector, reduce_op) = # shared code
f(m::BasicBlock) = # de-sugar
# ...

Still, if all three layers were the same type, we could write just one generic function in Julia that worked for all layers instead of a generic function + 3 routing functions:

f(m::Parallel) = # shared code

Okay, so hopefully you can see that having a Parallel type is useful, because it lets you dispatch on a common structure in vision models to write generic code once. I feel that Join/Split are natural partners to Parallel, but I wouldn't be upset if they were excluded. I don't think I've ever encountered them personally in a model. At the start of this year, I would have agreed that Parallel should not be a standard layer, and that we should only guide users on how to implement such layers. But these layer structures are inevitably going to appear in models, and if we have a standard type that all those models use, then the burden on maintainers of higher level ecosystem packages becomes simpler. It also makes end-user code easier to write.

At the end of the day, a neural network is a graph by definition. So I don't think that just because a certain layer can be seen as representing a subgraph that it should be excluded from Flux. Taking that to the extreme, we would have no layers and just closures over calls to NNlib. In the other extreme, we would have every model represented by a graph data structure instead of functions and structs. Clearly, a happy medium is probably going to be most pleasurable to use, and hopefully I convinced everyone that Parallel is useful enough to be part of that happy medium 😄. Also, importantly, there is no reason that the two extremes can't exist in conjunction with whatever we settle on (and have full interop with Flux).

@DrChainsaw DrChainsaw mentioned this pull request Dec 19, 2020
92 tasks
@darsnack darsnack mentioned this pull request Jan 12, 2021
4 tasks
bors bot added a commit that referenced this pull request Jan 14, 2021
1462: Add Parallel layer r=DhairyaLGandhi a=darsnack

Since #1289 stalled, I have added an implementation of `Parallel` with some of the changes we discussed during ML calls. This version excludes most of the structural layers in #1289 like `Join`, `Split`, and `Nop`. I also added the ability for the user to specify the reduction operator. If it is acceptable, I would like to remap `SkipConnection` to `Parallel` (not a deprecation exactly).

The reason for submitting this PR now is because I am creating pre-trained weights for the networks in FluxML/Metalhead.jl#70, and there is a lot of code that can be replaced with a `Parallel`. So, I'd like to have `Parallel` in Flux before continuing with training to make the process easier.

### PR Checklist

- [x] Tests are added
- [x] Entry in NEWS.md
- [x] Documentation, if applicable
- [x] Final review from @DhairyaLGandhi (for API changes).

cc @CarloLucibello

Co-authored-by: Kyle Daruwalla <daruwalla@wisc.edu>
Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>
@ToucheSir ToucheSir mentioned this pull request Aug 24, 2021
@bgroenks96
Copy link

I'd like to bump this. Looks like there's still some outstanding fixes that need to be made to this PR, right?

I'm having to write my own Split layer for what I am currently working on... it's simple enough, but it would be nice if this were just provided by Flux.

@darsnack
Copy link
Member

A Parallel layer has been added since this PR. You can use Parallel with tuple as the reduction operator to create a Split.

Split(branch, branches...) = Parallel(tuple, branch, branches...)

Calling this with a single input will pass the same input to each branch.

@HenriDeh
Copy link
Contributor

I think his point is that it should be pre-built in Flux, not because it's hard to do it yourself (it is not), but because it is sufficiently frequently used in ML to justify it. And I agree, many models predict Gaussian means and std. These have two heads.

Pre-building this makes Flux more accessible to newcomers and ML practitioners that are not that proficient in Julia.

@ToucheSir
Copy link
Member

I agree with @darsnack that Flux should not (except for very specific circumstances) be in the business of adding aliases purely for convenience. IMO Split does not pass muster because

  1. It's not a named layer in the literature like e.g. Maxout or SkipConnection.
  2. The name doesn't describe the actual behaviour that well (coming back to this issue after some time, I expected to see an implementation that partitioned the input array into chunks). I'd argue Parallel is a better name, because it highlights the independence of each function application. FanOut would be the most semantically descriptive, but that might confuse users without a CS background.
  3. the accessibility argument doesn't make sense to me as no other framework has an equivalent layer. For the mean and stddev example, x -> (mean(x), std(x)) is arguably clearer and requires no Flux-specific functionality.

That said, Parallel(tuple, branch, branches...) does not work either because the reduction there is done pairwise 1. We've discussed making the reduction (optionally) non-pairwise, so if that's figured out then the need for a separate Split layer disappears entirely.

Footnotes

  1. that works just fine for Join, however.

@DhairyaLGandhi
Copy link
Member

Reusing code from the right place is important, and I agree Split is too ambiguous about it's behavior to be reliably used in a wide enough list of cases. We face the same issue in Parallel, but since it's something that's very wide spread, we basically need to choose how we want it to behave and go with it.

@darsnack
Copy link
Member

I'm closing this as most of the work has been merged into Flux for quite some time, and the original author does appear to be responsive. As for Split and Join, these are documented in Flux, but the maintainers seem to agree that they don't meet the threshold for inclusion in the source code.

I would suggest opening an issue (or filing a PR) if there's some piece here that anyone still wants to see completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants