Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Zeros with dimensions #1402

Closed
wants to merge 17 commits into from

Conversation

DhairyaLGandhi
Copy link
Member

@DhairyaLGandhi DhairyaLGandhi commented Nov 19, 2020

With apologies to @mcabbott and @DrChainsaw :)

This PR relates to the changes in #1379 and moves some important things about. I feel uncomfortable to allow something that naturally translates as a vector or an array over to a special type without being consistent with the behaviour of the type it replaces (ie an array).

So the contentions behind the <: AbstractArray were that it complicated the implementation and didn't work well in some cases of moving to and from CUDA.

Turns out it doesn't add too much in terms of complexity either.

Some improvements from the current implementation:

  • Correctness of the broadcasting (see tests)
  • It works with loadparams! as expected while preserving the dims of the various arguments (can add tests)
  • Formalized implementation
  • allow dimensions to biases for different layers
  • CUDA compat (see tests)
  • manages to keep the fast paths anyway

What is currently doesn't do:

  • printing the array - easily done

cc @mcabbott @DrChainsaw for comments

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable
  • Final review from @dhairyagandhi96 (for API changes).

@DhairyaLGandhi DhairyaLGandhi changed the title Dg/zeros2 Allow Zeros with dimensions Nov 19, 2020
@oxinabox
Copy link
Member

oxinabox commented Nov 19, 2020

If we are doing this, why don't we just use FillArrays.jl?

@oxinabox
Copy link
Member

oxinabox commented Nov 19, 2020

If we are doing this, why don't we just use FillArrays.jl?

answer: because these also don't have any gradients

An alterantive is to use Fill(False, dims) which should always have zero gradients, being boolean

@mcabbott
Copy link
Member

An alterantive is to use Fill(False, dims) which should always have zero gradients, being boolean

Or just false, as I keep saying.

The current implementation is more or less False, which is also good.

In layers with bias, the bias is an array, which has shape, and this has to be preserved, etc. But in layers without bias, the absence of bias doesn't have to be "consistent with the behaviour of the type it replaces" because it's, well, not there. I mean, I don't have a dog. It doesn't have a name, and it also doesn't behave like other dogs when taken for a walk...

@DrChainsaw
Copy link
Contributor

Fwiw, the comment by @mcabbott about "absence of bias" is what made me prefer the current implementation over #1374.

I think the code in the PR looks like an improvement over #1374, but it seems like a couple of tests fail due to getindex not being supported. It also looked as if a gradient slipped through somewhere.

It works with loadparams! as expected while preserving the dims of the various arguments (can add tests)

Is it possible to get and ELI5 version of this? I found it easy to reason about what loadparams! (and destructure) should do when there is no bias, but I still can't wrap my head around out what is the expected behaviour in all cases when it is a fixed array. I'm thinking it could still be treated as "no bias", but the quoted line makes it seem like there is something more to it.

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Nov 20, 2020

But changing the behaviour because of "absence of bias" may make it diverge from what is expected from the model. The expectation of what the model is best suited to do is left to the user.

Basically saying that not training on the bias is not the same as not having the bias term in some modeling situations.

Granted - no bias, no need to compute anything, but in cases where the shape data is necessary, that might not be so trivial without also defining the implementation of the layer without the bias. With this, we can retain the fast paths of pretending that the bias doesn't exist, while giving the option of always holding it to zeros - ie it exists but it's never trained on.

Consider a contrived version of a layer

function layer(w, b, x)
  y = w .* x
  y .+ b
end

Here the shape of the output depends on the shape of the bias.

W = rand(3,3,2) 
x = rand(3,3,2) # last dim is batch

layer(W, Flux.Zeros(), x) # mirrors current implementation
size(layer(W, Flux.Zeros(3,3,2,2,2), x )) == size(layer(W, zeros(3,3,2,2,2), x ))

This way, we can have the flexibility of choosing how Zeros behaves - both as a traditional bias (with one value per neuron) or made to respect the implementation of a layer.

@mcabbott
Copy link
Member

I agree it's an important property of broadcasting that it extends dimensions based on all arguments. And it's neat that you can use arbitrary broadcasting in functions you intend to use with Flux, and arbitrary array types, including Fill. It's easy to specify which ones you consider to be trainable parameters. If you change the shape or number of arrays appearing in some expression, the output shape may well change, absolutely. It's all entirely up to you.

But the problem we are trying to solve here is much narrower. Some common ML layers exist in a form like y = σ.(W*x .+ b) and a form like y = σ.(W*x), and for these ones, this b is what people mean when they say bias. When present, it doesn't change the shape of the output, its size fits inside that of y.

For reasons of code re-use & shortness of the manual, we'd like to encode both situations as one layer type. They are, however, different models: they have different numbers of trainable parameters, either counting scalars (length(W)+length(b) vs length(W)) or counting arrays (length(params(m))). You have to choose what model you want to work with.

There is no general expectation that pre-trained parameters saved from one model should make sense when loaded into a different model. I mean you are welcome to try, it's all just numbers, but the default behaviour of calling loadparams! with data from a clearly different model should be at least a warning, if not an error.

not training on the bias is not the same as not having the bias term

Sure, if you want a model with bias, but want to exclude it from training (or from some stages of training), that is an orthogonal concern. There are lots of reasons you may wish to freeze some of the parameters present in your model, and we have tools for that.

@CarloLucibello
Copy link
Member

@mcabbott maybe you should turn the work you have done on the bias == false option into a PR, so that it doesn't get lost

@DhairyaLGandhi
Copy link
Member Author

But the problem we are trying to solve here is much narrower. Some common ML layers exist in a form like y = σ.(Wx .+ b) and a form like y = σ.(Wx), and for these ones, this b is what people mean when they say bias. When present, it doesn't change the shape of the output, its size fits inside that of y.

It is not about the narrow scale of the problem per se, and that is what I implied by the traditional use of bias (apologies if I wasn't clear earlier). It is about having the flexibility of treating it as if bias doesn't exist, or treating it as a fixed array or a more generally as a transformation based off of broadcasting. In this approach, we can have both basically, and that makes it strictly superior to making an assumption of whether the bias term will always act in a particular fashion.

Notice also in the example above, it isn't a traditional Dense layer we are considering - it is a learnable transformation.

You have to choose what model you want to work with.

Agree! Giving the users the ability to choose between fixed array vs not existing in a single model seems to be in line with that. To be clear - calling Zeros() mimics the same behaviour as in master albeit being collected in the params - an important distinction since I'd expect you want these properties to be known while saving/ loading the model.

There is no general expectation that pre-trained parameters saved from one model should make sense when loaded into a different model

That is correct, and I don't think this approach changes that in any way. Loading into a different model with incompatible parameters structure would error, however, having multiple layers with bias turned off would not. (ref #1332 and #1277)

c = Conv((3,3), 1=>3, bias = Flux.Zeros())
ps = Flux.params(c)
c2 =  Conv((3,3), 1=>3)
Flux.loadparams!(c2, ps) # errors as expected
c3 = Conv((3,3), 1 => 3, bias = false)
Flux.loadparams!(c3, ps) # works

I haven't exactly seen anything against this approach, which is encouraging.

@DrChainsaw
Copy link
Contributor

DrChainsaw commented Nov 23, 2020

It is about having the flexibility of treating it as if bias doesn't exist, or treating it as a fixed array or a more generally as a transformation based off of broadcasting.

I wouldn't mind having this flexibility as a user although I can see how maintainers might consider it a non-necessary requirement.

To be clear - calling Zeros() mimics the same behaviour as in master albeit being collected in the params - an important distinction since I'd expect you want these properties to be known while saving/ loading the model.

Just a question for my own understanding: Aren't there alot of things which would be required to load/save a model which are not included in params (activation functions, pad, dilation, stride)?

I somehow have the mental model that users can rely on things returned in params to be mutable, but maybe this is just because they often happen to be rather than because it is a formal contract.

As an example, won't returning them in params throw a wrench in the machinery if someone uses de/restructure in the neural ODE case? I might remember it wrong, but I think that use case means that the gradient is computed not by Zygote but by an ODE solver which sees the whole concatenated array. Perhaps this use case is fixable by using trainable instead of params though...

@mcabbott
Copy link
Member

Maybe we are tripping over terminology here? Bias is a constant shift applied before the nonlinearity. Constant meaning it's independent of which input image etc, of the batch index, but may vary per pixel. It shifts the kink in relu to a different point. The term is borrowed from electronics, where instead you'd shift the voltage at which some transistor switches on, etc.

Common layers with & without trainable bias are often used, and make sense for Flux to support. Ideally in a simple, easy-to-understand, easy-to-maintain way. Flux is not interested in elaborate contortions for 1% performance changes.

The example above with Flux.Zeros(3,3,2,2,2) is something else entirely, but it sheds a little light. Perhaps I attempt to summarise the motivation, so far as I can understand it.

In some models, people may wish to replace some other array of trainable parameters with a constant array of zeros. Not just freeze them during some training, but fix permanently to a constant. Such a constant array can often be avoided, as broadcasting tends to extend a single scalar just fine, but when it is needed (e.g. because it affects the output shape), there's a well-tested & maintained package called FillArrays which provides such a thing (and is already loaded by Zygote). If this has particular bugs, then we should fix them.

However, the claim is that this is not acceptable, or too inconvenient, because switching back and forth between the trainable array and the constant one is something people are going to do all the time. (Perhaps so often, even, that you wish to make it easy to load parameters from the model with the trainable array, into the model without it, and make all the other parameters still line up? I'm not sure.)

Is this a common use case? If so, it should be easy to find some links to papers using such things; there are of course many papers I haven't read. If it is common, then we can discuss how best to support it, whether the constant is always zero, and whether it should share any code with the handling of bias, etc. We're much more likely to get the design right with some actual real-life examples to look at. Or if it's not, then we can move on.

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Nov 27, 2020

because switching back and forth between the trainable array and the constant one is something people are going to do all the time.

I can see the constant advocation for FillArrays which has been taken and discussed plenty, and not that it merits repeating, but FillArrays.Fill/Zeros/Ones/False isn't something that Flux/ Zygote owns, and therefore doesn't want to take over the responsibility of specialising behaviour for at this time. The fact that we use FillArrays inside Zygote has to do with efficiency of implementing reverse mode AD. Sadly, it is not in the jurisdiction of Flux/ Zygote to specialise behaviour for array types it cannot control and can open a can of worms later on - this is why Flux.Zeros is a type at all. This is also a reason why we don't make liberal use of LazyArrays either where it might be useful. Also, if it wasn't obvious, I want to say that one design goal is also to implement this feature without touching any of the layer code - constructors, forward passes etc. This way we have a better chance of making it usable with layers written elsewhere.

Using FillArrays used to be my go to after false, but we've moved on since. Further, given the fact that it comes up often seems to suggest a general desire to have an array-like construct for such an operation. Also, most code is written to accept an array for a bias, so this is consistent with that assumption, not changing Flux.Zeros <: AbstractArrays makes sense.

I can see the fundamental disagreement using Zeros to represent a bias at all, and that's alright. I happen to think that this approach (in the PR) would be able to replicate most if not all of what we get from a Bool/ Singleton based approach and still be general enough for extending in the future.

(Perhaps so often, even, that you wish to make it easy to load parameters from the model with the trainable array, into the model without it, and make all the other parameters still line up? I'm not sure.)

This is false, and has been said so in the past. Not being able to loadparams! was an oversight earlier, and this is a bug fix. It makes no attempt to force switching out an array for Zeros or vice versa while loading parameters. In fact, both cases error, and both errors point to hitting Zeros, making the errors meaningful and actionable. On the contrary, this is what I see on master;

julia> d1 = Dense(1,3, bias = false)
Dense(1, 3)

julia> d2 = Dense(1,3)
Dense(1, 3)

julia> d1.b
Flux.Zeros()

julia> d2.b
3-element Array{Float32,1}:
 0.0
 0.0
 0.0

julia> Flux.loadparams!(d1, Flux.params(d2)) # should error -> can't (shouldn't) load array into Zeros/ loaded model doesn't contain the parameters from `d2`

julia> d1.b
Flux.Zeros()

julia> d2.b
3-element Array{Float32,1}:
 0.0
 0.0
 0.0

This should clearly error, since you are either not indeed loading parameters properly (d1.b isn't changed after the fact), not doing the intended thing (the second argument is arguably the intent - but isn't compatible with the model the data is being loaded into), or in more scary circumstances, will cause silent incorrectness.

@ToucheSir good point! Models with Zeros could potentially follow what we have been doing regular models but zeroed out bias (tis an error currently, because I have disabled getindex to catch unwanted errors early). Merits a discussion with people who would use it, but that is comfortably a lower priority for this feature since I haven't heard anyone come up with cases involving Zeros.

@ToucheSir
Copy link
Member

ToucheSir commented Nov 27, 2020

I assume you meant to tag @DrChainsaw, but I'm happy to weigh in as well :)

Documenting the "user-space" API for bias

Regardless of what approach we use to represent bias under the hood, I think it's good to step back and consider how users are going to learn about this feature (especially since intuitiveness/ease of use was a major point in starting this discussion).

Currently, there are exactly 0 references to a Dense constructor that takes a bias on https://fluxml.ai/Flux.jl/stable/models/layers/. The Conv section does have a sentence about it, but no examples of how one might use the keyword argument or how it interacts with the positional ones (i.e. "don't do this: ..."). There's also a point to be made here about the richness and superior structure of the PyTorch and TF references over Flux, but that's a discussion for another day.

Maybe it's not bias that's the problem, but loadparams!

I'd completely forgotten that loadparams! just does a pairwise copyto! across both sets of parameters. This has plenty of failure modes. For that reason, I think using loadparams! to justify one representation of bias over another (or bias/no bias) is extremely problematic because there are other APIs that don't have the same problem!

Digging into the last point a bit more, I'm of the opinion that loadparams! and co. mandate a properly-shaped structure or plain Dict/mapping instead of a wrapped IdDict like Params (which can only be iterated like a vector and has no useful key information). In fact, anything that doesn't try to flatten the parameter hierarchy without keeping around positional metadata (names and indices) would do. With PyTorch's load_state_dict(..., strict=False), Dense(bias) -> Dense(no bias) and vice versa are a non-issue.

And yes, this does tie into #628. Given this is now the 3rd or 4th discussion in which structural traversal of models has come up (see also FluxML/Optimisers.jl#1, #1073 / FluxML/Zygote.jl#535 and FluxML/FluxML-Community-Call-Minutes#10), I think it's worth a re-visit.

@DhairyaLGandhi
Copy link
Member Author

@DrChainsaw yes, indeed

Thanks!

@mcabbott
Copy link
Member

mcabbott commented Nov 27, 2020

Currently, there are exactly 0 references to a Dense constructor that takes a bias on https://fluxml.ai/Flux.jl/stable/models/layers/. The Conv section does have a sentence about it

I think this has the tagged version's docstrings. The ones for Conv were re-written in #1391, and can be seen here: https://fluxml.ai/Flux.jl/dev/models/layers/#Flux.Conv . There may still be improvements to be had, of course, but I hope it's clear that it shouldn't be possible to use bias as a keyword & positional argument in the same method, and that wherever the argument appears, it always accepts the same things.

Edit -- actually only the first method is shown in the docs, the docstring for Conv(W::Array, ... is missing. Not sure why.

The Dense docstring could use some polish, there are PRs in the works.

What do you mean re PyTorch etc? They use a similar bias = False API, I think. Note that their layers broadcast, i.e. treat extra dimensions as batch dims, which #1405 will add here, too. At last.

loadparams! just does a pairwise copyto! across both sets of parameters.

It just iterates all parameters, of any treelike structure. I haven't followed all those issues, and perhaps something fancier is needed there. The bug of #1277 was precisely about an IDDict storing just one copy of the earlier Zeros() fake-array, which was one of the reasons for re-visiting and simplifying the design. Layers without bias do not have bias and so there is no bias parameter in their params, at present, logically enough.

Edit -- If we think about designing other Params-like containers, they will always have to deal with struct fields which aren't parameters, like d.σ::Funciton and c.strides::Tuple{Int,Int}. How layers with bias turned off fit into that is, surely, that it should be treated just like these other non-parameter fields. At the moment, this seems to be done in a few places by testing x isa AbstractArray, very simple.

@mcabbott
Copy link
Member

This should clearly error

Yes, I agree. When last I looked there was a zip which didn't check the the lengths, but should be easy to fix. I made an issue, #1408, and a fix, #1409.

This is false

OK, thank you, that bit is clearer now.

FillArrays ... isn't something that Flux/ Zygote owns

Seems OK to use a package to solve some problem for exotic models, breaking new ground. If they aren't so exotic, then let's gather some examples to guide our thinking. (Of models with a constant array which affects the size of the output, to be clear.) And find out what, if anything, other frameworks do about such models, I guess.

@ToucheSir
Copy link
Member

I think this has the tagged version's docstrings. The ones for Conv were re-written in #1391, and can be seen here: https://fluxml.ai/Flux.jl/dev/models/layers/#Flux.Conv . There may still be improvements to be had, of course, but I hope it's clear that it shouldn't be possible to use bias as a keyword & positional argument in the same method, and that wherever the argument appears, it always accepts the same things.

That's exactly what I was referring to. Great to see the layer reference getting some TLC.

What do you mean re PyTorch etc?

Not the API, but the structure of the docs. Note the distinct section separators, field layout (in nice tables for TF and large bullets for PyTorch vs non-existent for Flux), prominent and consistent sections for in/output shape, etc. These are all more Documenter.jl issues though, just wanted to stick a pin in it for later.

It just iterates all parameters, of any treelike structure. I haven't followed all those issues, and perhaps something fancier is needed there. The bug of #1277 was precisely about an IDDict storing just one copy of the earlier Zeros() fake-array, which was one of the reasons for re-visiting and simplifying the design. Layers without bias do not have bias and so there is no bias parameter in their params, at present, logically enough.

I wasn't thinking about #1277, but in that context the fix makes sense.

However, what if you wanted to loadparams! from a 3-layer MLP onto a 2-layer one or vice versa? Assuming the shape of the first two layers are the same, this should be supported in Flux. That is what strict=False in load_state_dict gives you: the ability to partially load params into a network without having to slice it up and reconstitute it afterwards (if indeed that's even possible in e.g. the presence of custom container layers).

The issues I linked at the end were about eliminating the need for IdDict and all of the footguns that come with it. I included them because loadparams! seems to be a "wedge issue" in arguing for/against different ways to represent bias (e.g. #1402 (comment)). If loadparams! was "smarter" about not trying to load in params that don't exist, #1277 wouldn't have happened in the first place. AFAICT that requires structural information that is not present in a "bag of arrays" structure like Params, but can be done in current Flux with other Functor-based APIs (e.g. Flux.trainable)

Edit -- If we think about designing other Params-like containers, they will always have to deal with struct fields which aren't parameters, like d.σ::Funciton and c.strides::Tuple{Int,Int}. How layers with bias turned off fit into that is, surely, that they should be treated just like these other non-parameter fields. At the moment, this seems to be done in a few places by testing x isa AbstractArray, very simple.

I agree with your assessment, but all these are equally true for the current Params container!

@mcabbott
Copy link
Member

However, what if you wanted to loadparams! from a 3-layer MLP onto a 2-layer one or vice versa?

OK, that's not crazy, but then #1408 / #1409 are too strict. Perhaps we take this there?

but all these are equally true for the current Params container!

Sure. I just mean that absent bias seems like a solved problem (which we should be careful not to un-solve) rather than a driver of new designs.

@DhairyaLGandhi
Copy link
Member Author

I hardly find it solved considering it has unsafe behaviour in a variety of scenarios as showed earlier. Even without the rest of the benefits a more sane and predictable behaviour set seems attractive enough to me

@darsnack
Copy link
Member

darsnack commented Jan 22, 2021

AFAICT, #1407 (false approach) and Zeros are functionally the same with only two differences:

  1. Zeros lets you have a untrainable, absent bias that changes the size of the output.
  2. You get an error when loading trainable weights into a model with absent bias and vice-versa.

My thoughts:

  1. This is surprising behavior to me, and I don't know why anyone would want it, but as has been mentioned already, FillArrays.jl addresses this use-case. The argument that we shouldn't use FillArrays.jl because Flux doesn't control it is distinctly un-Julian IMO. We should be relying on and pointing users to well-maintained packages in the ecosystem. If at some point FillArrays.jl changes in a way that no longer serves our needs, then create a Zeros. But to add so much complexity for a hypothetical scenario seems like a foot gun.
  2. Couldn't this be addressed with a simple check on the # of parameters before attempting to load them? As @ToucheSir pointed out, loadparams! doesn't encode any structural information, but if it did, then that would also solve this. At the same time, it would improve the functionality of loadparams! in a meaningful way. If this is the only compelling reason to prefer Zeros, then a more appropriate solution would be Represent the absence of bias by bias=false #1407 + improved loadparams!.

Apologies if I am jumping into this discussion with imperfect information and conclusions. Please correct me where I am wrong.

@DhairyaLGandhi
Copy link
Member Author

  1. It doesn't necessarily change the size of outputs, but has the option to have dimensions. That is an important distinction as it can be used in place of something that needs to be marked constant but may be part of a different mechanism where the dimensions would come in handy.
  2. Error where sorry? If you are referring to the my comment from earlier, that kind of behaviour is what we currently have on master and while we can and should have the option in loadparams to handle it better, the approach in this PR adds an extra layer of checks implicitly and does the correct loading by default which is an added benefit. Doing the correct thing in the first place pays off imo.
    The correct answer to this whole debacle is simply having a Zygote.Const with which you could mark any object as constant, but this is slightly separate.
    For the comments:
  3. FillArrays fills a different hole in the story and as mentioned multiple times previously whose behaviour should not be messed with for something that is a very simple utility. The maintainence status of the package has nothing to do with it.
  4. The checks with loadparams is a slightly different tangent, where we would have to encode more information about the model. Currently one could also store the entire model, which is far better for reproducibility in any circumstance.

Re simplicity - I have been a proponent, and this is pretty straightforward design.

@ToucheSir
Copy link
Member

ToucheSir commented Jul 11, 2021

Since this topic has come up a couple times recently, I think a good way to get a sense of where everyone stands is to run a quick poll. The choices are as follows:

  • S = should work silently
  • W = should warn (either via @warn or by returning something like PyTorch's state_dict does)
  • E = should error

To start, I've marked out my preferences for all the scenarios I could feasibly think of us running into below:

Input Output Intended Behaviour
layer with mutable bias layer without bias W
layer without bias layer with mutable bias W
layer with mutable bias layer with fixed/const (e.g. Fill) bias W
layer with no bias layer with const bias W
layer with const bias layer with mutable bias S
layer with const bias layer with no bias W

Alternatively, if we wanted to have a "strict" mode like load_state_dict does:

Input Output Intended Behaviour
layer with mutable bias layer without bias E
layer without bias layer with mutable bias E
layer with mutable bias layer with fixed/const (e.g. Fill) bias E
layer with no bias layer with const bias E
layer with const bias layer with mutable bias E
layer with const bias layer with no bias E

@ToucheSir
Copy link
Member

With respect to using Fill vs having something in Flux, I had a read through FillArrays this morning and couldn't find anything obvious that would require us to write code specializing for it here. Perhaps the adjoints, but if those are a problem then we should add ChainRules support to FillArrays rather than throwing out the baby with the bathwater.

@mcabbott
Copy link
Member

This table might be better under #1408 perhaps? Doesn't seem tied to the idea of this PR.

@ToucheSir
Copy link
Member

I included it here because not everyone participating in this thread has commented there, and because it appears to be a sticking point for arguments over how to represent "sometimes params" like bias in general. You could extend a similar argument for the β and γ of the norm layers as well.

@mcabbott
Copy link
Member

mcabbott commented Jul 11, 2021

OK. But I feel like there's a missing word between "Input" & "Output" -- you're discussing how to represent these, with (I think) an eye towards loading / copying / destructure operations. Can you say more precisely which such operations you have in mind? loadparams! and destructure both appeared to be designed only for the simple case of the same model, and fail (or fail to fail) in various ways if not. But maybe the functions which update parameters using a gradient also need to worry about such mismatches...

Still think it would be clearer to discuss such loading / re-structuring / missing thing somewhere else than this thread.

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Jul 11, 2021

The reason for having something specific to the Flux case is that fill arrays (like other specialised arrays) are not something who's behavior flux/ zygote should own. Another factor is breaking pretty crucial behaviour such as

 julia> f = FillArrays.Ones(3,3)
 3×3 Ones{Float64}
 
 julia> gradient(f) do x
          # do something memory intensive
          # that can avoid materializing arrays
          sum(x .^ 2)
        end
 (3×3 Fill{Float64}: entries equal to 2.0,)

@ToucheSir
Copy link
Member

Ah, I think I understand now. I expected

gradient((m, x) -> sum(m(x)), Dense(10, 2, bias=FillArrays.Zeros(2)), rand(Float32, 10, 2))[1].bias == nothing

To be true OOTB, but it isn't.

@mcabbott I was purely focused on (theoretically smarter versions of) loadparams! and params that would be able to handle all of the cases listed. Gradients didn't seem to be much of an issue since both using Zeros and not referencing .bias at all give a gradient of nothing.

@mcabbott
Copy link
Member

There's a whole discussion of better AD for sparse & structured arrays, which includes FillArrays. This example is much like gradient(x -> sum(abs, x .+ 1), Diagonal([1,2,3]))[1] == ones(3,3), which should be solved soon.

@ToucheSir
Copy link
Member

Of course, but those aren't used as model parameter values by default. Currently users must explicitly choose to use sparse and structured arrays with Flux, whereas anyone who calls Dense(..., bias=false) will be using Zeros regardless if they know how it works or not.

@darsnack
Copy link
Member

Whatever happened to just using false? @mcabbott, you had a PR for that.

@mcabbott
Copy link
Member

mcabbott commented Jul 12, 2021

Yes, for bais=false specifically, I still think the answer of what to store is obvious. All possible alternatives have been discussed at pretty generous length. Am happy to rebase the PR if someone will approve it.

For the broader question of what happens if your model may have structured arrays, one answer for how to fill the table above is comes from broadcasting. Maybe loadparams! should just follow .=, which roughly allows writing an object with fewer mutable parameter into one with more, but never changes the structure of the target:

Diagonal(rand(3)) .= rand(3,3)  # error
Diagonal(rand(3)) .= zeros(3,3) # ok, because the values match
Diagonal(rand(3)) .= false      # ok
rand(3,3) .= Diagonal(rand(3))  # ok
rand(3,3) .= Fill(rand(), 3,3)  # ok
rand(3,3) .= false              # ok
MutableFill(rand(), 3,3) .= rand(3,3)  # hypothetical, but an error
MutableFill(rand(), 3,3) .= zeros(3,3) # hypothetical, but ok?
MutableFill(rand(), 3,3) .= false      # hypothetical, but ok?
Fill(1, 3,3) .= rand(3,3)       # error
Fill(1, 3,3) .= ones(3,3)       # ok, because the values match
Fill(1, 3,3) .= 1               # ok

The last two lines here I messed up at first, they do work.

Filling that in on the table, 2/3 depend on the values:

Input src Output dst Intended Behaviour
layer with mutable bias layer without bias Error unless values match, then Safe
layer without bias layer with mutable bias Safe
layer with mutable bias layer with fixed/const (e.g. Fill) bias Error unless values match, then Safe
layer with no bias layer with const bias Safe if constant matches, else Error
layer with const bias layer with mutable bias Safe
layer with const bias layer with no bias Safe if constant matches, else Error

The plan on the AD side, BTW, is to always project gradients down to the number of free, but not necessarily mutable, parameters. Thus the cotangent associated with Fill will only be one real number (possibly stored as another Fill) while the gradient of a FillArrays.Ones, or true, will be nothing (or some zero type).

@DhairyaLGandhi
Copy link
Member Author

false is also no different from the current behaviour, and wouldn't exactly flow naturally. It would need manual handling of cases for loadparams.

This pr has is needed anyway besides and already handles some of the intended loadparams behavior. To relax the constraints of what should error, we can also define setindex later.

@ToucheSir
Copy link
Member

ToucheSir commented Jul 12, 2021

This is precisely why I mentioned loadparams! earlier. If making it work is the wedge issue here (which AFAICT it is), then let's fix loadparams! first and then hold a referendum on Zeros. It's not like we don't know how to do that either: 90% of the puzzle is already filled in with FluxML/Functors.jl#1 and FluxML/Functors.jl#14.

@DhairyaLGandhi
Copy link
Member Author

To me, the loadparams behavior is a benefit of this PR, but not the main driver. I wouldn't block this PR over loadparams.

@ToucheSir
Copy link
Member

ToucheSir commented Jul 12, 2021

No intentions to block this PR, the changes themselves seem pretty uncontroversial. My point was that if that 80%+ of the purpose of Zeros (over, say, dispatch or an explicit branch) is to placate an underpowered loadparams!, then making loadparams! more robust would give us a good idea of how useful that last <20% is.

@DhairyaLGandhi
Copy link
Member Author

Well we already gain a bunch of consistency in this PR, so I think it's going in the direction of making loadparams more consistent too.

@mcabbott
Copy link
Member

mcabbott commented Jul 12, 2021

the changes themselves seem pretty uncontroversial.

Not really. This PR lacks a motivation for adding back quite a lot of complexity. No examples of ML models which need fixed arrays have been provided yet. And actual arguments against using the well-tested ones provided by an existing dependency, should someone want them, are also conspicuously absent.

loadparams!, as you say, can happily adapt to suit such things. Once we decide what behaviour to fill in on your table. And, perhaps, once someone has a use case.

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Jul 12, 2021

Actually, we've had plenty of debate over these in this, and connected issues/PRs. We've also wanted means to keep gamma and such constant for normalisation layers. Plus it's usable in places outside of biases as well, much like an array. Even the current implementation defines the math ops, broadcasting and their gradients, so I am unsure if there's any real concerns there. We've also gone over the FillArrays case which I think you're referring to in your comment. Fast conv_bias_act paths also fall out neatly.

@mcabbott
Copy link
Member

Yes, this thread has certainly gone around in circles a lot. On that everyone can agree. By "use case" I mean the arxiv numbers of papers that need this thing, and yes "outside of biases" since bias, by definition really, doesn't involve broadcasting which changes the shape. If "current implementation" means master, then yes, this is also needlessly complex.

@ToucheSir
Copy link
Member

ToucheSir commented Jul 12, 2021

Yeah, these issues have become increasingly unwieldy to the point where I keep forgetting about previous context (e.g. the "adding back" part, I thought these changes were new).

I also assumed that not treating Zeros as an array would lead to something like #1601 (hence the 20% comment), but after some testing that appears to not be the case. With that done and dusted, is loadparams! literally the only remaining concrete motivation here? The OP lists "CUDA compat" and "Correctness of the broadcasting" which I assumed were related to previous issues, but after digging around in the previous discussion I couldn't find any mention of either.

And actual arguments against using the well-tested ones provided by an existing dependency, should someone want them, are also conspicuously absent.

Can #1402 (comment) be solved without any code changes to Flux (such that using Zygote with and without Flux has consistent behaviour, for example)? For FillArrays at least, I feel like you'd have conflicts between the default behaviour of Zeros as an array:

julia> gradient((x, y) -> sum(x .+ y), Zeros(10), Ones(10))
(10-element Fill{Float64}: entries equal to 1.0, 10-element Fill{Float64}: entries equal to 1.0)

And Zeros as a null type:

julia> gradient((x, y) -> sum(x .+ y), Flux.Zeros(10), Ones(10))
(nothing, 10-element Fill{Float64}: entries equal to 1.0)

Maybe there's a way to reconcile the two, but I haven't been able to think of one that doesn't introduce surprising behaviour.

@DhairyaLGandhi
Copy link
Member Author

There's also a perfectly valid API compatible Zeros() which would do what you want. And this is pretty straightforward code.

Arxiv numbers aren't really relevant here. The goal is to generalise to existing notion of how layers are written and to have more flexibility for cases that may not conform to the current state. (We can extend this such that we don't need to store gamma if it wouldn't be trained on in norm layers for example). Re master: I don't remember seeing popular instances of scalar biases either.

If you would be happy with an example of practicality, conv_bias_act comes to mind.

@mcabbott mcabbott mentioned this pull request Feb 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants