Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some issues with Zeros #1374

Closed
wants to merge 10 commits into from
Closed

Fix some issues with Zeros #1374

wants to merge 10 commits into from

Conversation

DrChainsaw
Copy link
Contributor

@DrChainsaw DrChainsaw commented Oct 28, 2020

Fixes #1332, #1277

Two things where causing trouble:

  1. adapt (used to change type of parameters) turning Zeros() into Ts
  2. reshape (used in conv-layers) turning Zeros into Base.ReshapedArrays

Both are adressed by specializations.

I'm not sure what is the right way to overload reshape. Leaving the type out for the second argument leads to ambiguity, so there are still more than a handful ways to reshape Zeros which leads to issue 2). Any packages which does it I can look at?

I was too lazy to come up with a nice way to check dimensions of parameters when creating a layer.

Furthermore, a number of adjoints are defined to ensure that gradients of any Zeros is nothing.

Also fixes a few issues with loadparams! and destructure in the presence of Zeros.

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable
  • Final review from @dhairyagandhi96 (for API changes).

Add reshape implementation for Zeros
Copy link
Member

@DhairyaLGandhi DhairyaLGandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love seeing the green CI!

src/functor.jl Outdated
@@ -78,6 +78,7 @@ gpu(x) = use_cuda[] ? fmap(CUDA.cu, x) : x
adapt_storage(T::Type{<:Real}, xs::AbstractArray{<:Real}) = convert.(T, xs)

paramtype(T::Type{<:Real}, m) = fmap(x -> adapt(T, x), m)
adapt(::Any, x::Zeros) = x # So that we for example don't accidentally end up with layers having a scalar for bias
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Any correct here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does this actually materialise? You would want to use adapt_structure here. Thanks to Tim for the pointer!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes will change to adapt_structure following Tims advice!

::Any instead of just T or x is basically just because vs code marks unused variables. Should I change to T or did you mean that there a more constrained type to use here?

test/utils.jl Outdated
@@ -129,6 +129,12 @@ end
@test eltype(f32(f64(m))[1].W) == Float32
end

@testset "Zeros" begin
m = Dense(randn(2,3), Zeros())
@test f64(m).b === m.b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this due to the Varargs case being caught in the method above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow.

It is a test that Zeros do not materialize from f64. I'll add === Zeros() in the end for now to make the intention clear.

src/functor.jl Outdated
@@ -78,6 +78,7 @@ gpu(x) = use_cuda[] ? fmap(CUDA.cu, x) : x
adapt_storage(T::Type{<:Real}, xs::AbstractArray{<:Real}) = convert.(T, xs)

paramtype(T::Type{<:Real}, m) = fmap(x -> adapt(T, x), m)
adapt(::Any, x::Zeros) = x # So that we for example don't accidentally end up with layers having a scalar for bias
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does this actually materialise? You would want to use adapt_structure here. Thanks to Tim for the pointer!

op = bias(ip)
@test sum(op) ≈ 0.f0
gs = gradient(() -> sum(bias(ip)), Flux.params(bias))
@test gs[bias.bias] === nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to test this with CUDA as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you are referring to the test that gradient w.r.t Zeros is nothing, right? Will add it right away!

@DrChainsaw
Copy link
Contributor Author

Btw, I had a look at how FillArrays.jl does reshape. Quite a few methods, but nothing insurmountable and maybe not all of them are needed just for Zeros.

Was there ever a discussion to use (or recommend users to use) FillArrays.Zeros instead of rolling an own version?

@DrChainsaw
Copy link
Contributor Author

Another thought: Should e.g. f32(Zeros()) return a Zeros{Float32}? I guess something like this:

adapt_structure(::Type{<:AbstractArray{T}}, x::Flux.Zeros) where T = adapt_structure(T, x)
adapt_structure(::Type{T}, x::Flux.Zeros{T}) where T = x 
adapt_structure(T, x::Flux.Zeros) = Flux.Zeros(T, size(x)...)

Drawback is that there will then be some inconsistency between layers created through e.g Conv(...;bias=Zeros()) v.s. the ones which are mapped. So far everything seems to work fine with Zeros being of type Bool so maybe not touch it now.

@DhairyaLGandhi
Copy link
Member

The reason we have our own Zeros type is that Flux needs to own a type that it can define a bunch of these routines for, otherwise it's very unexpected to see why FillArrays' types should not have gradients.

@DrChainsaw
Copy link
Contributor Author

The reason we have our own Zeros type is that Flux needs to own a type that it can define a bunch of these routines for

Yeah, though it had something to do with type piracy-ish stuff.

I'm almost done, but I'm trying to squeeze in some testing with optimizers as well and running into some issues with zero, so I might need to add that too.

@DrChainsaw
Copy link
Contributor Author

DrChainsaw commented Oct 30, 2020

Hmmm, somehow it is only Conv layers which gets no gradient for Zeros, all other ways they are involved seem to result in gradients being created:

julia> ll = Dense(ones(Float32, 1,1), Flux.Zeros())
Dense(1, 1)

julia> gs = gradient(() -> sum(ll(ones(Float32, 1,1))), params(ll))
Grads(...)

julia> gs[ll.b]
0-dimensional FillArrays.Fill{Float32,0,Tuple{}} = 1.0

julia> ll.b
0-dimensional Flux.Zeros{Bool,0}:
0

Edit: Its because of this adjoint:

@adjoint reshape(xs::Zeros{T}, dims...) where T =
               reshape(xs, dims...), _ -> nothing

which is only used with conv layers.

I guess we need adjoints for +,- and * too, right?

@DrChainsaw
Copy link
Contributor Author

DrChainsaw commented Oct 31, 2020

Ugh, avoiding ambiguity while still catching all permutations of Zeros turned out to be a fair bit more painful than I though. I could not come up with a simpler way than more or less implementing every permutation.

I still need to implement the non-broadcasted OPs and test the / case, but I'm too tired for that now.

@CarloLucibello
Copy link
Member

we should move this discussion elsewhere but I think it would be quite natural for FillArray.Zeros and FillArray.Ones to have zero gradients since they are meant to be constants. We should really consider switching to FillArray types given the amount of extra bloat having our own Zeros produces.

@mcabbott
Copy link
Member

mcabbott commented Oct 31, 2020

Another thought is: might it be simpler not to do this at all? Just store bias=false (or bias=False()) instead of a fake array, remove Zeros() completely. This behaves correctly under broadcast already, and I think scalars are ignored by params.

@DrChainsaw
Copy link
Contributor Author

@CarloLucibello I guess Zygote could do the adjoints for it without it being piracy. I can't wrap my head around if it will cause problems to someone somewhere if there are never gradients for FillArrays. I can think of cases when there is inconvenience, but I'm not sure there are any real showstoppers. If new AD means Zygote is reaching eol then perhaps it can be done as an experiment and see what complaints come in :)

@mcabbott I was thinking along similar lines. It might be easier to just make something which only tries to solve the bias on/off case. Another option is to have some Off struct which is not an AbstractArray but using false is perhaps a bit more elegant as alot of things are already implemented for it. I think one issue is that the conv-layers for some reason need to reshape the bias. An own type could just short-circuit reshape though, just like Zeros does in this commit.

I'm not sure if it is desirable to have a turned off bias in params or not. Otherwise the same thing can perhaps be done more explicitly with Zeros as well. I remember being a bit surprised to see it there, but have in the back of my head that there was a good reason for it. If it needs to be in params then I think Zygote will return gradients for false too (see below).

About the adjoint overloading in the last commit: The core issue afaik is that Zygote has defined adjoints for broadcasting with numerical arrays (and scalars) so even if the primal op does not use a variable the gradient for it will be created. Could this be fixed in Zygote with reasonable effort? A hack like checking if the result is identical to one of the arguments is one way ofc...

Also, after getting some sleep on it there is a chance I can get rid of the adjoints and instead specialize on Zygote.unbroadcast so it returns nothing if one of the arguments is a Zeros. For multiplication this would then only work conj(::Zeros) is sacrificed so it returns a real array of zeros. Not sure if this will work out with e.g. CuArrays though. The solution would then also become reliant on what I guess should be an implementation detail in Zygote.

@mcabbott
Copy link
Member

I think one issue is that the conv-layers for some reason need to reshape the bias. An own type could just short-circuit reshape though, just like Zeros does in this commit.

My foggy recollection was that the original reason for Zeros was to avoid needing a special case in one or two conv methods. But that doesn't sound so bad compared to all this.

Have only skimmed the issues on loading/saving & params, but it sounds like the goal is to ignore absent biases, but Zeros <: AbstractArray means it needs all kinds of special-casing there.

@DrChainsaw
Copy link
Contributor Author

But that doesn't sound so bad compared to all this.

No disagreement on my side there. This endeavour turned out to be quite the rabbit hole :)

One option is that I just open another PR where try to do the bias=false approach and see where that takes us.

@DhairyaLGandhi
Copy link
Member

Definitely shouldn't change the behaviour of FillArrays types, and we should keep the forward passes untouched since this is orthogonal to how those are supposed to function.

Seems like this needs the correct adapt rules (check out CUDA.Adaptor) and a way to tell cuda not to materialize it till there's a kernel call to CuDNN perhaps.

@DhairyaLGandhi
Copy link
Member

We had considered the false approach while originally discussing the bias switch, I think we need to be careful not to do something which isn't generic enough. I agree that adjoints with Zeros can help here

@DrChainsaw
Copy link
Contributor Author

DrChainsaw commented Oct 31, 2020

@DhairyaLGandhi I will look into the CUDA.Adaptor. I could not find it when browsing the code on github, but I will search once I'm back in front of the computer. My thinking right now is that the specialized adapt_structure prevents them from turning into CuArrays and the specialized broadcasts will be used even if one argument is a CuArray. Is this about more generic things (like cu([1]) .+ 0 .+ Flux.Zeros() which currently fails)?

I looked through the issue for the param loading case: #1277

It seems that at least the issue with Zeros turning to Arrays is solved by this PR.

I will look at tests for loadparams! and destructure and drop some Zeros in there and see what happens now. It seems as those methods rely on params being AbstractArrays. Right now I can't wrap my head around if things will just work if non-existent biases are just not added to the set of params or not.

One obvious failure case is if someone tries to add a bias to a non-bias layer with loadparams!. This should throw an error as it is outside the scope of just modifying the parameters, right? Restructure otoh should perhaps work...

@DrChainsaw
Copy link
Contributor Author

@DhairyaLGandhi I looked up CUDA.Adaptor in the code but I could still not determine if it had any other purpose than moving stuff to the GPU which is what we want to avoid altogether here, right?

Is this sufficient to tell if it is handled correctly?

julia> cu(Flux.Zeros())
0-dimensional Flux.Zeros{Bool,0}:
0

julia> cu(Flux.Zeros(2))
2-element Flux.Zeros{Bool,1}:
 0
 0

@DrChainsaw
Copy link
Contributor Author

Adding fixes and tests for loadparams! and destructure made it seem slightly more attractive in my eyes to explore the path where bias is not an AbstractArray.

There are two ways one can view the act of disabling bias:

  1. It is immutable (and zero)
  2. It is simply just not there

I think the current approach with Zeros being an AbstractArray tries to do a little bit of both, or at least it sends that message.

For destructure having params as an AbstractArray was actively unhelpful as the contract is that the returned array is something which can be changed. I guess there are packages for exotic arrays which have different types for different parts of the array but I doubt that we want to go there. Whats on the branch now just makes the assumption that bias=Zeros() shall be treated as if the bias just does not exist, meaning they are not included in the destructured array.

loadparams! otoh is a bit of a mixed bag where immutable structs being singletons makes it awkward to e.g load between models where one has bias and the other has not. I'm not sure this is how it is supposed to be used though and for cases when the params to replace with are created manually I can't see how it matters.

@DrChainsaw
Copy link
Contributor Author

Ok, I think functionality wise this PR is now ready as far as I can tell.

I will start drafting another version where Zeros is not an AbstractArray and submit that as a separate PR.

bors bot added a commit that referenced this pull request Nov 9, 2020
1379: Fix some issues with Zeros option 2 r=CarloLucibello a=DrChainsaw

Fixes #1332, #1277

This is take two on fixing the above issues and is intended to be mutually exclusive to #1374.

In this version Zeros is no longer an AbstractArray and is thus more like a there-is-no-such-parameter marker instead of this-parameter-is-an-immutable-array-of-zeros type. 

I made the change so that the advertised way of disabling bias is through `bias=false` rather than `bias=Zeros()`. 

`Zeros` itself is alot less capable here compared to the previous attempt, but that also keeps the number of moving parts down, i.e no need to test that both the 0-dim version and the full size version works the same. All in all there are fewer specializations compared to #1374. 

I think that a rename could definitely be in place. Better names I can think of are `bias=Off`, `bias=None` or as suggested by @mcabbott `bias=False()`. 

The best argument against renaming I can think of is to be a little bit less breaking.

### PR Checklist

- [X] Tests are added
- [X] Entry in NEWS.md
- [ ] Documentation, if applicable
- [ ] Final review from `@dhairyagandhi96` (for API changes).


Co-authored-by: DrChainsaw <Christian.kyril.skarby@gmail.com>
@DrChainsaw
Copy link
Contributor Author

Redundant after #1379

@DrChainsaw DrChainsaw closed this Nov 9, 2020
@DrChainsaw DrChainsaw mentioned this pull request Nov 20, 2020
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU error when using Zeros() as bias in Conv layer
4 participants