Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to change the channel axis for BatchNorm function and the likes #1664

Open
vboussange opened this issue Jul 14, 2021 · 22 comments
Open

Comments

@vboussange
Copy link

Hi all,
Currently, the BatchNorm function selects by default the N-1 dimension of the N dimensional array as the channel dimension. I think it would be useful to let the user decide on the dimension that corresponds to the channel dimension in his/her particular setting.

I have already a proposal, that I am happy to release as a pull request.

Example

using Flux
channel_size = 3
channel_axis = 1
BN = BatchNorm(channel_size, dim = channel_axis)
x = randn(channel_size, 10, 10)
BN(x)
@DhairyaLGandhi
Copy link
Member

Thanks! We already support multiple dims in the norm layers for different data types. The question really is that in order to have a channel axis different in one set of layers compared to all the others in Flux. It might cause confusion and increase complexity of the package more generally.

@ToucheSir
Copy link
Member

ToucheSir commented Jul 14, 2021

Sounds like a reasonable (if somewhat uncommon) proposal to me if it is extended to GroupNorm and InstanceNorm as well. The main consideration I can think of would be how GPU-friendly this is.

Edit: do you have publications or other concrete examples of where this flexibility would be helpful? I had a look through other libraries and none of them appear to support this.

@vboussange
Copy link
Author

Thanks for your answers.
@ToucheSir This is an option proposed in TensorFlow with tf.keras.layers.BatchNormalization. It can be useful for scientific machine learning.
I am writing a package (paper in prep.) that relies on Flux.jl for solving high dimensional Partial Differential Equations. In this setting, I need to evaluate BN with an array x where ndims(x) is 2 or 3, but the "feature" axis is always the first one.
As regards to GPU performance, it works very well in my user case.

I agree, this option should be also implemented for GroupNorm and InstanceNorm, which is what I did. Let me know if you want me to PR.

@darsnack
Copy link
Member

If you have code, a PR is always better to discuss a possible change.

This is not meant to be push back, but is using permutedims before the BN not an option? Is it just too cumbersome to write?

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Jul 15, 2021

Well, something like this would need to be consistent through Conv*, Dense* etc too, not just *Norm for it to be used consistently in a model. I agree a x -> permutedims(x, (2,1,3)) before a BN would be a better solution. Its more in line with flux style and is one of the benefits of the way we do things imo.

@ToucheSir
Copy link
Member

Well, this might require two permutedims to restore the original format. Dense doesn't even have a concept of channels so it hardly applies there. For conv, my main concern (as with this) whether our cuDNN integration can be made to support changing the channel dimension.

@darsnack
Copy link
Member

Yeah if we did do this, I would prefer to do it across the board rather than have some layers with fixed features x channel x samples and some flexible. Being able to see a proposed implementation would be helpful to gauge how easily we can spread that API across all the layers.

Well, this might require two permutedims to restore the original format.

Yeah this is what I was thinking w.r.t. "cumbersome." It might make sense to have a layer that reshapes/permutes, applies a wrapped layer, then reshapes/permutes back. Just an alternate idea.

@vboussange
Copy link
Author

I have proposed a PR (#1666).

using permutedims before the BN not an option? Is it just too cumbersome to write?

Correct me if I am wrong, but using permutedims goes necessarily along with allocating a new array (in fact two since as mentioned one would need to apply permutedims twice). This can slow dow the execution for large arrays, or for one like me who would need to do this transformation many times in a run.

@darsnack
Copy link
Member

darsnack commented Jul 16, 2021

That's true. PermuteDimsArray is a zero-copy version of permutedims.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Jul 16, 2021

Here's a proposal, what if we check that PermuteDimsArray can work with Zygote, and write a constructor adjoint if not. Then we wouldn't need copies (does permutedims even copy? I doubt it) and downstream packages can write a simple helper utility that does the right thing by permuting twice with a layer in the middle without complicating the semantics of Flux layers.

function f(...; dims)
  Chain(PermuteDimsArray(...; dims),
        BN,
        PermuteDimsArray(...; inv(dims)))
end

@darsnack
Copy link
Member

permutedims does copy AFAIK. It isn't default no-copy like reshape. Sounds like a good proposal though.

@ToucheSir
Copy link
Member

How GPU-friendly is this double permute? Would the second one restore the original dense array for downstream operations?

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Jul 16, 2021

Some layers may copy, so no it shouldn't restore the original (hence it shouldn't be in Flux because it can't be generic). Note, the dimensionality of the output would match those of the input (the channels dim would be N - 1).

It should be gpu friendly as PermuteDimsArray is handled in CUDA (via Adapt) which I think is fine.

@vboussange
Copy link
Author

@DhairyaLGandhi your proposal is appealing. Nonetheless I am not sure how to interpret your piece of code.

I still gave a try on a double permute with a GPU array, and it throws a scalar indexing ...

using CUDA; CUDA.allowscalar(false)
import Base.PermutedDimsArray

x = CUDA.randn(10, 5 ,2 );
x = PermutedDimsArray(x,[3,2,1]);
x = PermutedDimsArray(x, invperm([3,2,1]));
x * 2

@DhairyaLGandhi
Copy link
Member

Could you check whether PermuteDimsArray has an appropriate rule in Adapt.jl?

@vboussange
Copy link
Author

vboussange commented Aug 5, 2021

@DhairyaLGandhi Yes it looks like there is something

@darsnack
Copy link
Member

darsnack commented Aug 5, 2021

You are not using Adapt.jl here, because x starts off on the GPU before being passed to PermuteDimsArray. You would be using Adapt.jl if you started with a PermuteDimsArray{Array} then called gpu/adapt.

The issue here is likely from invperm which is probably not a GPU-friendly implementation in Base. (never mind, that call doesn't even involve the GPU) It would be helpful to post complete stack traces so that we can easily spot where the error is occurring and why.

@vboussange
Copy link
Author

Sure, this is what it throws

julia> x * 2
ERROR: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:33
  [2] assertscalar(op::String)
    @ GPUArrays ~/.julia/packages/GPUArrays/8dzSJ/src/host/indexing.jl:53
  [3] getindex(::CuArray{Float32, 3}, ::Int64, ::Int64, ::Int64)
    @ GPUArrays ~/.julia/packages/GPUArrays/8dzSJ/src/host/indexing.jl:86
  [4] getindex (repeats 2 times)
    @ ./permuteddimsarray.jl:71 [inlined]
  [5] _getindex
    @ ./abstractarray.jl:1214 [inlined]
  [6] getindex
    @ ./abstractarray.jl:1170 [inlined]
  [7] _broadcast_getindex
    @ ./broadcast.jl:614 [inlined]
  [8] _getindex
    @ ./broadcast.jl:644 [inlined]
  [9] _broadcast_getindex
    @ ./broadcast.jl:620 [inlined]
 [10] getindex
    @ ./broadcast.jl:575 [inlined]
 [11] macro expansion
    @ ./broadcast.jl:984 [inlined]
 [12] macro expansion
    @ ./simdloop.jl:77 [inlined]
 [13] copyto!
    @ ./broadcast.jl:983 [inlined]
 [14] copyto!
    @ ./broadcast.jl:936 [inlined]
 [15] copy
    @ ./broadcast.jl:908 [inlined]
 [16] materialize
    @ ./broadcast.jl:883 [inlined]
 [17] broadcast_preserving_zero_d
    @ ./broadcast.jl:872 [inlined]
 [18] *(A::PermutedDimsArray{Float32, 3, (3, 2, 1), (3, 2, 1), PermutedDimsArray{Float32, 3, (3, 2, 1), (3, 2, 1), CuArray{Float32, 3}}}, B::Int64)
    @ Base ./arraymath.jl:55
 [19] top-level scope
    @ REPL[7]:1

@darsnack
Copy link
Member

darsnack commented Aug 5, 2021

Okay, that's because broadcasting with a PermuteDimsArray triggers getindex which uses scalar indexing.

@DhairyaLGandhi
Copy link
Member

Cc @maleadt

@maleadt
Copy link
Collaborator

maleadt commented Aug 5, 2021

Broadcast generally only works with plain CuArrays, see JuliaGPU/CUDA.jl#228.

@vargonis
Copy link

Any updates here? Channel dimension is naturally N-1 in space-like models, but for sequential or time-like models it is natural to have it first... Or more generally N-2, if one also has space-like dimensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants