CuDNN: Support for asymmetric padding? #128

Sleort · 2019-06-11T06:43:19Z

Problem
Currently, asymmetric padding doesn't work in Flux when using CuArrays:

using Flux, CuArrays
m = Conv((3,3), 1=>1, pad=(0,1,0,0)) |> gpu
x = rand(100,100,1,1) |> gpu
m(x)

throws an error (when run on a gpu): CUDNNError(code 3, CUDNN_STATUS_BAD_PARAM) and gives the warning Warning: CuDNN does not support asymmetric padding; defaulting to symmetric choice. This means that

my current model (which runs on the CPU) cannot be trained using a GPU (without modifying the code)
there is no longer a one-to-one correspondence between CPU valid code and GPU valid code. (I.e., the above code will run just fine on the CPU if there is no GPU available.)

Desired solution
Well, CuDNN support for asymmetric padding would be great. Then I wouldn't have to worry (as much) about whether code developed on a CPU will work on a GPU.

Alternative solution I've considered
One can obviously use symmetric padding and then crop the output to get the desired result. This is however not very elegant, nor as efficient (?).

The text was updated successfully, but these errors were encountered:

denizyuret · 2019-08-27T14:03:58Z

We recently had the same problem implementing Yolo in Knet. The workaround @Ybakman and I ended up using is an efficient implementation of your alternative solution: symmetric padding followed by cropping using a fixed convolution filter. See https://github.com/Ybakman/YoloV2 for an implementation. This turned out to be fairly efficient (one extra conv), and can be added as a standard solution to Knet/Flux/CuArrays, I think. Given that CuDNN does not support asymmetric padding this may be the best we can do other than writing new kernels.

Sleort · 2019-08-28T06:41:40Z

That's clever. Did you compare the performance of this solution to just cropping the output?

denizyuret · 2019-08-28T07:24:48Z

Cropping is slow because it is non-contiguous in this case. There is a lot of copying and pasting. Unless you write a specialized kernel it is hopeless.

…

On Wed, Aug 28, 2019, 9:41 AM Troels Arnfred Bojesen < ***@***.***> wrote: That's clever. Did you compare the performance of this solution to just cropping the output? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/JuliaGPU/CuArrays.jl/issues/356?email_source=notifications&email_token=AAN43JVY4T7UW6OG6XGDS2LQGYM2NA5CNFSM4HW3A2GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5KB2NY#issuecomment-525606199>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAN43JS57T7KXKMWYQGCZODQGYM2NANCNFSM4HW3A2GA> .

Sleort · 2019-08-28T07:32:08Z

Makes sense. I'll try it out. Thanks!

DrChainsaw · 2019-10-20T20:44:36Z

Hi,

Why not just pad input manually in case of asymetric padding? That is what other DL libraries (e.g. tensorflow) is doing.

At least when using Flux it seems to be faster than an additional conv and almost as fast as symetric padding.

I can volunteer to add this should it be desirable (e.g. if the benchmarking below is not flawed). I'm not sure this is the right package for it as the API (at least the one used by Flux) seems to require that both output and input is passed and this in turn would mean that one needs to resize two arrays instead of one (or?). Following the callchain from Flux it seems like NNlib is the lowest level where one can apply the "pad input manually" method. @denizyuret: Does Knet use NNlib?

Here is an example with AsConv1 being the "manual" padding of the input while AsConv2 is doing padding by cropping using another Conv layer:

using Flux
using BenchmarkTools
using CuArrays

struct AsConv1{T, N}
    c::T
    pad::NTuple{N,Int}
end
function AsConv1(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ = identity;
    init = Flux.glorot_uniform,  stride = 1, pad = 0, dilation = 1) where N
    length(pad) < 2N || all(i -> length(unique(pad[i:i+1])) == 1, 1:2:2N) == 1&& return Conv(k, ch , σ, init=init, stride=stride, pad=pad, dilation=dilation)

    pad_manual = Tuple(map(i -> abs(pad[i] - pad[i+1]), 1:2:2N))
    pad_auto = Tuple(map(i -> minimum(pad[i:i+1]), 1:2:2N))
    return AsConv1(Conv(k, ch, σ, init=init, stride=stride, pad=pad_auto, dilation=dilation), pad_manual)
end
function (c::AsConv1)(x::AbstractArray)
    # Maybe there are faster ways to do this as well...
    padding = similar(x, c.pad..., size(x)[end-1:end]...)
    fill!(padding, 0)
    c.c(cat(x, padding, dims=1:length(c.pad)))
end

Flux.@treelike AsConv1

struct AsConv2{T,V}
    cfun::T
    cpad::V
end
function AsConv2(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ = identity;
    init = Flux.glorot_uniform,  stride = 1, pad = 0, dilation = 1) where N
    length(pad) < 2N || all(i -> length(unique(pad[i:i+1])) == 1, 1:2:2N) == 1&& return Conv(k, ch , σ, init=init, stride=stride, pad=pad, dilation=dilation)

    pad_manual = Tuple(map(i -> abs(pad[i] - pad[i+1]), 1:2:2N))
    pad_auto = Tuple(map(i -> maximum(pad[i:i+1]), 1:2:2N))

    wpad = zeros(Float32, (pad_manual .+1) ...,ch[2], ch[2])
    inds = map(pp -> pp > 0 ? 1 : Colon(), pad_manual)
    wpad[inds..., :,:] .= 1
    cpad = Conv(wpad, zeros(Float32, ch[2]), identity)
    cfun = Conv(k, ch, σ, init=init, stride=stride, pad=pad_auto, dilation=dilation)
    return AsConv2(cfun, cpad)
end
function (c::AsConv2)(x::AbstractArray)
    x′ = c.cfun(x)
    return c.cpad(x′)
end

Flux.@treelike AsConv2

cc0 = Conv((3,5), 64 => 256, pad=(1,1,2,2)) |> gpu;
cc1 = AsConv1((3,5), 64 => 256, pad=(1,0,2,2)) |> gpu;
cc2 = AsConv2((3,5), 64 => 256, pad=(1,0,2,2)) |> gpu;

data = ones(Float32, 32,32,64,32) |> gpu;

julia> size(cc0(data)) # Note: Does not do asymetric padding as it would fail
(32, 32, 256, 32)

julia> size(cc1(data)) # Pads one less than cc0, same as cc2
(31, 32, 256, 32)

julia> size(cc2(data)) # Pads one less than cc0, same as cc1
(31, 32, 256, 32)

julia> @benchmark CuArrays.@sync cc0($data)
BenchmarkTools.Trial:
  memory estimate:  9.20 KiB
  allocs estimate:  188
  --------------
  minimum time:     5.925 ms (0.00% GC)
  median time:      6.199 ms (0.00% GC)
  mean time:        6.377 ms (1.02% GC)
  maximum time:     13.290 ms (42.58% GC)
  --------------
  samples:          781
  evals/sample:     1

julia> @benchmark CuArrays.@sync cc1($data)
BenchmarkTools.Trial:
  memory estimate:  23.05 KiB
  allocs estimate:  382
  --------------
  minimum time:     6.255 ms (0.00% GC)
  median time:      6.723 ms (0.00% GC)
  mean time:        6.933 ms (1.09% GC)
  maximum time:     15.048 ms (43.82% GC)
  --------------
  samples:          718
  evals/sample:     1

julia> @benchmark CuArrays.@sync cc2($data)
BenchmarkTools.Trial:
  memory estimate:  18.39 KiB
  allocs estimate:  371
  --------------
  minimum time:     9.720 ms (0.00% GC)
  median time:      10.655 ms (0.00% GC)
  mean time:        22.898 ms (0.44% GC)
  maximum time:     131.763 ms (0.00% GC)
  --------------
  samples:          218
  evals/sample:     1

denizyuret · 2019-10-28T11:18:44Z

That is certainly an option. In our experiments with Knet the two gave roughly the same speed. Memory cost is pretty much the same: padded input takes up extra space, but so does our extra convolution operation.

tknopp · 2021-12-18T15:27:58Z

I ran into this issue as well. Is there any downside of AsConv1 shown above by @DrChainsaw? Couldn't that be just be integrated into Conv?

DrChainsaw · 2021-12-22T16:55:45Z

I ran into this issue as well. Is there any downside of AsConv1 shown above by @DrChainsaw? Couldn't that be just be integrated into Conv?

The downside would be that Conv is implemented in Flux and Flux tries to be agnostic to the backend. Other backens (e.g. the CPU backend) supports asymmetric padding. Having AsConv1 as a special type would be inconvenient to the user (e.g. gpu(model) will not always produce a working model).

I'm thinking that the right course of action is to just do the manual padding in the function which currently warns in CUDA.jl. I suppose this is a breaking change in theory, but at least Flux does not assume the current behaviour and instead passes an array which assumes asymmertric padding so I think one should only need to copy+resize the input.

I can try to look into it tonight and see if it is doable without any larger restructuring.

DrChainsaw · 2021-12-23T00:51:05Z

Turns out this functionality has moved to NNlibCUDA. Either close now or when/if the above PR is accepted.

maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020

maleadt added cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request labels May 27, 2020

DrChainsaw mentioned this issue May 29, 2021

SamePad() with even kernel dimensions does not work (only in CUDA) FluxML/Flux.jl#1605

Closed

maleadt added the help wanted Extra attention is needed label Dec 20, 2021

DrChainsaw mentioned this issue Dec 23, 2021

Add support for asymmetric padding for convlayers FluxML/NNlibCUDA.jl#34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CuDNN: Support for asymmetric padding? #128

CuDNN: Support for asymmetric padding? #128

Sleort commented Jun 11, 2019

denizyuret commented Aug 27, 2019

Sleort commented Aug 28, 2019

denizyuret commented Aug 28, 2019 via email

Sleort commented Aug 28, 2019

DrChainsaw commented Oct 20, 2019 •

edited

Loading

denizyuret commented Oct 28, 2019

tknopp commented Dec 18, 2021

DrChainsaw commented Dec 22, 2021

DrChainsaw commented Dec 23, 2021

CuDNN: Support for asymmetric padding? #128

CuDNN: Support for asymmetric padding? #128

Comments

Sleort commented Jun 11, 2019

denizyuret commented Aug 27, 2019

Sleort commented Aug 28, 2019

denizyuret commented Aug 28, 2019 via email

Sleort commented Aug 28, 2019

DrChainsaw commented Oct 20, 2019 • edited Loading

denizyuret commented Oct 28, 2019

tknopp commented Dec 18, 2021

DrChainsaw commented Dec 22, 2021

DrChainsaw commented Dec 23, 2021

DrChainsaw commented Oct 20, 2019 •

edited

Loading