Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CuDNN: Support for asymmetric padding? #128

Open
Sleort opened this issue Jun 11, 2019 · 9 comments · Fixed by FluxML/NNlibCUDA.jl#34
Open

CuDNN: Support for asymmetric padding? #128

Sleort opened this issue Jun 11, 2019 · 9 comments · Fixed by FluxML/NNlibCUDA.jl#34
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request help wanted Extra attention is needed

Comments

@Sleort
Copy link
Contributor

Sleort commented Jun 11, 2019

Problem
Currently, asymmetric padding doesn't work in Flux when using CuArrays:

using Flux, CuArrays
m = Conv((3,3), 1=>1, pad=(0,1,0,0)) |> gpu
x = rand(100,100,1,1) |> gpu
m(x)

throws an error (when run on a gpu): CUDNNError(code 3, CUDNN_STATUS_BAD_PARAM) and gives the warning Warning: CuDNN does not support asymmetric padding; defaulting to symmetric choice. This means that

  1. my current model (which runs on the CPU) cannot be trained using a GPU (without modifying the code)
  2. there is no longer a one-to-one correspondence between CPU valid code and GPU valid code. (I.e., the above code will run just fine on the CPU if there is no GPU available.)

Desired solution
Well, CuDNN support for asymmetric padding would be great. Then I wouldn't have to worry (as much) about whether code developed on a CPU will work on a GPU.

Alternative solution I've considered
One can obviously use symmetric padding and then crop the output to get the desired result. This is however not very elegant, nor as efficient (?).

@denizyuret
Copy link
Contributor

We recently had the same problem implementing Yolo in Knet. The workaround @Ybakman and I ended up using is an efficient implementation of your alternative solution: symmetric padding followed by cropping using a fixed convolution filter. See https://github.com/Ybakman/YoloV2 for an implementation. This turned out to be fairly efficient (one extra conv), and can be added as a standard solution to Knet/Flux/CuArrays, I think. Given that CuDNN does not support asymmetric padding this may be the best we can do other than writing new kernels.

@Sleort
Copy link
Contributor Author

Sleort commented Aug 28, 2019

That's clever. Did you compare the performance of this solution to just cropping the output?

@denizyuret
Copy link
Contributor

denizyuret commented Aug 28, 2019 via email

@Sleort
Copy link
Contributor Author

Sleort commented Aug 28, 2019

Makes sense. I'll try it out. Thanks!

@DrChainsaw
Copy link

DrChainsaw commented Oct 20, 2019

Hi,

Why not just pad input manually in case of asymetric padding? That is what other DL libraries (e.g. tensorflow) is doing.

At least when using Flux it seems to be faster than an additional conv and almost as fast as symetric padding.

I can volunteer to add this should it be desirable (e.g. if the benchmarking below is not flawed). I'm not sure this is the right package for it as the API (at least the one used by Flux) seems to require that both output and input is passed and this in turn would mean that one needs to resize two arrays instead of one (or?). Following the callchain from Flux it seems like NNlib is the lowest level where one can apply the "pad input manually" method. @denizyuret: Does Knet use NNlib?

Here is an example with AsConv1 being the "manual" padding of the input while AsConv2 is doing padding by cropping using another Conv layer:

using Flux
using BenchmarkTools
using CuArrays

struct AsConv1{T, N}
    c::T
    pad::NTuple{N,Int}
end
function AsConv1(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ = identity;
    init = Flux.glorot_uniform,  stride = 1, pad = 0, dilation = 1) where N
    length(pad) < 2N || all(i -> length(unique(pad[i:i+1])) == 1, 1:2:2N) == 1&& return Conv(k, ch , σ, init=init, stride=stride, pad=pad, dilation=dilation)

    pad_manual = Tuple(map(i -> abs(pad[i] - pad[i+1]), 1:2:2N))
    pad_auto = Tuple(map(i -> minimum(pad[i:i+1]), 1:2:2N))
    return AsConv1(Conv(k, ch, σ, init=init, stride=stride, pad=pad_auto, dilation=dilation), pad_manual)
end
function (c::AsConv1)(x::AbstractArray)
    # Maybe there are faster ways to do this as well...
    padding = similar(x, c.pad..., size(x)[end-1:end]...)
    fill!(padding, 0)
    c.c(cat(x, padding, dims=1:length(c.pad)))
end

Flux.@treelike AsConv1

struct AsConv2{T,V}
    cfun::T
    cpad::V
end
function AsConv2(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ = identity;
    init = Flux.glorot_uniform,  stride = 1, pad = 0, dilation = 1) where N
    length(pad) < 2N || all(i -> length(unique(pad[i:i+1])) == 1, 1:2:2N) == 1&& return Conv(k, ch , σ, init=init, stride=stride, pad=pad, dilation=dilation)

    pad_manual = Tuple(map(i -> abs(pad[i] - pad[i+1]), 1:2:2N))
    pad_auto = Tuple(map(i -> maximum(pad[i:i+1]), 1:2:2N))

    wpad = zeros(Float32, (pad_manual .+1) ...,ch[2], ch[2])
    inds = map(pp -> pp > 0 ? 1 : Colon(), pad_manual)
    wpad[inds..., :,:] .= 1
    cpad = Conv(wpad, zeros(Float32, ch[2]), identity)
    cfun = Conv(k, ch, σ, init=init, stride=stride, pad=pad_auto, dilation=dilation)
    return AsConv2(cfun, cpad)
end
function (c::AsConv2)(x::AbstractArray)
    x′ = c.cfun(x)
    return c.cpad(x′)
end

Flux.@treelike AsConv2

cc0 = Conv((3,5), 64 => 256, pad=(1,1,2,2)) |> gpu;
cc1 = AsConv1((3,5), 64 => 256, pad=(1,0,2,2)) |> gpu;
cc2 = AsConv2((3,5), 64 => 256, pad=(1,0,2,2)) |> gpu;

data = ones(Float32, 32,32,64,32) |> gpu;

julia> size(cc0(data)) # Note: Does not do asymetric padding as it would fail
(32, 32, 256, 32)

julia> size(cc1(data)) # Pads one less than cc0, same as cc2
(31, 32, 256, 32)

julia> size(cc2(data)) # Pads one less than cc0, same as cc1
(31, 32, 256, 32)

julia> @benchmark CuArrays.@sync cc0($data)
BenchmarkTools.Trial:
  memory estimate:  9.20 KiB
  allocs estimate:  188
  --------------
  minimum time:     5.925 ms (0.00% GC)
  median time:      6.199 ms (0.00% GC)
  mean time:        6.377 ms (1.02% GC)
  maximum time:     13.290 ms (42.58% GC)
  --------------
  samples:          781
  evals/sample:     1

julia> @benchmark CuArrays.@sync cc1($data)
BenchmarkTools.Trial:
  memory estimate:  23.05 KiB
  allocs estimate:  382
  --------------
  minimum time:     6.255 ms (0.00% GC)
  median time:      6.723 ms (0.00% GC)
  mean time:        6.933 ms (1.09% GC)
  maximum time:     15.048 ms (43.82% GC)
  --------------
  samples:          718
  evals/sample:     1

julia> @benchmark CuArrays.@sync cc2($data)
BenchmarkTools.Trial:
  memory estimate:  18.39 KiB
  allocs estimate:  371
  --------------
  minimum time:     9.720 ms (0.00% GC)
  median time:      10.655 ms (0.00% GC)
  mean time:        22.898 ms (0.44% GC)
  maximum time:     131.763 ms (0.00% GC)
  --------------
  samples:          218
  evals/sample:     1

@denizyuret
Copy link
Contributor

That is certainly an option. In our experiments with Knet the two gave roughly the same speed. Memory cost is pretty much the same: padded input takes up extra space, but so does our extra convolution operation.

@maleadt maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020
@maleadt maleadt added cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request labels May 27, 2020
@tknopp
Copy link

tknopp commented Dec 18, 2021

I ran into this issue as well. Is there any downside of AsConv1 shown above by @DrChainsaw? Couldn't that be just be integrated into Conv?

@maleadt maleadt added the help wanted Extra attention is needed label Dec 20, 2021
@DrChainsaw
Copy link

I ran into this issue as well. Is there any downside of AsConv1 shown above by @DrChainsaw? Couldn't that be just be integrated into Conv?

The downside would be that Conv is implemented in Flux and Flux tries to be agnostic to the backend. Other backens (e.g. the CPU backend) supports asymmetric padding. Having AsConv1 as a special type would be inconvenient to the user (e.g. gpu(model) will not always produce a working model).

I'm thinking that the right course of action is to just do the manual padding in the function which currently warns in CUDA.jl. I suppose this is a breaking change in theory, but at least Flux does not assume the current behaviour and instead passes an array which assumes asymmertric padding so I think one should only need to copy+resize the input.

I can try to look into it tonight and see if it is doable without any larger restructuring.

@DrChainsaw
Copy link

Turns out this functionality has moved to NNlibCUDA. Either close now or when/if the above PR is accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants