-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CuDNN: Support for asymmetric padding? #128
Comments
We recently had the same problem implementing Yolo in Knet. The workaround @Ybakman and I ended up using is an efficient implementation of your alternative solution: symmetric padding followed by cropping using a fixed convolution filter. See https://github.com/Ybakman/YoloV2 for an implementation. This turned out to be fairly efficient (one extra conv), and can be added as a standard solution to Knet/Flux/CuArrays, I think. Given that CuDNN does not support asymmetric padding this may be the best we can do other than writing new kernels. |
That's clever. Did you compare the performance of this solution to just cropping the output? |
Cropping is slow because it is non-contiguous in this case. There is a lot
of copying and pasting. Unless you write a specialized kernel it is
hopeless.
…On Wed, Aug 28, 2019, 9:41 AM Troels Arnfred Bojesen < ***@***.***> wrote:
That's clever. Did you compare the performance of this solution to just
cropping the output?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/JuliaGPU/CuArrays.jl/issues/356?email_source=notifications&email_token=AAN43JVY4T7UW6OG6XGDS2LQGYM2NA5CNFSM4HW3A2GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5KB2NY#issuecomment-525606199>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAN43JS57T7KXKMWYQGCZODQGYM2NANCNFSM4HW3A2GA>
.
|
Makes sense. I'll try it out. Thanks! |
Hi, Why not just pad input manually in case of asymetric padding? That is what other DL libraries (e.g. tensorflow) is doing. At least when using Flux it seems to be faster than an additional conv and almost as fast as symetric padding. I can volunteer to add this should it be desirable (e.g. if the benchmarking below is not flawed). I'm not sure this is the right package for it as the API (at least the one used by Flux) seems to require that both output and input is passed and this in turn would mean that one needs to resize two arrays instead of one (or?). Following the callchain from Flux it seems like NNlib is the lowest level where one can apply the "pad input manually" method. @denizyuret: Does Knet use NNlib? Here is an example with AsConv1 being the "manual" padding of the input while AsConv2 is doing padding by cropping using another Conv layer: using Flux
using BenchmarkTools
using CuArrays
struct AsConv1{T, N}
c::T
pad::NTuple{N,Int}
end
function AsConv1(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ = identity;
init = Flux.glorot_uniform, stride = 1, pad = 0, dilation = 1) where N
length(pad) < 2N || all(i -> length(unique(pad[i:i+1])) == 1, 1:2:2N) == 1&& return Conv(k, ch , σ, init=init, stride=stride, pad=pad, dilation=dilation)
pad_manual = Tuple(map(i -> abs(pad[i] - pad[i+1]), 1:2:2N))
pad_auto = Tuple(map(i -> minimum(pad[i:i+1]), 1:2:2N))
return AsConv1(Conv(k, ch, σ, init=init, stride=stride, pad=pad_auto, dilation=dilation), pad_manual)
end
function (c::AsConv1)(x::AbstractArray)
# Maybe there are faster ways to do this as well...
padding = similar(x, c.pad..., size(x)[end-1:end]...)
fill!(padding, 0)
c.c(cat(x, padding, dims=1:length(c.pad)))
end
Flux.@treelike AsConv1
struct AsConv2{T,V}
cfun::T
cpad::V
end
function AsConv2(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ = identity;
init = Flux.glorot_uniform, stride = 1, pad = 0, dilation = 1) where N
length(pad) < 2N || all(i -> length(unique(pad[i:i+1])) == 1, 1:2:2N) == 1&& return Conv(k, ch , σ, init=init, stride=stride, pad=pad, dilation=dilation)
pad_manual = Tuple(map(i -> abs(pad[i] - pad[i+1]), 1:2:2N))
pad_auto = Tuple(map(i -> maximum(pad[i:i+1]), 1:2:2N))
wpad = zeros(Float32, (pad_manual .+1) ...,ch[2], ch[2])
inds = map(pp -> pp > 0 ? 1 : Colon(), pad_manual)
wpad[inds..., :,:] .= 1
cpad = Conv(wpad, zeros(Float32, ch[2]), identity)
cfun = Conv(k, ch, σ, init=init, stride=stride, pad=pad_auto, dilation=dilation)
return AsConv2(cfun, cpad)
end
function (c::AsConv2)(x::AbstractArray)
x′ = c.cfun(x)
return c.cpad(x′)
end
Flux.@treelike AsConv2
cc0 = Conv((3,5), 64 => 256, pad=(1,1,2,2)) |> gpu;
cc1 = AsConv1((3,5), 64 => 256, pad=(1,0,2,2)) |> gpu;
cc2 = AsConv2((3,5), 64 => 256, pad=(1,0,2,2)) |> gpu;
data = ones(Float32, 32,32,64,32) |> gpu;
julia> size(cc0(data)) # Note: Does not do asymetric padding as it would fail
(32, 32, 256, 32)
julia> size(cc1(data)) # Pads one less than cc0, same as cc2
(31, 32, 256, 32)
julia> size(cc2(data)) # Pads one less than cc0, same as cc1
(31, 32, 256, 32)
julia> @benchmark CuArrays.@sync cc0($data)
BenchmarkTools.Trial:
memory estimate: 9.20 KiB
allocs estimate: 188
--------------
minimum time: 5.925 ms (0.00% GC)
median time: 6.199 ms (0.00% GC)
mean time: 6.377 ms (1.02% GC)
maximum time: 13.290 ms (42.58% GC)
--------------
samples: 781
evals/sample: 1
julia> @benchmark CuArrays.@sync cc1($data)
BenchmarkTools.Trial:
memory estimate: 23.05 KiB
allocs estimate: 382
--------------
minimum time: 6.255 ms (0.00% GC)
median time: 6.723 ms (0.00% GC)
mean time: 6.933 ms (1.09% GC)
maximum time: 15.048 ms (43.82% GC)
--------------
samples: 718
evals/sample: 1
julia> @benchmark CuArrays.@sync cc2($data)
BenchmarkTools.Trial:
memory estimate: 18.39 KiB
allocs estimate: 371
--------------
minimum time: 9.720 ms (0.00% GC)
median time: 10.655 ms (0.00% GC)
mean time: 22.898 ms (0.44% GC)
maximum time: 131.763 ms (0.00% GC)
--------------
samples: 218
evals/sample: 1 |
That is certainly an option. In our experiments with Knet the two gave roughly the same speed. Memory cost is pretty much the same: padded input takes up extra space, but so does our extra convolution operation. |
I ran into this issue as well. Is there any downside of |
The downside would be that I'm thinking that the right course of action is to just do the manual padding in the function which currently warns in CUDA.jl. I suppose this is a breaking change in theory, but at least Flux does not assume the current behaviour and instead passes an array which assumes asymmertric padding so I think one should only need to copy+resize the input. I can try to look into it tonight and see if it is doable without any larger restructuring. |
Turns out this functionality has moved to NNlibCUDA. Either close now or when/if the above PR is accepted. |
Problem
Currently, asymmetric padding doesn't work in Flux when using CuArrays:
throws an error (when run on a gpu):
CUDNNError(code 3, CUDNN_STATUS_BAD_PARAM)
and gives the warningWarning: CuDNN does not support asymmetric padding; defaulting to symmetric choice
. This means thatDesired solution
Well, CuDNN support for asymmetric padding would be great. Then I wouldn't have to worry (as much) about whether code developed on a CPU will work on a GPU.
Alternative solution I've considered
One can obviously use symmetric padding and then crop the output to get the desired result. This is however not very elegant, nor as efficient (?).
The text was updated successfully, but these errors were encountered: