-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added fold/unfold and gpu tests #59
Conversation
Let's get FluxML/NNlib.jl#444 merged so we can see what CI thinks of this :) |
@ToucheSir do I have to add another commit to trigger tests since #56 was merged? |
Not entirely sure, but it looks like CI uses the registered version of NNlib. (Ideally it would grab master, I think, no idea if that's easy to change.) Once JuliaRegistries/General#73026 is merged, this PR should probably raise the compat bound on NNlib to 0.8.11 |
I think a similar thing is happening again but this time NNlibCUDA is looking at its registered version, not master, so #56 being merged doesn't change the fact that the test is still looking for julia 1.7 nightly. |
If it's not seeing #56, it may help to rebase on master? Surely it ought to test the result of merging, not this branch, but... |
We technically have bors for this (on Flux and a couple other repos, was never set up here IIRC), but heisenbugs and the overwhelming convenience of a rapidly-improving GH UI has meant it's never used these days. |
I've rewritten the fold/unfold kernels -- the original ones were not taking full advantage of parallelization and as a result turned out to be super slow. I didn't realize because I was testing them incorrectly (without I'm providing some test metrics below on my school's A100 GPU. For reference, I'm showing the timings of a I'm not too familiar with how/why the timing spreads look as they do, but the mean and median timings look reasonable to me. Any feedback/suggestion is greatly appreciated! Unless I've made some newb error in these kernels (entirely possible), I think they're ready. julia> using NNlib, CUDA, NNlibCUDA, BenchmarkTools
julia> x = CUDA.randn(Float32, 128, 128, 32, 10);
julia> cdims = DenseConvDims(x, CUDA.randn(32,32,32,1); stride=32);
julia> y = NNlib.unfold(x, cdims);
julia> z = NNlib.fold(y, size(x), cdims);
julia> w = CUDA.randn(7,7,32,1);
julia> y = conv(x, w);
julia> @benchmark CUDA.@sync conv($x, $w)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 334.307 μs … 3.502 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 346.275 μs ┊ GC (median): 0.00%
Time (mean ± σ): 346.932 μs ± 31.799 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▅▇██▇▅▄▂▁
▂▁▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▅▇████████████▆▆▅▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂ ▄
334 μs Histogram: frequency by time 355 μs <
Memory estimate: 2.83 KiB, allocs estimate: 78.
julia> @benchmark CUDA.@sync NNlib.unfold($x, $cdims)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 181.499 μs … 48.214 ms ┊ GC (min … max): 0.00% … 27.28%
Time (median): 184.726 μs ┊ GC (median): 0.00%
Time (mean ± σ): 196.088 μs ± 569.057 μs ┊ GC (mean ± σ): 1.10% ± 0.39%
▄█▆▄▂▁▁▁▁ ▁ ▁
█████████▇███▆▅▅▄▄▃▄▃▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▁▁▃▃▄▄▃▁▁▁▃▃▁▁▃▄▄▄▄▃ █
181 μs Histogram: log(frequency) by time 291 μs <
Memory estimate: 3.11 KiB, allocs estimate: 53.
julia> @benchmark CUDA.@sync NNlib.fold($y, $(size(x)), $cdims)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 189.996 μs … 33.561 ms ┊ GC (min … max): 0.00% … 30.59%
Time (median): 192.325 μs ┊ GC (median): 0.00%
Time (mean ± σ): 208.299 μs ± 533.227 μs ┊ GC (mean ± σ): 0.93% ± 0.42%
▇█▅▃▂▁▂ ▁ ▁
████████▇██▇▇▇▅▅▄▃▄▃▁▁▄▁▄▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▃▁▄▃▄▄▄▄▅▅▁▄▄▃▅▃▅▅▄ █
190 μs Histogram: log(frequency) by time 300 μs <
Memory estimate: 3.14 KiB, allocs estimate: 55.
julia> cdims = DenseConvDims(x, w; stride=1);
julia> y = NNlib.unfold(x, cdims);
julia> z = NNlib.fold(y, size(x), cdims);
julia> @benchmark CUDA.@sync NNlib.unfold($x, $cdims)
BenchmarkTools.Trial: 607 samples with 1 evaluation.
Range (min … max): 7.536 ms … 79.206 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.722 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.232 ms ± 4.209 ms ┊ GC (mean ± σ): 0.52% ± 1.11%
█▁
██▇▆▄▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▆
7.54 ms Histogram: log(frequency) by time 37.5 ms <
Memory estimate: 6.16 KiB, allocs estimate: 104.
julia> @benchmark CUDA.@sync NNlib.fold($y, $(size(x)), $cdims)
BenchmarkTools.Trial: 633 samples with 1 evaluation.
Range (min … max): 7.537 ms … 14.109 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.735 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.890 ms ± 532.117 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
██▁
▅█████▇▇▅▇▅▄▄▄▃▃▃▃▃▂▁▂▂▁▂▂▁▂▂▂▃▃▂▂▂▂▂▂▂▂▁▂▁▁▂▁▁▁▂▂▂▂▂▁▁▂▁▂▂ ▃
7.54 ms Histogram: frequency by time 9.82 ms <
Memory estimate: 6.22 KiB, allocs estimate: 107. |
src/fold.jl
Outdated
end | ||
|
||
# check out of bounds | ||
if any((w, h, d) .<= 0 .|| (w, h, d) .> input_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buildkite is complaining this syntax isn't 1.6 compatible. Maybe try
if any((w, h, d) .<= 0 .|| (w, h, d) .> input_size) | |
if any((w, h, d) .<= 0 .| (w, h, d) .> input_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, looks like |
might bind more tightly. The good news is that I found that the stdlib has Base.checkbounds
and Base.checkindex
, so the current conditional logic could be simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice find, tyty.
Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
Let me know if there's anything I can do to help your review, such as timing tests, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the last commit was passing on CI. Thanks for the great contribution!
One more thing before I merge: can you bump the NNlib compat in Project.toml to 0.8.11 so we ensure FluxML/NNlib.jl#444 is present to overload?
GPU kernels for fold/unfold functions in NNlib pull #444
Fold/unfold kernels are adapted from NNlib.im2col! and NNlib.col2im!.
This is my first julia kernel, feedback is greatly appreciated!
PR Checklist