added fold/unfold and gpu tests #59

nikopj · 2022-11-28T06:13:58Z

GPU kernels for fold/unfold functions in NNlib pull #444

Fold/unfold kernels are adapted from NNlib.im2col! and NNlib.col2im!.

This is my first julia kernel, feedback is greatly appreciated!

PR Checklist

Tests are added

ToucheSir · 2022-11-28T15:33:01Z

Let's get FluxML/NNlib.jl#444 merged so we can see what CI thinks of this :)

nikopj · 2022-11-28T17:27:21Z

@ToucheSir do I have to add another commit to trigger tests since #56 was merged?

mcabbott · 2022-11-28T19:08:07Z

Not entirely sure, but it looks like CI uses the registered version of NNlib. (Ideally it would grab master, I think, no idea if that's easy to change.)

Once JuliaRegistries/General#73026 is merged, this PR should probably raise the compat bound on NNlib to 0.8.11

nikopj · 2022-11-28T20:57:42Z

I think a similar thing is happening again but this time NNlibCUDA is looking at its registered version, not master, so #56 being merged doesn't change the fact that the test is still looking for julia 1.7 nightly.

mcabbott · 2022-11-28T21:03:31Z

If it's not seeing #56, it may help to rebase on master? Surely it ought to test the result of merging, not this branch, but...

ToucheSir · 2022-11-29T00:10:45Z

Surely it ought to test the result of merging, not this branch, but...

We technically have bors for this (on Flux and a couple other repos, was never set up here IIRC), but heisenbugs and the overwhelming convenience of a rapidly-improving GH UI has meant it's never used these days.

…opj-fold HPC merge.

nikopj · 2022-12-03T10:41:22Z

I've rewritten the fold/unfold kernels -- the original ones were not taking full advantage of parallelization and as a result turned out to be super slow. I didn't realize because I was testing them incorrectly (without CUDA.@sync).

I'm providing some test metrics below on my school's A100 GPU. For reference, I'm showing the timings of a conv first. The timings after are for fold/unfold with big-window+big-stride then small-window+small-stride. The numbers are a huge improvement over the previous commit, which had timings of around ~200ms for the big-window unfold (now 200 microsec).

I'm not too familiar with how/why the timing spreads look as they do, but the mean and median timings look reasonable to me.

Any feedback/suggestion is greatly appreciated! Unless I've made some newb error in these kernels (entirely possible), I think they're ready.

julia> using NNlib, CUDA, NNlibCUDA, BenchmarkTools

julia> x = CUDA.randn(Float32, 128, 128, 32, 10);
      
julia> cdims = DenseConvDims(x, CUDA.randn(32,32,32,1); stride=32);

julia> y = NNlib.unfold(x, cdims);
      
julia> z = NNlib.fold(y, size(x), cdims);

julia> w = CUDA.randn(7,7,32,1);

julia> y = conv(x, w);

julia> @benchmark CUDA.@sync conv($x, $w)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  334.307 μs …  3.502 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     346.275 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   346.932 μs ± 31.799 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▁▃▅▇██▇▅▄▂▁
  ▂▁▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▅▇████████████▆▆▅▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂ ▄
  334 μs          Histogram: frequency by time          355 μs <

 Memory estimate: 2.83 KiB, allocs estimate: 78.

julia> @benchmark CUDA.@sync NNlib.unfold($x, $cdims)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  181.499 μs …  48.214 ms  ┊ GC (min … max): 0.00% … 27.28%
 Time  (median):     184.726 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   196.088 μs ± 569.057 μs  ┊ GC (mean ± σ):  1.10% ±  0.39%

  ▄█▆▄▂▁▁▁▁  ▁                                                  ▁
  █████████▇███▆▅▅▄▄▃▄▃▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▁▁▃▃▄▄▃▁▁▁▃▃▁▁▃▄▄▄▄▃ █
  181 μs        Histogram: log(frequency) by time        291 μs <

 Memory estimate: 3.11 KiB, allocs estimate: 53.

julia> @benchmark CUDA.@sync NNlib.fold($y, $(size(x)), $cdims)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  189.996 μs …  33.561 ms  ┊ GC (min … max): 0.00% … 30.59%
 Time  (median):     192.325 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   208.299 μs ± 533.227 μs  ┊ GC (mean ± σ):  0.93% ±  0.42%

  ▇█▅▃▂▁▂   ▁                                                   ▁
  ████████▇██▇▇▇▅▅▄▃▄▃▁▁▄▁▄▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▃▁▄▃▄▄▄▄▅▅▁▄▄▃▅▃▅▅▄ █
  190 μs        Histogram: log(frequency) by time        300 μs <

 Memory estimate: 3.14 KiB, allocs estimate: 55.
       
julia> cdims = DenseConvDims(x, w; stride=1);

julia> y = NNlib.unfold(x, cdims);
       
julia> z = NNlib.fold(y, size(x), cdims);

julia> @benchmark CUDA.@sync NNlib.unfold($x, $cdims)
BenchmarkTools.Trial: 607 samples with 1 evaluation.
 Range (min … max):  7.536 ms … 79.206 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.722 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.232 ms ±  4.209 ms  ┊ GC (mean ± σ):  0.52% ± 1.11%

  █▁
  ██▇▆▄▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▆
  7.54 ms      Histogram: log(frequency) by time     37.5 ms <

 Memory estimate: 6.16 KiB, allocs estimate: 104.

julia> @benchmark CUDA.@sync NNlib.fold($y, $(size(x)), $cdims)
BenchmarkTools.Trial: 633 samples with 1 evaluation.
 Range (min … max):  7.537 ms …  14.109 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.735 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.890 ms ± 532.117 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ██▁
  ▅█████▇▇▅▇▅▄▄▄▃▃▃▃▃▂▁▂▂▁▂▂▁▂▂▂▃▃▂▂▂▂▂▂▂▂▁▂▁▁▂▁▁▁▂▂▂▂▂▁▁▂▁▂▂ ▃
  7.54 ms         Histogram: frequency by time        9.82 ms <

 Memory estimate: 6.22 KiB, allocs estimate: 107.

ToucheSir · 2022-12-04T00:48:30Z

src/fold.jl

+        end
+
+        # check out of bounds
+        if any((w, h, d) .<= 0 .|| (w, h, d) .> input_size)


Buildkite is complaining this syntax isn't 1.6 compatible. Maybe try

Suggested change

if any((w, h, d) .<= 0 .|| (w, h, d) .> input_size)

if any((w, h, d) .<= 0 .| (w, h, d) .> input_size)

Ah sorry, looks like | might bind more tightly. The good news is that I found that the stdlib has Base.checkbounds and Base.checkindex, so the current conditional logic could be simplified.

nice find, tyty.

Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>

nikopj · 2022-12-07T17:34:22Z

Let me know if there's anything I can do to help your review, such as timing tests, etc.

ToucheSir

I missed the last commit was passing on CI. Thanks for the great contribution!

One more thing before I merge: can you bump the NNlib compat in Project.toml to 0.8.11 so we ensure FluxML/NNlib.jl#444 is present to overload?

added fold/unfold and gpu tests

3a9f837

retrigger checks

6b11820

namespace fix

fdf8cc0

Merge branch 'FluxML:master' into nikopj-fold

cddd082

Nikola Janjusevic added 2 commits December 3, 2022 05:36

fold/unfold kernel rewrite

485b118

Merge branch 'nikopj-fold' of github.com:nikopj/NNlibCUDA.jl into nik…

7b3c7ec

…opj-fold HPC merge.

ToucheSir reviewed Dec 4, 2022

View reviewed changes

nikopj and others added 3 commits December 3, 2022 20:03

Update src/fold.jl for 1.6 compat

48770aa

Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>

fold kernel 1.6 compat

e888f26

bounds check with checkindex

a76619a

ToucheSir approved these changes Dec 7, 2022

View reviewed changes

NNlib bump

6ee6258

ToucheSir merged commit 6f910bd into FluxML:master Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added fold/unfold and gpu tests #59

added fold/unfold and gpu tests #59

nikopj commented Nov 28, 2022

ToucheSir commented Nov 28, 2022

nikopj commented Nov 28, 2022

mcabbott commented Nov 28, 2022

nikopj commented Nov 28, 2022

mcabbott commented Nov 28, 2022

ToucheSir commented Nov 29, 2022 •

edited

Loading

nikopj commented Dec 3, 2022

ToucheSir Dec 4, 2022

ToucheSir Dec 4, 2022

nikopj Dec 4, 2022

nikopj commented Dec 7, 2022

ToucheSir left a comment •

edited

Loading

	if any((w, h, d) .<= 0 .\|\| (w, h, d) .> input_size)
	if any((w, h, d) .<= 0 .\| (w, h, d) .> input_size)

added fold/unfold and gpu tests #59

added fold/unfold and gpu tests #59

Conversation

nikopj commented Nov 28, 2022

PR Checklist

ToucheSir commented Nov 28, 2022

nikopj commented Nov 28, 2022

mcabbott commented Nov 28, 2022

nikopj commented Nov 28, 2022

mcabbott commented Nov 28, 2022

ToucheSir commented Nov 29, 2022 • edited Loading

nikopj commented Dec 3, 2022

ToucheSir Dec 4, 2022

Choose a reason for hiding this comment

ToucheSir Dec 4, 2022

Choose a reason for hiding this comment

nikopj Dec 4, 2022

Choose a reason for hiding this comment

nikopj commented Dec 7, 2022

ToucheSir left a comment • edited Loading

Choose a reason for hiding this comment

ToucheSir commented Nov 29, 2022 •

edited

Loading

ToucheSir left a comment •

edited

Loading