Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added fold/unfold and gpu tests #59

Merged
merged 10 commits into from
Dec 7, 2022
Merged

added fold/unfold and gpu tests #59

merged 10 commits into from
Dec 7, 2022

Conversation

nikopj
Copy link
Contributor

@nikopj nikopj commented Nov 28, 2022

GPU kernels for fold/unfold functions in NNlib pull #444

Fold/unfold kernels are adapted from NNlib.im2col! and NNlib.col2im!.

This is my first julia kernel, feedback is greatly appreciated!

PR Checklist

  • Tests are added

@ToucheSir
Copy link
Member

Let's get FluxML/NNlib.jl#444 merged so we can see what CI thinks of this :)

@nikopj
Copy link
Contributor Author

nikopj commented Nov 28, 2022

@ToucheSir do I have to add another commit to trigger tests since #56 was merged?

@mcabbott
Copy link
Member

Not entirely sure, but it looks like CI uses the registered version of NNlib. (Ideally it would grab master, I think, no idea if that's easy to change.)

Once JuliaRegistries/General#73026 is merged, this PR should probably raise the compat bound on NNlib to 0.8.11

@nikopj
Copy link
Contributor Author

nikopj commented Nov 28, 2022

I think a similar thing is happening again but this time NNlibCUDA is looking at its registered version, not master, so #56 being merged doesn't change the fact that the test is still looking for julia 1.7 nightly.

@mcabbott
Copy link
Member

If it's not seeing #56, it may help to rebase on master? Surely it ought to test the result of merging, not this branch, but...

@ToucheSir
Copy link
Member

ToucheSir commented Nov 29, 2022

Surely it ought to test the result of merging, not this branch, but...

We technically have bors for this (on Flux and a couple other repos, was never set up here IIRC), but heisenbugs and the overwhelming convenience of a rapidly-improving GH UI has meant it's never used these days.

@nikopj
Copy link
Contributor Author

nikopj commented Dec 3, 2022

I've rewritten the fold/unfold kernels -- the original ones were not taking full advantage of parallelization and as a result turned out to be super slow. I didn't realize because I was testing them incorrectly (without CUDA.@sync).

I'm providing some test metrics below on my school's A100 GPU. For reference, I'm showing the timings of a conv first. The timings after are for fold/unfold with big-window+big-stride then small-window+small-stride. The numbers are a huge improvement over the previous commit, which had timings of around ~200ms for the big-window unfold (now 200 microsec).

I'm not too familiar with how/why the timing spreads look as they do, but the mean and median timings look reasonable to me.

Any feedback/suggestion is greatly appreciated! Unless I've made some newb error in these kernels (entirely possible), I think they're ready.

julia> using NNlib, CUDA, NNlibCUDA, BenchmarkTools

julia> x = CUDA.randn(Float32, 128, 128, 32, 10);
      
julia> cdims = DenseConvDims(x, CUDA.randn(32,32,32,1); stride=32);

julia> y = NNlib.unfold(x, cdims);
      
julia> z = NNlib.fold(y, size(x), cdims);

julia> w = CUDA.randn(7,7,32,1);

julia> y = conv(x, w);

julia> @benchmark CUDA.@sync conv($x, $w)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  334.307 μs   3.502 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     346.275 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   346.932 μs ± 31.799 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▁▃▅▇██▇▅▄▂▁
  ▂▁▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▅▇████████████▆▆▅▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂ ▄
  334 μs          Histogram: frequency by time          355 μs <

 Memory estimate: 2.83 KiB, allocs estimate: 78.

julia> @benchmark CUDA.@sync NNlib.unfold($x, $cdims)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  181.499 μs   48.214 ms  ┊ GC (min  max): 0.00%  27.28%
 Time  (median):     184.726 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   196.088 μs ± 569.057 μs  ┊ GC (mean ± σ):  1.10% ±  0.39%

  ▄█▆▄▂▁▁▁▁  ▁                                                  ▁
  █████████▇███▆▅▅▄▄▃▄▃▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▁▁▃▃▄▄▃▁▁▁▃▃▁▁▃▄▄▄▄▃ █
  181 μs        Histogram: log(frequency) by time        291 μs <

 Memory estimate: 3.11 KiB, allocs estimate: 53.

julia> @benchmark CUDA.@sync NNlib.fold($y, $(size(x)), $cdims)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  189.996 μs   33.561 ms  ┊ GC (min  max): 0.00%  30.59%
 Time  (median):     192.325 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   208.299 μs ± 533.227 μs  ┊ GC (mean ± σ):  0.93% ±  0.42%

  ▇█▅▃▂▁▂   ▁                                                   ▁
  ████████▇██▇▇▇▅▅▄▃▄▃▁▁▄▁▄▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▃▁▄▃▄▄▄▄▅▅▁▄▄▃▅▃▅▅▄ █
  190 μs        Histogram: log(frequency) by time        300 μs <

 Memory estimate: 3.14 KiB, allocs estimate: 55.
       
julia> cdims = DenseConvDims(x, w; stride=1);

julia> y = NNlib.unfold(x, cdims);
       
julia> z = NNlib.fold(y, size(x), cdims);

julia> @benchmark CUDA.@sync NNlib.unfold($x, $cdims)
BenchmarkTools.Trial: 607 samples with 1 evaluation.
 Range (min  max):  7.536 ms  79.206 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     7.722 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.232 ms ±  4.209 ms  ┊ GC (mean ± σ):  0.52% ± 1.11%

  █▁
  ██▇▆▄▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▆
  7.54 ms      Histogram: log(frequency) by time     37.5 ms <

 Memory estimate: 6.16 KiB, allocs estimate: 104.

julia> @benchmark CUDA.@sync NNlib.fold($y, $(size(x)), $cdims)
BenchmarkTools.Trial: 633 samples with 1 evaluation.
 Range (min  max):  7.537 ms   14.109 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     7.735 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.890 ms ± 532.117 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ██▁
  ▅█████▇▇▅▇▅▄▄▄▃▃▃▃▃▂▁▂▂▁▂▂▁▂▂▂▃▃▂▂▂▂▂▂▂▂▁▂▁▁▂▁▁▁▂▂▂▂▂▁▁▂▁▂▂ ▃
  7.54 ms         Histogram: frequency by time        9.82 ms <

 Memory estimate: 6.22 KiB, allocs estimate: 107.

src/fold.jl Outdated
end

# check out of bounds
if any((w, h, d) .<= 0 .|| (w, h, d) .> input_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buildkite is complaining this syntax isn't 1.6 compatible. Maybe try

Suggested change
if any((w, h, d) .<= 0 .|| (w, h, d) .> input_size)
if any((w, h, d) .<= 0 .| (w, h, d) .> input_size)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, looks like | might bind more tightly. The good news is that I found that the stdlib has Base.checkbounds and Base.checkindex, so the current conditional logic could be simplified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find, tyty.

nikopj and others added 3 commits December 3, 2022 20:03
@nikopj
Copy link
Contributor Author

nikopj commented Dec 7, 2022

Let me know if there's anything I can do to help your review, such as timing tests, etc.

Copy link
Member

@ToucheSir ToucheSir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the last commit was passing on CI. Thanks for the great contribution!

One more thing before I merge: can you bump the NNlib compat in Project.toml to 0.8.11 so we ensure FluxML/NNlib.jl#444 is present to overload?

@ToucheSir ToucheSir merged commit 6f910bd into FluxML:master Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants