Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeing large buffers takes a while #594

Closed
denizyuret opened this issue Dec 7, 2020 · 2 comments
Closed

Freeing large buffers takes a while #594

denizyuret opened this issue Dec 7, 2020 · 2 comments
Labels
cuda array Stuff about CuArray. performance How fast can we go?

Comments

@denizyuret
Copy link
Contributor

This is the issue mentioned in denizyuret/Knet.jl#624 (comment)

Here is an MWE: https://gist.github.com/denizyuret/e4155b9e2aeae5e19af6e1fdd2f6716b

The size argument to main is adjusted to push the GPU memory to its limits, 500 works on a T4, may have to lower it if you have less memory.

I used the latest CUDA master and Knet#dy/fix624.

gcnode calls unsafe_free! at https://github.com/denizyuret/Knet.jl/blob/8a944bf853ea7555bfa33909a2d2be4d0ddf0473/src/autograd_gpu/gcnode.jl#L87

Here is output from TimerOutputs with instrumented code -- the top line shows 32ms/call for unsafe_free!

 ─────────────────────────────────────────────────────────────────────
                              Time                   Allocations
                      ──────────────────────   ───────────────────────
   Tot / % measured:       15.7s / 11.3%           1.47GiB / 0.01%

 Section      ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────
 free             55    1.77s   100%  32.1ms   4.06KiB  2.00%    75.6B
 n.outgrad        90    703μs  0.04%  7.81μs   44.3KiB  21.8%     504B
 p.outgrad1      105    486μs  0.03%  4.63μs   52.0KiB  25.6%     507B
 init              5    455μs  0.03%  91.0μs   91.2KiB  44.9%  18.2KiB
 haskey           90   88.1μs  0.00%   978ns     0.00B  0.00%    0.00B
 dequeue          55   76.7μs  0.00%  1.39μs   3.44KiB  1.69%    64.0B
 p.outgrad2       15   62.8μs  0.00%  4.19μs   8.28KiB  4.07%     565B
 ni               90   60.8μs  0.00%   676ns     0.00B  0.00%    0.00B
 peek            145   40.3μs  0.00%   278ns     0.00B  0.00%    0.00B
 pi              105   37.7μs  0.00%   359ns     0.00B  0.00%    0.00B
 dict             55   23.8μs  0.00%   433ns     0.00B  0.00%    0.00B
 erase            90   21.9μs  0.00%   243ns     0.00B  0.00%    0.00B
 ─────────────────────────────────────────────────────────────────────
@maleadt
Copy link
Member

maleadt commented Jan 6, 2021

unsafe_free is fast:

julia> @benchmark CUDA.unsafe_free!(a) setup=(a=CUDA.rand(2048,2048,8))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     100.415 ns (0.00% GC)
  median time:      116.278 ns (0.00% GC)
  mean time:        116.580 ns (0.00% GC)
  maximum time:     164.305 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     927

The problem here is that your buffers fall beyond the MAX_POOL threshold, which means they immediately get freed:

julia> @benchmark CUDA.unsafe_free!(a) setup=(a=CUDA.rand(2048,2048,16))
BenchmarkTools.Trial: 
  memory estimate:  8 bytes
  allocs estimate:  0
  --------------
  minimum time:     92.617 μs (0.00% GC)
  median time:      131.811 μs (0.00% GC)
  mean time:        141.102 μs (0.00% GC)
  maximum time:     355.577 μs (0.00% GC)
  --------------
  samples:          1067
  evals/sample:     10
 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations      
                               ──────────────────────   ───────────────────────
       Tot / % measured:            75.4s / 1.58%           5.65GiB / 0.08%    

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 gcnode unsafe_free!       55    871ms  73.1%  15.8ms   4.50KiB  0.10%    83.8B
 kptr unsafe_free!        208    321ms  26.9%  1.54ms   4.38MiB  100%   21.6KiB
 ──────────────────────────────────────────────────────────────────────────────
 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations      
                               ──────────────────────   ───────────────────────
       Tot / % measured:            73.6s / 1.33%           5.25GiB / 0.00%    

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 unsafe_free              471    982ms   100%  2.08ms   20.0KiB  100%     43.5B
   free                   218    982ms   100%  4.50ms   19.2KiB  95.9%    90.2B
     try                  195    982ms   100%  5.03ms   12.3KiB  61.6%    64.8B
       pool               195    981ms   100%  5.03ms   9.86KiB  49.2%    51.8B
         actual_free      195    981ms   100%  5.03ms   8.69KiB  43.4%    45.6B
       allocated          195    126μs  0.01%   648ns     0.00B  0.00%    0.00B
       requested          195   19.6μs  0.00%   100ns     0.00B  0.00%    0.00B
 ──────────────────────────────────────────────────────────────────────────────

The split pool doesn't suffer from this, but has other issues.

@maleadt maleadt changed the title unsafe_free! takes 30ms? Freeing large buffers takes a while Jan 6, 2021
@maleadt maleadt added cuda array Stuff about CuArray. performance How fast can we go? labels Jan 6, 2021
@maleadt
Copy link
Member

maleadt commented Mar 5, 2021

Fixed on latest CUDA.jl with CUDA 11.2:

julia> @benchmark CUDA.unsafe_free!(a) setup=(a=CUDA.rand(2048,2048,16))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.707 ns (0.00% GC)
  median time:      15.067 ns (0.00% GC)
  mean time:        12.046 ns (0.00% GC)
  maximum time:     87.059 ns (0.00% GC)
  --------------
  samples:          7965
  evals/sample:     1000

@maleadt maleadt closed this as completed Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. performance How fast can we go?
Projects
None yet
Development

No branches or pull requests

2 participants