Freeing large buffers takes a while #594

denizyuret · 2020-12-07T08:58:00Z

This is the issue mentioned in denizyuret/Knet.jl#624 (comment)

Here is an MWE: https://gist.github.com/denizyuret/e4155b9e2aeae5e19af6e1fdd2f6716b

The size argument to main is adjusted to push the GPU memory to its limits, 500 works on a T4, may have to lower it if you have less memory.

I used the latest CUDA master and Knet#dy/fix624.

gcnode calls unsafe_free! at https://github.com/denizyuret/Knet.jl/blob/8a944bf853ea7555bfa33909a2d2be4d0ddf0473/src/autograd_gpu/gcnode.jl#L87

Here is output from TimerOutputs with instrumented code -- the top line shows 32ms/call for unsafe_free!

 ─────────────────────────────────────────────────────────────────────
                              Time                   Allocations
                      ──────────────────────   ───────────────────────
   Tot / % measured:       15.7s / 11.3%           1.47GiB / 0.01%

 Section      ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────
 free             55    1.77s   100%  32.1ms   4.06KiB  2.00%    75.6B
 n.outgrad        90    703μs  0.04%  7.81μs   44.3KiB  21.8%     504B
 p.outgrad1      105    486μs  0.03%  4.63μs   52.0KiB  25.6%     507B
 init              5    455μs  0.03%  91.0μs   91.2KiB  44.9%  18.2KiB
 haskey           90   88.1μs  0.00%   978ns     0.00B  0.00%    0.00B
 dequeue          55   76.7μs  0.00%  1.39μs   3.44KiB  1.69%    64.0B
 p.outgrad2       15   62.8μs  0.00%  4.19μs   8.28KiB  4.07%     565B
 ni               90   60.8μs  0.00%   676ns     0.00B  0.00%    0.00B
 peek            145   40.3μs  0.00%   278ns     0.00B  0.00%    0.00B
 pi              105   37.7μs  0.00%   359ns     0.00B  0.00%    0.00B
 dict             55   23.8μs  0.00%   433ns     0.00B  0.00%    0.00B
 erase            90   21.9μs  0.00%   243ns     0.00B  0.00%    0.00B
 ─────────────────────────────────────────────────────────────────────

The text was updated successfully, but these errors were encountered:

maleadt · 2021-01-06T10:55:45Z

unsafe_free is fast:

julia> @benchmark CUDA.unsafe_free!(a) setup=(a=CUDA.rand(2048,2048,8))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     100.415 ns (0.00% GC)
  median time:      116.278 ns (0.00% GC)
  mean time:        116.580 ns (0.00% GC)
  maximum time:     164.305 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     927

The problem here is that your buffers fall beyond the MAX_POOL threshold, which means they immediately get freed:

julia> @benchmark CUDA.unsafe_free!(a) setup=(a=CUDA.rand(2048,2048,16))
BenchmarkTools.Trial: 
  memory estimate:  8 bytes
  allocs estimate:  0
  --------------
  minimum time:     92.617 μs (0.00% GC)
  median time:      131.811 μs (0.00% GC)
  mean time:        141.102 μs (0.00% GC)
  maximum time:     355.577 μs (0.00% GC)
  --------------
  samples:          1067
  evals/sample:     10

 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations      
                               ──────────────────────   ───────────────────────
       Tot / % measured:            75.4s / 1.58%           5.65GiB / 0.08%    

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 gcnode unsafe_free!       55    871ms  73.1%  15.8ms   4.50KiB  0.10%    83.8B
 kptr unsafe_free!        208    321ms  26.9%  1.54ms   4.38MiB  100%   21.6KiB
 ──────────────────────────────────────────────────────────────────────────────
 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations      
                               ──────────────────────   ───────────────────────
       Tot / % measured:            73.6s / 1.33%           5.25GiB / 0.00%    

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 unsafe_free              471    982ms   100%  2.08ms   20.0KiB  100%     43.5B
   free                   218    982ms   100%  4.50ms   19.2KiB  95.9%    90.2B
     try                  195    982ms   100%  5.03ms   12.3KiB  61.6%    64.8B
       pool               195    981ms   100%  5.03ms   9.86KiB  49.2%    51.8B
         actual_free      195    981ms   100%  5.03ms   8.69KiB  43.4%    45.6B
       allocated          195    126μs  0.01%   648ns     0.00B  0.00%    0.00B
       requested          195   19.6μs  0.00%   100ns     0.00B  0.00%    0.00B
 ──────────────────────────────────────────────────────────────────────────────

The split pool doesn't suffer from this, but has other issues.

maleadt · 2021-03-05T15:16:58Z

Fixed on latest CUDA.jl with CUDA 11.2:

julia> @benchmark CUDA.unsafe_free!(a) setup=(a=CUDA.rand(2048,2048,16))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.707 ns (0.00% GC)
  median time:      15.067 ns (0.00% GC)
  mean time:        12.046 ns (0.00% GC)
  maximum time:     87.059 ns (0.00% GC)
  --------------
  samples:          7965
  evals/sample:     1000

maleadt changed the title ~~unsafe_free! takes 30ms?~~ Freeing large buffers takes a while Jan 6, 2021

maleadt added cuda array Stuff about CuArray. performance How fast can we go? labels Jan 6, 2021

maleadt closed this as completed Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freeing large buffers takes a while #594

Freeing large buffers takes a while #594

denizyuret commented Dec 7, 2020

maleadt commented Jan 6, 2021

maleadt commented Mar 5, 2021

Freeing large buffers takes a while #594

Freeing large buffers takes a while #594

Comments

denizyuret commented Dec 7, 2020

maleadt commented Jan 6, 2021

maleadt commented Mar 5, 2021