Fixes to split memory pool #427

maleadt · 2019-09-20T11:48:13Z

No description provided.

maleadt · 2019-09-20T14:21:53Z

Also pushed some optimizations. On an artificial alloc-heavy workload by Deniz, before:

                                              Time          
                                      ──────────────────────
           Tot / % measured:                124s / 78.4%    

 Section                      ncalls     time   %tot     avg
 ───────────────────────────────────────────────────────────
 pooled alloc                  1.01M    97.3s   100%  96.3μs
   1.1 repopulate + compact     184k    24.7s  25.4%   134μs
     pooled free                228k   82.5ms  0.08%   362ns
   1.2 scan                    1.01M    43.9s  45.0%  43.5μs
     pooled free               81.7k   29.8ms  0.03%   365ns
   1.3 alloc                   19.7k    10.5s  10.8%   533μs
   1.4 reclaim + alloc             7    5.06s  5.19%   723ms
   2.0 gc(false)                   2    602ms  0.62%   301ms
     pooled free               33.6k   12.8ms  0.01%   380ns
   2.1 repopulate + compact        2    799ms  0.82%   400ms
   2.2 scan                        2    477μs  0.00%   238μs
   pooled free                  236k   86.4ms  0.09%   366ns
 pooled free                    431k    213ms  0.22%   496ns
 ───────────────────────────────────────────────────────────

After:

                                              Time          
                                      ──────────────────────
           Tot / % measured:               74.6s / 65.1%    

 Section                      ncalls     time   %tot     avg
 ───────────────────────────────────────────────────────────
 pooled alloc                  1.01M    48.3s   100%  47.8μs
   1.1 repopulate + compact     184k    24.0s  49.4%   130μs
     pooled free                149k   54.8ms  0.11%   368ns
   1.2 scan                    1.01M    8.38s  17.3%  8.29μs
     pooled free                175k   64.6ms  0.13%   369ns
   1.3 alloc                   7.81k    4.44s  9.15%   568μs
   1.4 reclaim + alloc             2   61.2ms  0.13%  30.6ms
   2.0 gc(false)                   1    282ms  0.58%   282ms
     pooled free               17.2k   6.44ms  0.01%   375ns
   2.1 repopulate + compact        1    426ms  0.88%   426ms
   2.2 scan                        1   8.65μs  0.00%  8.65μs
   pooled free                  227k   83.3ms  0.17%   367ns
 pooled free                    440k    190ms  0.39%   433ns
 ───────────────────────────────────────────────────────────

Old (but still default) binned allocator:

                                   Time          
                           ──────────────────────
     Tot / % measured:          34.1s / 26.2%    

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 background task       10    396ms  4.44%  39.6ms
   pooled free      55.9k   30.1ms  0.34%   540ns
   reclaim             10   25.9μs  0.00%  2.59μs
   scan                10   13.4μs  0.00%  1.34μs
 pooled alloc       1.01M    8.30s  93.0%  8.21μs
   1. try alloc     16.7k    5.33s  59.7%   319μs
   2. gc(false)        44    1.83s  20.5%  41.5ms
     pooled free     598k    319ms  3.57%   533ns
 pooled free         357k    284ms  3.19%   796ns
 ────────────────────────────────────────────────

Basically, the new allocator is quite a bit slower, but doesn't lean on the GC as much which should be a (strong) positive in more realistic workloads like https://github.com/JuliaGPU/CuArrays.jl/issues/273. Proof (also convincing myself I'm doing something useful here):

resnet50 with binned memory allocator, no memory cap (32GB GPU):

 ────────────────────────────────────────────────
                                   Time          
                           ──────────────────────
     Tot / % measured:          18.3s / 16.6%    

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 background task        9    209ms  6.89%  23.2ms
   pooled free        870    494μs  0.02%   568ns
   reclaim              9    109ms  3.59%  12.1ms
   scan                 9   13.5μs  0.00%  1.50μs
 pooled alloc       65.1k    2.82s  92.8%  43.3μs
   1. try alloc     2.00k    1.58s  52.2%   793μs
   2. gc(false)       133    1.10s  36.1%  8.25ms
     pooled free    57.0k   29.4ms  0.97%   517ns
 pooled free        6.78k   7.91ms  0.26%  1.17μs
 ────────────────────────────────────────────────

resnet50 with split memory allocator, no memory cap (32GB GPU):

 ───────────────────────────────────────────────────────────
                                              Time          
                                      ──────────────────────
           Tot / % measured:               18.6s / 25.3%    

 Section                      ncalls     time   %tot     avg
 ───────────────────────────────────────────────────────────
 pooled alloc                  65.1k    4.69s   100%  72.0μs
   1.1 repopulate + compact    6.79k    267ms  5.69%  39.4μs
   1.2 scan                    65.1k    393ms  8.35%  6.03μs
   1.3 alloc                   2.13k    1.54s  32.7%   722μs
   1.4 reclaim + alloc            32    563ms  12.0%  17.6ms
   2.0 gc(false)                  31    651ms  13.8%  21.0ms
     pooled free               56.3k   20.1ms  0.43%   356ns
   2.1 repopulate + compact       31    772ms  16.4%  24.9ms
   2.2 scan                       31    212μs  0.00%  6.85μs
 pooled free                   6.79k   10.7ms  0.23%  1.57μs
 ───────────────────────────────────────────────────────────

Similar situation to the artificial benchmark. However, adding some memory pressure to the mix:

resnet50 with binned memory allocator, 8GB memory cap:

 ────────────────────────────────────────────────────
                                       Time          
                               ──────────────────────
       Tot / % measured:            82.2s / 84.0%    

 Section               ncalls     time   %tot     avg
 ────────────────────────────────────────────────────
 background task           34    153ms  0.22%  4.49ms
   pooled free            940    533μs  0.00%   567ns
   reclaim                 34   77.2μs  0.00%  2.27μs
   scan                    34   50.4μs  0.00%  1.48μs
 pooled alloc           65.1k    68.9s   100%  1.06ms
   1. try alloc         23.7k    5.66s  8.19%   239μs
   2. gc(false)         4.31k    15.6s  22.6%  3.62ms
     pooled free        33.1k   19.4ms  0.03%   586ns
   3. reclaim unused    3.32k    7.25s  10.5%  2.19ms
     reclaim            3.32k    7.24s  10.5%  2.18ms
     scan               3.32k   2.86ms  0.00%   863ns
   4. try alloc         3.32k    3.32s  4.81%  1.00ms
   5. gc(true)            127    36.9s  53.4%   291ms
     pooled free        24.2k   13.5ms  0.02%   557ns
 pooled free            6.78k   7.53ms  0.01%  1.11μs
 ────────────────────────────────────────────────────

resnet50 with binned split allocator, 8GB memory cap:

 ───────────────────────────────────────────────────────────
                                              Time          
                                      ──────────────────────
           Tot / % measured:               48.5s / 72.6%    

 Section                      ncalls     time   %tot     avg
 ───────────────────────────────────────────────────────────
 pooled alloc                  65.1k    35.2s   100%   540μs
   1.1 repopulate + compact    6.79k    262ms  0.74%  38.7μs
   1.2 scan                    65.1k    262ms  0.74%  4.03μs
   1.3 alloc                   20.1k    5.96s  16.9%   296μs
   1.4 reclaim + alloc           439    5.08s  14.4%  11.6ms
   2.0 gc(false)                 273    2.04s  5.80%  7.48ms
     pooled free               44.1k   15.6ms  0.04%   354ns
   2.1 repopulate + compact      218    625ms  1.78%  2.87ms
   2.2 scan                      273   1.29ms  0.00%  4.71μs
   2.3 alloc                      81   22.0μs  0.00%   271ns
   2.4 reclaim + alloc            81    1.29s  3.68%  16.0ms
   3.0 gc(true)                   64    18.9s  53.7%   295ms
     pooled free               13.5k   5.76ms  0.02%   427ns
   3.1 repopulate + compact       64    183ms  0.52%  2.86ms
   3.2 scan                       64    355μs  0.00%  5.55μs
 pooled free                   6.79k   10.2ms  0.03%  1.50μs
 ───────────────────────────────────────────────────────────

50% alloc time reduction, 50s -> 20s gc(false+true) time. There's some optimizations to be made before the new allocator outperforms the old one in all cases, but I'm confident that's possible.

Introduce an INVALID state for initial and actually freed blocks.

maleadt · 2019-09-25T17:21:19Z

Summary of improvements with Flux.jl (still using Tracker.jl) + @KristofferC's resnet50

default allocator, 16GB GPU: ~25s
default allocator, 8GB cap: ~50s
new allocator, 16GB GPU: ~20s
new allocator, 8GB cap: ~27s

So solid improvements across the board. Even stronger improvements on non-GC heavy workloads (e.g. Knet.jl models, which much more aggressively free memory before it hits the GC).

I propose to tag and release a version of CuArrays with the new allocator available but not enabled by default (CUARRAYS_MEMORY_POOL=split).

Detailed timings for those interested:

Binned pool, 16GB GPU:

 ────────────────────────────────────────────────────
                                       Time
                               ──────────────────────
       Tot / % measured:            25.2s / 59.5%

 Section               ncalls     time   %tot     avg
 ────────────────────────────────────────────────────
 background task           17   59.8ms  0.40%  3.52ms
   pooled free          2.29k    806μs  0.01%   351ns
   reclaim                 17    277μs  0.00%  16.3μs
   scan                    17   11.8μs  0.00%   697ns
 pooled alloc           58.3k    15.0s   100%   256μs
   1. try alloc         4.56k    2.08s  13.9%   457μs
   2. gc(false)         1.15k    3.90s  26.0%  3.41ms
     pooled free        42.0k   16.7ms  0.11%   399ns
   3. reclaim unused      692    1.44s  9.63%  2.09ms
     reclaim              692    1.44s  9.61%  2.09ms
     scan                 692    872μs  0.01%  1.26μs
   4. try alloc           692    2.48s  16.5%  3.58ms
   5. gc(true)             27    4.97s  33.1%   184ms
     pooled free        14.2k   4.41ms  0.03%   310ns
 ────────────────────────────────────────────────────

Binned pool, 8GB cap:

 ────────────────────────────────────────────────────
                                       Time
                               ──────────────────────
       Tot / % measured:            49.0s / 85.6%

 Section               ncalls     time   %tot     avg
 ────────────────────────────────────────────────────
 background task           33    115ms  0.28%  3.49ms
   pooled free            930    398μs  0.00%   427ns
   reclaim                 33   39.8μs  0.00%  1.21μs
   scan                    33   19.0μs  0.00%   575ns
 pooled alloc           58.3k    41.8s   100%   716μs
   1. try alloc         23.7k    1.08s  2.59%  45.8μs
   2. gc(false)         4.31k    12.4s  29.7%  2.89ms
     pooled free        33.2k   15.1ms  0.04%   455ns
   3. reclaim unused    3.32k    4.00s  9.56%  1.21ms
     reclaim            3.32k    4.00s  9.54%  1.20ms
     scan               3.32k   3.47ms  0.01%  1.05μs
   4. try alloc         3.32k    1.98s  4.73%   598μs
   5. gc(true)            127    22.2s  52.9%   175ms
     pooled free        24.2k   8.03ms  0.02%   332ns
 ────────────────────────────────────────────────────

Split allocator, 16GB GPU:

 ──────────────────────────────────────────────────
                                     Time
                             ──────────────────────
      Tot / % measured:           19.4s / 48.6%

 Section             ncalls     time   %tot     avg
 ──────────────────────────────────────────────────
 pooled alloc         58.3k    9.42s   100%   161μs
   1.1a repopulate        1   6.23μs  0.00%  6.23μs
   1.1b compact           1   1.68μs  0.00%  1.68μs
   1.2 scan           58.3k   60.0ms  0.64%  1.03μs
   1.3 alloc          2.79k    1.64s  17.4%   588μs
     alloc            2.79k    1.64s  17.4%   588μs
   1.4a reclaim       1.09k    2.39s  25.4%  2.20ms
     free             2.57k    2.37s  25.2%   925μs
   1.4b alloc         1.09k    4.20s  44.6%  3.85ms
     alloc            1.09k    4.20s  44.6%  3.85ms
   1.5a reclaim          56   1.82ms  0.02%  32.6μs
     free                10    867μs  0.01%  86.7μs
   1.5b alloc            56   72.9ms  0.77%  1.30ms
     alloc               56   72.9ms  0.77%  1.30ms
   2.0 gc(false)         56    496ms  5.26%  8.85ms
     pooled free      56.7k   21.1ms  0.22%   372ns
   2.1a repopulate       56   22.6ms  0.24%   404μs
   2.1b compact          56   34.6ms  0.37%   618μs
   2.2 scan              56   68.1μs  0.00%  1.22μs
   2.3 alloc              1   1.55ms  0.02%  1.55ms
     alloc                1   1.54ms  0.02%  1.54ms
   2.4a reclaim           3    361μs  0.00%   120μs
     free                23    299μs  0.00%  13.0μs
   2.4b alloc             3   9.35ms  0.10%  3.12ms
     alloc                3   9.34ms  0.10%  3.11ms
   2.5a reclaim           1   25.7μs  0.00%  25.7μs
   2.5b alloc             1    835μs  0.01%   835μs
     alloc                1    834μs  0.01%   834μs
   3.0 gc(true)           1    260ms  2.76%   260ms
     pooled free      1.12k    513μs  0.01%   459ns
   3.1a repopulate        1    475μs  0.01%   475μs
   3.1b compact           1    531μs  0.01%   531μs
   3.2 scan               1   1.00μs  0.00%  1.00μs
 pooled free              2   3.85μs  0.00%  1.93μs
 ──────────────────────────────────────────────────

Split allocator, 8GB cap:

 ──────────────────────────────────────────────────
                                     Time
                             ──────────────────────
      Tot / % measured:           26.8s / 70.6%

 Section             ncalls     time   %tot     avg
 ──────────────────────────────────────────────────
 pooled alloc         58.3k    18.9s   100%   324μs
   1.1a repopulate        1   6.38μs  0.00%  6.38μs
   1.1b compact           1   1.57μs  0.00%  1.57μs
   1.2 scan           58.3k   41.1ms  0.22%   704ns
   1.3 alloc          13.2k    321ms  1.70%  24.4μs
     alloc            13.2k    317ms  1.68%  24.1μs
   1.4a reclaim       3.77k    5.43s  28.7%  1.44ms
     free             12.6k    5.41s  28.6%   430μs
   1.4b alloc         3.77k    1.08s  5.70%   286μs
     alloc            3.77k    1.08s  5.69%   285μs
   1.5a reclaim         307   1.24ms  0.01%  4.04μs
   1.5b alloc           307    193μs  0.00%   629ns
     alloc              307   60.0μs  0.00%   196ns
   2.0 gc(false)        307    1.19s  6.28%  3.87ms
     pooled free      44.2k   16.4ms  0.09%   371ns
   2.1a repopulate      246   15.9ms  0.08%  64.7μs
   2.1b compact         246   29.3ms  0.16%   119μs
   2.2 scan             307    297μs  0.00%   967ns
   2.3 alloc             77   53.7μs  0.00%   697ns
     alloc               77   17.9μs  0.00%   232ns
   2.4a reclaim         217   17.9ms  0.09%  82.5μs
     free                82   16.8ms  0.09%   205μs
   2.4b alloc           217   3.44ms  0.02%  15.8μs
     alloc              217   3.38ms  0.02%  15.6μs
   2.5a reclaim          64    229μs  0.00%  3.57μs
   2.5b alloc            64   37.2μs  0.00%   582ns
     alloc               64   10.1μs  0.00%   157ns
   3.0 gc(true)          64    10.6s  55.9%   165ms
     pooled free      13.4k   4.82ms  0.03%   361ns
   3.1a repopulate       64   4.03ms  0.02%  63.0μs
   3.1b compact          64   8.30ms  0.04%   130μs
   3.2 scan              64   60.6μs  0.00%   946ns
   3.3 alloc              1    771ns  0.00%   771ns
     alloc                1    353ns  0.00%   353ns
   3.4a reclaim           1    186μs  0.00%   186μs
     free                 1    183μs  0.00%   183μs
   3.4b alloc             1   15.9μs  0.00%  15.9μs
     alloc                1   15.3μs  0.00%  15.3μs
 pooled free              2   3.10μs  0.00%  1.55μs
 ──────────────────────────────────────────────────

KristofferC · 2019-09-25T18:42:59Z

Exciting stuff!

maleadt added the bug label Sep 20, 2019

maleadt force-pushed the tb/split_bugs branch from 34d680c to 163ae91 Compare September 20, 2019 11:48

maleadt added the performance label Sep 20, 2019

maleadt added 11 commits September 24, 2019 18:36

Fix concurrent access of block state in finalizer.

a92f17c

Use the FREED blockstate for blocks in the freed set.

a77641d

Introduce an INVALID state for initial and actually freed blocks.

Test the split memory pool.

702b80c

Optimize scan by using the balanced tree.

070867e

Protect against allocating before module initialization.

321ec68

Time calls to the free function.

40080e8

Separately time repopulate and compact.

b8d5505

Type stuff.

954247c

Don't reclaim all memory at once.

f59a046

Simplify alloc calls.

0322d84

Do partial reclaim.

237d34d

maleadt force-pushed the tb/split_bugs branch 3 times, most recently from d27e323 to a6b814a Compare September 25, 2019 07:53

Compat for Base.at-lock.

a47b827

maleadt force-pushed the tb/split_bugs branch from a6b814a to a47b827 Compare September 25, 2019 08:06

maleadt added 2 commits September 25, 2019 16:20

Verbose assertions for debugging purposes.

0dddbbc

Bypass 0-byte allocations.

80f6104

maleadt force-pushed the tb/split_bugs branch from 08838ec to 80f6104 Compare September 25, 2019 17:12

maleadt merged commit a6fc8af into master Sep 25, 2019

bors bot deleted the tb/split_bugs branch September 25, 2019 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes to split memory pool #427

Fixes to split memory pool #427

maleadt commented Sep 20, 2019

maleadt commented Sep 20, 2019 •

edited

maleadt commented Sep 25, 2019

KristofferC commented Sep 25, 2019

Fixes to split memory pool #427

Fixes to split memory pool #427

Conversation

maleadt commented Sep 20, 2019

maleadt commented Sep 20, 2019 • edited

maleadt commented Sep 25, 2019

KristofferC commented Sep 25, 2019

maleadt commented Sep 20, 2019 •

edited