High allocations and getindex #150

Drvi · 2019-01-27T23:09:57Z

Hi and thanks for your work!
I have a problem that effectively prohibits me from training my Flux model with multiple of embedding matrices.
It's hard to avoid allocations when indexing into (flux-tracked) CuArrays and my training time mainly becomes GC time.
Views won't work in these examples that I think illustrate the problem:

using CuArrays

a = UInt16.(collect(1:65535));
W = cu(randn(64, maximum(a))); # These are parameters
f(t, a) = (A = W[:, rand(a, 5000)]; sum([A for _ in 1:t]))

function g(n, a)
          for _ in 1:n
              CuArrays.@time f(1000, a)
          end
       end

g(10, a)

  0.486 sec (280.39 k CPU allocs: 14.736 MiB, 11.66% gc time) (1.00 k GPU allocs: 1.192 GiB, 54.35% gc time of which 100.00% spent allocating)
  0.341 sec ( 50.05 k CPU allocs: 1.768 MiB)                  (1.00 k GPU allocs: 1.192 GiB, 95.19% gc time of which 100.00% spent allocating)
  0.398 sec ( 50.07 k CPU allocs: 1.909 MiB)                  (1.00 k GPU allocs: 1.192 GiB, 95.64% gc time of which 100.00% spent allocating)
  0.198 sec ( 79.98 k CPU allocs: 2.314 MiB, 34.47% gc time)  (1.00 k GPU allocs: 1.192 GiB, 58.37% gc time of which 100.00% spent allocating)
  0.016 sec ( 45.05 k CPU allocs: 1.677 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.019 sec ( 45.05 k CPU allocs: 1.677 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.089 sec ( 78.19 k CPU allocs: 2.282 MiB, 73.77% gc time)  (1.00 k GPU allocs: 1.192 GiB)
  0.012 sec ( 44.92 k CPU allocs: 1.675 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.018 sec ( 45.05 k CPU allocs: 1.677 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.086 sec ( 78.75 k CPU allocs: 2.291 MiB, 75.18% gc time)  (1.00 k GPU allocs: 1.192 GiB)

I think getindex has some issues with Flux backprop as well as the following script allocates a lot of memory in the Flux.back! calls:

using CuArrays
using Flux

Ea = gpu(param(randn(64, 1_000_000)));
Eb = gpu(param(randn(64, 65_535)));
i = UInt16.(collect(1:5_000));
loss(i,n) = sum(sum(Eb[:, i] .+ Ea[:, rand(1:size(Ea,2), 1)]) for _ in 1:n)

function g(n, t, i)
   for _ in 1:t
      print("loss ")
      CuArrays.@time l = loss(i, n)
      print("back ")
      CuArrays.@time Flux.back!(l)
   end
end

g(100, 10, i)

loss | 0.029 sec | (33.99 k CPU allocs: 2.097 MiB)                 | (600 GPU allocs: 245.150 MiB, 55.07% gc)
back | 0.837 sec | (63.93 k CPU allocs: 2.997 MiB, 31.91% gc time) | (900 GPU allocs:  25.882 GiB, 28.31% gc)
loss | 0.083 sec | (35.31 k CPU allocs: 2.111 MiB)                 | (600 GPU allocs: 245.150 MiB, 78.41% gc)
back | 0.828 sec | (63.90 k CPU allocs: 3.004 MiB, 31.35% gc time) | (900 GPU allocs:  25.882 GiB, 28.60% gc)
loss | 0.053 sec | (34.62 k CPU allocs: 2.103 MiB)                 | (600 GPU allocs: 245.150 MiB, 73.20% gc)
back | 0.840 sec | (64.20 k CPU allocs: 3.006 MiB, 31.64% gc time) | (900 GPU allocs:  25.882 GiB, 28.13% gc)
loss | 0.076 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.21% gc)
back | 0.830 sec | (63.91 k CPU allocs: 3.005 MiB, 31.28% gc time) | (900 GPU allocs:  25.882 GiB, 28.84% gc)
loss | 0.052 sec | (34.62 k CPU allocs: 2.106 MiB)                 | (600 GPU allocs: 245.150 MiB, 73.26% gc)
back | 0.835 sec | (64.20 k CPU allocs: 3.004 MiB, 31.16% gc time) | (900 GPU allocs:  25.882 GiB, 28.69% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.39% gc)
back | 0.832 sec | (63.91 k CPU allocs: 3.004 MiB, 31.52% gc time) | (900 GPU allocs:  25.882 GiB, 28.82% gc)
loss | 0.052 sec | (34.62 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 72.99% gc)
back | 0.846 sec | (64.20 k CPU allocs: 3.004 MiB, 31.89% gc time) | (900 GPU allocs:  25.882 GiB, 28.35% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.28% gc)
back | 0.834 sec | (63.91 k CPU allocs: 3.004 MiB, 31.21% gc time) | (900 GPU allocs:  25.882 GiB, 28.64% gc)
loss | 0.053 sec | (34.62 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 73.00% gc)
back | 0.844 sec | (64.20 k CPU allocs: 3.004 MiB, 30.82% gc time) | (900 GPU allocs:  25.882 GiB, 28.65% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.24% gc)
back | 0.832 sec | (63.91 k CPU allocs: 3.003 MiB, 30.95% gc time) | (900 GPU allocs:  25.882 GiB, 28.52% gc)

Are there any workarounds I can use?
Please let me know if you need more info.

Thank you so much for looking into this!

The text was updated successfully, but these errors were encountered:

Drvi · 2019-01-28T12:59:36Z

Ok, now I think that the problem might be related to the definition of grad for getindex https://github.com/FluxML/Flux.jl/blob/ca1c73ed352d0557a72e12b663fff20332d9aaff/src/tracker/lib/array.jl#L100

Where you allocate a new zeroed array every time (zero(xs))

It would be nice to be able to create a custom sparse version of this, but sparse cuarrays and views throw errors.

maleadt · 2019-02-08T15:19:59Z

The output of CuArrays.@time has changed (it contained a bug), and there's been some optimizations to the memory allocator, so it might be worth checking your use case again.

We now also have support for pushing objects into the GC pool again, see JuliaGPU/CuArrays.jl#275, so this could be used in Flux to early-free those zero arrays before returning. Would need some abstraction in order to support different kinds of arrays, but that should be straightforward.

Drvi · 2019-02-11T13:16:00Z

Thanks for looking into it. Should I use the do syntax or put some unsafe_frees in the code, or is it expected that the code snippet above should be faster as is?

maleadt · 2019-02-11T15:07:19Z

Both. The GC allocator has improved, so please try the master branch of CuArrays. But it didn't change the last example you posted much, so it might be worth adding some unsafe_free!s/do-block syntaxes.

maleadt · 2021-03-02T12:19:40Z

Let's close this, see #137 (comment). If the issue persists, please open a new issue with an updated MWE.

maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020

maleadt added cuda array Stuff about CuArray. performance How fast can we go? labels May 27, 2020

maleadt closed this as completed Mar 2, 2021

maleadt mentioned this issue Jun 22, 2022

The GPU compiler fails on a call to maximum #1548

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High allocations and getindex #150

High allocations and getindex #150

Drvi commented Jan 27, 2019

Drvi commented Jan 28, 2019 •

edited

maleadt commented Feb 8, 2019

Drvi commented Feb 11, 2019

maleadt commented Feb 11, 2019

maleadt commented Mar 2, 2021

High allocations and getindex #150

High allocations and getindex #150

Comments

Drvi commented Jan 27, 2019

Drvi commented Jan 28, 2019 • edited

maleadt commented Feb 8, 2019

Drvi commented Feb 11, 2019

maleadt commented Feb 11, 2019

maleadt commented Mar 2, 2021

Drvi commented Jan 28, 2019 •

edited