Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High allocations and getindex #150

Closed
Drvi opened this issue Jan 27, 2019 · 5 comments
Closed

High allocations and getindex #150

Drvi opened this issue Jan 27, 2019 · 5 comments
Labels
cuda array Stuff about CuArray. performance How fast can we go?

Comments

@Drvi
Copy link

Drvi commented Jan 27, 2019

Hi and thanks for your work!
I have a problem that effectively prohibits me from training my Flux model with multiple of embedding matrices.
It's hard to avoid allocations when indexing into (flux-tracked) CuArrays and my training time mainly becomes GC time.
Views won't work in these examples that I think illustrate the problem:

using CuArrays

a = UInt16.(collect(1:65535));
W = cu(randn(64, maximum(a))); # These are parameters
f(t, a) = (A = W[:, rand(a, 5000)]; sum([A for _ in 1:t]))

function g(n, a)
          for _ in 1:n
              CuArrays.@time f(1000, a)
          end
       end

g(10, a)
  0.486 sec (280.39 k CPU allocs: 14.736 MiB, 11.66% gc time) (1.00 k GPU allocs: 1.192 GiB, 54.35% gc time of which 100.00% spent allocating)
  0.341 sec ( 50.05 k CPU allocs: 1.768 MiB)                  (1.00 k GPU allocs: 1.192 GiB, 95.19% gc time of which 100.00% spent allocating)
  0.398 sec ( 50.07 k CPU allocs: 1.909 MiB)                  (1.00 k GPU allocs: 1.192 GiB, 95.64% gc time of which 100.00% spent allocating)
  0.198 sec ( 79.98 k CPU allocs: 2.314 MiB, 34.47% gc time)  (1.00 k GPU allocs: 1.192 GiB, 58.37% gc time of which 100.00% spent allocating)
  0.016 sec ( 45.05 k CPU allocs: 1.677 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.019 sec ( 45.05 k CPU allocs: 1.677 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.089 sec ( 78.19 k CPU allocs: 2.282 MiB, 73.77% gc time)  (1.00 k GPU allocs: 1.192 GiB)
  0.012 sec ( 44.92 k CPU allocs: 1.675 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.018 sec ( 45.05 k CPU allocs: 1.677 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.086 sec ( 78.75 k CPU allocs: 2.291 MiB, 75.18% gc time)  (1.00 k GPU allocs: 1.192 GiB)

I think getindex has some issues with Flux backprop as well as the following script allocates a lot of memory in the Flux.back! calls:

using CuArrays
using Flux

Ea = gpu(param(randn(64, 1_000_000)));
Eb = gpu(param(randn(64, 65_535)));
i = UInt16.(collect(1:5_000));
loss(i,n) = sum(sum(Eb[:, i] .+ Ea[:, rand(1:size(Ea,2), 1)]) for _ in 1:n)

function g(n, t, i)
   for _ in 1:t
      print("loss ")
      CuArrays.@time l = loss(i, n)
      print("back ")
      CuArrays.@time Flux.back!(l)
   end
end

g(100, 10, i)
loss | 0.029 sec | (33.99 k CPU allocs: 2.097 MiB)                 | (600 GPU allocs: 245.150 MiB, 55.07% gc)
back | 0.837 sec | (63.93 k CPU allocs: 2.997 MiB, 31.91% gc time) | (900 GPU allocs:  25.882 GiB, 28.31% gc)
loss | 0.083 sec | (35.31 k CPU allocs: 2.111 MiB)                 | (600 GPU allocs: 245.150 MiB, 78.41% gc)
back | 0.828 sec | (63.90 k CPU allocs: 3.004 MiB, 31.35% gc time) | (900 GPU allocs:  25.882 GiB, 28.60% gc)
loss | 0.053 sec | (34.62 k CPU allocs: 2.103 MiB)                 | (600 GPU allocs: 245.150 MiB, 73.20% gc)
back | 0.840 sec | (64.20 k CPU allocs: 3.006 MiB, 31.64% gc time) | (900 GPU allocs:  25.882 GiB, 28.13% gc)
loss | 0.076 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.21% gc)
back | 0.830 sec | (63.91 k CPU allocs: 3.005 MiB, 31.28% gc time) | (900 GPU allocs:  25.882 GiB, 28.84% gc)
loss | 0.052 sec | (34.62 k CPU allocs: 2.106 MiB)                 | (600 GPU allocs: 245.150 MiB, 73.26% gc)
back | 0.835 sec | (64.20 k CPU allocs: 3.004 MiB, 31.16% gc time) | (900 GPU allocs:  25.882 GiB, 28.69% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.39% gc)
back | 0.832 sec | (63.91 k CPU allocs: 3.004 MiB, 31.52% gc time) | (900 GPU allocs:  25.882 GiB, 28.82% gc)
loss | 0.052 sec | (34.62 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 72.99% gc)
back | 0.846 sec | (64.20 k CPU allocs: 3.004 MiB, 31.89% gc time) | (900 GPU allocs:  25.882 GiB, 28.35% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.28% gc)
back | 0.834 sec | (63.91 k CPU allocs: 3.004 MiB, 31.21% gc time) | (900 GPU allocs:  25.882 GiB, 28.64% gc)
loss | 0.053 sec | (34.62 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 73.00% gc)
back | 0.844 sec | (64.20 k CPU allocs: 3.004 MiB, 30.82% gc time) | (900 GPU allocs:  25.882 GiB, 28.65% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.24% gc)
back | 0.832 sec | (63.91 k CPU allocs: 3.003 MiB, 30.95% gc time) | (900 GPU allocs:  25.882 GiB, 28.52% gc)

Are there any workarounds I can use?
Please let me know if you need more info.

Thank you so much for looking into this!

@Drvi
Copy link
Author

Drvi commented Jan 28, 2019

Ok, now I think that the problem might be related to the definition of grad for getindex https://github.com/FluxML/Flux.jl/blob/ca1c73ed352d0557a72e12b663fff20332d9aaff/src/tracker/lib/array.jl#L100

Where you allocate a new zeroed array every time (zero(xs))

It would be nice to be able to create a custom sparse version of this, but sparse cuarrays and views throw errors.

@maleadt
Copy link
Member

maleadt commented Feb 8, 2019

The output of CuArrays.@time has changed (it contained a bug), and there's been some optimizations to the memory allocator, so it might be worth checking your use case again.

We now also have support for pushing objects into the GC pool again, see JuliaGPU/CuArrays.jl#275, so this could be used in Flux to early-free those zero arrays before returning. Would need some abstraction in order to support different kinds of arrays, but that should be straightforward.

@Drvi
Copy link
Author

Drvi commented Feb 11, 2019

Thanks for looking into it. Should I use the do syntax or put some unsafe_frees in the code, or is it expected that the code snippet above should be faster as is?

@maleadt
Copy link
Member

maleadt commented Feb 11, 2019

Both. The GC allocator has improved, so please try the master branch of CuArrays. But it didn't change the last example you posted much, so it might be worth adding some unsafe_free!s/do-block syntaxes.

@maleadt maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020
@maleadt maleadt added cuda array Stuff about CuArray. performance How fast can we go? labels May 27, 2020
@maleadt
Copy link
Member

maleadt commented Mar 2, 2021

Let's close this, see #137 (comment). If the issue persists, please open a new issue with an updated MWE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. performance How fast can we go?
Projects
None yet
Development

No branches or pull requests

2 participants