Memory allocation tracing and bug fixes. #212

maleadt · 2018-11-22T13:34:23Z

Developed when working on https://github.com/JuliaGPU/CuArrays.jl/issues/210:

fixes for Don't pool large arrays #169 (large alloc's didn't trigger memory reclaim)
tracing mode for debugging outstanding allocations

$ CUARRAYS_MANAGED_POOL=true CUARRAYS_TRACE_POOL=true jj --compiled-modules=no
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.2 (2018-11-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using CuArrays

julia> As = []
0-element Array{Any,1}

julia> push!(As, CuArray{Float32}(undef, 1000, 1000, 1000));

julia> push!(As, CuArray{Float32}(undef, 1000, 1000, 1000));
┌ Error: Failed to allocate 3.725 GiB (requires 3.725 GiB buffer)
└ @ CuArrays ~/Julia/CuArrays/src/memory.jl:273
┌ Warning: Outstanding allocation of 3.725 GiB (requires 3.725 GiB buffer)
│   exception =
│    CUDA error: out of memory (code #2, ERROR_OUT_OF_MEMORY)
│    Stacktrace:
│     [1] macro expansion at ./util.jl:213 [inlined]
│     [2] alloc(::Int64) at /home/tbesard/Julia/CuArrays/src/memory.jl:225
│     [3] CuArray{Float32,3}(::UndefInitializer, ::Tuple{Int64,Int64,Int64}) at /home/tbesard/Julia/CuArrays/src/array.jl:29
│     [4] CuArray{Float32,N} where N(::UndefInitializer, ::Tuple{Int64,Int64,Int64}) at /home/tbesard/Julia/CuArrays/src/array.jl:36
│     [5] CuArray{Float32,N} where N(::UndefInitializer, ::Int64, ::Vararg{Int64,N} where N) at /home/tbesard/Julia/CuArrays/src/array.jl:37
│     [6] top-level scope at none:0
│     [7] eval(::Module, ::Any) at ./boot.jl:319
│     [8] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85
│     [9] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117 [inlined]
│     [10] (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})() at ./task.jl:259
└ @ CuArrays ~/Julia/CuArrays/src/memory.jl:276
ERROR: CUDA error: out of memory (code #2, ERROR_OUT_OF_MEMORY)
Stacktrace:
 [1] macro expansion at /home/tbesard/Julia/CUDAdrv/src/base.jl:147 [inlined]
 [2] #alloc#3(::CUDAdrv.Mem.CUmem_attach, ::Function, ::Int64, ::Bool) at /home/tbesard/Julia/CUDAdrv/src/memory.jl:161
 [3] alloc at /home/tbesard/Julia/CUDAdrv/src/memory.jl:157 [inlined] (repeats 2 times)
 [4] (::getfield(CuArrays, Symbol("##17#18")))() at /home/tbesard/Julia/CuArrays/src/memory.jl:263
 [5] lock(::getfield(CuArrays, Symbol("##17#18")), ::ReentrantLock) at ./lock.jl:101
 [6] macro expansion at ./util.jl:213 [inlined]
 [7] alloc(::Int64) at /home/tbesard/Julia/CuArrays/src/memory.jl:225
 [8] CuArray{Float32,3}(::UndefInitializer, ::Tuple{Int64,Int64,Int64}) at /home/tbesard/Julia/CuArrays/src/array.jl:29
 [9] CuArray{Float32,N} where N(::UndefInitializer, ::Tuple{Int64,Int64,Int64}) at /home/tbesard/Julia/CuArrays/src/array.jl:36
 [10] CuArray{Float32,N} where N(::UndefInitializer, ::Int64, ::Vararg{Int64,N} where N) at /home/tbesard/Julia/CuArrays/src/array.jl:37
 [11] top-level scope at none:0

maleadt · 2018-11-22T13:34:37Z

bors try

maleadt · 2018-11-22T13:39:04Z

@staticfloat: IIRC you developed something similar while debugging FluxML, any missing features? Furthermore, since you looked at the Profiler code recently, any suggestions on improving the call stack saving? push!(..., stacktrace()) doesn't sound very efficient.

bors · 2018-11-22T13:53:06Z

try

Build succeeded

ci/gitlab/trying

staticfloat · 2018-11-24T00:20:13Z

I started looking at GPU usage, but it was all manual and didn't get me very far. I'm actually currently working on something a little more fundamental for Julia; something that can build graphs similar to what Google gives you in the TPU cloud console:

The basic idea is that you'll do something like Profile.@memprofile foo() and it will capture relevant information about all allocations and deallocations within the given expression. The simplest output would be just a cumulative "memory versus time" graph that looks like a sawtooth wave (e.g. memory usage grows until GC runs trim it down) and the user could then strategically insert manual GC.gc() calls to debug where large objects are lasting for a long time, etc... something more sophisticated could try to analyze the allocation sites and give an idea of which modules/functions are responsible for the most allocations, etc... I'm going to build just enough of this to get an idea for our Metalhead issues, but it would be neat if we could do similar things for GPUs as well.

You can follow along with my work on this branch, but note that I'm mostly just hooking into the GC in C-land, then using Julia code to pull the results out, just like the profiler does.

As far as efficiency goes, stacktrace() calls backtrace() which calls jl_backtrace_from_here(), which does an awful lot of work as compared to rec_backtrace(). My guess is because it is doing all the Julia object conversion voodoo live, at that moment. You'd probably do better to call curr_offset += rec_backtrace(pointer_to_array + current_offset, max_size - current_offset) over and over again, then after the fact convert the giant array of backtrace pointers to actual readable Julia objects. But I wouldn't do that until you check to see what the time cost is.

maleadt · 2018-11-26T08:59:22Z

You can follow along with my work on this branch, but note that I'm mostly just hooking into the GC in C-land, then using Julia code to pull the results out, just like the profiler does.

Cool, that looks very promising. Would be trivial to add a kind type and expose the allocation tracking. Will keep an eye on that, thanks for chiming in!

maleadt · 2018-11-29T09:39:24Z

Looks like rec_backtrace is exported but local so this wouldn't be 1.0 compatible. Let's go with the slower approach for now, it's only meant for debugging anyway.

staticfloat · 2018-12-01T01:30:14Z

Would be trivial to add a kind type and expose the allocation tracking.

I've done something like that, thanks for the suggestion. I'm currently drowning in backtrace swampland, but once I get things all working together, I hope to be able to support your usecase as well.

maleadt added the enhancement label Nov 22, 2018

bors bot added a commit that referenced this pull request Nov 22, 2018

Try #212:

f101f81

Memory allocation tracing and bug fixes.

92094dd

maleadt force-pushed the tb/trace_pool branch from 14baa01 to 92094dd Compare November 22, 2018 15:22

maleadt merged commit 4987303 into master Nov 29, 2018

bors bot deleted the tb/trace_pool branch November 29, 2018 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory allocation tracing and bug fixes. #212

Memory allocation tracing and bug fixes. #212

maleadt commented Nov 22, 2018

maleadt commented Nov 22, 2018

maleadt commented Nov 22, 2018

bors bot commented Nov 22, 2018

staticfloat commented Nov 24, 2018

maleadt commented Nov 26, 2018

maleadt commented Nov 29, 2018

staticfloat commented Dec 1, 2018 •

edited

Loading

Memory allocation tracing and bug fixes. #212

Memory allocation tracing and bug fixes. #212

Conversation

maleadt commented Nov 22, 2018

maleadt commented Nov 22, 2018

maleadt commented Nov 22, 2018

bors bot commented Nov 22, 2018

try

Build succeeded

staticfloat commented Nov 24, 2018

maleadt commented Nov 26, 2018

maleadt commented Nov 29, 2018

staticfloat commented Dec 1, 2018 • edited Loading

staticfloat commented Dec 1, 2018 •

edited

Loading