Switching devices causes GC errors #731

marius311 · 2021-02-23T07:48:41Z

Allocating an array, switching devices, then triggering GC, seems to cause errors. Its unclear to me to what extent this is supposed to work or whether this is too experimental, but it certainly hampers single-process multi-GPU work quite a bit (which otherwise seems very doable) so if there's an easy fix it'd be great to have one.

Here's a MWE (Julia 1.6, CUDA 2.6.1):

julia> using CUDA

julia> device!(0)

julia> x = CUDA.rand(2,2)
2×2 CuArray{Float32, 2}:
 0.386771  0.448549
 0.419093  0.383297

julia> device!(1)

julia> x = nothing

julia> GC.gc(true)
WARNING: Error while freeing CuPtr{Nothing}(0x00002aab9fe30000):
Base.KeyError(key=CUDA.CuPtr{Nothing}(0x00002aab9fe30000))

The bug is easy enough to understand, this line looks up the pointer in the pool for the current device, rather than the one in which it was allocated, so its not there.

Stacktrace:
  [1] getindex
    @ ./dict.jl:482 [inlined]
  [2] free
    @ ~/.julia/packages/CUDA/Zmd60/src/pool.jl:347 [inlined]
  [3] unsafe_free!(xs::CuArray{Float32, 2})
    @ CUDA ~/.julia/packages/CUDA/Zmd60/src/array.jl:42
  [4] gc(full::Bool)
    @ Base.GC ./gcutils.jl:94
  [5] top-level scope
    @ REPL[6]:1
  [6] eval(m::Module, e::Any)
    @ Core ./boot.jl:360
  [7] eval_user_input(ast::Any, backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:139
  [8] repl_backend_loop(backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:200
  [9] start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:185
 [10] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:317
 [11] run_repl(repl::REPL.AbstractREPL, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:305
 [12] (::Base.var"#875#877"{Bool, Bool, Bool})(REPL::Module)
    @ Base ./client.jl:387
 [13] #invokelatest#2
    @ ./essentials.jl:707 [inlined]
 [14] invokelatest
    @ ./essentials.jl:706 [inlined]
 [15] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
    @ Base ./client.jl:372
 [16] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:302
 [17] _start()
    @ Base ./client.jl:485

maleadt · 2021-02-23T19:45:55Z

Thanks for the clear bug report and MWE!

marius311 added the bug Something isn't working label Feb 23, 2021

maleadt mentioned this issue Feb 23, 2021

Perform pool operations in the correct context. #732

Merged

maleadt closed this as completed in #732 Feb 23, 2021

maleadt added the cuda array Stuff about CuArray. label Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switching devices causes GC errors #731

Switching devices causes GC errors #731

marius311 commented Feb 23, 2021

maleadt commented Feb 23, 2021

Switching devices causes GC errors #731

Switching devices causes GC errors #731

Comments

marius311 commented Feb 23, 2021

maleadt commented Feb 23, 2021