Memory free error with CUDA 11.2 and multi threads/GPUs #737

marius311 · 2021-02-26T09:07:38Z

The following prints (not throws) an error with CUDA 11.2 and julia --threads=2 and 2 GPUs:

x = unified_gpu(rand(256,256));

N = 1000 # need this about this high to trigger, might take a few tries
Threads.@threads for dev in collect(devices())
    device!(dev)
    for i=1:N
        fft(x)
    end
end

The stack trace is below, as well as the source for unified_gpu which just puts the array in unified memory.

Stack trace

Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:695
  [2] wait
    @ ./task.jl:764 [inlined]
  [3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
    @ Base ./condition.jl:106
  [4] lock(rl::ReentrantLock)
    @ Base ./lock.jl:100
  [5] lock
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool/utils.jl:29 [inlined]
  [6] macro expansion
    @ ./lock.jl:207 [inlined]
  [7] free
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:359 [inlined]
  [8] (::CUDA.var"#290#291"{CuArray{ComplexF64, 2}})()
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:43
  [9] context!(f::CUDA.var"#290#291"{CuArray{ComplexF64, 2}}, ctx::CuContext)
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/state.jl:196
 [10] unsafe_free!(xs::CuArray{ComplexF64, 2})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:42
 [11] backtrace()
    @ Base ./error.jl:112
 [12] alloc
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:286 [inlined]
 [13] CuArray{Int8, 1}(#unused#::UndefInitializer, dims::Tuple{Int64})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:20
 [14] CuArray
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:78 [inlined]
 [15] CuArray
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:79 [inlined]
 [16] create_plan(xtype::CUDA.CUFFT.cufftType_t, xdims::Tuple{Int64, Int64}, region::UnitRange{Int64})
    @ CUDA.CUFFT /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:227
 [17] plan_fft
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:277 [inlined]
 [18] fft(x::CuArray{ComplexF64, 2}, region::UnitRange{Int64})
    @ AbstractFFTs ~/.julia/packages/AbstractFFTs/JAxy0/src/definitions.jl:51
 [19] fft
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:238 [inlined]
 [20] macro expansion
    @ ./In[6]:7 [inlined]
 [21] (::var"#61#threadsfor_fun#3"{Vector{CuDevice}})(onethread::Bool)
    @ Main ./threadingconstructs.jl:81
 [22] (::var"#61#threadsfor_fun#3"{Vector{CuDevice}})()
    @ Main ./threadingconstructs.jl:48

unified_gpu

function Adapt.adapt_structure(::Type{Mem.Unified}, x::Union{Array{T,N},CuArray{T,N}}) where {T,N}
    buf = Mem.alloc(Mem.Unified, sizeof(T) * prod(size(x)))
    y = unsafe_wrap(CuArray{T,N}, convert(CuPtr{T}, buf), size(x); own=true)
    copyto!(y, x)
    return y
end
unified_gpu(x) = adapt(Mem.Unified, x)

versioninfo

CUDA ac1f52f
Julia 1.6 backports-rc2 branch as per your suggestion on Slack

CUDA toolkit 11.2.0, local installation
CUDA driver 11.2.0
NVIDIA driver 450.102.4

Libraries: 
- CUBLAS: 11.3.1
- CURAND: 10.2.3
- CUFFT: 10.4.0
- CUSOLVER: 11.0.2
- CUSPARSE: 11.3.1
- CUPTI: 14.0.0
- NVML: 11.0.0+450.102.4
- CUDNN: 8.10.0 (for CUDA 11.2.0)
- CUTENSOR: missing

Toolchain:
- Julia: 1.6.0-rc1.42
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Preferences:
- Memory pool: None
- Async allocation: true

Environment:
- JULIA_CUDA_MEMORY_LIMIT: 15032385536
- JULIA_CUDA_USE_BINARYBUILDER: false

2 devices:
  0: Tesla V100-SXM2-16GB (sm_70, 12.802 GiB / 15.782 GiB available)
  1: Tesla V100-SXM2-16GB (sm_70, 14.019 GiB / 15.782 GiB available)

The text was updated successfully, but these errors were encountered:

marius311 · 2021-02-26T09:24:49Z

Ah sorry, the above is because I had debug level set to 2. Nevertheless, if you put it back down to 1, the same code (with N=5000 to make it more dependable) still triggers a similar error though, this time with this stack trace:

  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:695
  [2] wait
    @ ./task.jl:764 [inlined]
  [3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
    @ Base ./condition.jl:106
  [4] lock(rl::ReentrantLock)
    @ Base ./lock.jl:100
  [5] macro expansion
    @ ./lock.jl:207 [inlined]
  [6] CuContext(handle::Ptr{Nothing})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/lib/cudadrv/context.jl:27
  [7] CuCurrentContext
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cudadrv/context.jl:98 [inlined]
  [8] macro expansion
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/state.jl:209 [inlined]
  [9] actual_free(dev::CuDevice, block::CUDA.PoolUtils.Block)
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:158
 [10] free
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool/none.jl:31 [inlined]
 [11] macro expansion
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:205 [inlined]
 [12] macro expansion
    @ ~/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [13] macro expansion
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:366 [inlined]
 [14] macro expansion
    @ ./timing.jl:279 [inlined]
 [15] free
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:365 [inlined]
 [16] (::CUDA.var"#290#291"{CuArray{ComplexF64, 2}})()
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:43
 [17] context!(f::CUDA.var"#290#291"{CuArray{ComplexF64, 2}}, ctx::CuContext)
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/state.jl:196
 [18] unsafe_free!(xs::CuArray{ComplexF64, 2})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:42
 [19] gc
    @ ./gcutils.jl:94 [inlined]
 [20] macro expansion
    @ ~/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [21] alloc(dev::CuDevice, sz::Int64)
    @ CUDA.NoPool /global/u1/m/marius/work/clem/dev/CUDA/src/pool/none.jl:16
 [22] macro expansion
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:205 [inlined]
 [23] macro expansion
    @ ~/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [24] macro expansion
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:272 [inlined]
 [25] macro expansion
    @ ./timing.jl:279 [inlined]
 [26] alloc
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:271 [inlined]
 [27] CuArray{ComplexF64, 2}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:20
 [28] similar
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:99 [inlined]
 [29] copy(a::CuArray{ComplexF64, 2})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:104
 [30] unsafe_execute!(plan::CUDA.CUFFT.cCuFFTPlan{ComplexF64, -1, false, 2}, x::CuArray{ComplexF64, 2}, y::CuArray{ComplexF64, 2})
    @ CUDA.CUFFT /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:435
 [31] mul!
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:460 [inlined]
 [32] *
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:488 [inlined]
 [33] fft(x::CuArray{ComplexF64, 2}, region::UnitRange{Int64})
    @ AbstractFFTs ~/.julia/packages/AbstractFFTs/JAxy0/src/definitions.jl:51
 [34] fft
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:238 [inlined]
 [35] macro expansion
    @ ./In[11]:7 [inlined]
 [36] (::var"#136#threadsfor_fun#8"{Vector{CuDevice}})(onethread::Bool)
    @ Main ./threadingconstructs.jl:81
 [37] (::var"#136#threadsfor_fun#8"{Vector{CuDevice}})()
    @ Main ./threadingconstructs.jl:48

maleadt · 2021-02-26T15:29:35Z

What's the actual error you're seeing? You only included the stack trace.
FWIW, I'm seeing a free after the context is destroyed, and occasionally a segfault; both are hard to explain since we never really destroy the context before that happens...

maleadt · 2021-02-26T15:33:00Z

Ah, I'm a dummy. Of course this crashes, you're allocating a unified buffer (using cuMemAllocManaged), but you're instructing CUDA.jl to free it (by passing own=true) which calls cuMemFree assuming a device buffer. So unsafe_wrap is only intended to be used with buffers that CUDA.jl knows how to free, which this isn't. Instead, pass own=false and register your own finalizer.

marius311 · 2021-02-26T19:23:25Z

What's the actual error you're seeing? You only included the stack trace.

Thats all thats printed. I guess I'm not sure where its coming from, this should have printed the error itself?

Instead, pass own=false and register your own finalizer

Ah I see, thanks. I just followed the example here. Should that be changed?

marius311 · 2021-02-26T21:18:43Z

I tried setting own=false but still get sporadic printed errors / segfaults with pretty much that same stack trace. I think the origin of the problem is the context! switches in the finalizers. Maybe this should work, but it doesn't seem to. With the following small changes though you can get rid of the context switches: master...marius311:no_gc_ctx_switch. With that, the unified-memory multi GPU/thread stuff I'm trying to do is actually working really well!

maleadt · 2021-02-27T09:41:51Z

There's definitely a context switch missing in the unsafe_wrap finalizer, I'll tackle that next week. And yes the docs are out of date...

marius311 added the bug Something isn't working label Feb 26, 2021

maleadt closed this as completed Feb 26, 2021

maleadt mentioned this issue Apr 23, 2021

Identify the buffer during unsafe_wrap to support unified free. #857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory free error with CUDA 11.2 and multi threads/GPUs #737

Memory free error with CUDA 11.2 and multi threads/GPUs #737

marius311 commented Feb 26, 2021

marius311 commented Feb 26, 2021

maleadt commented Feb 26, 2021

maleadt commented Feb 26, 2021

marius311 commented Feb 26, 2021

marius311 commented Feb 26, 2021

maleadt commented Feb 27, 2021

Memory free error with CUDA 11.2 and multi threads/GPUs #737

Memory free error with CUDA 11.2 and multi threads/GPUs #737

Comments

marius311 commented Feb 26, 2021

marius311 commented Feb 26, 2021

maleadt commented Feb 26, 2021

maleadt commented Feb 26, 2021

marius311 commented Feb 26, 2021

marius311 commented Feb 26, 2021

maleadt commented Feb 27, 2021