Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory free error with CUDA 11.2 and multi threads/GPUs #737

Closed
marius311 opened this issue Feb 26, 2021 · 6 comments · Fixed by #857
Closed

Memory free error with CUDA 11.2 and multi threads/GPUs #737

marius311 opened this issue Feb 26, 2021 · 6 comments · Fixed by #857
Labels
bug Something isn't working

Comments

@marius311
Copy link
Contributor

The following prints (not throws) an error with CUDA 11.2 and julia --threads=2 and 2 GPUs:

x = unified_gpu(rand(256,256));

N = 1000 # need this about this high to trigger, might take a few tries
Threads.@threads for dev in collect(devices())
    device!(dev)
    for i=1:N
        fft(x)
    end
end

The stack trace is below, as well as the source for unified_gpu which just puts the array in unified memory.

Stack trace
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:695
  [2] wait
    @ ./task.jl:764 [inlined]
  [3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
    @ Base ./condition.jl:106
  [4] lock(rl::ReentrantLock)
    @ Base ./lock.jl:100
  [5] lock
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool/utils.jl:29 [inlined]
  [6] macro expansion
    @ ./lock.jl:207 [inlined]
  [7] free
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:359 [inlined]
  [8] (::CUDA.var"#290#291"{CuArray{ComplexF64, 2}})()
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:43
  [9] context!(f::CUDA.var"#290#291"{CuArray{ComplexF64, 2}}, ctx::CuContext)
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/state.jl:196
 [10] unsafe_free!(xs::CuArray{ComplexF64, 2})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:42
 [11] backtrace()
    @ Base ./error.jl:112
 [12] alloc
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:286 [inlined]
 [13] CuArray{Int8, 1}(#unused#::UndefInitializer, dims::Tuple{Int64})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:20
 [14] CuArray
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:78 [inlined]
 [15] CuArray
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:79 [inlined]
 [16] create_plan(xtype::CUDA.CUFFT.cufftType_t, xdims::Tuple{Int64, Int64}, region::UnitRange{Int64})
    @ CUDA.CUFFT /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:227
 [17] plan_fft
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:277 [inlined]
 [18] fft(x::CuArray{ComplexF64, 2}, region::UnitRange{Int64})
    @ AbstractFFTs ~/.julia/packages/AbstractFFTs/JAxy0/src/definitions.jl:51
 [19] fft
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:238 [inlined]
 [20] macro expansion
    @ ./In[6]:7 [inlined]
 [21] (::var"#61#threadsfor_fun#3"{Vector{CuDevice}})(onethread::Bool)
    @ Main ./threadingconstructs.jl:81
 [22] (::var"#61#threadsfor_fun#3"{Vector{CuDevice}})()
    @ Main ./threadingconstructs.jl:48
unified_gpu
function Adapt.adapt_structure(::Type{Mem.Unified}, x::Union{Array{T,N},CuArray{T,N}}) where {T,N}
    buf = Mem.alloc(Mem.Unified, sizeof(T) * prod(size(x)))
    y = unsafe_wrap(CuArray{T,N}, convert(CuPtr{T}, buf), size(x); own=true)
    copyto!(y, x)
    return y
end
unified_gpu(x) = adapt(Mem.Unified, x)
versioninfo
  • CUDA ac1f52f
  • Julia 1.6 backports-rc2 branch as per your suggestion on Slack
CUDA toolkit 11.2.0, local installation
CUDA driver 11.2.0
NVIDIA driver 450.102.4

Libraries: 
- CUBLAS: 11.3.1
- CURAND: 10.2.3
- CUFFT: 10.4.0
- CUSOLVER: 11.0.2
- CUSPARSE: 11.3.1
- CUPTI: 14.0.0
- NVML: 11.0.0+450.102.4
- CUDNN: 8.10.0 (for CUDA 11.2.0)
- CUTENSOR: missing

Toolchain:
- Julia: 1.6.0-rc1.42
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Preferences:
- Memory pool: None
- Async allocation: true

Environment:
- JULIA_CUDA_MEMORY_LIMIT: 15032385536
- JULIA_CUDA_USE_BINARYBUILDER: false

2 devices:
  0: Tesla V100-SXM2-16GB (sm_70, 12.802 GiB / 15.782 GiB available)
  1: Tesla V100-SXM2-16GB (sm_70, 14.019 GiB / 15.782 GiB available)
@marius311 marius311 added the bug Something isn't working label Feb 26, 2021
@marius311
Copy link
Contributor Author

Ah sorry, the above is because I had debug level set to 2. Nevertheless, if you put it back down to 1, the same code (with N=5000 to make it more dependable) still triggers a similar error though, this time with this stack trace:

  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:695
  [2] wait
    @ ./task.jl:764 [inlined]
  [3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
    @ Base ./condition.jl:106
  [4] lock(rl::ReentrantLock)
    @ Base ./lock.jl:100
  [5] macro expansion
    @ ./lock.jl:207 [inlined]
  [6] CuContext(handle::Ptr{Nothing})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/lib/cudadrv/context.jl:27
  [7] CuCurrentContext
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cudadrv/context.jl:98 [inlined]
  [8] macro expansion
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/state.jl:209 [inlined]
  [9] actual_free(dev::CuDevice, block::CUDA.PoolUtils.Block)
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:158
 [10] free
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool/none.jl:31 [inlined]
 [11] macro expansion
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:205 [inlined]
 [12] macro expansion
    @ ~/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [13] macro expansion
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:366 [inlined]
 [14] macro expansion
    @ ./timing.jl:279 [inlined]
 [15] free
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:365 [inlined]
 [16] (::CUDA.var"#290#291"{CuArray{ComplexF64, 2}})()
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:43
 [17] context!(f::CUDA.var"#290#291"{CuArray{ComplexF64, 2}}, ctx::CuContext)
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/state.jl:196
 [18] unsafe_free!(xs::CuArray{ComplexF64, 2})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:42
 [19] gc
    @ ./gcutils.jl:94 [inlined]
 [20] macro expansion
    @ ~/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [21] alloc(dev::CuDevice, sz::Int64)
    @ CUDA.NoPool /global/u1/m/marius/work/clem/dev/CUDA/src/pool/none.jl:16
 [22] macro expansion
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:205 [inlined]
 [23] macro expansion
    @ ~/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [24] macro expansion
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:272 [inlined]
 [25] macro expansion
    @ ./timing.jl:279 [inlined]
 [26] alloc
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/pool.jl:271 [inlined]
 [27] CuArray{ComplexF64, 2}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:20
 [28] similar
    @ /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:99 [inlined]
 [29] copy(a::CuArray{ComplexF64, 2})
    @ CUDA /global/u1/m/marius/work/clem/dev/CUDA/src/array.jl:104
 [30] unsafe_execute!(plan::CUDA.CUFFT.cCuFFTPlan{ComplexF64, -1, false, 2}, x::CuArray{ComplexF64, 2}, y::CuArray{ComplexF64, 2})
    @ CUDA.CUFFT /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:435
 [31] mul!
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:460 [inlined]
 [32] *
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:488 [inlined]
 [33] fft(x::CuArray{ComplexF64, 2}, region::UnitRange{Int64})
    @ AbstractFFTs ~/.julia/packages/AbstractFFTs/JAxy0/src/definitions.jl:51
 [34] fft
    @ /global/u1/m/marius/work/clem/dev/CUDA/lib/cufft/fft.jl:238 [inlined]
 [35] macro expansion
    @ ./In[11]:7 [inlined]
 [36] (::var"#136#threadsfor_fun#8"{Vector{CuDevice}})(onethread::Bool)
    @ Main ./threadingconstructs.jl:81
 [37] (::var"#136#threadsfor_fun#8"{Vector{CuDevice}})()
    @ Main ./threadingconstructs.jl:48

@maleadt
Copy link
Member

maleadt commented Feb 26, 2021

What's the actual error you're seeing? You only included the stack trace.
FWIW, I'm seeing a free after the context is destroyed, and occasionally a segfault; both are hard to explain since we never really destroy the context before that happens...

@maleadt
Copy link
Member

maleadt commented Feb 26, 2021

Ah, I'm a dummy. Of course this crashes, you're allocating a unified buffer (using cuMemAllocManaged), but you're instructing CUDA.jl to free it (by passing own=true) which calls cuMemFree assuming a device buffer. So unsafe_wrap is only intended to be used with buffers that CUDA.jl knows how to free, which this isn't. Instead, pass own=false and register your own finalizer.

@maleadt maleadt closed this as completed Feb 26, 2021
@marius311
Copy link
Contributor Author

What's the actual error you're seeing? You only included the stack trace.

Thats all thats printed. I guess I'm not sure where its coming from, this should have printed the error itself?

Instead, pass own=false and register your own finalizer

Ah I see, thanks. I just followed the example here. Should that be changed?

@marius311
Copy link
Contributor Author

I tried setting own=false but still get sporadic printed errors / segfaults with pretty much that same stack trace. I think the origin of the problem is the context! switches in the finalizers. Maybe this should work, but it doesn't seem to. With the following small changes though you can get rid of the context switches: master...marius311:no_gc_ctx_switch. With that, the unified-memory multi GPU/thread stuff I'm trying to do is actually working really well!

@maleadt
Copy link
Member

maleadt commented Feb 27, 2021

There's definitely a context switch missing in the unsafe_wrap finalizer, I'll tackle that next week. And yes the docs are out of date...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants