Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic allocations cannot be freed #38

Open
NTimmons opened this issue May 12, 2020 · 5 comments
Open

Dynamic allocations cannot be freed #38

NTimmons opened this issue May 12, 2020 · 5 comments
Labels
bug Something isn't working cuda kernels Stuff about writing CUDA kernels.

Comments

@NTimmons
Copy link

Hi,

I am getting a dynamic memory crash when using Ref and custom structs.
This is occuring across multiple calls to the same kernel, which looks like the memory being allocated is not being deallocated when the kernel returns. Even when using garbage collection.

I am running a mobile NVidia 2060 6GB, if that helps.

I have put together a minimal example here:

# Running once is fine. Running multiple times causes crash
# It seems dynamic memory allocation is consistent across
# runs.


struct Point
    x::Float32
    y::Float32
    z::Float32
    Point() = new(0.0f0, 0.0f0, 0.0f0)
    Point(a,b,c) = new(a,b,c)
end

function GetThreePoint(mem, offset, sz)
    a = GetValue(mem, offset, sz)
    b = GetValue(mem, offset, sz)   ## Comment this line out to not crash.
    return Point(a,b,a)
end

function GetValue(mem, offset, sz)
    offset[] = offset[]+1
    r = mem[][1+(offset[] % sz-2)]
    return r
end

function AllocatingKernel(memIn, memOut, size)
    i = 1 #(blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    memRef = Ref(memIn)
    a   = Ref(1)    
    memOut[i] = GetThreePoint(memRef, a, size).x
    return 
    
end

# Running

memInGpu  = CuArray(ones(Float32, 256))
memOutGpu = CuArray(ones(Float32, 256))
sz        = length(memInGpu)

for i in 1:1000
     @cuda blocks = (16,1) threads=(64,1) AllocatingKernel(memInGpu, memOutGpu, sz)
end

which results many pages of

ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.

From my limited testing, this crash only happens with my Point type, if we dont use the reference object with it, it is fine?

@maleadt maleadt changed the title Dynamic Memory Crash Dynamic allocations cannot be freed May 12, 2020
@maleadt
Copy link
Member

maleadt commented May 12, 2020

As I mentioned on Slack, this is kinda expected. We don't have a device-side GC, so you'll be quickly running out of memory when doing allocations in the hot path (i.e. using a Ref). This is exacerbated by a CUDA bug, where it's not possible to free the device-side heap from the host. We'll need to reimplement malloc to fix that, so making the issue about that.

@maleadt maleadt transferred this issue from JuliaGPU/CUDAnative.jl May 27, 2020
@maleadt maleadt added bug Something isn't working cuda kernels Stuff about writing CUDA kernels. labels May 27, 2020
@cdsousa
Copy link
Contributor

cdsousa commented Aug 5, 2021

Hi, I think I'm hitting this bug -- I'm getting that error -- but I'm not sure. This is happening in a code that used to work 4 months ago.
How can one know whether this is happening and what can be done to workaround it?

@maleadt
Copy link
Member

maleadt commented Aug 5, 2021

Check @device_code_llvm (if neededm passing dump_module=true) and look for calls to alloc (or gpu_malloc). It's possible a change in GPUCompiler.jl/CUDA.jl/Julia introduced allocations where there didn't used to be any (generally you don't want any, except possibly in exception paths).

@cdsousa
Copy link
Contributor

cdsousa commented Aug 6, 2021

I must say that I'm having this bug indeed, and it is related to the usage of an MVector (StaticArray.jl).
I've tried this exact code:

using CUDA
using StaticArrays

function kernel(a)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    STACKSIZE = 15
    stack = ones(MVector{15, UInt32})
    for (j,x) in enumerate(stack)
        a[i] += stack[j]
    end
    return
end

a = CuArray(ones(UInt32, 1024))

for i in 1:1000
     @cuda blocks=(16,1) threads=(64,1) kernel(a)
     CUDA.synchronize()
end

with this exact package versions:

  [052768ef] CUDA v3.3.4
  [61eb1bfa] GPUCompiler v0.12.8
  [90137ffa] StaticArrays v1.2.11

ant it works with 1.6.2 but errors with version 1.7.0-beta3.0 ...

@maleadt
Copy link
Member

maleadt commented Aug 6, 2021

Looks like a regression in Julia:

julia> VERSION

julia> using StaticArrays

julia> function cpu_kernel(a, i)
           STACKSIZE = 15
           stack = ones(MVector{15, UInt32})
           for (j,x) in enumerate(stack)
               a[i] += stack[j]
           end
           return
       end

julia> code_llvm(cpu_kernel, Tuple{Array{UInt32, 1}, Int})
julia> VERSION
v"1.7.0-beta3.0"

julia> code_llvm(cpu_kernel, Tuple{Array{UInt32, 1}, Int})
;  @ REPL[7]:1 within `cpu_kernel`
define void @julia_cpu_kernel_3340({}* nonnull align 16 dereferenceable(40) %0, i64 signext %1) #0 {
top:
  %gcframe62 = alloca [3 x {}*], align 16
  %gcframe62.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe62, i64 0, i64 0
  %2 = bitcast [3 x {}*]* %gcframe62 to i8*
  call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
  %3 = alloca [1 x i64], align 8
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #9
  %ppgcstack_i8 = getelementptr i8, i8* %thread_ptr, i64 -8
  %ppgcstack = bitcast i8* %ppgcstack_i8 to {}****
  %pgcstack = load {}***, {}**** %ppgcstack, align 8
;  @ REPL[7]:3 within `cpu_kernel`
; ┌ @ /home/tim/Julia/depot/packages/StaticArrays/rIymU/src/arraymath.jl:14 within `ones`
; │┌ @ /home/tim/Julia/depot/packages/StaticArrays/rIymU/src/arraymath.jl:15 within `_ones`
; ││┌ @ /home/tim/Julia/depot/packages/StaticArrays/rIymU/src/arraymath.jl:23 within `macro expansion`
; │││┌ @ /home/tim/Julia/depot/packages/StaticArrays/rIymU/src/MArray.jl:25 within `MArray`
      %4 = bitcast [3 x {}*]* %gcframe62 to i64*
      store i64 4, i64* %4, align 16
      %5 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe62, i64 0, i64 1
      %6 = bitcast {}** %5 to {}***
      %7 = load {}**, {}*** %pgcstack, align 8
      store {}** %7, {}*** %6, align 8
      %8 = bitcast {}*** %pgcstack to {}***
      store {}** %gcframe62.sub, {}*** %8, align 8
      %ptls_field20 = getelementptr inbounds {}**, {}*** %pgcstack, i64 2305843009213693954
      %9 = bitcast {}*** %ptls_field20 to i8**
      %ptls_load2122 = load i8*, i8** %9, align 8
      %10 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* %ptls_load2122, i32 1488, i32 80) #2
      %11 = bitcast {}* %10 to i64*
      %12 = getelementptr inbounds i64, i64* %11, i64 -1
      store atomic i64 139999965732448, i64* %12 unordered, align 8
      %13 = bitcast {}* %10 to [15 x i32]*
      %14 = bitcast {}* %10 to <8 x i32>*
      store <8 x i32> <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>, <8 x i32>* %14, align 4
      %.repack30 = getelementptr inbounds [15 x i32], [15 x i32]* %13, i64 0, i64 8
      %15 = bitcast i32* %.repack30 to <4 x i32>*
      store <4 x i32> <i32 1, i32 1, i32 1, i32 1>, <4 x i32>* %15, align 4
      %.repack34 = getelementptr inbounds [15 x i32], [15 x i32]* %13, i64 0, i64 12
      store i32 1, i32* %.repack34, align 4
      %.repack35 = getelementptr inbounds [15 x i32], [15 x i32]* %13, i64 0, i64 13
      store i32 1, i32* %.repack35, align 4
      %.repack36 = getelementptr inbounds [15 x i32], [15 x i32]* %13, i64 0, i64 14
      store i32 1, i32* %.repack36, align 4
; └└└└
julia> VERSION
v"1.6.2"

julia> code_llvm(cpu_kernel, Tuple{Array{UInt32, 1}, Int})
;  @ REPL[14]:1 within `cpu_kernel'
define void @julia_cpu_kernel_4773({}* nonnull align 16 dereferenceable(40) %0, i64 signext %1) {
top:
  %2 = alloca [15 x i32], align 16
  %.sub = bitcast [15 x i32]* %2 to i8*
  call void @llvm.lifetime.start.p0i8(i64 60, i8* nonnull %.sub)
;  @ REPL[14]:3 within `cpu_kernel'
; ┌ @ /home/tim/Julia/depot/packages/StaticArrays/lpEdQ/src/arraymath.jl:14 within `ones'
; │┌ @ /home/tim/Julia/depot/packages/StaticArrays/lpEdQ/src/arraymath.jl:15 within `_ones'
; ││┌ @ /home/tim/Julia/depot/packages/StaticArrays/lpEdQ/src/arraymath.jl:23 within `macro expansion'
; │││┌ @ /home/tim/Julia/depot/packages/StaticArrays/lpEdQ/src/MArray.jl:25 within `MArray'
      %3 = bitcast [15 x i32]* %2 to <8 x i32>*
      store <8 x i32> <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>, <8 x i32>* %3, align 16
      %.repack28 = getelementptr inbounds [15 x i32], [15 x i32]* %2, i64 0, i64 8
      %4 = bitcast i32* %.repack28 to <4 x i32>*
      store <4 x i32> <i32 1, i32 1, i32 1, i32 1>, <4 x i32>* %4, align 16
      %.repack32 = getelementptr inbounds [15 x i32], [15 x i32]* %2, i64 0, i64 12
      store i32 1, i32* %.repack32, align 16
      %.repack33 = getelementptr inbounds [15 x i32], [15 x i32]* %2, i64 0, i64 13
      store i32 1, i32* %.repack33, align 4
      %.repack34 = getelementptr inbounds [15 x i32], [15 x i32]* %2, i64 0, i64 14
      store i32 1, i32* %.repack34, align 8
; └└└└

See the jl_gc_pool_alloc, which is only there on 1.7. Probably worth filing an upstream bug for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda kernels Stuff about writing CUDA kernels.
Projects
None yet
Development

No branches or pull requests

3 participants