Dynamic allocations cannot be freed #38

NTimmons · 2020-05-12T13:56:51Z

Hi,

I am getting a dynamic memory crash when using Ref and custom structs.
This is occuring across multiple calls to the same kernel, which looks like the memory being allocated is not being deallocated when the kernel returns. Even when using garbage collection.

I am running a mobile NVidia 2060 6GB, if that helps.

I have put together a minimal example here:

# Running once is fine. Running multiple times causes crash
# It seems dynamic memory allocation is consistent across
# runs.


struct Point
    x::Float32
    y::Float32
    z::Float32
    Point() = new(0.0f0, 0.0f0, 0.0f0)
    Point(a,b,c) = new(a,b,c)
end

function GetThreePoint(mem, offset, sz)
    a = GetValue(mem, offset, sz)
    b = GetValue(mem, offset, sz)   ## Comment this line out to not crash.
    return Point(a,b,a)
end

function GetValue(mem, offset, sz)
    offset[] = offset[]+1
    r = mem[][1+(offset[] % sz-2)]
    return r
end

function AllocatingKernel(memIn, memOut, size)
    i = 1 #(blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    memRef = Ref(memIn)
    a   = Ref(1)    
    memOut[i] = GetThreePoint(memRef, a, size).x
    return 
    
end

# Running

memInGpu  = CuArray(ones(Float32, 256))
memOutGpu = CuArray(ones(Float32, 256))
sz        = length(memInGpu)

for i in 1:1000
     @cuda blocks = (16,1) threads=(64,1) AllocatingKernel(memInGpu, memOutGpu, sz)
end

which results many pages of

ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 16 bytes)
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.

From my limited testing, this crash only happens with my Point type, if we dont use the reference object with it, it is fine?

The text was updated successfully, but these errors were encountered:

maleadt · 2020-05-12T16:26:59Z

As I mentioned on Slack, this is kinda expected. We don't have a device-side GC, so you'll be quickly running out of memory when doing allocations in the hot path (i.e. using a Ref). This is exacerbated by a CUDA bug, where it's not possible to free the device-side heap from the host. We'll need to reimplement malloc to fix that, so making the issue about that.

cdsousa · 2021-08-05T18:45:37Z

Hi, I think I'm hitting this bug -- I'm getting that error -- but I'm not sure. This is happening in a code that used to work 4 months ago.
How can one know whether this is happening and what can be done to workaround it?

maleadt · 2021-08-05T20:25:51Z

Check @device_code_llvm (if neededm passing dump_module=true) and look for calls to alloc (or gpu_malloc). It's possible a change in GPUCompiler.jl/CUDA.jl/Julia introduced allocations where there didn't used to be any (generally you don't want any, except possibly in exception paths).

cdsousa · 2021-08-06T14:52:11Z

I must say that I'm having this bug indeed, and it is related to the usage of an MVector (StaticArray.jl).
I've tried this exact code:

using CUDA
using StaticArrays

function kernel(a)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    STACKSIZE = 15
    stack = ones(MVector{15, UInt32})
    for (j,x) in enumerate(stack)
        a[i] += stack[j]
    end
    return
end

a = CuArray(ones(UInt32, 1024))

for i in 1:1000
     @cuda blocks=(16,1) threads=(64,1) kernel(a)
     CUDA.synchronize()
end

with this exact package versions:

  [052768ef] CUDA v3.3.4
  [61eb1bfa] GPUCompiler v0.12.8
  [90137ffa] StaticArrays v1.2.11

ant it works with 1.6.2 but errors with version 1.7.0-beta3.0 ...

maleadt · 2021-08-06T15:15:54Z

Looks like a regression in Julia:

julia> VERSION

julia> using StaticArrays

julia> function cpu_kernel(a, i)
           STACKSIZE = 15
           stack = ones(MVector{15, UInt32})
           for (j,x) in enumerate(stack)
               a[i] += stack[j]
           end
           return
       end

julia> code_llvm(cpu_kernel, Tuple{Array{UInt32, 1}, Int})

julia> VERSION
v"1.7.0-beta3.0"

julia> code_llvm(cpu_kernel, Tuple{Array{UInt32, 1}, Int})
;  @ REPL[7]:1 within `cpu_kernel`
define void @julia_cpu_kernel_3340({}* nonnull align 16 dereferenceable(40) %0, i64 signext %1) #0 {
top:
  %gcframe62 = alloca [3 x {}*], align 16
  %gcframe62.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe62, i64 0, i64 0
  %2 = bitcast [3 x {}*]* %gcframe62 to i8*
  call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
  %3 = alloca [1 x i64], align 8
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #9
  %ppgcstack_i8 = getelementptr i8, i8* %thread_ptr, i64 -8
  %ppgcstack = bitcast i8* %ppgcstack_i8 to {}****
  %pgcstack = load {}***, {}**** %ppgcstack, align 8
;  @ REPL[7]:3 within `cpu_kernel`
; ┌ @ /home/tim/Julia/depot/packages/StaticArrays/rIymU/src/arraymath.jl:14 within `ones`
; │┌ @ /home/tim/Julia/depot/packages/StaticArrays/rIymU/src/arraymath.jl:15 within `_ones`
; ││┌ @ /home/tim/Julia/depot/packages/StaticArrays/rIymU/src/arraymath.jl:23 within `macro expansion`
; │││┌ @ /home/tim/Julia/depot/packages/StaticArrays/rIymU/src/MArray.jl:25 within `MArray`
      %4 = bitcast [3 x {}*]* %gcframe62 to i64*
      store i64 4, i64* %4, align 16
      %5 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe62, i64 0, i64 1
      %6 = bitcast {}** %5 to {}***
      %7 = load {}**, {}*** %pgcstack, align 8
      store {}** %7, {}*** %6, align 8
      %8 = bitcast {}*** %pgcstack to {}***
      store {}** %gcframe62.sub, {}*** %8, align 8
      %ptls_field20 = getelementptr inbounds {}**, {}*** %pgcstack, i64 2305843009213693954
      %9 = bitcast {}*** %ptls_field20 to i8**
      %ptls_load2122 = load i8*, i8** %9, align 8
      %10 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* %ptls_load2122, i32 1488, i32 80) #2
      %11 = bitcast {}* %10 to i64*
      %12 = getelementptr inbounds i64, i64* %11, i64 -1
      store atomic i64 139999965732448, i64* %12 unordered, align 8
      %13 = bitcast {}* %10 to [15 x i32]*
      %14 = bitcast {}* %10 to <8 x i32>*
      store <8 x i32> <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>, <8 x i32>* %14, align 4
      %.repack30 = getelementptr inbounds [15 x i32], [15 x i32]* %13, i64 0, i64 8
      %15 = bitcast i32* %.repack30 to <4 x i32>*
      store <4 x i32> <i32 1, i32 1, i32 1, i32 1>, <4 x i32>* %15, align 4
      %.repack34 = getelementptr inbounds [15 x i32], [15 x i32]* %13, i64 0, i64 12
      store i32 1, i32* %.repack34, align 4
      %.repack35 = getelementptr inbounds [15 x i32], [15 x i32]* %13, i64 0, i64 13
      store i32 1, i32* %.repack35, align 4
      %.repack36 = getelementptr inbounds [15 x i32], [15 x i32]* %13, i64 0, i64 14
      store i32 1, i32* %.repack36, align 4
; └└└└

julia> VERSION
v"1.6.2"

julia> code_llvm(cpu_kernel, Tuple{Array{UInt32, 1}, Int})
;  @ REPL[14]:1 within `cpu_kernel'
define void @julia_cpu_kernel_4773({}* nonnull align 16 dereferenceable(40) %0, i64 signext %1) {
top:
  %2 = alloca [15 x i32], align 16
  %.sub = bitcast [15 x i32]* %2 to i8*
  call void @llvm.lifetime.start.p0i8(i64 60, i8* nonnull %.sub)
;  @ REPL[14]:3 within `cpu_kernel'
; ┌ @ /home/tim/Julia/depot/packages/StaticArrays/lpEdQ/src/arraymath.jl:14 within `ones'
; │┌ @ /home/tim/Julia/depot/packages/StaticArrays/lpEdQ/src/arraymath.jl:15 within `_ones'
; ││┌ @ /home/tim/Julia/depot/packages/StaticArrays/lpEdQ/src/arraymath.jl:23 within `macro expansion'
; │││┌ @ /home/tim/Julia/depot/packages/StaticArrays/lpEdQ/src/MArray.jl:25 within `MArray'
      %3 = bitcast [15 x i32]* %2 to <8 x i32>*
      store <8 x i32> <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>, <8 x i32>* %3, align 16
      %.repack28 = getelementptr inbounds [15 x i32], [15 x i32]* %2, i64 0, i64 8
      %4 = bitcast i32* %.repack28 to <4 x i32>*
      store <4 x i32> <i32 1, i32 1, i32 1, i32 1>, <4 x i32>* %4, align 16
      %.repack32 = getelementptr inbounds [15 x i32], [15 x i32]* %2, i64 0, i64 12
      store i32 1, i32* %.repack32, align 16
      %.repack33 = getelementptr inbounds [15 x i32], [15 x i32]* %2, i64 0, i64 13
      store i32 1, i32* %.repack33, align 4
      %.repack34 = getelementptr inbounds [15 x i32], [15 x i32]* %2, i64 0, i64 14
      store i32 1, i32* %.repack34, align 8
; └└└└

See the jl_gc_pool_alloc, which is only there on 1.7. Probably worth filing an upstream bug for.

maleadt changed the title ~~Dynamic Memory Crash~~ Dynamic allocations cannot be freed May 12, 2020

maleadt transferred this issue from JuliaGPU/CUDAnative.jl May 27, 2020

maleadt added bug Something isn't working cuda kernels Stuff about writing CUDA kernels. labels May 27, 2020

lawless-m mentioned this issue Jul 11, 2020

LLVM error: Cannot cast between two non-generic address spaces #286

Closed

cdsousa mentioned this issue Aug 6, 2021

Regression in memory allocation optimization of a mutable StaticArray JuliaLang/julia#41800

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic allocations cannot be freed #38

Dynamic allocations cannot be freed #38

NTimmons commented May 12, 2020

maleadt commented May 12, 2020

cdsousa commented Aug 5, 2021

maleadt commented Aug 5, 2021

cdsousa commented Aug 6, 2021

maleadt commented Aug 6, 2021

Dynamic allocations cannot be freed #38

Dynamic allocations cannot be freed #38

Comments

NTimmons commented May 12, 2020

maleadt commented May 12, 2020

cdsousa commented Aug 5, 2021

maleadt commented Aug 5, 2021

cdsousa commented Aug 6, 2021

maleadt commented Aug 6, 2021