Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Error 717: Invalid Address Space -- Atomic operation not supported on stack memory. #428

Open
jgreener64 opened this issue Aug 25, 2022 · 34 comments
Milestone

Comments

@jgreener64
Copy link
Contributor

I am on Julia 1.8.0, Enzyme main (71ebce9) and CUDA 3.12.0. The following CUDA kernel runs but the gradient kernel with Enzyme errors.

using CUDA, Enzyme, Molly, LinearAlgebra

function find_neighbors(coords)
    n_atoms = length(coords)
    neighbors = Tuple{Int, Int}[]
    for i in 1:n_atoms
        for j in (i + 1):n_atoms
            if norm(vector(coords[i], coords[j], boundary)) <= 1.0
                push!(neighbors, (i, j))
            end
        end
    end
    return neighbors
end

n_atoms = 1024
boundary = CubicBoundary(2.7f0, 2.7f0, 2.7f0)
coords = place_atoms(n_atoms, boundary; min_dist=0.05f0)
atoms = [Atom(charge=0.0f0, mass=1.0f0, σ=Float32(0.02 + rand() * 0.01),
              ϵ=Float32(0.2 + rand() * 0.1)) for _ in 1:n_atoms]
cu_coords = CuArray(coords)
cu_atoms = CuArray(atoms)
neighbors = find_neighbors(coords)
cu_neighbors = CuArray(neighbors)

function force(c1, c2, a1, a2)
    dr = c2 - c1
    invr2 = inv(sum(abs2, dr))
    σ = (a1.σ + a2.σ) / 2
    ϵ = sqrt(a1.ϵ * a2.ϵ)
    six_term =^2 * invr2) ^ 3
    f = (24 * ϵ * invr2) * (2 * six_term ^ 2 - six_term)
    return f * dr
end

function kernel!(forces::CuDeviceMatrix{T}, coords_var, atoms_var, neighbors_var, ::Val{M}) where {T, M}
    coords = CUDA.Const(coords_var)
    atoms = CUDA.Const(atoms_var)
    neighbors = CUDA.Const(neighbors_var)

    tidx = threadIdx().x
    inter_ig = (blockIdx().x - 1) * blockDim().x + tidx
    stride = gridDim().x * blockDim().x
    shared_fs = CuStaticSharedArray(T, (3, M))
    shared_is = CuStaticSharedArray(Int32, M)
    shared_js = CuStaticSharedArray(Int32, M)

    if tidx == 1
        for si in 1:M
            shared_is[si] = zero(Int32)
        end
    end
    sync_threads()

    for (thread_i, inter_i) in enumerate(inter_ig:stride:length(neighbors))
        si = (thread_i - 1) * blockDim().x + tidx
        i, j = neighbors[inter_i]
        f = force(coords[i], coords[j], atoms[i], atoms[j])
        shared_fs[1, si] = f[1]
        shared_fs[2, si] = f[2]
        shared_fs[3, si] = f[3]
        shared_is[si] = i
        shared_js[si] = j
    end
    sync_threads()

    if tidx == 1
        for si in 1:M
            i = shared_is[si]
            if iszero(i)
                break
            end
            j = shared_js[si]
            dx, dy, dz = shared_fs[1, si], shared_fs[2, si], shared_fs[3, si]
            forces[1, i] -= dx
            forces[2, i] -= dy
            forces[3, i] -= dz
            forces[1, j] += dx
            forces[2, j] += dy
            forces[3, j] += dz
        end
    end
    return
end

function grad_kernel!(forces, dforces, coords, dcoords, atoms, datoms, neighbors, shared_mem_size)
    Enzyme.autodiff_deferred(
        kernel!,
        Duplicated(forces, dforces),
        Duplicated(coords, dcoords),
        Duplicated(atoms, datoms),
        Const(neighbors),
        Const(shared_mem_size),
    )
    return
end

cu_forces_mat = CuArray(zeros(Float32, 3, n_atoms))
d_cu_forces_mat = zero(cu_forces_mat)
d_cu_coords = zero(cu_coords)
d_cu_atoms = zero(cu_atoms)
n_threads = 256
n_blocks = 800
shared_mem_size = 2000

CUDA.@sync @cuda threads=n_threads blocks=n_blocks kernel!(cu_forces_mat, cu_coords, cu_atoms,
        cu_neighbors, Val(shared_mem_size))
println("Kernel worked")

CUDA.@sync @cuda threads=n_threads blocks=n_blocks grad_kernel!(cu_forces_mat, d_cu_forces_mat,
        cu_coords, d_cu_coords, cu_atoms, d_cu_atoms, cu_neighbors, Val(shared_mem_size))
println("Grad kernel worked")

The output is:

Kernel worked
warning found shared memory
ERROR: LoadError: Failed to compile PTX code (ptxas exited with code 255)
ptxas error   : Entry function '_Z23julia_grad_kernel__579813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_I6SArrayI5TupleILi3EES0_Li1ELi3EELi1ELi1EES_IS1_IS2_ILi3EES0_Li1ELi3EELi1ELi1EES_I4AtomIS0_S0_S0_S0_ELi1ELi1EES_IS3_IS0_S0_S0_S0_ELi1ELi1EES_IS2_I5Int64S4_ELi1ELi1EE3ValILi2000EE' uses too much shared data (0xfa00 bytes, 0xc000 max)
ptxas info    : 292 bytes gmem
ptxas info    : Compiling entry function '_Z23julia_grad_kernel__579813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_I6SArrayI5TupleILi3EES0_Li1ELi3EELi1ELi1EES_IS1_IS2_ILi3EES0_Li1ELi3EELi1ELi1EES_I4AtomIS0_S0_S0_S0_ELi1ELi1EES_IS3_IS0_S0_S0_S0_ELi1ELi1EES_IS2_I5Int64S4_ELi1ELi1EE3ValILi2000EE' for 'sm_86'
ptxas info    : Function properties for _Z23julia_grad_kernel__579813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_I6SArrayI5TupleILi3EES0_Li1ELi3EELi1ELi1EES_IS1_IS2_ILi3EES0_Li1ELi3EELi1ELi1EES_I4AtomIS0_S0_S0_S0_ELi1ELi1EES_IS3_IS0_S0_S0_S0_ELi1ELi1EES_IS2_I5Int64S4_ELi1ELi1EE3ValILi2000EE
    208 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 96 registers, 64000 bytes smem, 600 bytes cmem[0], 24 bytes cmem[2]
ptxas info    : Function properties for augmented_julia_force_6190
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for diffejulia_force_6190
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for gpu_report_exception
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for gpu_signal_exception
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for julia__throw_boundserror_6169
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for julia__throw_inexacterror_6184
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for julia_steprange_last_6192
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
If you think this is a bug, please file an issue and attach /tmp/jl_vo7Mbazqzg.ptx
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:427
  [3] #224
    @ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:347 [inlined]
  [4] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(grad_kernel!), Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/jVY4I/src/driver.jl:76
  [5] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:346
  [6] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/jVY4I/src/cache.jl:90
  [7] cufunction(f::typeof(grad_kernel!), tt::Type{Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:299
  [8] cufunction(f::typeof(grad_kernel!), tt::Type{Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:292
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:102 [inlined]
 [10] top-level scope
    @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:25

Changing the dx, dy, dz = shared_fs[1, si], shared_fs[2, si], shared_fs[3, si] line to dx, dy, dz = 1.0f0, 1.0f0, 1.0f0 makes this not error, suggesting a problem when updating an array using the shared memory.

I realise that I should be using atomics at the forces[1, i] -= dx lines but that errored due to #421. This may be a separate error though since the following simple test works, suggesting that the above should also work.

using Enzyme, CUDA

function mul_kernel(A)
    shared = CuStaticSharedArray(Float32, 64)
    i = threadIdx().x
    if i <= length(A)
        shared[i] = A[i] * A[i]
    end
    sync_threads()
    if i == 1
        for j in 1:length(A)
            A[j] = shared[j]
        end
    end
    return nothing
end

function grad_mul_kernel(A, dA)
    Enzyme.autodiff_deferred(mul_kernel, Const, Duplicated(A, dA))
    return nothing
end

A = CUDA.ones(64,)
dA = similar(A)
dA .= 1
@cuda threads=length(A) grad_mul_kernel(A, dA)
dA
@wsmoses
Copy link
Member

wsmoses commented Sep 26, 2022

Shared memory is fine, the warning (which we should probably make optional or disable) is that you now need twice the shared memory for the derivatives as well.

@leios
Copy link

leios commented Sep 26, 2022

Great! Thanks for the clarification!

@leios
Copy link

leios commented Sep 26, 2022

Alright, working with the code now, I found 3 errors:

  1. There was a problem with the sqrt function for some reason <-- Not really sure how to fix this one...
  2. The provided kernels use too much shmem. <-- Assuming we need 2x the shmem, then I'll just play around with the allocated size until it runs
  3. An apparent issue with sync_threads()

Stacktrace for 1:

Kernel worked
ERROR: LoadError: Enzyme compilation failed.
Current scope: 
; Function Attrs: mustprogress willreturn
define internal fastcc void @preprocess_julia_force_6809([1 x [3 x float]]* noalias nocapture nofree noundef nonnull writeonly sret([1 x [3 x float]]) align 4 dereferenceable(12) %0, [1 x [3 x float]] addrspace(11)* nocapture noundef nonnull readonly align 4 dereferenceable(12) %1, [1 x [3 x float]] addrspace(11)* nocapture noundef nonnull readonly align 4 dereferenceable(12) %2, { i64, float, float, float, float, i8 } addrspace(11)* nocapture noundef nonnull readonly align 8 dereferenceable(32) %3, { i64, float, float, float, float, i8 } addrspace(11)* nocapture noundef nonnull readonly align 8 dereferenceable(32) %4) unnamed_addr #13 !dbg !1847 {
top:
  %5 = call {}*** @julia.get_pgcstack() #16
  %6 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 0, !dbg !1848
  %7 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 0, !dbg !1848
  %8 = load float, float addrspace(11)* %6, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %9 = load float, float addrspace(11)* %7, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %10 = fsub float %8, %9, !dbg !1855
  %11 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 1, !dbg !1848
  %12 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 1, !dbg !1848
  %13 = load float, float addrspace(11)* %11, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %14 = load float, float addrspace(11)* %12, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %15 = fsub float %13, %14, !dbg !1855
  %16 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 2, !dbg !1848
  %17 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 2, !dbg !1848
  %18 = load float, float addrspace(11)* %16, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %19 = load float, float addrspace(11)* %17, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %20 = fsub float %18, %19, !dbg !1855
  %21 = fmul float %10, %10, !dbg !1856
  %22 = fmul float %15, %15, !dbg !1856
  %23 = fadd float %21, %22, !dbg !1864
  %24 = fmul float %20, %20, !dbg !1856
  %25 = fadd float %23, %24, !dbg !1864
  %26 = fdiv float 1.000000e+00, %25, !dbg !1865
  %27 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %3, i64 0, i32 3, !dbg !1867
  %28 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %4, i64 0, i32 3, !dbg !1867
  %29 = load float, float addrspace(11)* %27, align 8, !dbg !1869, !tbaa !91, !invariant.load !4
  %30 = load float, float addrspace(11)* %28, align 8, !dbg !1869, !tbaa !91, !invariant.load !4
  %31 = fadd float %29, %30, !dbg !1869
  %32 = fmul float %31, 5.000000e-01, !dbg !1870
  %33 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %3, i64 0, i32 4, !dbg !1872
  %34 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %4, i64 0, i32 4, !dbg !1872
  %35 = load float, float addrspace(11)* %33, align 4, !dbg !1874, !tbaa !91, !invariant.load !4
  %36 = load float, float addrspace(11)* %34, align 4, !dbg !1874, !tbaa !91, !invariant.load !4
  %37 = fmul float %35, %36, !dbg !1874
  %38 = call i32 @__nvvm_reflect(i8* noundef getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i64 0, i64 0)) #17, !dbg !1875
  %.not = icmp eq i32 %38, 0, !dbg !1875
  br i1 %.not, label %45, label %39, !dbg !1875

39:                                               ; preds = %top
  %40 = call i32 @__nvvm_reflect(i8* noundef getelementptr inbounds ([17 x i8], [17 x i8]* @.str.2, i64 0, i64 0)) #17, !dbg !1875
  %.not9 = icmp eq i32 %40, 0, !dbg !1875
  br i1 %.not9, label %43, label %41, !dbg !1875

41:                                               ; preds = %39
  %42 = call float @llvm.nvvm.sqrt.rn.ftz.f(float %37) #17, !dbg !1875
  br label %__nv_sqrtf.exit, !dbg !1875

43:                                               ; preds = %39
  %44 = call float @llvm.nvvm.sqrt.approx.ftz.f(float %37) #17, !dbg !1875
  br label %__nv_sqrtf.exit, !dbg !1875

45:                                               ; preds = %top
  %46 = call i32 @__nvvm_reflect(i8* noundef getelementptr inbounds ([17 x i8], [17 x i8]* @.str.2, i64 0, i64 0)) #17, !dbg !1875
  %.not8 = icmp eq i32 %46, 0, !dbg !1875
  br i1 %.not8, label %49, label %47, !dbg !1875

47:                                               ; preds = %45
  %48 = call float @llvm.sqrt.f32(float %37) #16, !dbg !1875
  br label %__nv_sqrtf.exit, !dbg !1875

49:                                               ; preds = %45
  %50 = call float @llvm.nvvm.sqrt.approx.f(float %37) #17, !dbg !1875
  br label %__nv_sqrtf.exit, !dbg !1875

__nv_sqrtf.exit:                                  ; preds = %49, %47, %43, %41
  %.0.i = phi float [ %42, %41 ], [ %44, %43 ], [ %48, %47 ], [ %50, %49 ], !dbg !1875
  %51 = fmul float %32, %32, !dbg !1876
  %52 = fmul float %26, %51, !dbg !1879
  %53 = fmul float %52, %52, !dbg !1880
  %54 = fmul float %52, %53, !dbg !1880
  %55 = fmul float %.0.i, 2.400000e+01, !dbg !1883
  %56 = fmul float %26, %55, !dbg !1887
  %57 = fmul float %54, %54, !dbg !1888
  %58 = fmul float %57, 2.000000e+00, !dbg !1890
  %59 = fsub float %58, %54, !dbg !1892
  %60 = fmul float %59, %56, !dbg !1893
  %61 = fmul float %10, %60, !dbg !1894
  %62 = fmul float %15, %60, !dbg !1894
  %63 = fmul float %20, %60, !dbg !1894
  %.sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 0, !dbg !1900
  store float %61, float* %.sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx, align 4, !dbg !1900
  %.sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 1, !dbg !1900
  store float %62, float* %.sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6, align 4, !dbg !1900
  %.sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 2, !dbg !1900
  store float %63, float* %.sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7, align 4, !dbg !1900
  ret void, !dbg !1900
}

; Function Attrs: mustprogress willreturn
define internal fastcc void @preprocess_julia_force_6809([1 x [3 x float]]* noalias nocapture nofree noundef nonnull writeonly sret([1 x [3 x float]]) align 4 dereferenceable(12) %0, [1 x [3 x float]] addrspace(11)* nocapture noundef nonnull readonly align 4 dereferenceable(12) %1, [1 x [3 x float]] addrspace(11)* nocapture noundef nonnull readonly align 4 dereferenceable(12) %2, { i64, float, float, float, float, i8 } addrspace(11)* nocapture noundef nonnull readonly align 8 dereferenceable(32) %3, { i64, float, float, float, float, i8 } addrspace(11)* nocapture noundef nonnull readonly align 8 dereferenceable(32) %4) unnamed_addr #13 !dbg !1847 {
top:
  %5 = call {}*** @julia.get_pgcstack() #16
  %6 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 0, !dbg !1848
  %7 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 0, !dbg !1848
  %8 = load float, float addrspace(11)* %6, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %9 = load float, float addrspace(11)* %7, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %10 = fsub float %8, %9, !dbg !1855
  %11 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 1, !dbg !1848
  %12 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 1, !dbg !1848
  %13 = load float, float addrspace(11)* %11, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %14 = load float, float addrspace(11)* %12, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %15 = fsub float %13, %14, !dbg !1855
  %16 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 2, !dbg !1848
  %17 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 2, !dbg !1848
  %18 = load float, float addrspace(11)* %16, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %19 = load float, float addrspace(11)* %17, align 4, !dbg !1855, !tbaa !91, !invariant.load !4
  %20 = fsub float %18, %19, !dbg !1855
  %21 = fmul float %10, %10, !dbg !1856
  %22 = fmul float %15, %15, !dbg !1856
  %23 = fadd float %21, %22, !dbg !1864
  %24 = fmul float %20, %20, !dbg !1856
  %25 = fadd float %23, %24, !dbg !1864
  %26 = fdiv float 1.000000e+00, %25, !dbg !1865
  %27 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %3, i64 0, i32 3, !dbg !1867
  %28 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %4, i64 0, i32 3, !dbg !1867
  %29 = load float, float addrspace(11)* %27, align 8, !dbg !1869, !tbaa !91, !invariant.load !4
  %30 = load float, float addrspace(11)* %28, align 8, !dbg !1869, !tbaa !91, !invariant.load !4
  %31 = fadd float %29, %30, !dbg !1869
  %32 = fmul float %31, 5.000000e-01, !dbg !1870
  %33 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %3, i64 0, i32 4, !dbg !1872
  %34 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %4, i64 0, i32 4, !dbg !1872
  %35 = load float, float addrspace(11)* %33, align 4, !dbg !1874, !tbaa !91, !invariant.load !4
  %36 = load float, float addrspace(11)* %34, align 4, !dbg !1874, !tbaa !91, !invariant.load !4
  %37 = fmul float %35, %36, !dbg !1874
  %38 = call i32 @__nvvm_reflect(i8* noundef getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i64 0, i64 0)) #17, !dbg !1875
  %.not = icmp eq i32 %38, 0, !dbg !1875
  br i1 %.not, label %45, label %39, !dbg !1875

39:                                               ; preds = %top
  %40 = call i32 @__nvvm_reflect(i8* noundef getelementptr inbounds ([17 x i8], [17 x i8]* @.str.2, i64 0, i64 0)) #17, !dbg !1875
  %.not9 = icmp eq i32 %40, 0, !dbg !1875
  br i1 %.not9, label %43, label %41, !dbg !1875

41:                                               ; preds = %39
  %42 = call float @llvm.nvvm.sqrt.rn.ftz.f(float %37) #17, !dbg !1875
  br label %__nv_sqrtf.exit, !dbg !1875

43:                                               ; preds = %39
  %44 = call float @llvm.nvvm.sqrt.approx.ftz.f(float %37) #17, !dbg !1875
  br label %__nv_sqrtf.exit, !dbg !1875

45:                                               ; preds = %top
  %46 = call i32 @__nvvm_reflect(i8* noundef getelementptr inbounds ([17 x i8], [17 x i8]* @.str.2, i64 0, i64 0)) #17, !dbg !1875
  %.not8 = icmp eq i32 %46, 0, !dbg !1875
  br i1 %.not8, label %49, label %47, !dbg !1875

47:                                               ; preds = %45
  %48 = call float @llvm.sqrt.f32(float %37) #16, !dbg !1875
  br label %__nv_sqrtf.exit, !dbg !1875

49:                                               ; preds = %45
  %50 = call float @llvm.nvvm.sqrt.approx.f(float %37) #17, !dbg !1875
  br label %__nv_sqrtf.exit, !dbg !1875

__nv_sqrtf.exit:                                  ; preds = %49, %47, %43, %41
  %.0.i = phi float [ %42, %41 ], [ %44, %43 ], [ %48, %47 ], [ %50, %49 ], !dbg !1875
  %51 = fmul float %32, %32, !dbg !1876
  %52 = fmul float %26, %51, !dbg !1879
  %53 = fmul float %52, %52, !dbg !1880
  %54 = fmul float %52, %53, !dbg !1880
  %55 = fmul float %.0.i, 2.400000e+01, !dbg !1883
  %56 = fmul float %26, %55, !dbg !1887
  %57 = fmul float %54, %54, !dbg !1888
  %58 = fmul float %57, 2.000000e+00, !dbg !1890
  %59 = fsub float %58, %54, !dbg !1892
  %60 = fmul float %59, %56, !dbg !1893
  %61 = fmul float %10, %60, !dbg !1894
  %62 = fmul float %15, %60, !dbg !1894
  %63 = fmul float %20, %60, !dbg !1894
  %.sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 0, !dbg !1900
  store float %61, float* %.sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx, align 4, !dbg !1900
  %.sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 1, !dbg !1900
  store float %62, float* %.sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6, align 4, !dbg !1900
  %.sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 2, !dbg !1900
  store float %63, float* %.sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7, align 4, !dbg !1900
  ret void, !dbg !1900
}

; Function Attrs: mustprogress willreturn
define internal fastcc { i8* } @fakeaugmented_julia_force_6809([1 x [3 x float]]* noalias nocapture nofree noundef nonnull writeonly sret([1 x [3 x float]]) align 4 dereferenceable(12) %0, [1 x [3 x float]]* nocapture %"'", [1 x [3 x float]] addrspace(11)* nocapture noundef nonnull readonly align 4 dereferenceable(12) %1, [1 x [3 x float]] addrspace(11)* nocapture %"'1", [1 x [3 x float]] addrspace(11)* nocapture noundef nonnull readonly align 4 dereferenceable(12) %2, [1 x [3 x float]] addrspace(11)* nocapture %"'2", { i64, float, float, float, float, i8 } addrspace(11)* nocapture noundef nonnull readonly align 8 dereferenceable(32) %3, { i64, float, float, float, float, i8 } addrspace(11)* nocapture %"'3", { i64, float, float, float, float, i8 } addrspace(11)* nocapture noundef nonnull readonly align 8 dereferenceable(32) %4, { i64, float, float, float, float, i8 } addrspace(11)* nocapture %"'4") unnamed_addr #13 !dbg !1901 {
top:
  %_replacementA = phi {}*** 
  %5 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 0, !dbg !1902
  %6 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 0, !dbg !1902
  %7 = load float, float addrspace(11)* %5, align 4, !dbg !1909, !tbaa !91, !invariant.load !4
  %8 = load float, float addrspace(11)* %6, align 4, !dbg !1909, !tbaa !91, !invariant.load !4
  %9 = fsub float %7, %8, !dbg !1909
  %10 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 1, !dbg !1902
  %11 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 1, !dbg !1902
  %12 = load float, float addrspace(11)* %10, align 4, !dbg !1909, !tbaa !91, !invariant.load !4
  %13 = load float, float addrspace(11)* %11, align 4, !dbg !1909, !tbaa !91, !invariant.load !4
  %14 = fsub float %12, %13, !dbg !1909
  %15 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 2, !dbg !1902
  %16 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 2, !dbg !1902
  %17 = load float, float addrspace(11)* %15, align 4, !dbg !1909, !tbaa !91, !invariant.load !4
  %18 = load float, float addrspace(11)* %16, align 4, !dbg !1909, !tbaa !91, !invariant.load !4
  %19 = fsub float %17, %18, !dbg !1909
  %20 = fmul float %9, %9, !dbg !1910
  %21 = fmul float %14, %14, !dbg !1910
  %22 = fadd float %20, %21, !dbg !1918
  %23 = fmul float %19, %19, !dbg !1910
  %24 = fadd float %22, %23, !dbg !1918
  %25 = fdiv float 1.000000e+00, %24, !dbg !1919
  %26 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %3, i64 0, i32 3, !dbg !1921
  %27 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %4, i64 0, i32 3, !dbg !1921
  %28 = load float, float addrspace(11)* %26, align 8, !dbg !1923, !tbaa !91, !invariant.load !4
  %29 = load float, float addrspace(11)* %27, align 8, !dbg !1923, !tbaa !91, !invariant.load !4
  %30 = fadd float %28, %29, !dbg !1923
  %31 = fmul float %30, 5.000000e-01, !dbg !1924
  %32 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %3, i64 0, i32 4, !dbg !1926
  %33 = getelementptr inbounds { i64, float, float, float, float, i8 }, { i64, float, float, float, float, i8 } addrspace(11)* %4, i64 0, i32 4, !dbg !1926
  %34 = load float, float addrspace(11)* %32, align 4, !dbg !1928, !tbaa !91, !invariant.load !4
  %35 = load float, float addrspace(11)* %33, align 4, !dbg !1928, !tbaa !91, !invariant.load !4
  %36 = fmul float %34, %35, !dbg !1928
  %37 = call i32 @__nvvm_reflect(i8* noundef getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i64 0, i64 0)) #16, !dbg !1929
  %.not = icmp eq i32 %37, 0, !dbg !1929
  br i1 %.not, label %44, label %38, !dbg !1929

38:                                               ; preds = %top
  %39 = call i32 @__nvvm_reflect(i8* noundef getelementptr inbounds ([17 x i8], [17 x i8]* @.str.2, i64 0, i64 0)) #16, !dbg !1929
  %.not9 = icmp eq i32 %39, 0, !dbg !1929
  br i1 %.not9, label %42, label %40, !dbg !1929

40:                                               ; preds = %38
  %41 = call float @llvm.nvvm.sqrt.rn.ftz.f(float %36) #16, !dbg !1929
  br label %__nv_sqrtf.exit, !dbg !1929

42:                                               ; preds = %38
  %43 = call float @llvm.nvvm.sqrt.approx.ftz.f(float %36) #16, !dbg !1929
  br label %__nv_sqrtf.exit, !dbg !1929

44:                                               ; preds = %top
  %45 = call i32 @__nvvm_reflect(i8* noundef getelementptr inbounds ([17 x i8], [17 x i8]* @.str.2, i64 0, i64 0)) #16, !dbg !1929
  %.not8 = icmp eq i32 %45, 0, !dbg !1929
  br i1 %.not8, label %48, label %46, !dbg !1929

46:                                               ; preds = %44
  %47 = call float @llvm.sqrt.f32(float %36) #17, !dbg !1929
  br label %__nv_sqrtf.exit, !dbg !1929

48:                                               ; preds = %44
  %49 = call float @llvm.nvvm.sqrt.approx.f(float %36) #16, !dbg !1929
  br label %__nv_sqrtf.exit, !dbg !1929

__nv_sqrtf.exit:                                  ; preds = %48, %46, %42, %40
  %.0.i = phi float [ %41, %40 ], [ %43, %42 ], [ %47, %46 ], [ %49, %48 ], !dbg !1929
  %50 = fmul float %31, %31, !dbg !1930
  %51 = fmul float %25, %50, !dbg !1933
  %52 = fmul float %51, %51, !dbg !1934
  %53 = fmul float %51, %52, !dbg !1934
  %54 = fmul float %.0.i, 2.400000e+01, !dbg !1937
  %55 = fmul float %25, %54, !dbg !1941
  %56 = fmul float %53, %53, !dbg !1942
  %57 = fmul float %56, 2.000000e+00, !dbg !1944
  %58 = fsub float %57, %53, !dbg !1946
  %59 = fmul float %58, %55, !dbg !1947
  %60 = fmul float %9, %59, !dbg !1948
  %61 = fmul float %14, %59, !dbg !1948
  %62 = fmul float %19, %59, !dbg !1948
  %.sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 0, !dbg !1954
  store float %60, float* %.sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx, align 4, !dbg !1954
  %.sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 1, !dbg !1954
  store float %61, float* %.sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6, align 4, !dbg !1954
  %.sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 2, !dbg !1954
  store float %62, float* %.sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7, align 4, !dbg !1954
  ret { i8* } undef, !dbg !1954

allocsForInversion:                               ; No predecessors!
}

cannot handle (augmented) unknown intrinsic
  %42 = call float @llvm.nvvm.sqrt.rn.ftz.f(float %37) #17, !dbg !110

Stacktrace:
 [1] #sqrt
   @ ~/.julia/packages/CUDA/DfvRa/src/device/intrinsics/math.jl:220
 [2] force
   @ ~/projects/CESMIX/tests/cuda_enzyme_2.jl:30

Stacktrace:
  [1] julia_error(cstr::Cstring, val::Ptr{LLVM.API.LLVMOpaqueValue}, errtype::Enzyme.API.ErrorType, data::Ptr{Nothing})
    @ Enzyme.Compiler ~/.julia/packages/Enzyme/di3zM/src/compiler.jl:2636
  [2] EnzymeCreatePrimalAndGradient(logic::Enzyme.Logic, todiff::LLVM.Function, retType::Enzyme.API.CDIFFE_TYPE, constant_args::Vector{Enzyme.API.CDIFFE_TYPE}, TA::Enzyme.TypeAnalysis, returnValue::Bool, dretUsed::Bool, mode::Enzyme.API.CDerivativeMode, width::Int64, additionalArg::Ptr{Nothing}, typeInfo::Enzyme.FnTypeInfo, uncacheable_args::Vector{Bool}, augmented::Ptr{Nothing}, atomicAdd::Bool)
    @ Enzyme.API ~/.julia/packages/Enzyme/di3zM/src/api.jl:111
  [3] enzyme!(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(kernel!), Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}}, mod::LLVM.Module, primalf::LLVM.Function, adjoint::GPUCompiler.FunctionSpec{typeof(kernel!), Tuple{Duplicated{CuDeviceMatrix{Float32, 1}}, Duplicated{CuDeviceVector{SVector{3, Float32}, 1}}, Duplicated{CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}}, Const{CuDeviceVector{Tuple{Int64, Int64}, 1}}, Const{Val{2000}}}}, mode::Enzyme.API.CDerivativeMode, width::Int64, parallel::Bool, actualRetType::Type, dupClosure::Bool, wrap::Bool, modifiedBetween::Bool, returnPrimal::Bool)
    @ Enzyme.Compiler ~/.julia/packages/Enzyme/di3zM/src/compiler.jl:3271
  [4] codegen(output::Symbol, job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(kernel!), Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, ctx::LLVM.Context, strip::Bool, validate::Bool, only_entry::Bool, parent_job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(grad_kernel!), Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}})
    @ Enzyme.Compiler ~/.julia/packages/Enzyme/di3zM/src/compiler.jl:4158
  [5] (::GPUCompiler.var"#114#117"{LLVM.Context, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(grad_kernel!), Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}}, GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(kernel!), Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}}})()
    @ GPUCompiler ~/.julia/packages/GPUCompiler/07qaN/src/driver.jl:296
  [6] get!(default::GPUCompiler.var"#114#117"{LLVM.Context, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(grad_kernel!), Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}}, GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(kernel!), Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}}}, h::Dict{GPUCompiler.CompilerJob, String}, key::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(kernel!), Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}})
    @ Base ./dict.jl:481
  [7] macro expansion
    @ ~/.julia/packages/GPUCompiler/07qaN/src/driver.jl:295 [inlined]
  [8] emit_llvm(job::GPUCompiler.CompilerJob, method_instance::Any; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, ctx::LLVM.Context)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/07qaN/src/utils.jl:68
  [9] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:353
 [10] #224
    @ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:347 [inlined]
 [11] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(grad_kernel!), Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/07qaN/src/driver.jl:76
 [12] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:346
 [13] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/07qaN/src/cache.jl:90
 [14] cufunction(f::typeof(grad_kernel!), tt::Type{Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:299
 [15] cufunction(f::typeof(grad_kernel!), tt::Type{Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:292
 [16] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:102 [inlined]
 [17] top-level scope
    @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:25
 [18] include(fname::String)
    @ Base.MainInclude ./client.jl:476
 [19] top-level scope
    @ REPL[10]:1
 [20] top-level scope
    @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52
in expression starting at /home/leios/projects/CESMIX/tests/cuda_enzyme_2.jl:110

Removing the sqrt gives the following error:

Kernel worked
warning found shared memory
ERROR: LoadError: Failed to compile PTX code (ptxas exited with code 255)
ptxas error   : Entry function '_Z23julia_grad_kernel__736813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_I6SArrayI5TupleILi3EES0_Li1ELi3EELi1ELi1EES_IS1_IS2_ILi3EES0_Li1ELi3EELi1ELi1EES_I4AtomIS0_S0_S0_S0_ELi1ELi1EES_IS3_IS0_S0_S0_S0_ELi1ELi1EES_IS2_I5Int64S4_ELi1ELi1EE3ValILi2000EE' uses too much shared data (0xfa00 bytes, 0xc000 max)
ptxas info    : 292 bytes gmem
ptxas info    : Compiling entry function '_Z23julia_grad_kernel__736813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_I6SArrayI5TupleILi3EES0_Li1ELi3EELi1ELi1EES_IS1_IS2_ILi3EES0_Li1ELi3EELi1ELi1EES_I4AtomIS0_S0_S0_S0_ELi1ELi1EES_IS3_IS0_S0_S0_S0_ELi1ELi1EES_IS2_I5Int64S4_ELi1ELi1EE3ValILi2000EE' for 'sm_60'
ptxas info    : Function properties for _Z23julia_grad_kernel__736813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_I6SArrayI5TupleILi3EES0_Li1ELi3EELi1ELi1EES_IS1_IS2_ILi3EES0_Li1ELi3EELi1ELi1EES_I4AtomIS0_S0_S0_S0_ELi1ELi1EES_IS3_IS0_S0_S0_S0_ELi1ELi1EES_IS2_I5Int64S4_ELi1ELi1EE3ValILi2000EE
    208 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 136 registers, 64000 bytes smem, 568 bytes cmem[0], 24 bytes cmem[2]
ptxas info    : Function properties for augmented_julia_force_7475
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for diffejulia_force_7475
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for gpu_report_exception
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for gpu_signal_exception
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for julia__throw_boundserror_7454
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for julia__throw_inexacterror_7469
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for julia_steprange_last_7477
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
If you think this is a bug, please file an issue and attach /tmp/jl_dEdYJeRwVe.ptx
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:427
  [3] #224
    @ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:347 [inlined]
  [4] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(grad_kernel!), Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/07qaN/src/driver.jl:76
  [5] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:346
  [6] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/07qaN/src/cache.jl:90
  [7] cufunction(f::typeof(grad_kernel!), tt::Type{Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:299
  [8] cufunction(f::typeof(grad_kernel!), tt::Type{Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Atom{Float32, Float32, Float32, Float32}, 1}, CuDeviceVector{Tuple{Int64, Int64}, 1}, Val{2000}}})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:292
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:102 [inlined]
 [10] top-level scope
    @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:25
 [11] include(fname::String)
    @ Base.MainInclude ./client.jl:476
 [12] top-level scope
    @ REPL[12]:1
 [13] top-level scope
    @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52
in expression starting at /home/leios/projects/CESMIX/tests/cuda_enzyme_2.jl:110

After halving the shmem size, I then got an issue with sync_threads():

Kernel worked
warning found shared memory
ERROR: CUDA error: operation not supported on global/shared address space (code 717, ERROR_INVALID_ADDRESS_SPACE)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
 [2] isdone
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:109 [inlined]
 [3] nonblocking_synchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:139 [inlined]
 [4] nonblocking_synchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:325 [inlined]
 [5] device_synchronize()
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:319
 [6] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:54

caused by: LoadError: CUDA error: operation not supported on global/shared address space (code 717, ERROR_INVALID_ADDRESS_SPACE)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
 [2] macro expansion
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
 [3] cuStreamSynchronize(hStream::CuStream)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
 [4] synchronize(stream::CuStream; blocking::Nothing)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:132
 [5] synchronize (repeats 2 times)
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:121 [inlined]
 [6] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:26
 [7] include(fname::String)
   @ Base.MainInclude ./client.jl:476
 [8] top-level scope
   @ REPL[3]:1
 [9] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52
in expression starting at /home/leios/projects/CESMIX/tests/cuda_enzyme_2.jl:110

Just tracking my progress a bit

@jgreener64
Copy link
Contributor Author

Thanks for looking at this. The sqrt issue may be related to #426, which was fixed on main.

Shared memory is fine, the warning (which we should probably make optional or disable) is that you now need twice the shared memory for the derivatives as well.

I guess this means that shared memory should be created over the function boundary, like the following (which works).

using CUDA, Enzyme, Test

function mul_kernel(A, shared)
    i = threadIdx().x
    if i <= length(A)
        shared[i] = A[i] * A[i]
        A[i] = shared[i]
    end
    return nothing
end

function grad_mul_kernel(A, dA)
    shared = CuStaticSharedArray(Float32, 64)
    d_shared = CuStaticSharedArray(Float32, 64)
    Enzyme.autodiff_deferred(mul_kernel, Const, Duplicated(A, dA), Duplicated(shared, d_shared))
    return nothing
end

A = CUDA.ones(64,)
@cuda threads=length(A) mul_kernel(A, similar(A))
A = CUDA.ones(64,)
dA = similar(A)
dA .= 1
@cuda threads=length(A) grad_mul_kernel(A, dA)
@test all(dA .== 2)

@wsmoses
Copy link
Member

wsmoses commented Sep 26, 2022

Shared memory should also work if declared within the kernel.

@vchuravy
Copy link
Member

Yeah static shared memory should be fine, dynamic shared memory is problematic since that is a launch parameter

@leios
Copy link

leios commented Sep 27, 2022

After messing around with this some more, I don't really think the first 2 errors I listed are all that important. sqrt works on main, and the second error does not appear if we shrink the shmem size. The third error seems to be a bit harder for me to diagnose. Here is the message again along with the segfault:

julia> include("cuda_enzyme_2.jl")
Kernel worked
warning found shared memory
ERROR: CUDA error: operation not supported on global/shared address space (code 717, ERROR_INVALID_ADDRESS_SPACE)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
 [2] isdone
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:109 [inlined]
 [3] nonblocking_synchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:139 [inlined]
 [4] nonblocking_synchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:325 [inlined]
 [5] device_synchronize()
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:319
 [6] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:54

caused by: LoadError: CUDA error: operation not supported on global/shared address space (code 717, ERROR_INVALID_ADDRESS_SPACE)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
 [2] macro expansion
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
 [3] cuStreamSynchronize(hStream::CuStream)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
 [4] synchronize(stream::CuStream; blocking::Nothing)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:132
 [5] synchronize (repeats 2 times)
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:121 [inlined]
 [6] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:26
 [7] include(fname::String)
   @ Base.MainInclude ./client.jl:476
 [8] top-level scope
   @ REPL[4]:1
 [9] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52
in expression starting at /home/leios/projects/CESMIX/tests/cuda_enzyme_2.jl:111

julia> 
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)
throw_api_error at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuModuleUnload at /home/leios/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
#27 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:82 [inlined]
#context!#63 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f0fd0f27b1b)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_unload! at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:81
unknown function (ip: 0x7f0fd0f276e2)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:720
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)
throw_api_error at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuModuleUnload at /home/leios/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
#27 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:82 [inlined]
#context!#63 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f0fd0f27b1b)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_unload! at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:81
unknown function (ip: 0x7f0fd0f276e2)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:720
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)
throw_api_error at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuModuleUnload at /home/leios/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
#27 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:82 [inlined]
#context!#63 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f0fd0f27b1b)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_unload! at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:81
unknown function (ip: 0x7f0fd0f276e2)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:720
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
WARNING: Error while freeing DeviceBuffer(32.000 KiB at 0x00000008021c6600):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync(dptr::CUDA.Mem.DeviceBuffer, hStream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] #actual_free#155
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57 [inlined]
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{Atom{Float32, Float32, Float32, Float32}, 1, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{Atom{Float32, Float32, Float32, Float32}, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{Atom{Float32, Float32, Float32, Float32}, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)
throw_api_error at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuModuleUnload at /home/leios/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
#27 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:82 [inlined]
#context!#63 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f0fd0f27b1b)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_unload! at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:81
unknown function (ip: 0x7f0fd0f276e2)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:720
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
WARNING: Error while freeing DeviceBuffer(12.000 KiB at 0x00000008021c3600):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync(dptr::CUDA.Mem.DeviceBuffer, hStream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] #actual_free#155
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57 [inlined]
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
WARNING: Error while freeing DeviceBuffer(12.000 KiB at 0x00000008021c0600):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync(dptr::CUDA.Mem.DeviceBuffer, hStream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] #actual_free#155
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57 [inlined]
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
WARNING: Error while freeing DeviceBuffer(12.000 KiB at 0x00000008021bd600):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync(dptr::CUDA.Mem.DeviceBuffer, hStream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] #actual_free#155
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57 [inlined]
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
WARNING: Error while freeing DeviceBuffer(1.696 MiB at 0x000000080200b000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync(dptr::CUDA.Mem.DeviceBuffer, hStream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] #actual_free#155
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57 [inlined]
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{Tuple{Int64, Int64}, 1, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{Tuple{Int64, Int64}, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{Tuple{Int64, Int64}, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
WARNING: Error while freeing DeviceBuffer(32.000 KiB at 0x0000000802003000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync(dptr::CUDA.Mem.DeviceBuffer, hStream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] #actual_free#155
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57 [inlined]
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{Atom{Float32, Float32, Float32, Float32}, 1, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{Atom{Float32, Float32, Float32, Float32}, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{Atom{Float32, Float32, Float32, Float32}, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
WARNING: Error while freeing DeviceBuffer(12.000 KiB at 0x0000000802000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync(dptr::CUDA.Mem.DeviceBuffer, hStream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] #actual_free#155
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57 [inlined]
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002cd), meta=nothing)
throw_api_error at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuStreamDestroy_v2 at /home/leios/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
#10 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:86 [inlined]
#context!#63 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f0fd0f3e1ab)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_destroy! at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:85
unknown function (ip: 0x7f0fd0f3dd72)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:720
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)

After messing with the kernel a bit more, I found the following lines to be problematic:

        shared_fs[1, si] = f[1]
        shared_fs[2, si] = f[2]
        shared_fs[3, si] = f[3]

When commenting out everything past these lines in the kernel, I get the following error:

Kernel worked
Grad kernel worked
ERROR: CUDA error: operation not supported on global/shared address space (code 717, ERROR_INVALID_ADDRESS_SPACE)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
 [2] macro expansion
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
 [3] cuCtxSynchronize()
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
 [4] nonblocking_synchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:329 [inlined]
 [5] device_synchronize()
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:319
 [6] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:54

This has the same segfault as before, so it might be the cause of the other error.

Changing the lines as below fixes the issue:

        shared_fs[1, si] = 0
        shared_fs[2, si] = 0
        shared_fs[3, si] = 0

Still working on it / trying to make a Molly free MWE

@leios
Copy link

leios commented Sep 28, 2022

Addressing the ERROR: CUDA error: operation not supported on global/shared address space (code 717, ERROR_INVALID_ADDRESS_SPACE) issue... As far as I can tell, the problem is based on the complexity of the equation for f.

These work:

f = σ
f = σ^2
f = 2*σ^2 - σ
f = ϵ * (2*σ^2 - σ)
f = (24 * ϵ * invr2) * 2

These do not:

f = (24 * ϵ * invr2) * (2 * six_term ^ 2 - six_term)
f = invr2 * (2*σ^2 - σ)

Reminder, this is the original code block:

    dr = c2 - c1
    invr2 = inv(sum(abs2, dr))
    σ = (a1.σ + a2.σ) / 2
    ϵ = sqrt(a1.ϵ * a2.ϵ)
    six_term = (σ^2 * invr2) ^ 3
    f = (24 * ϵ * invr2) * (2 * six_term ^ 2 - six_term)
    return f * dr

Note that the kernels without enzyme work with no problems and also work if f is not computed. Maybe there is some form of register spilling causing some variables to overflow onto global memory?

@vchuravy
Copy link
Member

Yeah this sounds like a LLVM Codegen bug where we synthesis an instruction that is not valid in this context. Can you post the MWE with the minimized example?

@leios
Copy link

leios commented Sep 28, 2022

Still working on that... I'll post it as soon as possible

@leios
Copy link

leios commented Sep 29, 2022

Ok, it's not perfect, but I stripped the code down a bit to generate a similar (but different) error:

using CUDA, Enzyme, LinearAlgebra, StaticArrays

struct Atom
    σ::Float32
    ϵ::Float32
end

@inline function force(c1, c2, a1, a2)
    dr = c2 - c1
    invr2 = 1/(sum(abs2, dr))
    σ = (a1.σ + a2.σ) / 2
    ϵ = sqrt(a1.ϵ * a2.ϵ)
    six_term = (σ^2 * invr2) ^ 3
    f = (24 * ϵ * invr2) * (2 * six_term ^ 2 - six_term)
    return f * dr
end

function kernel!(coords_var, atoms_var)
    coords = CUDA.Const(coords_var)
    atoms = CUDA.Const(atoms_var)

    tid = (blockIdx().x-1) * blockDim().x + threadIdx().x

    for j = 1:length(atoms)
        f = force(coords[tid], coords[j], atoms[tid], atoms[j])
    end

    sync_threads()

    return nothing
end

function grad_kernel!(coords, dcoords, atoms, datoms)
    Enzyme.autodiff_deferred(
        kernel!,
        Duplicated(coords, dcoords),
        Duplicated(atoms, datoms),
    )
    return
end

n_atoms = 1024
coords = [SVector((rand(Float32), rand(Float32), rand(Float32))) for i = 1:n_atoms]
atoms = [Atom(Float32(0.02 + rand() * 0.01),
              Float32(0.2 + rand() * 0.1)) for _ in 1:n_atoms]
cu_coords = CuArray(coords)
cu_atoms = CuArray(atoms)

d_cu_coords = zero(cu_coords)
d_cu_atoms = copy(cu_atoms)
n_threads = 256
n_blocks = 4

CUDA.@sync @cuda threads=n_threads blocks=n_blocks kernel!(cu_coords, cu_atoms)
println("Kernel worked")

CUDA.@sync @cuda threads=n_threads blocks=n_blocks grad_kernel!(
        cu_coords, d_cu_coords, cu_atoms, d_cu_atoms)
println("Grad kernel worked")

Note that:

  1. We don't need to output / use shmem for this example
  2. Both the for loop and sync_threads() are necessary for the error to appear
  3. If we use a less complex function, the error does not trigger.

The error this time is:

julia> include("mwe_attempt_no_output.jl")
Kernel worked
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
 [2] isdone
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:109 [inlined]
 [3] nonblocking_synchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:139 [inlined]
 [4] nonblocking_synchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:325 [inlined]
 [5] device_synchronize()
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:319
 [6] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:54

caused by: LoadError: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
 [2] macro expansion
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
 [3] cuStreamSynchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
 [4] synchronize(stream::CuStream; blocking::Nothing)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:132
 [5] synchronize (repeats 2 times)
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:121 [inlined]
 [6] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:26
 [7] include(fname::String)
   @ Base.MainInclude ./client.jl:476
 [8] top-level scope
   @ REPL[3]:1
 [9] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52
in expression starting at /home/leios/projects/CESMIX/tests/mwe_attempt_no_output.jl:57

With the following stacktrace upon leaving Julia (CUDA finalization):

error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
throw_api_error at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuModuleUnload at /home/leios/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
#27 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:82 [inlined]
#context!#63 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f2add2cc5ab)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_unload! at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:81
unknown function (ip: 0x7f2add2cc2f2)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:720
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
throw_api_error at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuModuleUnload at /home/leios/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
#27 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:82 [inlined]
#context!#63 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f2add2cc5ab)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_unload! at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:81
unknown function (ip: 0x7f2add2cc2f2)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:720
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
WARNING: Error while freeing DeviceBuffer(8.000 KiB at 0x0000000802008000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync
    @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] actual_free(buf::CUDA.Mem.DeviceBuffer; stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{Atom, 1, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{Atom, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{Atom, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
throw_api_error at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuModuleUnload at /home/leios/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
#27 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:82 [inlined]
#context!#63 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f2add2cc5ab)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_unload! at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:81
unknown function (ip: 0x7f2add2cc2f2)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:720
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
WARNING: Error while freeing DeviceBuffer(12.000 KiB at 0x0000000802005000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync
    @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] actual_free(buf::CUDA.Mem.DeviceBuffer; stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
WARNING: Error while freeing DeviceBuffer(8.000 KiB at 0x0000000802003000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync
    @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] actual_free(buf::CUDA.Mem.DeviceBuffer; stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{Atom, 1, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{Atom, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{Atom, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
WARNING: Error while freeing DeviceBuffer(12.000 KiB at 0x0000000802000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync
    @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] actual_free(buf::CUDA.Mem.DeviceBuffer; stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
throw_api_error at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuStreamDestroy_v2 at /home/leios/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
#10 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:86 [inlined]
#context!#63 at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f2add2e1d4b)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_destroy! at /home/leios/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:85
unknown function (ip: 0x7f2add2e1aa2)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:720
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)

@wsmoses
Copy link
Member

wsmoses commented Sep 30, 2022

using CUDA, Enzyme

Enzyme.API.printall!(true)

struct Atom
    ϵ::Float32
end

function kernel!(atoms_var, n)
    atoms = atoms_var # CUDA.Const(atoms_var)

# Core.Intrinsics.llvmcall(("declare void @llvm.assume(i1)
# define void @f(i64 %a) alwaysinline {
#      %z = icmp sgt i64 %a, 0
#      call void @llvm.assume(i1 %z)
#      ret void
# }", "f"), Nothing, Tuple{Int64}, n)
    j = 0::Int64
    #for j = 1:length(atoms)
    while j < n
        a1 = @inbounds atoms[1].ϵ
            a2 = @inbounds atoms[j+1].ϵ

        o = a1 * a2
        @inbounds atoms[j+1] = Atom(o)
        j += 1
    end

    # sync_threads()

    return nothing
end

function grad_kernel!(atoms, datoms)
    Enzyme.autodiff_deferred(
        kernel!,
        Duplicated(atoms, datoms),
        Const(length(atoms)::Int64)
    )
    return
end

n_atoms = 1024
atoms = [Atom(Float32(0.2 + rand() * 0.1)) for _ in 1:n_atoms]
cu_atoms = CuArray(atoms)

d_cu_atoms = copy(cu_atoms)
n_threads = 256
n_blocks = 4

CUDA.@sync @cuda threads=n_threads blocks=n_blocks kernel!(cu_atoms, length(atoms)::Int64)
println("Kernel worked")

CUDA.@sync @cuda threads=n_threads blocks=n_blocks grad_kernel!(
        cu_atoms, d_cu_atoms)
println("Grad kernel worked")
Kernel worked
after simplification :
; Function Attrs: mustprogress nosync willreturn
define void @preprocess_julia_kernel__4134_inner1({ i8 addrspace(1)*, i64, [1 x i64], i64 } %0, i64 signext %1) local_unnamed_addr #3 !dbg !65 {
entry:
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0, !dbg !66
  %2 = call {}*** @julia.get_pgcstack() #4
  %.not12 = icmp sgt i64 %1, 0, !dbg !67
  br i1 %.not12, label %L5.i.lr.ph, label %julia_kernel__4134_inner.exit, !dbg !70

L5.i.lr.ph:                                       ; preds = %entry
  %3 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
  %4 = bitcast i8 addrspace(1)* %.fca.0.extract to [1 x float] addrspace(1)*
  br label %L5.i, !dbg !70

L5.i:                                             ; preds = %L5.i, %L5.i.lr.ph
  %iv = phi i64 [ %iv.next, %L5.i ], [ 0, %L5.i.lr.ph ]
  %iv.next = add nuw nsw i64 %iv, 1, !dbg !71
  %.unpack = load float, float addrspace(1)* %3, align 4, !dbg !71, !tbaa !34
  %5 = getelementptr inbounds [1 x float], [1 x float] addrspace(1)* %4, i64 %iv, i64 0, !dbg !79
  %.unpack11 = load float, float addrspace(1)* %5, align 4, !dbg !79, !tbaa !34
  %6 = fmul float %.unpack, %.unpack11, !dbg !87
  store float %6, float addrspace(1)* %5, align 4, !dbg !89, !tbaa !34
  %exitcond.not = icmp eq i64 %iv.next, %1, !dbg !67
  br i1 %exitcond.not, label %julia_kernel__4134_inner.exit.loopexit, label %L5.i, !dbg !70

julia_kernel__4134_inner.exit.loopexit:           ; preds = %L5.i
  br label %julia_kernel__4134_inner.exit, !dbg !66

julia_kernel__4134_inner.exit:                    ; preds = %julia_kernel__4134_inner.exit.loopexit, %entry
  ret void, !dbg !66
}

; Function Attrs: mustprogress nosync willreturn
define internal void @diffejulia_kernel__4134_inner1({ i8 addrspace(1)*, i64, [1 x i64], i64 } %0, { i8 addrspace(1)*, i64, [1 x i64], i64 } %"'", i64 signext %1) local_unnamed_addr #3 !dbg !97 {
entry:
  %".fca.0.extract'ipev" = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %"'", 0, !dbg !98
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0, !dbg !98
  %.fca.0.extract1 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*, !dbg !99
  %.not12 = icmp sgt i64 %1, 0, !dbg !99
  br i1 %.not12, label %L5.i.lr.ph, label %julia_kernel__4134_inner.exit, !dbg !102

L5.i.lr.ph:                                       ; preds = %entry
  %"'ipc5" = bitcast i8 addrspace(1)* %".fca.0.extract'ipev" to float addrspace(1)*
  %2 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
  %"'ipc" = bitcast i8 addrspace(1)* %".fca.0.extract'ipev" to [1 x float] addrspace(1)*
  %3 = bitcast i8 addrspace(1)* %.fca.0.extract to [1 x float] addrspace(1)*
  %4 = add i64 %1, -1, !dbg !102
  %5 = add nuw nsw i64 %4, 1, !dbg !102
  %6 = add nuw i64 %4, 1, !dbg !102
  %7 = mul nuw i64 %6, 4, !dbg !102
  %8 = call noalias nonnull i8* @malloc(i64 %7), !dbg !102
  %.unpack11_malloccache = bitcast i8* %8 to float*, !dbg !102
  %9 = mul nuw nsw i64 4, %6, !dbg !102
  call void @llvm.memset.p0i8.i64(i8* %8, i8 0, i64 %9, i1 false), !dbg !102
  %10 = mul nuw i64 %6, 4, !dbg !102
  %11 = call noalias nonnull i8* @malloc(i64 %10), !dbg !102
  %.unpack_malloccache = bitcast i8* %11 to float*, !dbg !102
  %12 = mul nuw nsw i64 4, %6, !dbg !102
  call void @llvm.memset.p0i8.i64(i8* %11, i8 0, i64 %12, i1 false), !dbg !102
  br label %L5.i, !dbg !102

L5.i:                                             ; preds = %L5.i, %L5.i.lr.ph
  %iv = phi i64 [ %iv.next, %L5.i ], [ 0, %L5.i.lr.ph ]
  %iv.next = add nuw nsw i64 %iv, 1, !dbg !103
  %.unpack = load float, float addrspace(1)* %2, align 4, !dbg !103, !tbaa !34
  %"'ipg" = getelementptr inbounds [1 x float], [1 x float] addrspace(1)* %"'ipc", i64 %iv, i64 0, !dbg !111
  %13 = getelementptr inbounds [1 x float], [1 x float] addrspace(1)* %3, i64 %iv, i64 0, !dbg !111
  %.unpack11 = load float, float addrspace(1)* %13, align 4, !dbg !111, !tbaa !34
  %14 = fmul float %.unpack, %.unpack11, !dbg !119
  store float %14, float addrspace(1)* %13, align 4, !dbg !121, !tbaa !34, !alias.scope !129, !noalias !132
  %15 = getelementptr inbounds float, float* %.unpack_malloccache, i64 %iv, !dbg !99
  store float %.unpack, float* %15, align 4, !dbg !99, !tbaa !34, !invariant.group !134
  %16 = getelementptr inbounds float, float* %.unpack11_malloccache, i64 %iv, !dbg !99
  store float %.unpack11, float* %16, align 4, !dbg !99, !tbaa !34, !invariant.group !135
  %exitcond.not = icmp eq i64 %iv.next, %1, !dbg !99
  br i1 %exitcond.not, label %julia_kernel__4134_inner.exit.loopexit, label %L5.i, !dbg !102

julia_kernel__4134_inner.exit.loopexit:           ; preds = %L5.i
  br label %julia_kernel__4134_inner.exit, !dbg !98

julia_kernel__4134_inner.exit:                    ; preds = %julia_kernel__4134_inner.exit.loopexit, %entry
  %.unpack_cache.0 = phi float* [ %.unpack_malloccache, %julia_kernel__4134_inner.exit.loopexit ], [ undef, %entry ]
  %.unpack11_cache.0 = phi float* [ %.unpack11_malloccache, %julia_kernel__4134_inner.exit.loopexit ], [ undef, %entry ]
  br label %invertjulia_kernel__4134_inner.exit, !dbg !98

invertentry:                                      ; preds = %invertjulia_kernel__4134_inner.exit, %invertL5.i.lr.ph
  ret void

invertL5.i.lr.ph:                                 ; preds = %invertL5.i
  %17 = bitcast float* %.unpack11_cache.0 to i8*
  call void @free(i8* nonnull %17), !dbg !98
  %18 = bitcast float* %.unpack_cache.0 to i8*
  call void @free(i8* nonnull %18), !dbg !98
  br label %invertentry

invertL5.i:                                       ; preds = %mergeinvertL5.i_julia_kernel__4134_inner.exit.loopexit, %incinvertL5.i
  %"iv'ac.0" = phi i64 [ %_unwrap6, %mergeinvertL5.i_julia_kernel__4134_inner.exit.loopexit ], [ %33, %incinvertL5.i ]
  %"'ipc_unwrap" = bitcast i8 addrspace(1)* %".fca.0.extract'ipev" to [1 x float] addrspace(1)*
  %"'ipg_unwrap" = getelementptr inbounds [1 x float], [1 x float] addrspace(1)* %"'ipc_unwrap", i64 %"iv'ac.0", i64 0
  %19 = load float, float addrspace(1)* %"'ipg_unwrap", align 4
  store float 0.000000e+00, float addrspace(1)* %"'ipg_unwrap", align 4, !dbg !121, !tbaa !34, !alias.scope !132, !noalias !129
  %20 = fadd fast float 0.000000e+00, %19
  %_unwrap = add i64 %1, -1
  %21 = add nuw i64 %_unwrap, 1
  %22 = getelementptr inbounds float, float* %.unpack11_cache.0, i64 %"iv'ac.0"
  %23 = load float, float* %22, align 4, !dbg !111, !tbaa !34, !invariant.group !135
  %m0diffe.unpack = fmul fast float %20, %23
  %24 = add nuw i64 %_unwrap, 1
  %25 = getelementptr inbounds float, float* %.unpack_cache.0, i64 %"iv'ac.0"
  %26 = load float, float* %25, align 4, !dbg !103, !tbaa !34, !invariant.group !134
  %m1diffe.unpack11 = fmul fast float %20, %26
  %27 = fadd fast float 0.000000e+00, %m0diffe.unpack
  %28 = fadd fast float 0.000000e+00, %m1diffe.unpack11
  %29 = atomicrmw fadd float addrspace(1)* %"'ipg_unwrap", float %28 monotonic, align 4
  %"'ipc5_unwrap" = bitcast i8 addrspace(1)* %".fca.0.extract'ipev" to float addrspace(1)*
  %30 = atomicrmw fadd float addrspace(1)* %"'ipc5_unwrap", float %27 monotonic, align 4
  %31 = icmp eq i64 %"iv'ac.0", 0
  %32 = xor i1 %31, true
  br i1 %31, label %invertL5.i.lr.ph, label %incinvertL5.i

incinvertL5.i:                                    ; preds = %invertL5.i
  %33 = add nsw i64 %"iv'ac.0", -1
  br label %invertL5.i

invertjulia_kernel__4134_inner.exit.loopexit:     ; preds = %invertjulia_kernel__4134_inner.exit
  %_unwrap6 = add i64 %1, -1
  br label %mergeinvertL5.i_julia_kernel__4134_inner.exit.loopexit

mergeinvertL5.i_julia_kernel__4134_inner.exit.loopexit: ; preds = %invertjulia_kernel__4134_inner.exit.loopexit
  br label %invertL5.i

invertjulia_kernel__4134_inner.exit:              ; preds = %julia_kernel__4134_inner.exit
  br i1 %.not12, label %invertjulia_kernel__4134_inner.exit.loopexit, label %invertentry
}

ERROR: LoadError: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
 [2] macro expansion
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
 [3] cuStreamSynchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
 [4] synchronize(stream::CuStream; blocking::Nothing)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:132
 [5] synchronize (repeats 2 times)
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:121 [inlined]
 [6] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:26
in expression starting at /home/wmoses/git/Enzyme.jl/cuerr.jl:54
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
throw_api_error at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuModuleUnload at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
#27 at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:82 [inlined]
#context!#63 at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f1c565aabab)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_unload! at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:81
unknown function (ip: 0x7f1c565aa8f2)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
ijl_exit at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jl_uv.c:640
exit at ./initdefs.jl:28 [inlined]
exec_options at ./client.jl:308
_start at ./client.jl:522
jfptr__start_61720.clone_1 at /home/wmoses/git/Enzyme.jl/julia-1.8.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
true_main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:575
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:719
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
unknown function (ip: 0x7f1e8959ed8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
throw_api_error at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuModuleUnload at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
#27 at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:82 [inlined]
#context!#63 at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f1c565aabab)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_unload! at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:81
unknown function (ip: 0x7f1c565aa8f2)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
ijl_exit at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jl_uv.c:640
exit at ./initdefs.jl:28 [inlined]
exec_options at ./client.jl:308
_start at ./client.jl:522
jfptr__start_61720.clone_1 at /home/wmoses/git/Enzyme.jl/julia-1.8.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
true_main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:575
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:719
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
unknown function (ip: 0x7f1e8959ed8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
WARNING: Error while freeing DeviceBuffer(4.000 KiB at 0x0000000402001000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync
    @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] actual_free(buf::CUDA.Mem.DeviceBuffer; stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{Atom, 1, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{Atom, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{Atom, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
 [16] exit
    @ ./initdefs.jl:28 [inlined]
 [17] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:308
 [18] _start()
    @ Base ./client.jl:522
WARNING: Error while freeing DeviceBuffer(4.000 KiB at 0x0000000402000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemFreeAsync
    @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
  [4] #free#2
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:97 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:58 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] actual_free(buf::CUDA.Mem.DeviceBuffer; stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/pool.jl:57
  [8] #_free#173
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:375 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:348 [inlined]
 [10] macro expansion
    @ ./timing.jl:382 [inlined]
 [11] #free#172
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:347 [inlined]
 [12] #178
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:79 [inlined]
 [13] context!(f::CUDA.var"#178#179"{CuArray{Atom, 1, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
 [14] unsafe_free!(xs::CuArray{Atom, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:78
 [15] unsafe_finalize!(xs::CuArray{Atom, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:99
 [16] exit
    @ ./initdefs.jl:28 [inlined]
 [17] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:308
 [18] _start()
    @ Base ./client.jl:522
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
throw_api_error at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
macro expansion at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
cuStreamDestroy_v2 at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
#10 at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:86 [inlined]
#context!#63 at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:164
unknown function (ip: 0x7f1c565bfc0b)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
context!##kw at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:159
unsafe_destroy! at /home/wmoses/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:85
unknown function (ip: 0x7f1c565bf962)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
run_finalizer at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:280
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:367
run_finalizers at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gc.c:410
ijl_atexit_hook at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/init.c:236
ijl_exit at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jl_uv.c:640
exit at ./initdefs.jl:28 [inlined]
exec_options at ./client.jl:308
_start at ./client.jl:522
jfptr__start_61720.clone_1 at /home/wmoses/git/Enzyme.jl/julia-1.8.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
true_main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:575
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:719
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
unknown function (ip: 0x7f1e8959ed8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)

@wsmoses
Copy link
Member

wsmoses commented Sep 30, 2022

ptr=0x7fa9f2cf2e20
ptr=0x7fa9f2cf3f60
ptr=0x7fa9f2cf50a0
ptr=0x7fa9f2cf61e0
ptr=0x7fa9f2cdb0a0
ptr=0x7fa9f2cdc1e0
ptr=0x7fa9f2ddbfe0
ptr=0x7fa9f2ddf3a0
ptr=0x7fa9f2dd0120
ptr=0x7fa9f2dbec20
ptr=0x7fa9f2dc1fe0
ptr=0x7fa9f2dc4260
ptr=0x7fa9f2daf9a0
ptr=0x7fa9f2db1c20
ptr=0x7fa9f2db2d60
ptr=0x7fa9f2db4fe0
ptr=(nil)
ptr=0x7fa9f2cea420
ptr=0x7fa9f30ad020
ptr=0x7fa9f30bc2a0
ptr=0x7fa9f30be520
ptr=0x7fa9f30c07a0
ptr=0x7fa9f30c2a20
ptr=0x7fa9f30c4ca0

Looks like you're exhausting the ability of malloc on the GPU to provide cache memory for your code.

@leios
Copy link

leios commented Oct 3, 2022

Brief update from last week. As it turns out, I created a MWE for the wrong error. To fix error 700, you need to increase the malloc heap size to be large with:

CUDA.limit!(CUDA.CU_LIMIT_MALLOC_HEAP_SIZE, 1*1024^3)

Putting this in your code before the kernel launch will fix the MWE posted before.

The more concerning issue is error 717, which I am still attempting to create a MWE for.

@leios
Copy link

leios commented Oct 4, 2022

well, good news, 717 will actually trigger on the previous MWE once adding the line to it from the previous comment:

using CUDA, Enzyme, LinearAlgebra, StaticArrays

CUDA.limit!(CUDA.CU_LIMIT_MALLOC_HEAP_SIZE, 1*1024^3)
#Enzyme.API.printall!(true)

struct Atom
    σ::Float32
    ϵ::Float32
end

@noinline function force(c1, c2, a1, a2)
    dr = c2 - c1
    invr2 = 1/(sum(abs2, dr))
    σ = (a1.σ + a2.σ) / 2
    ϵ = sqrt(a1.ϵ * a2.ϵ)
    six_term = (σ^2 * invr2) ^ 3
    f = (24 * ϵ * invr2) * (2 * six_term ^ 2 - six_term)
    return f * dr
end

function kernel!(coords_var, atoms_var)
    coords = CUDA.Const(coords_var)
    atoms = CUDA.Const(atoms_var)

    tid = (blockIdx().x-1) * blockDim().x + threadIdx().x

    for j = 1:length(atoms)
        f = force(coords[tid], coords[j], atoms[tid], atoms[j])
    end

    return nothing
end

function grad_kernel!(coords, dcoords, atoms, datoms)
    Enzyme.autodiff_deferred(
        kernel!,
        Duplicated(coords, dcoords),
        Duplicated(atoms, datoms),
    )
    return
end

n_atoms = 1024
coords = [SVector((rand(Float32), rand(Float32), rand(Float32))) for i = 1:n_atoms]
atoms = [Atom(Float32(0.02 + rand() * 0.01),
              Float32(0.2 + rand() * 0.1)) for _ in 1:n_atoms]
cu_coords = CuArray(coords)
cu_atoms = CuArray(atoms)

d_cu_coords = zero(cu_coords)
d_cu_atoms = copy(cu_atoms)
n_threads = 256
n_blocks = 4

CUDA.@sync @cuda threads=n_threads blocks=n_blocks kernel!(cu_coords, cu_atoms)
println("Kernel worked")

CUDA.@sync @cuda threads=n_threads blocks=n_blocks grad_kernel!(
        cu_coords, d_cu_coords, cu_atoms, d_cu_atoms)
println("Grad kernel worked")

@wsmoses
Copy link
Member

wsmoses commented Oct 5, 2022

@leios I cannot reproduce the error with that code locally.

What version of Julia and Enzyme are you on (commit as relevant)

Furthermore, would you be able to simplify the code in force to be as minimal as possible, and potentially reduce Atom to only have one parameter if possible?

Perhaps also get rid of const?

Finally for debugging sake can you post the log when you add Enzyme.API.printall!(true)

@leios
Copy link

leios commented Oct 5, 2022

Right, sorry for the missing information.

GPUs I have tested on:

Tesla V100-PCIE-16GB # Note: errors sometimes, but I cannot find a MWE for it
Tesla P100-PCIE-16GB
NVIDIA GeForce GTX 1080 Ti
(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
  [052768ef] CUDA v3.12.0
  [7da242da] Enzyme v0.10.6 `https://github.com/EnzymeAD/Enzyme.jl.git#main`
  [aa0f7f06] Molly v0.13.0
  [90137ffa] StaticArrays v1.5.9

I'll focus on the P100 error for now or maybe try to find a case that also errors for the V100?

I was able to get consisten errors on the P100 and GTX 1080 with the following example:

using CUDA, Enzyme, LinearAlgebra, StaticArrays

#CUDA.device!(0) # V100 does not consistently error
CUDA.device!(2) # P100
#CUDA.device!(4) # GTX 1080

CUDA.limit!(CUDA.CU_LIMIT_MALLOC_HEAP_SIZE, 1*1024^3)
Enzyme.API.printall!(true)

@noinline function force(c1, c2)
    dr = c2 - c1
    a = dr.-dr

    return dr
end

function kernel!(coords)
    tid = (blockIdx().x-1) * blockDim().x + threadIdx().x

    for j = 1:length(coords)
        f = force(coords[tid], coords[j])
    end

    return nothing
end

function grad_kernel!(coords, dcoords)
    Enzyme.autodiff_deferred(
        kernel!,
        Duplicated(coords, dcoords),
    )
    return
end

n_atoms = 1024
coords = [SVector((rand(Float32), rand(Float32), rand(Float32))) for i = 1:n_atoms]
cu_coords = CuArray(coords)

d_cu_coords = zero(cu_coords)
n_threads = 256
n_blocks = 4

CUDA.@sync @cuda threads=n_threads blocks=n_blocks kernel!(cu_coords)
println("Kernel worked")

CUDA.@sync @cuda threads=n_threads blocks=n_blocks grad_kernel!(cu_coords, d_cu_coords)
println("Grad kernel worked")

Here is the Enzyme print and error for the P100:

julia> include("check_test_3.jl")
Kernel worked
after simplification :
; Function Attrs: mustprogress nosync willreturn
define void @preprocess_julia_kernel__5175_inner1({ i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr #6 !dbg !130 {
entry:
  %1 = alloca [1 x [3 x float]], align 4
  %2 = alloca [1 x [3 x float]], align 4
  %3 = alloca [1 x [3 x float]], align 4
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0, !dbg !131
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 3, !dbg !131
  %4 = bitcast [1 x [3 x float]]* %1 to i8*
  call void @llvm.lifetime.start.p0i8(i64 noundef 12, i8* noundef nonnull align 4 dereferenceable(12) %4) #7
  %5 = bitcast [1 x [3 x float]]* %2 to i8*
  call void @llvm.lifetime.start.p0i8(i64 noundef 12, i8* noundef nonnull align 4 dereferenceable(12) %5) #7
  %6 = bitcast [1 x [3 x float]]* %3 to i8*
  call void @llvm.lifetime.start.p0i8(i64 noundef 12, i8* noundef nonnull align 4 dereferenceable(12) %6) #7
  %7 = call {}*** @julia.get_pgcstack() #7
  %.inv = icmp sgt i64 %.fca.3.extract, 0, !dbg !132
  %value_phi.i = select i1 %.inv, i64 %.fca.3.extract, i64 0, !dbg !132
  %8 = icmp slt i64 %value_phi.i, 1, !dbg !137
  %9 = bitcast i8 addrspace(1)* %.fca.0.extract to [1 x [3 x float]] addrspace(1)*, !dbg !141
  br i1 %8, label %julia_kernel__5175_inner.exit, label %L54.i.preheader, !dbg !141

L54.i.preheader:                                  ; preds = %entry
  %10 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #7, !dbg !142, !range !70
  %11 = add nuw nsw i32 %10, 1, !dbg !148
  %12 = zext i32 %11 to i64, !dbg !149
  %13 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #7, !dbg !151, !range !83
  %14 = zext i32 %13 to i64, !dbg !156
  %15 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #7, !dbg !158, !range !94
  %16 = zext i32 %15 to i64, !dbg !163
  %17 = mul nuw nsw i64 %14, %16, !dbg !165
  %18 = add nsw i64 %17, -1
  %19 = add nsw i64 %18, %12
  %.unpack.elt = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %19, i64 0, i64 0
  %.unpack.elt7 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %19, i64 0, i64 1
  %.unpack.elt9 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %19, i64 0, i64 2
  %.fca.0.0.gep2 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %1, i64 0, i64 0, i64 0
  %.fca.0.1.gep4 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %1, i64 0, i64 0, i64 1
  %.fca.0.2.gep6 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %1, i64 0, i64 0, i64 2
  %.fca.0.0.gep = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %2, i64 0, i64 0, i64 0
  %.fca.0.1.gep = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %2, i64 0, i64 0, i64 1
  %.fca.0.2.gep = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %2, i64 0, i64 0, i64 2
  %20 = addrspacecast [1 x [3 x float]]* %1 to [1 x [3 x float]] addrspace(11)*
  %21 = addrspacecast [1 x [3 x float]]* %2 to [1 x [3 x float]] addrspace(11)*
  br label %L54.i, !dbg !167

L54.i:                                            ; preds = %L54.i, %L54.i.preheader
  %iv = phi i64 [ %iv.next, %L54.i ], [ 0, %L54.i.preheader ]
  %iv.next = add nuw nsw i64 %iv, 1, !dbg !168
  %.unpack.unpack = load float, float addrspace(1)* %.unpack.elt, align 4, !dbg !168, !tbaa !117
  %.unpack.unpack8 = load float, float addrspace(1)* %.unpack.elt7, align 4, !dbg !168, !tbaa !117
  %.unpack.unpack10 = load float, float addrspace(1)* %.unpack.elt9, align 4, !dbg !168, !tbaa !117
  store float %.unpack.unpack, float* %.fca.0.0.gep2, align 4, !dbg !168
  store float %.unpack.unpack8, float* %.fca.0.1.gep4, align 4, !dbg !168
  store float %.unpack.unpack10, float* %.fca.0.2.gep6, align 4, !dbg !168
  %22 = add nsw i64 %iv.next, -1, !dbg !176
  %.unpack.elt12 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %22, i64 0, i64 0, !dbg !168
  %.unpack.unpack13 = load float, float addrspace(1)* %.unpack.elt12, align 4, !dbg !168, !tbaa !117
  %.unpack.elt14 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %22, i64 0, i64 1, !dbg !168
  %.unpack.unpack15 = load float, float addrspace(1)* %.unpack.elt14, align 4, !dbg !168, !tbaa !117
  %.unpack.elt16 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %22, i64 0, i64 2, !dbg !168
  %.unpack.unpack17 = load float, float addrspace(1)* %.unpack.elt16, align 4, !dbg !168, !tbaa !117
  store float %.unpack.unpack13, float* %.fca.0.0.gep, align 4, !dbg !168
  store float %.unpack.unpack15, float* %.fca.0.1.gep, align 4, !dbg !168
  store float %.unpack.unpack17, float* %.fca.0.2.gep, align 4, !dbg !168
  call fastcc void @julia_force_5194([1 x [3 x float]]* noalias nocapture nofree noundef nonnull writeonly sret([1 x [3 x float]]) align 4 dereferenceable(12) %3, [1 x [3 x float]] addrspace(11)* nocapture nofree noundef nonnull readonly align 4 dereferenceable(12) %20, [1 x [3 x float]] addrspace(11)* nocapture nofree noundef nonnull readonly align 4 dereferenceable(12) %21) #8, !dbg !178
  %.not19.not = icmp eq i64 %iv.next, %value_phi.i, !dbg !179
  %23 = add nuw i64 %iv.next, 1, !dbg !182
  br i1 %.not19.not, label %julia_kernel__5175_inner.exit.loopexit, label %L54.i, !dbg !167

julia_kernel__5175_inner.exit.loopexit:           ; preds = %L54.i
  br label %julia_kernel__5175_inner.exit, !dbg !183

julia_kernel__5175_inner.exit:                    ; preds = %julia_kernel__5175_inner.exit.loopexit, %entry
  call void @llvm.lifetime.end.p0i8(i64 noundef 12, i8* noundef nonnull %4) #7, !dbg !183
  call void @llvm.lifetime.end.p0i8(i64 noundef 12, i8* noundef nonnull %5) #7, !dbg !183
  call void @llvm.lifetime.end.p0i8(i64 noundef 12, i8* noundef nonnull align 4 dereferenceable(12) %6) #7, !dbg !183
  ret void, !dbg !131
}

after simplification :
; Function Attrs: argmemonly mustprogress nofree noinline nosync willreturn
define internal fastcc void @preprocess_julia_force_5194([1 x [3 x float]]* noalias nocapture nofree noundef nonnull writeonly sret([1 x [3 x float]]) align 4 dereferenceable(12) %0, [1 x [3 x float]] addrspace(11)* nocapture nofree noundef nonnull readonly align 4 dereferenceable(12) %1, [1 x [3 x float]] addrspace(11)* nocapture nofree noundef nonnull readonly align 4 dereferenceable(12) %2) unnamed_addr #7 !dbg !238 {
top:
  %3 = call {}*** @julia.get_pgcstack() #8
  %4 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 0, !dbg !239
  %5 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 0, !dbg !239
  %6 = load float, float addrspace(11)* %4, align 4, !dbg !246, !tbaa !30, !invariant.load !4
  %7 = load float, float addrspace(11)* %5, align 4, !dbg !246, !tbaa !30, !invariant.load !4
  %8 = fsub float %6, %7, !dbg !246
  %9 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 1, !dbg !239
  %10 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 1, !dbg !239
  %11 = load float, float addrspace(11)* %9, align 4, !dbg !246, !tbaa !30, !invariant.load !4
  %12 = load float, float addrspace(11)* %10, align 4, !dbg !246, !tbaa !30, !invariant.load !4
  %13 = fsub float %11, %12, !dbg !246
  %14 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 2, !dbg !239
  %15 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 2, !dbg !239
  %16 = load float, float addrspace(11)* %14, align 4, !dbg !246, !tbaa !30, !invariant.load !4
  %17 = load float, float addrspace(11)* %15, align 4, !dbg !246, !tbaa !30, !invariant.load !4
  %18 = fsub float %16, %17, !dbg !246
  %.sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 0, !dbg !247
  store float %8, float* %.sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx, align 4, !dbg !247
  %.sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 1, !dbg !247
  store float %13, float* %.sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6, align 4, !dbg !247
  %.sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 2, !dbg !247
  store float %18, float* %.sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7, align 4, !dbg !247
  ret void, !dbg !247
}

; Function Attrs: argmemonly mustprogress nofree noinline nosync willreturn
define internal fastcc void @augmented_julia_force_5194([1 x [3 x float]]* noalias nocapture nofree noundef nonnull writeonly align 4 dereferenceable(12) "enzyme_sret" %0, [1 x [3 x float]]* nocapture "enzyme_sret" %"'", [1 x [3 x float]] addrspace(11)* nocapture nofree noundef nonnull readonly align 4 dereferenceable(12) %1, [1 x [3 x float]] addrspace(11)* nocapture %"'1", [1 x [3 x float]] addrspace(11)* nocapture nofree noundef nonnull readonly align 4 dereferenceable(12) %2, [1 x [3 x float]] addrspace(11)* nocapture %"'2") unnamed_addr #7 !dbg !248 {
top:
  %3 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 0, !dbg !249
  %4 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 0, !dbg !249
  %5 = load float, float addrspace(11)* %3, align 4, !dbg !256, !tbaa !30, !invariant.load !4
  %6 = load float, float addrspace(11)* %4, align 4, !dbg !256, !tbaa !30, !invariant.load !4
  %7 = fsub float %5, %6, !dbg !256
  %8 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 1, !dbg !249
  %9 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 1, !dbg !249
  %10 = load float, float addrspace(11)* %8, align 4, !dbg !256, !tbaa !30, !invariant.load !4
  %11 = load float, float addrspace(11)* %9, align 4, !dbg !256, !tbaa !30, !invariant.load !4
  %12 = fsub float %10, %11, !dbg !256
  %13 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %2, i64 0, i64 0, i64 2, !dbg !249
  %14 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %1, i64 0, i64 0, i64 2, !dbg !249
  %15 = load float, float addrspace(11)* %13, align 4, !dbg !256, !tbaa !30, !invariant.load !4
  %16 = load float, float addrspace(11)* %14, align 4, !dbg !256, !tbaa !30, !invariant.load !4
  %17 = fsub float %15, %16, !dbg !256
  %.sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 0, !dbg !257
  store float %7, float* %.sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx, align 4, !dbg !257, !alias.scope !258, !noalias !261
  %.sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 1, !dbg !257
  store float %12, float* %.sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6, align 4, !dbg !257, !alias.scope !258, !noalias !261
  %.sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %0, i64 0, i64 0, i64 2, !dbg !257
  store float %17, float* %.sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7, align 4, !dbg !257, !alias.scope !258, !noalias !261
  ret void, !dbg !257
}

; Function Attrs: argmemonly mustprogress nofree noinline nosync willreturn
define internal fastcc void @diffejulia_force_5194([1 x [3 x float]]* noalias nocapture nofree noundef nonnull writeonly align 4 dereferenceable(12) "enzyme_sret" %0, [1 x [3 x float]]* nocapture "enzyme_sret" %"'", [1 x [3 x float]] addrspace(11)* nocapture nofree noundef nonnull readonly align 4 dereferenceable(12) %1, [1 x [3 x float]] addrspace(11)* nocapture %"'1", [1 x [3 x float]] addrspace(11)* nocapture nofree noundef nonnull readonly align 4 dereferenceable(12) %2, [1 x [3 x float]] addrspace(11)* nocapture %"'2") unnamed_addr #7 !dbg !263 {
top:
  %"'ipg26" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %"'2", i64 0, i64 0, i64 0, !dbg !264
  %"'ipg24" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %"'1", i64 0, i64 0, i64 0, !dbg !264
  %"'ipg17" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %"'2", i64 0, i64 0, i64 1, !dbg !264
  %"'ipg15" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %"'1", i64 0, i64 0, i64 1, !dbg !264
  %"'ipg8" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %"'2", i64 0, i64 0, i64 2, !dbg !264
  %"'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(11)* %"'1", i64 0, i64 0, i64 2, !dbg !264
  %".sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'", i64 0, i64 0, i64 0, !dbg !271
  %".sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'", i64 0, i64 0, i64 1, !dbg !271
  %".sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'", i64 0, i64 0, i64 2, !dbg !271
  br label %inverttop, !dbg !271

inverttop:                                        ; preds = %top
  %3 = load float, float* %".sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7'ipg", align 4
  store float 0.000000e+00, float* %".sroa.0.sroa.3.0..sroa.0.0..sroa_cast1.sroa_idx7'ipg", align 4, !dbg !271, !alias.scope !272, !noalias !275
  %4 = fadd fast float 0.000000e+00, %3
  %5 = load float, float* %".sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6'ipg", align 4
  store float 0.000000e+00, float* %".sroa.0.sroa.2.0..sroa.0.0..sroa_cast1.sroa_idx6'ipg", align 4, !dbg !271, !alias.scope !272, !noalias !275
  %6 = fadd fast float 0.000000e+00, %5
  %7 = load float, float* %".sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx'ipg", align 4
  store float 0.000000e+00, float* %".sroa.0.sroa.0.0..sroa.0.0..sroa_cast1.sroa_idx'ipg", align 4, !dbg !271, !alias.scope !272, !noalias !275
  %8 = fadd fast float 0.000000e+00, %7
  %9 = fneg fast float %4
  %10 = fadd fast float 0.000000e+00, %4
  %11 = fadd fast float 0.000000e+00, %9
  %12 = atomicrmw fadd float addrspace(11)* %"'ipg", float %11 monotonic, align 4
  %13 = atomicrmw fadd float addrspace(11)* %"'ipg8", float %10 monotonic, align 4
  %14 = fneg fast float %6
  %15 = fadd fast float 0.000000e+00, %6
  %16 = fadd fast float 0.000000e+00, %14
  %17 = atomicrmw fadd float addrspace(11)* %"'ipg15", float %16 monotonic, align 4
  %18 = atomicrmw fadd float addrspace(11)* %"'ipg17", float %15 monotonic, align 4
  %19 = fneg fast float %8
  %20 = fadd fast float 0.000000e+00, %8
  %21 = fadd fast float 0.000000e+00, %19
  %22 = atomicrmw fadd float addrspace(11)* %"'ipg24", float %21 monotonic, align 4
  %23 = atomicrmw fadd float addrspace(11)* %"'ipg26", float %20 monotonic, align 4
  ret void
}

; Function Attrs: mustprogress nosync willreturn
define internal void @diffejulia_kernel__5175_inner1({ i8 addrspace(1)*, i64, [1 x i64], i64 } %0, { i8 addrspace(1)*, i64, [1 x i64], i64 } %"'") local_unnamed_addr #6 !dbg !184 {
entry:
  %"'ipa5" = alloca [1 x [3 x float]], align 4
  store [1 x [3 x float]] zeroinitializer, [1 x [3 x float]]* %"'ipa5", align 4
  %1 = alloca [1 x [3 x float]], align 4
  %"'ipa7" = alloca [1 x [3 x float]], align 4
  store [1 x [3 x float]] zeroinitializer, [1 x [3 x float]]* %"'ipa7", align 4
  %2 = alloca [1 x [3 x float]], align 4
  %"'ipa" = alloca [1 x [3 x float]], align 4
  store [1 x [3 x float]] zeroinitializer, [1 x [3 x float]]* %"'ipa", align 4
  %3 = alloca [1 x [3 x float]], align 4
  %".fca.0.extract'ipev" = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %"'", 0, !dbg !185
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0, !dbg !185
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 3, !dbg !185
  %.inv = icmp sgt i64 %.fca.3.extract, 0, !dbg !186
  %value_phi.i = select i1 %.inv, i64 %.fca.3.extract, i64 0, !dbg !186
  %4 = icmp slt i64 %value_phi.i, 1, !dbg !191
  %"'ipc9" = bitcast i8 addrspace(1)* %".fca.0.extract'ipev" to [1 x [3 x float]] addrspace(1)*, !dbg !195
  %5 = bitcast i8 addrspace(1)* %.fca.0.extract to [1 x [3 x float]] addrspace(1)*, !dbg !195
  br i1 %4, label %julia_kernel__5175_inner.exit, label %L54.i.preheader, !dbg !195

L54.i.preheader:                                  ; preds = %entry
  %6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #8, !dbg !196, !range !70
  %7 = add nuw nsw i32 %6, 1, !dbg !202
  %8 = zext i32 %7 to i64, !dbg !203
  %9 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #8, !dbg !205, !range !83
  %10 = zext i32 %9 to i64, !dbg !210
  %11 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #8, !dbg !212, !range !94
  %12 = zext i32 %11 to i64, !dbg !217
  %13 = mul nuw nsw i64 %10, %12, !dbg !219
  %14 = add nsw i64 %13, -1
  %15 = add nsw i64 %14, %8
  %".unpack.elt'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %15, i64 0, i64 0
  %.unpack.elt = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %5, i64 %15, i64 0, i64 0
  %".unpack.elt7'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %15, i64 0, i64 1
  %.unpack.elt7 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %5, i64 %15, i64 0, i64 1
  %".unpack.elt9'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %15, i64 0, i64 2
  %.unpack.elt9 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %5, i64 %15, i64 0, i64 2
  %".fca.0.0.gep2'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa5", i64 0, i64 0, i64 0
  %.fca.0.0.gep2 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %1, i64 0, i64 0, i64 0
  %".fca.0.1.gep4'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa5", i64 0, i64 0, i64 1
  %.fca.0.1.gep4 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %1, i64 0, i64 0, i64 1
  %".fca.0.2.gep6'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa5", i64 0, i64 0, i64 2
  %.fca.0.2.gep6 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %1, i64 0, i64 0, i64 2
  %".fca.0.0.gep'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa7", i64 0, i64 0, i64 0
  %.fca.0.0.gep = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %2, i64 0, i64 0, i64 0
  %".fca.0.1.gep'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa7", i64 0, i64 0, i64 1
  %.fca.0.1.gep = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %2, i64 0, i64 0, i64 1
  %".fca.0.2.gep'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa7", i64 0, i64 0, i64 2
  %.fca.0.2.gep = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %2, i64 0, i64 0, i64 2
  %"'ipc" = addrspacecast [1 x [3 x float]]* %"'ipa5" to [1 x [3 x float]] addrspace(11)*
  %16 = addrspacecast [1 x [3 x float]]* %1 to [1 x [3 x float]] addrspace(11)*
  %"'ipc8" = addrspacecast [1 x [3 x float]]* %"'ipa7" to [1 x [3 x float]] addrspace(11)*
  %17 = addrspacecast [1 x [3 x float]]* %2 to [1 x [3 x float]] addrspace(11)*
  %18 = add nsw i64 %value_phi.i, -1, !dbg !221
  br label %L54.i, !dbg !221

L54.i:                                            ; preds = %L54.i, %L54.i.preheader
  %iv = phi i64 [ %iv.next, %L54.i ], [ 0, %L54.i.preheader ]
  %iv.next = add nuw nsw i64 %iv, 1, !dbg !222
  %.unpack.unpack = load float, float addrspace(1)* %.unpack.elt, align 4, !dbg !222, !tbaa !117
  %.unpack.unpack8 = load float, float addrspace(1)* %.unpack.elt7, align 4, !dbg !222, !tbaa !117
  %.unpack.unpack10 = load float, float addrspace(1)* %.unpack.elt9, align 4, !dbg !222, !tbaa !117
  store float %.unpack.unpack, float* %.fca.0.0.gep2, align 4, !dbg !222, !alias.scope !230, !noalias !233
  store float %.unpack.unpack8, float* %.fca.0.1.gep4, align 4, !dbg !222, !alias.scope !230, !noalias !233
  store float %.unpack.unpack10, float* %.fca.0.2.gep6, align 4, !dbg !222, !alias.scope !230, !noalias !233
  %19 = add nsw i64 %iv.next, -1, !dbg !235
  %".unpack.elt12'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %19, i64 0, i64 0, !dbg !222
  %.unpack.elt12 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %5, i64 %19, i64 0, i64 0, !dbg !222
  %.unpack.unpack13 = load float, float addrspace(1)* %.unpack.elt12, align 4, !dbg !222, !tbaa !117
  %".unpack.elt14'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %19, i64 0, i64 1, !dbg !222
  %.unpack.elt14 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %5, i64 %19, i64 0, i64 1, !dbg !222
  %.unpack.unpack15 = load float, float addrspace(1)* %.unpack.elt14, align 4, !dbg !222, !tbaa !117
  %".unpack.elt16'ipg" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %19, i64 0, i64 2, !dbg !222
  %.unpack.elt16 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %5, i64 %19, i64 0, i64 2, !dbg !222
  %.unpack.unpack17 = load float, float addrspace(1)* %.unpack.elt16, align 4, !dbg !222, !tbaa !117
  store float %.unpack.unpack13, float* %.fca.0.0.gep, align 4, !dbg !222, !alias.scope !230, !noalias !233
  store float %.unpack.unpack15, float* %.fca.0.1.gep, align 4, !dbg !222, !alias.scope !230, !noalias !233
  store float %.unpack.unpack17, float* %.fca.0.2.gep, align 4, !dbg !222, !alias.scope !230, !noalias !233
  call fastcc void @augmented_julia_force_5194([1 x [3 x float]]* "enzyme_sret" %3, [1 x [3 x float]]* "enzyme_sret" %"'ipa", [1 x [3 x float]] addrspace(11)* %16, [1 x [3 x float]] addrspace(11)* %"'ipc", [1 x [3 x float]] addrspace(11)* %17, [1 x [3 x float]] addrspace(11)* %"'ipc8"), !dbg !237
  %.not19.not = icmp eq i64 %iv.next, %value_phi.i, !dbg !238
  br i1 %.not19.not, label %julia_kernel__5175_inner.exit.loopexit, label %L54.i, !dbg !221

julia_kernel__5175_inner.exit.loopexit:           ; preds = %L54.i
  br label %julia_kernel__5175_inner.exit, !dbg !241

julia_kernel__5175_inner.exit:                    ; preds = %julia_kernel__5175_inner.exit.loopexit, %entry
  br label %invertjulia_kernel__5175_inner.exit, !dbg !185

invertentry:                                      ; preds = %invertjulia_kernel__5175_inner.exit, %invertL54.i.preheader
  ret void

invertL54.i.preheader:                            ; preds = %invertL54.i
  br label %invertentry

invertL54.i:                                      ; preds = %mergeinvertL54.i_julia_kernel__5175_inner.exit.loopexit, %incinvertL54.i
  %"iv'ac.0" = phi i64 [ %_unwrap18, %mergeinvertL54.i_julia_kernel__5175_inner.exit.loopexit ], [ %43, %incinvertL54.i ]
  %_unwrap = addrspacecast [1 x [3 x float]]* %1 to [1 x [3 x float]] addrspace(11)*
  %"'ipc_unwrap" = addrspacecast [1 x [3 x float]]* %"'ipa5" to [1 x [3 x float]] addrspace(11)*
  %_unwrap6 = addrspacecast [1 x [3 x float]]* %2 to [1 x [3 x float]] addrspace(11)*
  %"'ipc8_unwrap" = addrspacecast [1 x [3 x float]]* %"'ipa7" to [1 x [3 x float]] addrspace(11)*
  call fastcc void @diffejulia_force_5194([1 x [3 x float]]* "enzyme_sret" null, [1 x [3 x float]]* "enzyme_sret" %"'ipa", [1 x [3 x float]] addrspace(11)* %_unwrap, [1 x [3 x float]] addrspace(11)* %"'ipc_unwrap", [1 x [3 x float]] addrspace(11)* %_unwrap6, [1 x [3 x float]] addrspace(11)* %"'ipc8_unwrap"), !dbg !237
  %".fca.0.2.gep'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa7", i64 0, i64 0, i64 2
  %20 = load float, float* %".fca.0.2.gep'ipg_unwrap", align 4
  store float 0.000000e+00, float* %".fca.0.2.gep'ipg_unwrap", align 4, !dbg !222, !alias.scope !233, !noalias !230
  %21 = fadd fast float 0.000000e+00, %20
  %".fca.0.1.gep'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa7", i64 0, i64 0, i64 1
  %22 = load float, float* %".fca.0.1.gep'ipg_unwrap", align 4
  store float 0.000000e+00, float* %".fca.0.1.gep'ipg_unwrap", align 4, !dbg !222, !alias.scope !233, !noalias !230
  %23 = fadd fast float 0.000000e+00, %22
  %".fca.0.0.gep'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa7", i64 0, i64 0, i64 0
  %24 = load float, float* %".fca.0.0.gep'ipg_unwrap", align 4
  store float 0.000000e+00, float* %".fca.0.0.gep'ipg_unwrap", align 4, !dbg !222, !alias.scope !233, !noalias !230
  %25 = fadd fast float 0.000000e+00, %24
  %iv.next_unwrap = add nuw nsw i64 %"iv'ac.0", 1
  %_unwrap10 = add nsw i64 %iv.next_unwrap, -1
  %".unpack.elt16'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %_unwrap10, i64 0, i64 2
  %26 = atomicrmw fadd float addrspace(1)* %".unpack.elt16'ipg_unwrap", float %21 monotonic, align 4
  %".unpack.elt14'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %_unwrap10, i64 0, i64 1
  %27 = atomicrmw fadd float addrspace(1)* %".unpack.elt14'ipg_unwrap", float %23 monotonic, align 4
  %".unpack.elt12'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %_unwrap10, i64 0, i64 0
  %28 = atomicrmw fadd float addrspace(1)* %".unpack.elt12'ipg_unwrap", float %25 monotonic, align 4
  %".fca.0.2.gep6'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa5", i64 0, i64 0, i64 2
  %29 = load float, float* %".fca.0.2.gep6'ipg_unwrap", align 4
  store float 0.000000e+00, float* %".fca.0.2.gep6'ipg_unwrap", align 4, !dbg !222, !alias.scope !233, !noalias !230
  %30 = fadd fast float 0.000000e+00, %29
  %".fca.0.1.gep4'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa5", i64 0, i64 0, i64 1
  %31 = load float, float* %".fca.0.1.gep4'ipg_unwrap", align 4
  store float 0.000000e+00, float* %".fca.0.1.gep4'ipg_unwrap", align 4, !dbg !222, !alias.scope !233, !noalias !230
  %32 = fadd fast float 0.000000e+00, %31
  %".fca.0.0.gep2'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %"'ipa5", i64 0, i64 0, i64 0
  %33 = load float, float* %".fca.0.0.gep2'ipg_unwrap", align 4
  store float 0.000000e+00, float* %".fca.0.0.gep2'ipg_unwrap", align 4, !dbg !222, !alias.scope !233, !noalias !230
  %34 = fadd fast float 0.000000e+00, %33
  %35 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #8, !dbg !205
  %_unwrap11 = zext i32 %35 to i64
  %36 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #8, !dbg !212
  %_unwrap12 = zext i32 %36 to i64
  %_unwrap13 = mul nuw nsw i64 %_unwrap11, %_unwrap12
  %_unwrap14 = add nsw i64 %_unwrap13, -1
  %37 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #8, !dbg !196
  %_unwrap15 = add nuw nsw i32 %37, 1
  %_unwrap16 = zext i32 %_unwrap15 to i64
  %_unwrap17 = add nsw i64 %_unwrap14, %_unwrap16
  %".unpack.elt9'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %_unwrap17, i64 0, i64 2
  %38 = atomicrmw fadd float addrspace(1)* %".unpack.elt9'ipg_unwrap", float %30 monotonic, align 4
  %".unpack.elt7'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %_unwrap17, i64 0, i64 1
  %39 = atomicrmw fadd float addrspace(1)* %".unpack.elt7'ipg_unwrap", float %32 monotonic, align 4
  %".unpack.elt'ipg_unwrap" = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %"'ipc9", i64 %_unwrap17, i64 0, i64 0
  %40 = atomicrmw fadd float addrspace(1)* %".unpack.elt'ipg_unwrap", float %34 monotonic, align 4
  %41 = icmp eq i64 %"iv'ac.0", 0
  %42 = xor i1 %41, true
  br i1 %41, label %invertL54.i.preheader, label %incinvertL54.i

incinvertL54.i:                                   ; preds = %invertL54.i
  %43 = add nsw i64 %"iv'ac.0", -1
  br label %invertL54.i

invertjulia_kernel__5175_inner.exit.loopexit:     ; preds = %invertjulia_kernel__5175_inner.exit
  %_unwrap18 = add nsw i64 %value_phi.i, -1
  br label %mergeinvertL54.i_julia_kernel__5175_inner.exit.loopexit

mergeinvertL54.i_julia_kernel__5175_inner.exit.loopexit: ; preds = %invertjulia_kernel__5175_inner.exit.loopexit
  br label %invertL54.i

invertjulia_kernel__5175_inner.exit:              ; preds = %julia_kernel__5175_inner.exit
  br i1 %4, label %invertentry, label %invertjulia_kernel__5175_inner.exit.loopexit
}

ERROR: CUDA error: operation not supported on global/shared address space (code 717, ERROR_INVALID_ADDRESS_SPACE)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
 [2] isdone
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:109 [inlined]
 [3] nonblocking_synchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:139 [inlined]
 [4] nonblocking_synchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:325 [inlined]
 [5] device_synchronize()
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:319
 [6] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:54

caused by: LoadError: CUDA error: operation not supported on global/shared address space (code 717, ERROR_INVALID_ADDRESS_SPACE)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
 [2] macro expansion
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
 [3] cuStreamSynchronize
   @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
 [4] synchronize(stream::CuStream; blocking::Nothing)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:132
 [5] synchronize (repeats 2 times)
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/stream.jl:121 [inlined]
 [6] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:26
 [7] include(fname::String)
   @ Base.MainInclude ./client.jl:476
 [8] top-level scope
   @ REPL[1]:1
 [9] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52
in expression starting at /home/leios/projects/CESMIX/tests/check_test_3.jl:46

@wsmoses
Copy link
Member

wsmoses commented Oct 5, 2022

Can you attempt to get rid of the broadcasts (those hit recursive and allocating functions).

@leios
Copy link

leios commented Oct 6, 2022

Ah, it seems to be the broadcast causing the error.

This kernel errors:

@noinline function force(c1, c2)
    dr = c2 .- c1

    return dr
end

This one does not:

@noinline function force(c1, c2)
    dr = c2 - c1

    return dr
end

So I guess the short-term fix here is to just avoid broadcasts and set the malloc heap size to be high?

@wsmoses
Copy link
Member

wsmoses commented Oct 6, 2022

I mean we still want to figure out the root cause (avoiding broadcasts isn't necessarily a fix, though because it allocates, can be good practice). What if you manually create a new array of the appropriate size and use a for loop to set it with the elems of c2-c1?

@wsmoses
Copy link
Member

wsmoses commented Oct 6, 2022

Alternatively, if that doesnt work, you can perhaps inline the definition of the various broadcast calls (I find @ edit / @ which , etc very useful in those circumstances) ?

@leios
Copy link

leios commented Oct 6, 2022

Ok, working on it... From what I can tell, it's breaking on bcsm(a, b) in Base.Broadcast

Relevant functions in Base.Broadcast:

broadcast(f::Tf, As...) where {Tf} = materialize(broadcasted(f, As...))

materialize(bc::Broadcasted) = copy(instantiate(bc)) # instantiate in StaticArrays below...

broadcast_shape(shape::Tuple, shape1::Tuple, shapes::Tuple...) = broadcast_shape(_bcs(shape, shape1), shapes...)

function _bcs(shape::Tuple, newshape::Tuple)
    return (_bcs1(shape[1], newshape[1]), _bcs(tail(shape), tail(newshape))...)
end

_bcs1(a, b) = _bcsm(b, a) ? axistype(b, a) : (_bcsm(a, b) ? axistype(a, b) : throw(DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths $(length(a)) and $(length(b))")))

bcsm(a, b) = a == b || length(b) == 1

axistype(a::T, b::T) where T = a

Relevant functions in StaticArrays/src/broadcast.jl

# instantiate overload
@inline function instantiate(B::Broadcasted{StaticArrayStyle{M}}) where M
    if B.axes isa Tuple{Vararg{SOneTo}} || B.axes isa Tuple && length(B.axes) > M
        return invoke(instantiate, Tuple{Broadcasted}, B)
    elseif B.axes isa Nothing
        ax = static_combine_axes(B.args...)
        return Broadcasted{StaticArrayStyle{M}}(B.f, B.args, ax)
    else
        # We need to update B.axes for `broadcast!` if it's not static and `ndims(dest) < M`.
        ax = static_check_broadcast_shape(B.axes, static_combine_axes(B.args...))
        return Broadcasted{StaticArrayStyle{M}}(B.f, B.args, ax)
    end
end

@inline static_combine_axes(A, B...) = broadcast_shape(static_axes(A), static_combine_axes(B...)) # broadcast shape in Base above...

Attempted functions:

@noinline function force(c1, c2)
    # works
    #dr = SVector((c1[1]-c2[1], c1[2] - c2[2], c1[3] - c2[3]))
    #dr_2 = copy(dr)

    # works
    dr = Base.broadcasted(-, c2, c1)

    # doesn't work
    #dr_2 = Base.Broadcast.materialize(dr)

    # doesn't work
    #dr_instantiated = Base.Broadcast.instantiate(dr)

    # doesn't work
    #dr_static_combine = StaticArrays.static_combine_axes(dr.args...)

    # works
    ax1 = StaticArrays.static_axes(dr.args[1])

    # also works
    ax2 = StaticArrays.static_axes(dr.args[2])

    # works
    dr_axis_type = Base.Broadcast.axistype(ax1, ax2)

    # doesn't work
    #dr_bcsm = Base.Broadcast._bcsm(ax1, ax2)

    # doesn't work
    #dr_bcsm = ax1 == ax2 || length(ax2) == 1

    # doesn't work
    dr_bcsm = ax1 == ax2
    # all likely fail as they rely on bcsm
    #dr_bcs1 = Base.Broadcast._bcs1(ax1, ax2)
    #dr_bcs = Base.Broadcast._bcs(ax1, ax2)
    #dr_broadcast_shape = Base.Broadcast.broadcast_shape(ax1, ax2)

    return dr
end

Smallest force function:

@noinline function force(c1, c2)
    dr = Base.broadcasted(-, c2, c1)

    ax1 = StaticArrays.static_axes(dr.args[1])
    ax2 = StaticArrays.static_axes(dr.args[2])

    dr_bcsm = ax1 == ax2

    return dr
end

Enzyme print:

julia> include("check_test_5.jl")
Kernel worked
after simplification :
; Function Attrs: mustprogress nosync willreturn
define void @preprocess_julia_kernel__4400_inner1({ i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr #6 !dbg !120 {
entry:
  %1 = alloca [1 x [3 x float]], align 4
  %2 = alloca [1 x [3 x float]], align 4
  %3 = alloca { [2 x [1 x [3 x float]]] }, align 8
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0, !dbg !121
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 3, !dbg !121
  %4 = bitcast [1 x [3 x float]]* %1 to i8*
  call void @llvm.lifetime.start.p0i8(i64 noundef 12, i8* noundef nonnull align 4 dereferenceable(12) %4) #7
  %5 = bitcast [1 x [3 x float]]* %2 to i8*
  call void @llvm.lifetime.start.p0i8(i64 noundef 12, i8* noundef nonnull align 4 dereferenceable(12) %5) #7
  %6 = bitcast { [2 x [1 x [3 x float]]] }* %3 to i8*
  call void @llvm.lifetime.start.p0i8(i64 noundef 24, i8* noundef nonnull align 8 dereferenceable(24) %6) #7
  %7 = call {}*** @julia.get_pgcstack() #7
  %.inv = icmp sgt i64 %.fca.3.extract, 0, !dbg !122
  %value_phi.i = select i1 %.inv, i64 %.fca.3.extract, i64 0, !dbg !122
  %8 = icmp slt i64 %value_phi.i, 1, !dbg !127
  %9 = bitcast i8 addrspace(1)* %.fca.0.extract to [1 x [3 x float]] addrspace(1)*, !dbg !131
  br i1 %8, label %julia_kernel__4400_inner.exit, label %L54.i.preheader, !dbg !131

L54.i.preheader:                                  ; preds = %entry
  %10 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #7, !dbg !132, !range !60
  %11 = add nuw nsw i32 %10, 1, !dbg !138
  %12 = zext i32 %11 to i64, !dbg !139
  %13 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #7, !dbg !141, !range !73
  %14 = zext i32 %13 to i64, !dbg !146
  %15 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #7, !dbg !148, !range !84
  %16 = zext i32 %15 to i64, !dbg !153
  %17 = mul nuw nsw i64 %14, %16, !dbg !155
  %18 = add nsw i64 %17, -1
  %19 = add nsw i64 %18, %12
  %.unpack.elt = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %19, i64 0, i64 0
  %.unpack.elt7 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %19, i64 0, i64 1
  %.unpack.elt9 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %19, i64 0, i64 2
  %.fca.0.0.gep2 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %1, i64 0, i64 0, i64 0
  %.fca.0.1.gep4 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %1, i64 0, i64 0, i64 1
  %.fca.0.2.gep6 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %1, i64 0, i64 0, i64 2
  %.fca.0.0.gep = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %2, i64 0, i64 0, i64 0
  %.fca.0.1.gep = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %2, i64 0, i64 0, i64 1
  %.fca.0.2.gep = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]]* %2, i64 0, i64 0, i64 2
  %20 = addrspacecast [1 x [3 x float]]* %1 to [1 x [3 x float]] addrspace(11)*
  %21 = addrspacecast [1 x [3 x float]]* %2 to [1 x [3 x float]] addrspace(11)*
  br label %L54.i, !dbg !157

L54.i:                                            ; preds = %L54.i, %L54.i.preheader
  %iv = phi i64 [ %iv.next, %L54.i ], [ 0, %L54.i.preheader ]
  %iv.next = add nuw nsw i64 %iv, 1, !dbg !158
  %.unpack.unpack = load float, float addrspace(1)* %.unpack.elt, align 4, !dbg !158, !tbaa !107
  %.unpack.unpack8 = load float, float addrspace(1)* %.unpack.elt7, align 4, !dbg !158, !tbaa !107
  %.unpack.unpack10 = load float, float addrspace(1)* %.unpack.elt9, align 4, !dbg !158, !tbaa !107
  store float %.unpack.unpack, float* %.fca.0.0.gep2, align 4, !dbg !158
  store float %.unpack.unpack8, float* %.fca.0.1.gep4, align 4, !dbg !158
  store float %.unpack.unpack10, float* %.fca.0.2.gep6, align 4, !dbg !158
  %22 = add nsw i64 %iv.next, -1, !dbg !166
  %.unpack.elt12 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %22, i64 0, i64 0, !dbg !158
  %.unpack.unpack13 = load float, float addrspace(1)* %.unpack.elt12, align 4, !dbg !158, !tbaa !107
  %.unpack.elt14 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %22, i64 0, i64 1, !dbg !158
  %.unpack.unpack15 = load float, float addrspace(1)* %.unpack.elt14, align 4, !dbg !158, !tbaa !107
  %.unpack.elt16 = getelementptr inbounds [1 x [3 x float]], [1 x [3 x float]] addrspace(1)* %9, i64 %22, i64 0, i64 2, !dbg !158
  %.unpack.unpack17 = load float, float addrspace(1)* %.unpack.elt16, align 4, !dbg !158, !tbaa !107
  store float %.unpack.unpack13, float* %.fca.0.0.gep, align 4, !dbg !158
  store float %.unpack.unpack15, float* %.fca.0.1.gep, align 4, !dbg !158
  store float %.unpack.unpack17, float* %.fca.0.2.gep, align 4, !dbg !158
  call fastcc void @julia_force_4419({ [2 x [1 x [3 x float]]] }* noalias nocapture nofree noundef nonnull writeonly sret({ [2 x [1 x [3 x float]]] }) align 8 dereferenceable(24) %3, [1 x [3 x float]] addrspace(11)* nocapture nofree noundef nonnull readonly align 4 dereferenceable(12) %20, [1 x [3 x float]] addrspace(11)* nocapture nofree noundef nonnull readonly align 4 dereferenceable(12) %21) #6, !dbg !168
  %.not19.not = icmp eq i64 %iv.next, %value_phi.i, !dbg !169
  %23 = add nuw i64 %iv.next, 1, !dbg !172
  br i1 %.not19.not, label %julia_kernel__4400_inner.exit.loopexit, label %L54.i, !dbg !157

julia_kernel__4400_inner.exit.loopexit:           ; preds = %L54.i
  br label %julia_kernel__4400_inner.exit, !dbg !173

julia_kernel__4400_inner.exit:                    ; preds = %julia_kernel__4400_inner.exit.loopexit, %entry
  call void @llvm.lifetime.end.p0i8(i64 noundef 12, i8* noundef nonnull %4) #7, !dbg !173
  call void @llvm.lifetime.end.p0i8(i64 noundef 12, i8* noundef nonnull %5) #7, !dbg !173
  call void @llvm.lifetime.end.p0i8(i64 noundef 24, i8* noundef nonnull align 8 dereferenceable(24) %6) #7, !dbg !173
  ret void, !dbg !121
}

Segfault:

signal (11): Segmentation fault
in expression starting at /home/leios/projects/CESMIX/tests/check_test_5.jl:86
unsafe_store! at ./pointer.jl:118 [inlined]
unsafe_store! at ./pointer.jl:118
unknown function (ip: 0x7fd1ecd8945c)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
julia_allocator at /home/leios/.julia/packages/Enzyme/wYJeS/src/compiler.jl:3676
unknown function (ip: 0x7fd1ecd88616)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
julia_allocator at /home/leios/.julia/packages/Enzyme/wYJeS/src/compiler.jl:3535
unknown function (ip: 0x7fd20bdad3fc)
CreateAllocation at /workspace/srcdir/Enzyme/enzyme/Enzyme/Utils.cpp:258
UpgradeAllocasToMallocs at /workspace/srcdir/Enzyme/enzyme/Enzyme/FunctionUtils.cpp:431
preprocessForClone at /workspace/srcdir/Enzyme/enzyme/Enzyme/FunctionUtils.cpp:1710
CloneFunctionWithReturns at /workspace/srcdir/Enzyme/enzyme/Enzyme/FunctionUtils.cpp:1998
CreateFromClone at /workspace/srcdir/Enzyme/enzyme/Enzyme/GradientUtils.cpp:3447
CreateAugmentedPrimal at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:2097
visitCallInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:12114
delegateCallInst at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:302 [inlined]
visitCall at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/Instruction.def:209 [inlined]
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/Instruction.def:209
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:112 [inlined]
CreatePrimalAndGradient at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:3947
EnzymeCreatePrimalAndGradient at /workspace/srcdir/Enzyme/enzyme/Enzyme/CApi.cpp:473
EnzymeCreatePrimalAndGradient at /home/leios/.julia/packages/Enzyme/wYJeS/src/api.jl:118
enzyme! at /home/leios/.julia/packages/Enzyme/wYJeS/src/compiler.jl:4563
unknown function (ip: 0x7fd1ecd64e62)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
#codegen#120 at /home/leios/.julia/packages/Enzyme/wYJeS/src/compiler.jl:5655
codegen##kw at /home/leios/.julia/packages/Enzyme/wYJeS/src/compiler.jl:5325 [inlined]
#114 at /home/leios/.julia/packages/GPUCompiler/07qaN/src/driver.jl:296
get! at ./dict.jl:481
unknown function (ip: 0x7fd1ecd2815f)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
macro expansion at /home/leios/.julia/packages/GPUCompiler/07qaN/src/driver.jl:295 [inlined]
#emit_llvm#111 at /home/leios/.julia/packages/GPUCompiler/07qaN/src/utils.jl:68
unknown function (ip: 0x7fd20be3c283)
unknown function (ip: 0x7fd20be06d19)
unknown function (ip: 0x7fd20be06ce1)
emit_llvm##kw at /home/leios/.julia/packages/GPUCompiler/07qaN/src/utils.jl:62 [inlined]
cufunction_compile at /home/leios/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:353
#224 at /home/leios/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:347 [inlined]
JuliaContext at /home/leios/.julia/packages/GPUCompiler/07qaN/src/driver.jl:76
unknown function (ip: 0x7fd1eccf7eea)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
cufunction_compile at /home/leios/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:346
cached_compilation at /home/leios/.julia/packages/GPUCompiler/07qaN/src/cache.jl:90
#cufunction#221 at /home/leios/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:299
cufunction at /home/leios/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:292
unknown function (ip: 0x7fd1eccf784f)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
do_call at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/interpreter.c:126
eval_value at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/interpreter.c:215
eval_body at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/interpreter.c:467
jl_interpret_toplevel_thunk at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/interpreter.c:750
jl_toplevel_eval_flex at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/toplevel.c:906
jl_toplevel_eval_flex at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/toplevel.c:850
ijl_toplevel_eval_in at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/toplevel.c:965
eval at ./boot.jl:368 [inlined]
include_string at ./loading.jl:1428
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
_include at ./loading.jl:1488
include at ./client.jl:476
unknown function (ip: 0x7fd20bd5f0d2)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
do_call at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/interpreter.c:126
eval_value at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/interpreter.c:215
eval_stmt_value at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/interpreter.c:166 [inlined]
eval_body at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/interpreter.c:612
jl_interpret_toplevel_thunk at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/interpreter.c:750
jl_toplevel_eval_flex at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/toplevel.c:906
jl_toplevel_eval_flex at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/toplevel.c:850
ijl_toplevel_eval_in at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/toplevel.c:965
eval at ./boot.jl:368 [inlined]
eval_user_input at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:151
repl_backend_loop at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:247
start_repl_backend at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:232
#run_repl#47 at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:369
run_repl at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:355
jfptr_run_repl_66557.clone_1 at /home/leios/builds/julia-1.8.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
#967 at ./client.jl:419
jfptr_YY.967_49700.clone_1 at /home/leios/builds/julia-1.8.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
jl_f__call_latest at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/builtins.c:774
#invokelatest#2 at ./essentials.jl:729 [inlined]
invokelatest at ./essentials.jl:726 [inlined]
run_main_repl at ./client.jl:404
exec_options at ./client.jl:318
_start at ./client.jl:522
jfptr__start_61720.clone_1 at /home/leios/builds/julia-1.8.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/julia.h:1838 [inlined]
true_main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:575
jl_repl_entrypoint at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/src/jlapi.c:719
main at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
Allocations: 64898559 (Pool: 64841540; Big: 57019); GC: 59
Segmentation fault (core dumped)

What happens on the cpu:

julia> using StaticArrays

julia> a = SVector((0,0,0))
3-element SVector{3, Int64} with indices SOneTo(3):
 0
 0
 0

julia> dr = Base.broadcasted(-, a, a)
Base.Broadcast.Broadcasted(-, ([0, 0, 0], [0, 0, 0]))

julia> dr.args
([0, 0, 0], [0, 0, 0])

julia> ax1 = StaticArrays.static_axes(dr.args[1])
(SOneTo(3),)

julia> ax2 = StaticArrays.static_axes(dr.args[2])
(SOneTo(3),)

julia> dr_bcsm = ax1 == ax2
true

@wsmoses
Copy link
Member

wsmoses commented Oct 6, 2022

What version of Enzyme/julia are you on? Can you retry with main

@leios
Copy link

leios commented Oct 6, 2022

Well, just updated to the new main (again) and now things work for the mwe posted above. My current st:

(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
  [052768ef] CUDA v3.12.0
  [7da242da] Enzyme v0.10.8 `https://github.com/EnzymeAD/Enzyme.jl.git#main`
  [90137ffa] StaticArrays v1.5.9

Sorry for the spam. I'll keep messing with it. Thanks for the patience here!

@vchuravy
Copy link
Member

vchuravy commented Oct 6, 2022

Thank you for your patience!

@vchuravy
Copy link
Member

vchuravy commented Oct 7, 2022

Ok I finally managed to reproduce... This didn't occur on the A100 I was trying earlier.

@vchuravy
Copy link
Member

vchuravy commented Oct 7, 2022

julia> CUDA.run_compute_sanitizer()
Re-starting your active Julia session...
========= COMPUTE-SANITIZER

julia> include("enz.jl")
Kernel worked

signal (11): Segmentation fault
in expression starting at /home/vchuravy/enz.jl:46
Allocations: 70190705 (Pool: 70129654; Big: 61051); GC: 64
========= Error: process didn't terminate successfully
=========     The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or a host debugger to catch host side errors.
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

Rather weird...

@vchuravy
Copy link
Member

vchuravy commented Oct 7, 2022

Ah without -g2

========= Invalid Address Space
=========     at 0x678 in /home/vchuravy/.julia/packages/LLVM/WjSQG/src/interop/base.jl:40:julia_grad_kernel__4354(CuDeviceArray<SArray<Tuple<(int)3>, Float32, (int)1, (int)3>, (int)1, (int)1>, CuDeviceArray<SArray<Tuple<(int)3>, Tuple<(int)3>, (int)1, (int)3>, (int)1, (int)1>)
=========     by thread (96,0,0) in block (0,0,0)

@vchuravy
Copy link
Member

vchuravy commented Oct 7, 2022

Today I learned about:

CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1

Which gives a device core dump:

(cuda-gdb) target cudacore core_1665152103_cyclops_165521.nvcudmp
Opening GPU coredump: core_1665152103_cyclops_165521.nvcudmp

CUDA Exception: Warp Invalid Address Space
The exception was triggered at PC 0x5ff3bb8 (enz.jl:14)
[Current focus set to CUDA kernel 0, grid 3, block (0,0,0), thread (32,0,0), device 2, sm 0, warp 1, lane 0]
#0  0x0000000005ff3d30 in julia_grad_kernel__4323 () at /home/vchuravy/enz.jl:14
14	    return dr
cuda-gdb) disassemble 
Dump of assembler code for function $_Z23julia_grad_kernel__432313CuDeviceArrayI6SArrayI5TupleILi3EE7Float32Li1ELi3EELi1ELi1EES_IS0_IS1_ILi3EES2_Li1ELi3EELi1ELi1EE$diffejulia_force_4596:
   0x0000000005ff3af8 <+0>:	{         LEA R6.CC, R12.reuse, 0x8 
   0x0000000005ff3b00 <+8>:	
   0x0000000005ff3b08 <+16>:	
   0x0000000005ff3b10 <+24>:	LEA.HI.X P0, R7, R12, RZ, R15 
   0x0000000005ff3b18 <+32>:	LEA RZ.CC, R14, RZ 
   0x0000000005ff3b20 <+40>:	
   0x0000000005ff3b28 <+48>:	{         LEA.HI.X P1, RZ, R14, RZ, R13 
   0x0000000005ff3b30 <+56>:	
   0x0000000005ff3b38 <+64>:	{         MOV R2, R14 
   0x0000000005ff3b40 <+72>:	
   0x0000000005ff3b48 <+80>:	
   0x0000000005ff3b50 <+88>:	{         MOV R3, R13 
   0x0000000005ff3b58 <+96>:	
   0x0000000005ff3b60 <+104>:	
   0x0000000005ff3b68 <+112>:	{         MOV R4, R17 
   0x0000000005ff3b70 <+120>:	
   0x0000000005ff3b78 <+128>:	{         MOV R5, R16 
   0x0000000005ff3b80 <+136>:	
   0x0000000005ff3b88 <+144>:	
   0x0000000005ff3b90 <+152>:	ST.E [R6+-0x8], RZ, P0 
   0x0000000005ff3b98 <+160>:	@!P1 BRA 0x990 
   0x0000000005ff3ba0 <+168>:	
   0x0000000005ff3ba8 <+176>:	DEPBAR.LE SB5, 0x2 
   0x0000000005ff3bb0 <+184>:	FADD R7, -R19, -RZ 
*> 0x0000000005ff3bb8 <+192>:	RED.E.ADD.F32.FTZ.RN [R2+0x8], R7 
   0x0000000005ff3bc0 <+200>:	
   0x0000000005ff3bc8 <+208>:	SYNC 
   0x0000000005ff3bd0 <+216>:	IADD32I R6, R2, 0x8 
   0x0000000005ff3bd8 <+224>:	LOP32I.AND R20, R6, 0xffffff 
   0x0000000005ff3be0 <+232>:	
   0x0000000005ff3be8 <+240>:	LDS.U.32 R6, [R20] 
   0x0000000005ff3bf0 <+248>:	DEPBAR.LE SB5, 0x2 
   0x0000000005ff3bf8 <+256>:	FADD R7, -R19, R6 
   0x0000000005ff3c00 <+264>:	
   0x0000000005ff3c08 <+272>:	ATOMS.CAS R21, [R20], R6, R7 
   0x0000000005ff3c10 <+280>:	ISETP.EQ.U32.AND P0, PT, R6, R21, PT 
   0x0000000005ff3c18 <+288>:	@!P0 MOV R6, R21 
   0x0000000005ff3c20 <+296>:	
   0x0000000005ff3c28 <+304>:	{    @!P0 FADD R7, -R19, R21 
   0x0000000005ff3c30 <+312>:	
   0x0000000005ff3c38 <+320>:	SYNC 
   0x0000000005ff3c40 <+328>:	
   0x0000000005ff3c48 <+336>:	{         LEA RZ.CC, R4.reuse, RZ 
   0x0000000005ff3c50 <+344>:	
   0x0000000005ff3c58 <+352>:	LEA.HI.X P0, RZ, R4, RZ, R5 
   0x0000000005ff3c60 <+360>:	
   0x0000000005ff3c68 <+368>:	@!P0 BRA 0xa40 
   0x0000000005ff3c70 <+376>:	RED.E.ADD.F32.FTZ.RN [R4+0x8], R19 
   0x0000000005ff3c78 <+384>:	SYNC 
   0x0000000005ff3c80 <+392>:	
   0x0000000005ff3c88 <+400>:	IADD32I R6, R4, 0x8 
   0x0000000005ff3c90 <+408>:	LOP32I.AND R20, R6, 0xffffff 
   0x0000000005ff3c98 <+416>:	LDS.U.32 R6, [R20] 
   0x0000000005ff3ca0 <+424>:	
   0x0000000005ff3ca8 <+432>:	FADD R7, R19, R6 
   0x0000000005ff3cb0 <+440>:	NOP 
   0x0000000005ff3cb8 <+448>:	NOP 
   0x0000000005ff3cc0 <+456>:	
   0x0000000005ff3cc8 <+464>:	ATOMS.CAS R21, [R20], R6, R7 
   0x0000000005ff3cd0 <+472>:	ISETP.EQ.U32.AND P0, PT, R6, R21, PT 
   0x0000000005ff3cd8 <+480>:	@!P0 MOV R6, R21 
   0x0000000005ff3ce0 <+488>:	
   0x0000000005ff3ce8 <+496>:	{    @!P0 FADD R7, R19, R21 
   0x0000000005ff3cf0 <+504>:	
   0x0000000005ff3cf8 <+512>:	SYNC 
   0x0000000005ff3d00 <+520>:	
   0x0000000005ff3d08 <+528>:	{         LEA RZ.CC, R2.reuse, RZ 
   0x0000000005ff3d10 <+536>:	
   0x0000000005ff3d18 <+544>:	LEA.HI.X P0, RZ, R2, RZ, R3 
   0x0000000005ff3d20 <+552>:	
   0x0000000005ff3d28 <+560>:	@!P0 BRA 0xb18 
=> 0x0000000005ff3d30 <+568>:	DEPBAR.LE SB5, 0x1 
   0x0000000005ff3d38 <+576>:	FADD R7, -R18, -RZ 
   0x0000000005ff3d40 <+584>:	
   0x0000000005ff3d48 <+592>:	RED.E.ADD.F32.FTZ.RN [R2+0x4], R7 
   0x0000000005ff3d50 <+600>:	SYNC 
   0x0000000005ff3d58 <+608>:	IADD32I R6, R2, 0x4 
   0x0000000005ff3d60 <+616>:	
   0x0000000005ff3d68 <+624>:	LOP32I.AND R6, R6, 0xffffff 
   0x0000000005ff3d70 <+632>:	LDS.U.32 R20, [R6] 
   0x0000000005ff3d78 <+640>:	DEPBAR.LE SB5, 0x1 
   0x0000000005ff3d80 <+648>:	
   0x0000000005ff3d88 <+656>:	FADD R21, -R18, R20 
   0x0000000005ff3d90 <+664>:	NOP 
   0x0000000005ff3d98 <+672>:	NOP 
   0x0000000005ff3da0 <+680>:	
   0x0000000005ff3da8 <+688>:	NOP 
   0x0000000005ff3db0 <+696>:	NOP 
   0x0000000005ff3db8 <+704>:	NOP 
   0x0000000005ff3dc0 <+712>:	
   0x0000000005ff3dc8 <+720>:	ATOMS.CAS R21, [R6], R20, R21 
   0x0000000005ff3dd0 <+728>:	ISETP.EQ.U32.AND P0, PT, R20, R21, PT 
   0x0000000005ff3dd8 <+736>:	@!P0 MOV R20, R21 
   0x0000000005ff3de0 <+744>:	
   0x0000000005ff3de8 <+752>:	{    @!P0 FADD R21, -R18, R21 
   0x0000000005ff3df0 <+760>:	
   0x0000000005ff3df8 <+768>:	SYNC 
   0x0000000005ff3e00 <+776>:	
 0x0000000005ff3e08 <+784>:	{         LEA RZ.CC, R4.reuse, RZ 
   0x0000000005ff3e10 <+792>:	
   0x0000000005ff3e18 <+800>:	LEA.HI.X P0, RZ, R4, RZ, R5 
   0x0000000005ff3e20 <+808>:	
   0x0000000005ff3e28 <+816>:	@!P0 BRA 0xc00 
   0x0000000005ff3e30 <+824>:	RED.E.ADD.F32.FTZ.RN [R4+0x4], R18 
   0x0000000005ff3e38 <+832>:	SYNC 
   0x0000000005ff3e40 <+840>:	
   0x0000000005ff3e48 <+848>:	IADD32I R6, R4, 0x4 
   0x0000000005ff3e50 <+856>:	LOP32I.AND R6, R6, 0xffffff 
   0x0000000005ff3e58 <+864>:	LDS.U.32 R20, [R6] 
   0x0000000005ff3e60 <+872>:	
   0x0000000005ff3e68 <+880>:	FADD R21, R18, R20 
   0x0000000005ff3e70 <+888>:	NOP 
   0x0000000005ff3e78 <+896>:	NOP 
   0x0000000005ff3e80 <+904>:	
   0x0000000005ff3e88 <+912>:	ATOMS.CAS R21, [R6], R20, R21 
   0x0000000005ff3e90 <+920>:	ISETP.EQ.U32.AND P0, PT, R20, R21, PT 
   0x0000000005ff3e98 <+928>:	@!P0 MOV R20, R21 
   0x0000000005ff3ea0 <+936>:	
   0x0000000005ff3ea8 <+944>:	{    @!P0 FADD R21, R18, R21 
   0x0000000005ff3eb0 <+952>:	
   0x0000000005ff3eb8 <+960>:	SYNC 
   0x0000000005ff3ec0 <+968>:	
   0x0000000005ff3ec8 <+976>:	{         LEA RZ.CC, R2.reuse, RZ 
   0x0000000005ff3ed0 <+984>:	
   0x0000000005ff3ed8 <+992>:	LEA.HI.X P0, RZ, R2, RZ, R3 
   0x0000000005ff3ee0 <+1000>:	
   0x0000000005ff3ee8 <+1008>:	@!P0 BRA 0xcd0 
   0x0000000005ff3ef0 <+1016>:	FADD R7, -R0, -RZ 
   0x0000000005ff3ef8 <+1024>:	RED.E.ADD.F32.FTZ.RN [R2], R7 
   0x0000000005ff3f00 <+1032>:	
   0x0000000005ff3f08 <+1040>:	SYNC 
   0x0000000005ff3f10 <+1048>:	LOP32I.AND R2, R2, 0xffffff 
   0x0000000005ff3f18 <+1056>:	LDS.U.32 R6, [R2] 
   0x0000000005ff3f20 <+1064>:	
   0x0000000005ff3f28 <+1072>:	FADD R7, -R0, R6 
   0x0000000005ff3f30 <+1080>:	NOP 
   0x0000000005ff3f38 <+1088>:	NOP 
   0x0000000005ff3f40 <+1096>:	
   0x0000000005ff3f48 <+1104>:	ATOMS.CAS R7, [R2], R6, R7 
   0x0000000005ff3f50 <+1112>:	ISETP.EQ.U32.AND P0, PT, R6, R7, PT 
   0x0000000005ff3f58 <+1120>:	@!P0 MOV R6, R7 
   0x0000000005ff3f60 <+1128>:	
   0x0000000005ff3f68 <+1136>:	{    @!P0 FADD R7, -R0, R7 
   0x0000000005ff3f70 <+1144>:	
   0x0000000005ff3f78 <+1152>:	SYNC 
   0x0000000005ff3f80 <+1160>:	
   0x0000000005ff3f88 <+1168>:	LEA RZ.CC, R4.reuse, RZ 
   0x0000000005ff3f90 <+1176>:	LEA.HI.X P0, RZ, R4, RZ, R5 
0x0000000005ff3f98 <+1184>:	@!P0 BRA 0xd90 
   0x0000000005ff3fa0 <+1192>:	
   0x0000000005ff3fa8 <+1200>:	MOV R2, R4 
   0x0000000005ff3fb0 <+1208>:	MOV R3, R5 
   0x0000000005ff3fb8 <+1216>:	RED.E.ADD.F32.FTZ.RN [R2], R0 
   0x0000000005ff3fc0 <+1224>:	
   0x0000000005ff3fc8 <+1232>:	RET 
   0x0000000005ff3fd0 <+1240>:	LOP32I.AND R4, R4, 0xffffff 
   0x0000000005ff3fd8 <+1248>:	LDS.U.32 R2, [R4] 
   0x0000000005ff3fe0 <+1256>:	
   0x0000000005ff3fe8 <+1264>:	FADD R3, R0, R2 
   0x0000000005ff3ff0 <+1272>:	ATOMS.CAS R3, [R4], R2, R3 
   0x0000000005ff3ff8 <+1280>:	ISETP.EQ.U32.AND P0, PT, R2, R3, PT 
   0x0000000005ff4000 <+1288>:	
   0x0000000005ff4008 <+1296>:	@P0 RET 
   0x0000000005ff4010 <+1304>:	MOV R2, R3 
   0x0000000005ff4018 <+1312>:	{         FADD R3, R0, R3 
   0x0000000005ff4020 <+1320>:	
   0x0000000005ff4028 <+1328>:	

So:

*> 0x0000000005ff3bb8 <+192>:	RED.E.ADD.F32.FTZ.RN [R2+0x8], R7 
(cuda-gdb) info cuda lanes
  Ln   State       Physical PC    ThreadIdx          Exception         
Device 2 SM 0 Warp 1
*  0   active  0x0000000000000af0  (32,0,0) Warp Invalid Address Space 
   1   active  0x0000000000000af0  (33,0,0) Warp Invalid Address Space 
   2   active  0x0000000000000af0  (34,0,0) Warp Invalid Address Space 
   3   active  0x0000000000000af0  (35,0,0) Warp Invalid Address Space 
   4   active  0x0000000000000af0  (36,0,0) Warp Invalid Address Space 
   5   active  0x0000000000000af0  (37,0,0) Warp Invalid Address Space 
   6   active  0x0000000000000af0  (38,0,0) Warp Invalid Address Space 
   7   active  0x0000000000000af0  (39,0,0) Warp Invalid Address Space 
   8   active  0x0000000000000af0  (40,0,0) Warp Invalid Address Space 
   9   active  0x0000000000000af0  (41,0,0) Warp Invalid Address Space 
  10   active  0x0000000000000af0  (42,0,0) Warp Invalid Address Space 
  11   active  0x0000000000000af0  (43,0,0) Warp Invalid Address Space 
  12   active  0x0000000000000af0  (44,0,0) Warp Invalid Address Space 
  13   active  0x0000000000000af0  (45,0,0) Warp Invalid Address Space 
  14   active  0x0000000000000af0  (46,0,0) Warp Invalid Address Space 
  15   active  0x0000000000000af0  (47,0,0) Warp Invalid Address Space 
  16   active  0x0000000000000af0  (48,0,0) Warp Invalid Address Space 
  17   active  0x0000000000000af0  (49,0,0) Warp Invalid Address Space 
  18   active  0x0000000000000af0  (50,0,0) Warp Invalid Address Space 
  19   active  0x0000000000000af0  (51,0,0) Warp Invalid Address Space 
  20   active  0x0000000000000af0  (52,0,0) Warp Invalid Address Space 
  21   active  0x0000000000000af0  (53,0,0) Warp Invalid Address Space 
  22   active  0x0000000000000af0  (54,0,0) Warp Invalid Address Space 
  23   active  0x0000000000000af0  (55,0,0) Warp Invalid Address Space 
  24   active  0x0000000000000af0  (56,0,0) Warp Invalid Address Space 
  25   active  0x0000000000000af0  (57,0,0) Warp Invalid Address Space 
  26   active  0x0000000000000af0  (58,0,0) Warp Invalid Address Space 
  27   active  0x0000000000000af0  (59,0,0) Warp Invalid Address Space 
  28   active  0x0000000000000af0  (60,0,0) Warp Invalid Address Space 
  29   active  0x0000000000000af0  (61,0,0) Warp Invalid Address Space 
  30   active  0x0000000000000af0  (62,0,0) Warp Invalid Address Space 
  31   active  0x0000000000000af0  (63,0,0) Warp Invalid Address Space 
(cuda-gdb) 

@vchuravy
Copy link
Member

vchuravy commented Oct 7, 2022

According to https://www.olcf.ornl.gov/wp-content/uploads/2019/12/05_Atomics_Reductions_Warp_Shuffle.pdf page 7
that is a atomic_add

.func diffejulia_force_6144(
  47     .param .b64 diffejulia_force_6144_param_0,
  46     .param .b64 diffejulia_force_6144_param_1,
  45     .param .b64 diffejulia_force_6144_param_2
  44 )                                       // -- Begin function diffejulia_force_6144
  43                                         // @diffejulia_force_6144
  42 {
  41     .reg .f32   %f<13>;
  40     .reg .b32   %r<2>;
  39     .reg .b64   %rd<8>;
  38     .loc    1 10 0                          // /home/vchuravy/enz.jl:10:0
  37 Lfunc_begin1:
  36     .loc    1 10 0                          // /home/vchuravy/enz.jl:10:0
  35 
  34 // %bb.0:                               // %top
  33     ld.param.u64    %rd1, [diffejulia_force_6144_param_0];
  32     ld.param.u64    %rd2, [diffejulia_force_6144_param_1];
  31     ld.param.u64    %rd3, [diffejulia_force_6144_param_2];
  30 Ltmp3:
  29     .loc    5 29 0                          // tuple.jl:29:0
  28     add.s64     %rd4, %rd3, 4;
  27     add.s64     %rd5, %rd2, 4;
  26     add.s64     %rd6, %rd3, 8;
  25     add.s64     %rd7, %rd2, 8;
  24     ld.f32  %f1, [%rd1+8];
  23     mov.u32     %r1, 0;
  22 Ltmp4:
  21     .loc    1 14 0                          // /home/vchuravy/enz.jl:14:0
  20     st.u32  [%rd1+8], %r1;
  19     ld.f32  %f2, [%rd1+4];
  18     st.u32  [%rd1+4], %r1;
  17     ld.f32  %f3, [%rd1];
  16     st.u32  [%rd1], %r1;
  15     neg.f32     %f4, %f1;
  14     atom.add.f32    %f5, [%rd7], %f4;
  13     atom.add.f32    %f6, [%rd6], %f1;
  12     neg.f32     %f7, %f2;
  11     atom.add.f32    %f8, [%rd5], %f7;
  10     atom.add.f32    %f9, [%rd4], %f2;
   9     neg.f32     %f10, %f3;
   8     atom.add.f32    %f11, [%rd2], %f10;
   7     atom.add.f32    %f12, [%rd3], %f3;
   6     ret;
   5 Ltmp5:
   4 Lfunc_end1:

So the atomic add targets offsets from a base ptr passed into the arguments.

  36     st.param.b64    [param0+0], %rd33;
  35     .param .b64 param1;
  34     st.param.b64    [param1+0], %rd29;
  33     .param .b64 param2;
  32     st.param.b64    [param2+0], %rd31;
  31     call.uni
  30     diffejulia_force_6144,
  29     (
  28     param0,
  27     param1,
  26     param2
  25     );

So what is rd33?

527      add.u64     %rd33, %SP, 48;

So my hypothesis here is that CUDA does not allow for atomic_add on stack pointers on Pascal (but maybe fine later?). Enzyme uncondintionally emits an atomic add since we don't know that the value us a stack pointer. When everything is inlined all is good.

Side-note diffejulia_kernel__6125_inner7wrap is not rm'd from the module and thus compiled all the way down to assembly.

@vchuravy
Copy link
Member

vchuravy commented Oct 7, 2022

Billy and I confirmed the hypothesis. We alloca some stack memory, which is passed to a function (argument types are AddressSpace 0, not AddressSpace 3). Because of that the Enzyme proper can't recognize the address as thread-local and must emit a atomic write, in theory @wsmoses added a pass to fixup these kind of situations llvm/llvm-project@7aa3cad, but the InferAddresSpaces pass is function local and does not operate inter-procedural, so it is also unable to fix it.

In newer HW generation this operation is legal...

The conclusion here is that is very hard to fix in Enzyme. There is a potential fix to propagate information about whether pointer are thread-local, but that is rather speculative and a large change.

Workarounds are:

  • inline the function
  • Use a GPU newer than Pascal, so Volta and above

Inlining the function is probably the best anyway since that will drastically improve performance.

@vchuravy vchuravy changed the title CUDA kernel error when updating array using shared memory CUDA Error 717: Invalid Address Space -- Atomic operation not supported on stack memory. Oct 7, 2022
@leios
Copy link

leios commented Oct 8, 2022

Ok, I think we have partial solutions to everything.

  1. If people see Error 700, then it is probably a shmem issue where you can fix it by setting: CUDA.limit!(CUDA.CU_LIMIT_MALLOC_HEAP_SIZE, 1*1024^3)
  2. If people see Error 717, then you need to either inline the function or use hardware newer than pascal
  3. We need to be careful about the amount of shmem generated as Enzyme will likely use about double that amount
  4. Enzyme.jl needs to be on the main branch for certain operations for now.

If I am missing any highlights, be sure to add them, but I think this issue can be closed for now?

@vchuravy
Copy link
Member

vchuravy commented Oct 8, 2022

  1. Should not be the case since we released current main as a patch release, but everything else sounds about right.

Let's leave the issue open though to remind us and in case e we can implement a proper fix eventually

@vchuravy vchuravy added this to the backlog milestone Oct 8, 2022
@jgreener64
Copy link
Contributor Author

Thanks everyone for your work on this. I have the kernel working with the suggestions above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants