Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pullback on mean() gives illegal memory access code 700 #1473

Closed
koenvos opened this issue Nov 24, 2023 · 32 comments · Fixed by JuliaGPU/CUDA.jl#2206
Closed

Pullback on mean() gives illegal memory access code 700 #1473

koenvos opened this issue Nov 24, 2023 · 32 comments · Fixed by JuliaGPU/CUDA.jl#2206
Labels
bug Something isn't working CUDA All things GPU

Comments

@koenvos
Copy link

koenvos commented Nov 24, 2023

Minimal working example:

using Statistics, CUDA, Zygote
x = CUDA.randn(100)
y = CUDA.randn(100)
loss, back = pullback(y -> mean(x .- y), y)
grads = back(one(loss))

The same error happens when using Flux.mse(x, y), instead of mean(x .- y), as the loss function.

The error goes away by replacing mean() by sum(), or by running on the CPU.

Running latest Julia and packages:
Julia 1.9.4
Statistics v1.9.0
CUDA v5.1.1
Zygote v0.6.67

The first ~20% of the stack trace:

ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
  [2] nonblocking_synchronize(val::CuContext)
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:163
  [3] device_synchronize(; blocking::Bool, spin::Bool)
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:174
  [4] device_synchronize
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:169 [inlined]
  [5] CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\module.jl:40
  [6] CuModule
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\module.jl:23 [inlined]
  [7] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\compiler\compilation.jl:365
  [8] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler C:\Users\koenv\.julia\packages\GPUCompiler\U36Ed\src\execution.jl:132
  [9] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler C:\Users\koenv\.julia\packages\GPUCompiler\U36Ed\src\execution.jl:103
 [10] macro expansion
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:382 [inlined]
 [11] macro expansion
    @ .\lock.jl:267 [inlined]
 [12] cufunction(f::GPUArrays.var"#broadcast_kernel#38", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(*), Tuple{Int64, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(real), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}}}, Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:377
 [13] cufunction(f::GPUArrays.var"#broadcast_kernel#38", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(*), Tuple{Int64, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(real), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}}}, Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}})
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:374
 [14] macro expansion
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:104 [inlined]
 [15] #launch_heuristic#1120
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:17 [inlined]
 [16] launch_heuristic
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:15 [inlined]
 [17] _copyto!
    @ C:\Users\koenv\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:70 [inlined]
 [18] copyto!
    @ C:\Users\koenv\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:51 [inlined]
 [19] copy
    @ C:\Users\koenv\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:42 [inlined]
 [20] materialize
    @ .\broadcast.jl:873 [inlined]
 [21] (::Zygote.var"#1283#1286"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}})(z̄::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote C:\Users\koenv\.julia\packages\Zygote\YYT6v\src\lib\broadcast.jl:128
 [22] #3978#back
    @ C:\Users\koenv\.julia\packages\ZygoteRules\4nXuu\src\adjoint.jl:71 [inlined]
 [23] Pullback
    @ C:\Users\koenv\.julia\packages\Flux\u7QSl\src\losses\functions.jl:47 [inlined]
 [24] WARNING: Error while freeing DeviceBuffer(400 bytes at 0x0000000204601600):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), details=CUDA.Optional{String}(data=nothing))
@dorn-gerhard
Copy link

I encounter the same error when trying the above example:

ERROR: WARNING: Error while freeing DeviceBuffer(400 bytes at 0x0000000205200a00):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), details=CUDA.Optional{String}(data=nothing))

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
  [2] check
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:34 [inlined]
  [3] cuMemFreeAsync
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\utils\call.jl:26 [inlined]
  [4] #free#2
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:97 [inlined]
  [5] free
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:92 [inlined]
  [6] #actual_free#1001
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:78 [inlined]
  [7] actual_free
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:75 [inlined]
  [8] #_free#1026
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:506 [inlined]
  [9] _free
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:493 [inlined]
 [10] macro expansion
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:478 [inlined]
 [11] macro expansion
    @ .\timing.jl:393 [inlined]
 [12] #free#1025
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:477 [inlined]
 [13] free
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:466 [inlined]
 [14] (::CUDA.var"#1032#1033"{CUDA.Mem.DeviceBuffer, Bool})()
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\array.jl:101
 [15] #context!#915
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\state.jl:170 [inlined]
 [16] context!
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\state.jl:165 [inlined]
 [17] _free_buffer(buf::CUDA.Mem.DeviceBuffer, early::Bool)
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\array.jl:89
 [18] release(rc::GPUArrays.RefCounted{CUDA.Mem.DeviceBuffer}, args::Bool)
    @ GPUArrays C:\Users\gerharddorn\.julia\packages\GPUArrays\dAUOE\src\host\abstractarray.jl:42
 [19] unsafe_free!
    @ C:\Users\gerharddorn\.julia\packages\GPUArrays\dAUOE\src\host\abstractarray.jl:90 [inlined]
 [20] unsafe_finalize!(xs::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\array.jl:113
 [21] show_exception_stack(io::IOContext{Base.TTY}, stack::Base.ExceptionStack)
    @ Base .\errorshow.jl:895
 [22] display_error(io::IOContext{Base.TTY}, stack::Base.ExceptionStack)
    @ Base .\client.jl:111
 [23] #invokelatest#2
    @ .\essentials.jl:819 [inlined]
 [24] invokelatest
    @ .\essentials.jl:816 [inlined]
 [25] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:300
 [26] (::REPL.var"#57#58"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:287
 [27] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:557
 [28] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:285
 [29] (::REPL.var"#do_respond#80"{Bool, Bool, REPL.var"#93#103"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:899
 [30] #invokelatest#2
    @ .\essentials.jl:819 [inlined]
 [31] invokelatest
    @ .\essentials.jl:816 [inlined]
 [32] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\LineEdit.jl:2647
 [33] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:1300
 [34] (::REPL.var"#62#68"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL .\task.jl:514
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), details=CUDA.Optional{String}(data=nothing))
throw_api_error at C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
check at C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:34 [inlined]
cuMemHostUnregister at C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\utils\call.jl:26 [inlined]
unregister at C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:193 [inlined]
#21 at C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:701 [inlined]
#context!#915 at C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\state.jl:170 [inlined]
context! at C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\state.jl:165 [inlined]
macro expansion at C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:700 [inlined]
macro expansion at .\lock.jl:267 [inlined]
__unpin at C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:694
#1081 at C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:180
unknown function (ip: 0000029d478cdc66)
run_finalizer at C:/workdir/src\gc.c:417
jl_gc_run_finalizers_in_list at C:/workdir/src\gc.c:507
run_finalizers at C:/workdir/src\gc.c:553
jl_mutex_unlock at C:/workdir/src\julia_locks.h:81 [inlined]
jl_generate_fptr_impl at C:/workdir/src\jitlayers.cpp:467
jl_compile_method_internal at C:/workdir/src\gf.c:2348
jl_compile_method_internal at C:/workdir/src\gf.c:2241 [inlined]
_jl_invoke at C:/workdir/src\gf.c:2750 [inlined]
ijl_apply_generic at C:/workdir/src\gf.c:2940
show_exception_stack at .\errorshow.jl:895
display_error at .\client.jl:111
jfptr_display_error_47944.clone_1 at C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1880 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:774
#invokelatest#2 at .\essentials.jl:819 [inlined]
invokelatest at .\essentials.jl:816 [inlined]
print_response at C:\workdir\usr\share\julia\stdlib\v1.9\REPL\src\REPL.jl:300
#57 at C:\workdir\usr\share\julia\stdlib\v1.9\REPL\src\REPL.jl:287
jfptr_YY.57_61235.clone_1 at C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\lib\julia\sys.dll (unknown line)
with_repl_linfo at C:\workdir\usr\share\julia\stdlib\v1.9\REPL\src\REPL.jl:557
jfptr_with_repl_linfo_61240.clone_1 at C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\lib\julia\sys.dll (unknown line)
print_response at C:\workdir\usr\share\julia\stdlib\v1.9\REPL\src\REPL.jl:285
do_respond at C:\workdir\usr\share\julia\stdlib\v1.9\REPL\src\REPL.jl:899
jfptr_do_respond_61066.clone_1 at C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1880 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:774
#invokelatest#2 at .\essentials.jl:819 [inlined]
invokelatest at .\essentials.jl:816 [inlined]
run_interface at C:\workdir\usr\share\julia\stdlib\v1.9\REPL\src\LineEdit.jl:2647
jfptr_run_interface_61442.clone_1 at C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\lib\julia\sys.dll (unknown line)
run_frontend at C:\workdir\usr\share\julia\stdlib\v1.9\REPL\src\REPL.jl:1300
#62 at .\task.jl:514
jfptr_YY.62_61043.clone_1 at C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1880 [inlined]
start_task at C:/workdir/src\task.c:1092
WARNING: Error while freeing DeviceBuffer(400 bytes at 0x0000000205200800):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), details=CUDA.Optional{String}(data=nothing))

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
  [2] check
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:34 [inlined]
  [3] cuMemFreeAsync
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\utils\call.jl:26 [inlined]
  [4] #free#2
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:97 [inlined]
  [5] free
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:92 [inlined]
  [6] #actual_free#1001
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:78 [inlined]
  [7] actual_free
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:75 [inlined]
  [8] #_free#1026
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:506 [inlined]
  [9] _free
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:493 [inlined]
 [10] macro expansion
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:478 [inlined]
 [11] macro expansion
    @ .\timing.jl:393 [inlined]
 [12] #free#1025
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:477 [inlined]
 [13] free
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\pool.jl:466 [inlined]
 [14] (::CUDA.var"#1032#1033"{CUDA.Mem.DeviceBuffer, Bool})()
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\array.jl:101
 [15] #context!#915
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\state.jl:170 [inlined]
 [16] context!
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\state.jl:165 [inlined]
 [17] _free_buffer(buf::CUDA.Mem.DeviceBuffer, early::Bool)
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\array.jl:89
 [18] release(rc::GPUArrays.RefCounted{CUDA.Mem.DeviceBuffer}, args::Bool)
    @ GPUArrays C:\Users\gerharddorn\.julia\packages\GPUArrays\dAUOE\src\host\abstractarray.jl:42
 [19] unsafe_free!
    @ C:\Users\gerharddorn\.julia\packages\GPUArrays\dAUOE\src\host\abstractarray.jl:90 [inlined]
 [20] unsafe_finalize!(xs::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\array.jl:113
 [21] show_exception_stack(io::IOContext{Base.TTY}, stack::Base.ExceptionStack)
    @ Base .\errorshow.jl:895
 [22] display_error(io::IOContext{Base.TTY}, stack::Base.ExceptionStack)
    @ Base .\client.jl:111
 [23] #invokelatest#2
    @ .\essentials.jl:819 [inlined]
 [24] invokelatest
    @ .\essentials.jl:816 [inlined]
 [25] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:300
 [26] (::REPL.var"#57#58"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:287
 [27] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:557
 [28] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:285
 [29] (::REPL.var"#do_respond#80"{Bool, Bool, REPL.var"#93#103"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:899
 [30] #invokelatest#2
    @ .\essentials.jl:819 [inlined]
 [31] invokelatest
    @ .\essentials.jl:816 [inlined]
 [32] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\LineEdit.jl:2647
 [33] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL C:\Users\gerharddorn\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\share\julia\stdlib\v1.9\REPL\src\REPL.jl:1300
 [34] (::REPL.var"#62#68"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL .\task.jl:514
CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
 [2] isdone
   @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\stream.jl:111 [inlined]
 [3] spinning_synchronization(f::typeof(CUDA.isdone), obj::CuStream)
   @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:79
 [4] device_synchronize(; blocking::Bool, spin::Bool)
   @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:171
 [5] device_synchronize()
   @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:169
 [6] top-level scope
   @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\initialization.jl:210

caused by: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
  [2] nonblocking_synchronize(val::CuContext)
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:163
  [3] device_synchronize(; blocking::Bool, spin::Bool)
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:174
  [4] device_synchronize
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:169 [inlined]
  [5] CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\module.jl:40
  [6] CuModule
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\lib\cudadrv\module.jl:23 [inlined]
  [7] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\compiler\compilation.jl:365
  [8] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler C:\Users\gerharddorn\.julia\packages\GPUCompiler\U36Ed\src\execution.jl:132
  [9] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler C:\Users\gerharddorn\.julia\packages\GPUCompiler\U36Ed\src\execution.jl:103
 [10] macro expansion
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:382 [inlined]
 [11] macro expansion
    @ .\lock.jl:267 [inlined]
 [12] cufunction(f::GPUArrays.var"#broadcast_kernel#38", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(-), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:377
 [13] cufunction
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:374 [inlined]
 [14] macro expansion
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:104 [inlined]
 [15] #launch_heuristic#1120
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:17 [inlined]
 [16] launch_heuristic
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:15 [inlined]
 [17] _copyto!
    @ C:\Users\gerharddorn\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:70 [inlined]
 [18] copyto!
    @ C:\Users\gerharddorn\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:51 [inlined]
 [19] copy
    @ C:\Users\gerharddorn\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:42 [inlined]
 [20] materialize
    @ .\broadcast.jl:873 [inlined]
 [21] broadcast_preserving_zero_d
    @ .\broadcast.jl:862 [inlined]
 [22] -(A::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Base .\abstractarraymath.jl:218
 [23] _minus
    @ C:\Users\gerharddorn\.julia\packages\Zygote\YYT6v\src\lib\broadcast.jl:89 [inlined]
 [24] #1185
    @ C:\Users\gerharddorn\.julia\packages\Zygote\YYT6v\src\lib\broadcast.jl:86 [inlined]
 [25] #3770#back
    @ C:\Users\gerharddorn\.julia\packages\ZygoteRules\4nXuu\src\adjoint.jl:71 [inlined]
 [26] Pullback
    @ .\REPL[6]:1 [inlined]
 [27] (::Zygote.Pullback{Tuple{var"#3#4", CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.var"#1990#back#194"{Zygote.var"#190#193"{Zygote.Context{false}, GlobalRef, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Zygote.Pullback{Tuple{typeof(Base.Broadcast.materialize), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Tuple{}}, Zygote.var"#3770#back#1189"{Zygote.var"#1185#1188"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Zygote.ZBack{ChainRules.var"#mean_pullback#1821"{Int64, ChainRules.var"#sum_pullback#1633"{Colon, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ChainRulesCore.ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ChainRulesCore.ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}}}}}}}}}})(Δ::Float32)
    @ Zygote C:\Users\gerharddorn\.julia\packages\Zygote\YYT6v\src\compiler\interface2.jl:0
 [28] (::Zygote.var"#75#76"{Zygote.Pullback{Tuple{var"#3#4", CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.var"#1990#back#194"{Zygote.var"#190#193"{Zygote.Context{false}, GlobalRef, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Zygote.Pullback{Tuple{typeof(Base.Broadcast.materialize), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Tuple{}}, Zygote.var"#3770#back#1189"{Zygote.var"#1185#1188"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Zygote.ZBack{ChainRules.var"#mean_pullback#1821"{Int64, ChainRules.var"#sum_pullback#1633"{Colon, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ChainRulesCore.ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ChainRulesCore.ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}}}}}}}}}}})(Δ::Float32)
    @ Zygote C:\Users\gerharddorn\.julia\packages\Zygote\YYT6v\src\compiler\interface.jl:45
 [29] top-level scope
    @ REPL[7]:1
 [30] top-level scope
    @ C:\Users\gerharddorn\.julia\packages\CUDA\YIj5X\src\initialization.jl:208

@ToucheSir
Copy link
Member

This is tricky because I'm not able to reproduce the error on my end. Can either of you post the output of versioninfo() and CUDA.versioninfo()? Also, can you confirm that normal GPU array operations such as -x, x .- y and sum(x .- y) all work?

@dorn-gerhard
Copy link

Sure ;)

julia> -x
100-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
  1.1735989
  
 -0.33174348

julia> x .- y
100-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 -0.7638358
  
  2.1407743

julia> sum(x .- y)
-4.3421297f0

julia> mean(x)
0.03391623f0
julia> versioninfo()
Julia Version 1.9.4
Commit 8e5136fa29 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
  Threads: 10 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = auto
julia> CUDA.versioninfo()
CUDA runtime 12.3, artifact installation
CUDA driver 12.0
NVIDIA driver 528.49.0

CUDA libraries:
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+528.49

Julia packages:
- CUDA: 5.1.1
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

1 device:
  0: NVIDIA T500 (sm_75, 3.777 GiB / 4.000 GiB available)

@ToucheSir
Copy link
Member

ToucheSir commented Dec 6, 2023

Hmm, nothing looks off to me. @maleadt would this be enough for you to work with MWE-wise? Or would you like me to try to make one without Zygote? That may take a couple rounds of back-and-forth since I'd need others to run further MWEs to see if they throw the same error.

@koenvos
Copy link
Author

koenvos commented Dec 7, 2023

More details from my side:

All basic operations work fine.

julia> versioninfo()
Julia Version 1.9.4
Commit 8e5136fa29 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900K
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, goldmont)
  Threads: 28 on 32 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 24
julia> CUDA.versioninfo()
CUDA runtime 12.3, artifact installation
CUDA driver 12.3
NVIDIA driver 546.12.0

CUDA libraries:
- CUBLAS: 12.3.2
- CURAND: 10.3.4
- CUFFT: 11.0.11
- CUSOLVER: 11.5.3
- CUSPARSE: 12.1.3
- CUPTI: 21.0.0
- NVML: 12.0.0+546.12

Julia packages:
- CUDA: 5.1.1
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.0+1

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

1 device:
  0: NVIDIA GeForce RTX 4090 (sm_89, 23.124 GiB / 23.988 GiB available)

@maleadt
Copy link
Contributor

maleadt commented Dec 11, 2023

I can't reproduce this either. A couple of things you could try:

  • run the grad call under @device_code_llvm dump_module=true and try to spot CPU pointers getting loaded (which typically involves an inttoptr from a large literal integer, see e.g. Validator does not catch global host loads JuliaGPU/GPUCompiler.jl#232)
  • try running under compute-sanitizer (using CUDA.run_compute_sanitizer())
  • try a different CUDA toolkit (by calling CUDA.set_runtime_version!(v"12.2")); there's one illegal memory issue I know that got introduced in CUDA 12.3.1

@dorn-gerhard
Copy link

i tried the makro and got the following output:

julia> @device_code_llvm dump_module=true grads = back(one(loss))
; PTX CompilerJob of MethodInstance for (::GPUArrays.var"#broadcast_kernel#38")(::CUDA.CuKernelContext, ::CuDeviceVector{Float32, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, ComposedFunction{typeof(last), typeof(tuple)}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}, CUDA.CuRefPointer{Float32}}}, ::Int64) for sm_61
; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

; Function Attrs: nounwind readnone speculatable
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0

; Function Attrs: nounwind readnone speculatable
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #0

; Function Attrs: nounwind readnone speculatable
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #0

; Function Attrs: nounwind readnone speculatable
declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() #0

;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:59 within `broadcast_kernel`
; Function Attrs: uwtable
define ptx_kernel void @_Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE16ComposedFunctionI4last5tupleES4_I8ExtrudedIS0_IS1_Li1ELi1EES4_I4BoolES4_IS6_EE12CuRefPointerIS1_EEES6_({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, { { { { i8 addrspace(1)*, i64, [1 x i64], i64 }, [1 x i8], [1 x i64] }, [1 x i64] }, [1 x [1 x i64]] } %1, i64 signext %2) local_unnamed_addr #1 {
conversion:
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 3
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:63 within `broadcast_kernel`
; ┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:66 within `macro expansion`
; │┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:44 within `linear_index`
; ││┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:20 within `global_index`
; │││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:40 within `threadidx`
; ││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:92 within `#threadIdx`
; │││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:46 within `threadIdx_x`
; ││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `_index`
; │││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
          %3 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
; ││││││└└
; ││││││┌ @ int.jl:87 within `+`
         %4 = add nuw nsw i32 %3, 1
; │││└└└└
; │││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:38 within `blockidx`
; ││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:78 within `#blockIdx`
; │││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:56 within `blockIdx_x`
; ││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `_index`
; │││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
          %5 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
; │││└└└└└
; │││┌ @ int.jl:1042 within `-` @ int.jl:86
      %6 = zext i32 %5 to i64
; │││└
; │││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:39 within `blockdim`
; ││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:85 within `#blockDim`
; │││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:51 within `blockDim_x`
; ││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `_index`
; │││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
          %7 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
; │││└└└└└
; │││┌ @ int.jl:1040 within `*`
; ││││┌ @ int.jl:523 within `rem`
       %8 = zext i32 %7 to i64
; ││││└
; ││││ @ int.jl:1042 within `*` @ int.jl:88
      %9 = mul nuw nsw i64 %6, %8
; │││└
; │││┌ @ int.jl:1040 within `+`
; ││││┌ @ int.jl:523 within `rem`
       %10 = zext i32 %4 to i64
; ││││└
; ││││ @ int.jl:1042 within `+` @ int.jl:87
      %11 = add nuw nsw i64 %9, %10
; ││└└
; ││┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:29 within `global_size`
; │││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:41 within `griddim`
; ││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:71 within `#gridDim`
; │││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:61 within `gridDim_x`
; ││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `_index`
; │││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
          %12 = call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
; │││└└└└└
; │││┌ @ int.jl:88 within `*`
      %13 = mul i32 %12, %7
; ││└└
; ││┌ @ int.jl:1040 within `*`
; │││┌ @ int.jl:523 within `rem`
      %14 = sext i32 %13 to i64
; └└└└
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:61 within `broadcast_kernel`
; ┌ @ int.jl:83 within `<`
   %.not11 = icmp sgt i64 %2, 0
; └
  br i1 %.not11, label %L5.lr.ph, label %common.ret

L5.lr.ph:                                         ; preds = %conversion
  %.fca.0.1.0.extract = extractvalue { { { { i8 addrspace(1)*, i64, [1 x i64], i64 }, [1 x i8], [1 x i64] }, [1 x i64] }, [1 x [1 x i64]] } %1, 0, 1, 0
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %15 = inttoptr i64 %.fca.0.1.0.extract to float*
  %16 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
  br label %L5

L5:                                               ; preds = %L53, %L5.lr.ph
  %value_phi12 = phi i64 [ 0, %L5.lr.ph ], [ %19, %L53 ]
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:63 within `broadcast_kernel`
; ┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:66 within `macro expansion`
; │┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:44 within `linear_index`
; ││┌ @ int.jl:1042 within `*` @ int.jl:88
     %17 = mul i64 %value_phi12, %14
; ││└
; ││┌ @ int.jl:87 within `+`
     %18 = add i64 %11, %17
; │└└
; │ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:67 within `macro expansion`
; │┌ @ operators.jl:369 within `>`
; ││┌ @ int.jl:83 within `<`
     %.not10 = icmp slt i64 %.fca.3.extract, %18
; │└└
   br i1 %.not10, label %common.ret, label %L53

common.ret:                                       ; preds = %L53, %L5, %conversion
; └
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl within `broadcast_kernel`
  ret void

L53:                                              ; preds = %L5
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:62 within `broadcast_kernel`
; ┌ @ int.jl:87 within `+`
   %19 = add nuw nsw i64 %value_phi12, 1
; └
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:64 within `broadcast_kernel`
; ┌ @ broadcast.jl:610 within `getindex`
; │┌ @ broadcast.jl:655 within `_broadcast_getindex`
; ││┌ @ broadcast.jl:679 within `_getindex` @ broadcast.jl:680
; │││┌ @ broadcast.jl:630 within `_broadcast_getindex`
; ││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:169 within `getindex`
; │││││┌ @ pointer.jl:111 within `unsafe_load` @ pointer.jl:111
        %20 = load float, float* %15, align 1
; └└└└└└
; ┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:179 within `setindex!` @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:166
; │┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:127 within `#arrayset`
; ││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:134 within `arrayset_bits`
; │││┌ @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\pointer.jl:88 within `unsafe_store!`
; ││││┌ @ none within `pointerset`
; │││││┌ @ none within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
; ││││││┌ @ int.jl:86 within `-`
         %21 = add i64 %18, -1
; ││││││└
        %22 = getelementptr inbounds float, float addrspace(1)* %16, i64 %21
        store float %20, float addrspace(1)* %22, align 4
; └└└└└└
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:61 within `broadcast_kernel`
; ┌ @ int.jl:83 within `<`
   %exitcond.not = icmp eq i64 %19, %2
; └
  br i1 %exitcond.not, label %common.ret, label %L5
}

attributes #0 = { nounwind readnone speculatable }
attributes #1 = { uwtable "frame-pointer"="all" }

!llvm.module.flags = !{!0, !1}
!julia.kernel = !{!2}
!nvvm.annotations = !{!3}

!0 = !{i32 2, !"Dwarf Version", i32 2}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{void ({ i64, i32 }, { i8 addrspace(1)*, i64, [1 x i64], i64 }, { { { { i8 addrspace(1)*, i64, [1 x i64], i64 }, [1 x i8], [1 x i64] }, [1 x i64] }, [1 x [1 x i64]] }, i64)* @_Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE16ComposedFunctionI4last5tupleES4_I8ExtrudedIS0_IS1_Li1ELi1EES4_I4BoolES4_IS6_EE12CuRefPointerIS1_EEES6_}
!3 = !{void ({ i64, i32 }, { i8 addrspace(1)*, i64, [1 x i64], i64 }, { { { { i8 addrspace(1)*, i64, [1 x i64], i64 }, [1 x i8], [1 x i64] }, [1 x i64] }, [1 x [1 x i64]] }, i64)* @_Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE16ComposedFunctionI4last5tupleES4_I8ExtrudedIS0_IS1_Li1ELi1EES4_I4BoolES4_IS6_EE12CuRefPointerIS1_EEES6_, !"kernel", i32 1}
; PTX CompilerJob of MethodInstance for (::GPUArrays.var"#broadcast_kernel#38")(::CUDA.CuKernelContext, ::CuDeviceVector{Float32, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(-), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64) for sm_61
; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

; Function Attrs: nounwind readnone speculatable
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0

; Function Attrs: nounwind readnone speculatable
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #0

; Function Attrs: nounwind readnone speculatable
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #0

; Function Attrs: nounwind readnone speculatable
declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() #0

;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:59 within `broadcast_kernel`
; Function Attrs: uwtable
define ptx_kernel void @_Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE1_S4_I8ExtrudedIS0_IS1_Li1ELi1EES4_I4BoolES4_IS6_EEEES6_({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, { [1 x { { i8 addrspace(1)*, i64, [1 x i64], i64 }, [1 x i8], [1 x i64] }], [1 x [1 x i64]] } %1, i64 signext %2) local_unnamed_addr #1 {
conversion:
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 3
  %.fca.0.0.2.0.extract = extractvalue { [1 x { { i8 addrspace(1)*, i64, [1 x i64], i64 }, [1 x i8], [1 x i64] }], [1 x [1 x i64]] } %1, 0, 0, 2, 0
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:63 within `broadcast_kernel`
; ┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:66 within `macro expansion`
; │┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:44 within `linear_index`
; ││┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:20 within `global_index`
; │││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:40 within `threadidx`
; ││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:92 within `#threadIdx`
; │││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:46 within `threadIdx_x`
; ││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `_index`
; │││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
          %3 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
; ││││││└└
; ││││││┌ @ int.jl:87 within `+`
         %4 = add nuw nsw i32 %3, 1
; │││└└└└
; │││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:38 within `blockidx`
; ││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:78 within `#blockIdx`
; │││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:56 within `blockIdx_x`
; ││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `_index`
; │││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
          %5 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
; │││└└└└└
; │││┌ @ int.jl:1042 within `-` @ int.jl:86
      %6 = zext i32 %5 to i64
; │││└
; │││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:39 within `blockdim`
; ││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:85 within `#blockDim`
; │││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:51 within `blockDim_x`
; ││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `_index`
; │││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
          %7 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
; │││└└└└└
; │││┌ @ int.jl:1040 within `*`
; ││││┌ @ int.jl:523 within `rem`
       %8 = zext i32 %7 to i64
; ││││└
; ││││ @ int.jl:1042 within `*` @ int.jl:88
      %9 = mul nuw nsw i64 %6, %8
; │││└
; │││┌ @ int.jl:1040 within `+`
; ││││┌ @ int.jl:523 within `rem`
       %10 = zext i32 %4 to i64
; ││││└
; ││││ @ int.jl:1042 within `+` @ int.jl:87
      %11 = add nuw nsw i64 %9, %10
; ││└└
; ││┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:29 within `global_size`
; │││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:41 within `griddim`
; ││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:71 within `#gridDim`
; │││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:61 within `gridDim_x`
; ││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `_index`
; │││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\intrinsics\indexing.jl:7 within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
          %12 = call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
; │││└└└└└
; │││┌ @ int.jl:88 within `*`
      %13 = mul i32 %12, %7
; ││└└
; ││┌ @ int.jl:1040 within `*`
; │││┌ @ int.jl:523 within `rem`
      %14 = sext i32 %13 to i64
; └└└└
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:61 within `broadcast_kernel`
; ┌ @ int.jl:83 within `<`
   %.not11 = icmp sgt i64 %2, 0
; └
  br i1 %.not11, label %L5.lr.ph, label %common.ret

L5.lr.ph:                                         ; preds = %conversion
  %.fca.0.0.1.0.extract = extractvalue { [1 x { { i8 addrspace(1)*, i64, [1 x i64], i64 }, [1 x i8], [1 x i64] }], [1 x [1 x i64]] } %1, 0, 0, 1, 0
  %.fca.0.0.0.0.extract = extractvalue { [1 x { { i8 addrspace(1)*, i64, [1 x i64], i64 }, [1 x i8], [1 x i64] }], [1 x [1 x i64]] } %1, 0, 0, 0, 0
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %15 = and i8 %.fca.0.0.1.0.extract, 1
  %.not10 = icmp eq i8 %15, 0
  %16 = bitcast i8 addrspace(1)* %.fca.0.0.0.0.extract to float addrspace(1)*
  %17 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
  br i1 %.not10, label %L5.lr.ph.split.us, label %L5

L5.lr.ph.split.us:                                ; preds = %L5.lr.ph
  %18 = add i64 %.fca.0.0.2.0.extract, -1
  %19 = getelementptr inbounds float, float addrspace(1)* %16, i64 %18
  br label %L5.us

L5.us:                                            ; preds = %L53.us, %L5.lr.ph.split.us
  %value_phi12.us = phi i64 [ 0, %L5.lr.ph.split.us ], [ %22, %L53.us ]
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:63 within `broadcast_kernel`
; ┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:66 within `macro expansion`
; │┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:44 within `linear_index`
; ││┌ @ int.jl:1042 within `*` @ int.jl:88
     %20 = mul i64 %value_phi12.us, %14
; ││└
; ││┌ @ int.jl:87 within `+`
     %21 = add i64 %11, %20
; │└└
; │ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:67 within `macro expansion`
; │┌ @ operators.jl:369 within `>`
; ││┌ @ int.jl:83 within `<`
     %.not9.us = icmp slt i64 %.fca.3.extract, %21
; │└└
   br i1 %.not9.us, label %common.ret, label %L53.us

L53.us:                                           ; preds = %L5.us
; └
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:62 within `broadcast_kernel`
; ┌ @ int.jl:87 within `+`
   %22 = add nuw nsw i64 %value_phi12.us, 1
; └
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:64 within `broadcast_kernel`
; ┌ @ broadcast.jl:610 within `getindex`
; │┌ @ broadcast.jl:655 within `_broadcast_getindex`
; ││┌ @ broadcast.jl:680 within `_getindex`
; │││┌ @ broadcast.jl:649 within `_broadcast_getindex`
; ││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:176 within `getindex` @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:164
; │││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:85 within `#arrayref`
; ││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:91 within `arrayref_bits`
; │││││││┌ @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\pointer.jl:85 within `unsafe_load`
; ││││││││┌ @ none within `pointerref`
; │││││││││┌ @ none within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
            %23 = load float, float addrspace(1)* %19, align 4
; ││└└└└└└└└
; ││ @ broadcast.jl:656 within `_broadcast_getindex`
; ││┌ @ broadcast.jl:683 within `_broadcast_getindex_evalf`
; │││┌ @ float.jl:406 within `-`
      %24 = fneg float %23
; └└└└
; ┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:179 within `setindex!` @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:166
; │┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:127 within `#arrayset`
; ││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:134 within `arrayset_bits`
; │││┌ @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\pointer.jl:88 within `unsafe_store!`
; ││││┌ @ none within `pointerset`
; │││││┌ @ none within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
; ││││││┌ @ int.jl:86 within `-`
         %25 = add i64 %21, -1
; ││││││└
        %26 = getelementptr inbounds float, float addrspace(1)* %17, i64 %25
        store float %24, float addrspace(1)* %26, align 4
; └└└└└└
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:61 within `broadcast_kernel`
; ┌ @ int.jl:83 within `<`
   %exitcond13.not = icmp eq i64 %22, %2
; └
  br i1 %exitcond13.not, label %common.ret, label %L5.us

L5:                                               ; preds = %L53, %L5.lr.ph
  %value_phi12 = phi i64 [ %29, %L53 ], [ 0, %L5.lr.ph ]
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:63 within `broadcast_kernel`
; ┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:66 within `macro expansion`
; │┌ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:44 within `linear_index`
; ││┌ @ int.jl:1042 within `*` @ int.jl:88
     %27 = mul i64 %value_phi12, %14
; ││└
; ││┌ @ int.jl:87 within `+`
     %28 = add i64 %11, %27
; │└└
; │ @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:67 within `macro expansion`
; │┌ @ operators.jl:369 within `>`
; ││┌ @ int.jl:83 within `<`
     %.not9 = icmp slt i64 %.fca.3.extract, %28
; │└└
   br i1 %.not9, label %common.ret, label %L53

common.ret:                                       ; preds = %L53, %L5, %L53.us, %L5.us, %conversion
; └
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl within `broadcast_kernel`
  ret void

L53:                                              ; preds = %L5
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:62 within `broadcast_kernel`
; ┌ @ int.jl:87 within `+`
   %29 = add nuw nsw i64 %value_phi12, 1
; └
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:64 within `broadcast_kernel`
; ┌ @ broadcast.jl:610 within `getindex`
; │┌ @ broadcast.jl:655 within `_broadcast_getindex`
; ││┌ @ broadcast.jl:680 within `_getindex`
; │││┌ @ broadcast.jl:649 within `_broadcast_getindex`
; ││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:176 within `getindex` @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:164
; │││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:85 within `#arrayref`
; ││││││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:91 within `arrayref_bits`
; │││││││┌ @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\pointer.jl:85 within `unsafe_load`
; ││││││││┌ @ none within `pointerref`
; │││││││││┌ @ none within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
; ││││││││││┌ @ int.jl:86 within `-`
             %30 = add i64 %28, -1
; ││││││││││└
            %31 = getelementptr inbounds float, float addrspace(1)* %16, i64 %30
            %32 = load float, float addrspace(1)* %31, align 4
; ││└└└└└└└└
; ││ @ broadcast.jl:656 within `_broadcast_getindex`
; ││┌ @ broadcast.jl:683 within `_broadcast_getindex_evalf`
; │││┌ @ float.jl:406 within `-`
      %33 = fneg float %32
; └└└└
; ┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:179 within `setindex!` @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:166
; │┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:127 within `#arrayset`
; ││┌ @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\device\array.jl:134 within `arrayset_bits`
; │││┌ @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\pointer.jl:88 within `unsafe_store!`
; ││││┌ @ none within `pointerset`
; │││││┌ @ none within `macro expansion` @ C:\Users\Gerhard\.julia\packages\LLVM\RpBog\src\interop\base.jl:38
        %34 = getelementptr inbounds float, float addrspace(1)* %17, i64 %30
        store float %33, float addrspace(1)* %34, align 4
; └└└└└└
;  @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:61 within `broadcast_kernel`
; ┌ @ int.jl:83 within `<`
   %exitcond.not = icmp eq i64 %29, %2
; └
  br i1 %exitcond.not, label %common.ret, label %L5
}

attributes #0 = { nounwind readnone speculatable }
attributes #1 = { uwtable "frame-pointer"="all" }

!llvm.module.flags = !{!0, !1}
!julia.kernel = !{!2}
!nvvm.annotations = !{!3}

!0 = !{i32 2, !"Dwarf Version", i32 2}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{void ({ i64, i32 }, { i8 addrspace(1)*, i64, [1 x i64], i64 }, { [1 x { { i8 addrspace(1)*, i64, [1 x i64], i64 }, [1 x i8], [1 x i64] }], [1 x [1 x i64]] }, i64)* @_Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE1_S4_I8ExtrudedIS0_IS1_Li1ELi1EES4_I4BoolES4_IS6_EEEES6_}
!3 = !{void ({ i64, i32 }, { i8 addrspace(1)*, i64, [1 x i64], i64 }, { [1 x { { i8 addrspace(1)*, i64, [1 x i64], i64 }, [1 x i8], [1 x i64] }], [1 x [1 x i64]] }, i64)* @_Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE1_S4_I8ExtrudedIS0_IS1_Li1ELi1EES4_I4BoolES4_IS6_EEEES6_, !"kernel", i32 1}
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
 [2] isdone
   @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\lib\cudadrv\stream.jl:111 [inlined]
 [3] spinning_synchronization(f::typeof(CUDA.isdone), obj::CuStream)
   @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:79
 [4] device_synchronize(; blocking::Bool, spin::Bool)
   @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:171
 [5] device_synchronize()
   @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:169
 [6] top-level scope
   @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\initialization.jl:210

caused by: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
  [2] nonblocking_synchronize(val::CuContext)
    @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:163
  [3] device_synchronize(; blocking::Bool, spin::Bool)
    @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:174
  [4] device_synchronize
    @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:169 [inlined]
  [5] CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
    @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\lib\cudadrv\module.jl:40
  [6] CuModule
    @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\lib\cudadrv\module.jl:23 [inlined]
  [7] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
    @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\compiler\compilation.jl:365
  [8] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler C:\Users\Gerhard\.julia\packages\GPUCompiler\U36Ed\src\execution.jl:132
  [9] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler C:\Users\Gerhard\.julia\packages\GPUCompiler\U36Ed\src\execution.jl:103
 [10] macro expansion
    @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:382 [inlined]
 [11] macro expansion
    @ .\lock.jl:267 [inlined]
 [12] cufunction(f::GPUArrays.var"#broadcast_kernel#38", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(-), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:377
 [13] cufunction
    @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:374 [inlined]
 [14] macro expansion
    @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:104 [inlined]
 [15] #launch_heuristic#1120
    @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:17 [inlined]
 [16] launch_heuristic
    @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\gpuarrays.jl:15 [inlined]
 [17] _copyto!
    @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:70 [inlined]
 [18] copyto!
    @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:51 [inlined]
 [19] copy
    @ C:\Users\Gerhard\.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:42 [inlined]
 [20] materialize
    @ .\broadcast.jl:873 [inlined]
 [21] broadcast_preserving_zero_d
    @ .\broadcast.jl:862 [inlined]
 [22] -(A::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Base .\abstractarraymath.jl:218
 [23] _minus
    @ C:\Users\Gerhard\.julia\packages\Zygote\YYT6v\src\lib\broadcast.jl:89 [inlined]
 [24] #1185
    @ C:\Users\Gerhard\.julia\packages\Zygote\YYT6v\src\lib\broadcast.jl:86 [inlined]
 [25] #3770#back
    @ C:\Users\Gerhard\.julia\packages\ZygoteRules\4nXuu\src\adjoint.jl:71 [inlined]
 [26] Pullback
    @ .\REPL[6]:1 [inlined]
 [27] (::Zygote.Pullback{Tuple{var"#5#6", CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.ZBack{ChainRules.var"#mean_pullback#1821"{Int64, ChainRules.var"#sum_pullback#1633"{Colon, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ChainRulesCore.ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ChainRulesCore.ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}}}}}}}}, Zygote.var"#3770#back#1189"{Zygote.var"#1185#1188"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Zygote.var"#1990#back#194"{Zygote.var"#190#193"{Zygote.Context{false}, GlobalRef, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Zygote.Pullback{Tuple{typeof(Base.Broadcast.materialize), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Tuple{}}}})(Δ::Float32)
    @ Zygote C:\Users\Gerhard\.julia\packages\Zygote\YYT6v\src\compiler\interface2.jl:0
 [28] (::Zygote.var"#75#76"{Zygote.Pullback{Tuple{var"#5#6", CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.ZBack{ChainRules.var"#mean_pullback#1821"{Int64, ChainRules.var"#sum_pullback#1633"{Colon, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ChainRulesCore.ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ChainRulesCore.ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}}}}}}}}, Zygote.var"#3770#back#1189"{Zygote.var"#1185#1188"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Zygote.var"#1990#back#194"{Zygote.var"#190#193"{Zygote.Context{false}, GlobalRef, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Zygote.Pullback{Tuple{typeof(Base.Broadcast.materialize), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Tuple{}}}}})(Δ::Float32)
    @ Zygote C:\Users\Gerhard\.julia\packages\Zygote\YYT6v\src\compiler\interface.jl:45
 [29] top-level scope
    @ C:\Users\Gerhard\.julia\packages\GPUCompiler\U36Ed\src\reflection.jl:206
 [30] top-level scope
    @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\initialization.jl:208

@dorn-gerhard
Copy link

dorn-gerhard commented Dec 14, 2023

I also switched to CUDA runtime version 12.2 and tried the compute-sanitizer,
The line using CUDA triggered the following error:

julia> using CUDA

julia> CUDA.version

version      versioninfo
julia> CUDA.versioninfo()
CUDA runtime 12.2, artifact installation
CUDA driver 12.0
NVIDIA driver 528.79.0

CUDA libraries:
- CUBLAS: 12.2.5
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.2
- CUSPARSE: 12.1.2
- CUPTI: 20.0.0
- NVML: 12.0.0+528.79

Julia packages:
- CUDA: 5.1.1
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

Preferences:
- CUDA_Runtime_jll.version: 12.2

1 device:
  0: NVIDIA GeForce GTX 1050 (sm_61, 3.926 GiB / 4.000 GiB available)

julia> CUDA.run_compute_sanitizer()
Re-starting your active Julia session...
========= COMPUTE-SANITIZER
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.4 (2023-11-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using CUDA

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x6be45c40 -- nvtxGlobals_v3 at C:\Users\Gerhard\.julia\artifacts\b4eeaf094ffb6aacf1b20ee5d2ac9aa1818fc732\bin\libnvToolsExt.dll (unknown line)
in expression starting at REPL[1]:1
nvtxGlobals_v3 at C:\Users\Gerhard\.julia\artifacts\b4eeaf094ffb6aacf1b20ee5d2ac9aa1818fc732\bin\libnvToolsExt.dll (unknown line)
Allocations: 703421 (Pool: 702539; Big: 882); GC: 1
========= Error: Target application terminated before first instrumented API call
ERROR: failed process: Process(setenv(`'C:\Users\Gerhard\.julia\artifacts\0cdffaf70d865a7149744c4c5670ea6b2145e80d\bin\compute-sanitizer.exe' --tool memcheck --launch-timeout=0 --target-processes=all --report-api-errors=no 'C:\Users\Gerhard\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\bin\julia.exe' -Cnative '-JC:\Users\Gerhard\.julia\juliaup\julia-1.9.4+0.x64.w64.mingw32\lib\julia\sys.dll' -g1 '--project=C:\Users\Gerhard\.julia\environments\v1.9\Project.toml'`,["WINDIR=C:\\WINDOWS", "PATH=C:\\Users\\Gerhard\\.julia\\artifacts\\0cdffaf70d865a7149744c4c5670ea6b2145e80d\\bin;C:\\Users\\Gerhard\\.julia\\juliaup\\julia-1.9.4+0.x64.w64.mingw32\\bin\\..\\lib\\julia;C:\\Users\\Gerhard\\.julia\\juliaup\\julia-1.9.4+0.x64.w64.mingw32\\bin\\..\\lib;C:\\Users\\Gerhard\\.julia\\juliaup\\julia-1.9.4+0.x64.w64.mingw32\\bin;E:\\Programs\\VM Ware\\bin\\;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files (x86)\\Razer Chroma SDK\\bin;C:\\Program Files\\Razer Chroma SDK\\bin;C:\\Program Files (x86)\\Razer\\ChromaBroadcast\\bin;C:\\Program Files\\Razer\\ChromaBroadcast\\bin;C:\\Program Files\\ImageMagick-6.9.10-Q16;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath;C:\\ProgramData\\Oracle\\Java\\javapath;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Program Files\\MiKTeX 2.9\\miktex\\bin\\x64\\;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\Android;C:\\Program Files\\MATLAB\\R2019a\\bin;C:\\Program Files\\MATLAB\\R2018b\\bin;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\gs\\gs9.20\\bin;C:\\ad;C:\\Program Files (x86)\\Intel\\iCLS Client\\;C:\\Program Files\\Intel\\iCLS Client\\;C:\\Program Files (x86)\\GNU\\GnuPG\\pub;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\Int;C:\\WINDOWS\\system32\\config\\systemprofile\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Program Files\\PuTTY\\;C:\\Program Files (x86)\\Gpg4win\\..\\GnuPG\\bin;C:\\platform-tools\\;C:\\Program Files\\Git\\cmd;C:\\Users\\Gerhard\\AppData\\Roaming\\Python\\Python39\\Scripts;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files (x86)\\PDFtk\\bin\\;C:\\Program Files (x86)\\PDFtk Server\\bin\\;C:\\Program Files (x86)\\GitExtensions\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\ArangoDB3 3.9.3\\usr\\bin;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git LFS;C:\\Program Files\\Java\\jdk-20\\bin;C:\\Program Files\\nodejs\\;C:\\Users\\Gerhard\\AppData\\Local\\Microsoft\\WindowsApps;C;C:\\Users\\Gerhard\\scoop\\shims;C:\\Users\\Gerhard\\AppData\\Local\\Programs\\Python\\Python39\\Scripts\\;C:\\Users\\Gerhard\\AppData\\Local\\Programs\\Python\\Python39\\;C:\\Program Files\\MATLAB\\R2018a\\bin;C:\\Program Files (x86)\\Intel\\Intel(R) Management E;C:\\Users\\Gerhard\\AppData\\Local\\Programs\\Microsoft VS Code\\bin", "USERDOMAIN_ROAMINGPROFILE=DESKTOP-BTMG2IL", "ZES_ENABLE_SYSMAN=1", "LOCALAPPDATA=C:\\Users\\Gerhard\\AppData\\Local", "HOMEPATH=\\Users\\Gerhard", "PROCESSOR_IDENTIFIER=Intel64 Family 6 Model 158 Stepping 9, GenuineIntel", "NUMBER_OF_PROCESSORS=8", "PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC", "CYGWIN=nodosfilewarning"  …  "USERPROFILE=C:\\Users\\Gerhard", "DRIVERDATA=C:\\Windows\\System32\\Drivers\\DriverData", "ANDROID_SDK_HOME=C:\\Android", "PROCESSOR_LEVEL=6", "SYSTEMDRIVE=C:", "PROGRAMW6432=C:\\Program Files", "TEMP=C:\\Users\\Gerhard\\AppData\\Local\\Temp", "HOMEDRIVE=C:", "OPENBLAS_MAIN_FREE=1", "PROCESSOR_ARCHITECTURE=AMD64"]), ProcessExited(4294967295)) [4294967295]

Stacktrace:
 [1] pipeline_error
   @ .\process.jl:565 [inlined]
 [2] run(::Cmd; wait::Bool)
   @ Base .\process.jl:480
 [3] run
   @ .\process.jl:477 [inlined]
 [4] run_compute_sanitizer(julia_args::Cmd; tool::String, sanitizer_args::Cmd)
   @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\utilities.jl:200
 [5] run_compute_sanitizer
   @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\utilities.jl:196 [inlined]
 [6] run_compute_sanitizer()
   @ CUDA C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\utilities.jl:196
 [7] top-level scope
   @ REPL[3]:1
 [8] top-level scope
   @ C:\Users\Gerhard\.julia\packages\CUDA\YIj5X\src\initialization.jl:208

@maleadt
Copy link
Contributor

maleadt commented Dec 15, 2023

I also switched to CUDA runtime version 12.2 and tried the compute-sanitizer,

Does just using 12.2 work, i.e., not running under compute-sanitizer?

@dorn-gerhard
Copy link

no, just using 12.2 also throws an error

@yolhan83
Copy link

yolhan83 commented Dec 18, 2023

Hello, I have the same issue,
using Statistics,CUDA,Tracker,Zygote f(x) = mean(x) x = rand(Float32,2,1000) |> cu f(x) Tracker.gradient(f,x) Zygote.gradient(f,x)

Tracker works fine but Zygote throw ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS).
The Flux Losses mse and crossentropy also ends up throwing.
Otherwise everything works on my gpu.

I'm running a fresh instalation of julia (juliaup +release )
Julia Version 1.9.4
Commit 8e5136fa29 (2023-11-14 08:46 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
Threads: 1 on 20 virtual cores

with the following CUDA :
CUDA runtime 12.3, artifact installation
CUDA driver 12.3
NVIDIA driver 546.33.0

CUDA libraries:

  • CUBLAS: 12.3.4
  • CURAND: 10.3.4
  • CUFFT: 11.0.12
  • CUSOLVER: 11.5.4
  • CUSPARSE: 12.2.0
  • CUPTI: 21.0.0
  • NVML: 12.0.0+546.33

Julia packages:

  • CUDA: 5.1.1
  • CUDA_Driver_jll: 0.7.0+0
  • CUDA_Runtime_jll: 0.10.1+0

Toolchain:

  • Julia: 1.9.4
  • LLVM: 14.0.6

1 device:
0: NVIDIA GeForce RTX 4060 Laptop GPU (sm_89, 7.304 GiB / 7.996 GiB available)

and the following versions for packages:
[052768ef] CUDA v5.1.1
[9f7883ad] Tracker v0.2.30
[e88e6eb3] Zygote v0.6.67
[10745b16] Statistics v1.9.0

Also, here is the failing when I run @device_code_llvm dump_module=true grads = Zygote.gradient(f,x), I also have a lot of llvm code if you think it's needed I will post it

fail: ; preds = %L5.lr.ph.L5.lr.ph.split_crit_edge
; @ C:\Users\yolha.julia\packages\GPUArrays\dAUOE\src\host\broadcast.jl:63 within broadcast_kernel
; ┌ @ C:\Users\yolha.julia\packages\GPUArrays\dAUOE\src\device\indexing.jl:81 within macro expansion
; │┌ @ abstractarray.jl:1296 within getindex
; ││┌ @ abstractarray.jl:1341 within _getindex
; │││┌ @ abstractarray.jl:1348 within _to_subscript_indices
; ││││┌ @ abstractarray.jl:1370 within _unsafe_ind2sub
; │││││┌ @ abstractarray.jl:2940 within _ind2sub @ abstractarray.jl:2978
; ││││││┌ @ abstractarray.jl:2991 within _ind2sub_recurse
; │││││││┌ @ abstractarray.jl:2998 within _div
; ││││││││┌ @ int.jl:295 within div
call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception1 to i64))
call fastcc void @gpu_signal_exception({ i64, i32 } %state)
call void @llvm.trap()
call void @llvm.trap()
call void asm sideeffect "exit;", ""()
unreachable
; └└└└└└└└└
}

Also, the weirdest thing, this works :
using CUDA,Zygote,StatsBase

x = rand(1,1000) |> cu

StatsBase.Statistics._mean(identity,x)

gradient(x-> StatsBase.Statistics._mean(identity,x),x)

mymean(x::AbstractArray;dims=:) = StatsBase.Statistics._mean(identity,x,dims)

gradient(mymean,x)

mymean is like exactly the same as the StatsBase.Statistics.mean yet not throwing while gradient calculation.

Finally, it's working with,

⌅ [052768ef] CUDA v4.4.1
⌃ [587475ba] Flux v0.14.5
⌃ [02a925ec] cuDNN v1.1.1

If you need anything else don't hesitate, thanks you all.

@ToucheSir
Copy link
Member

ToucheSir commented Dec 20, 2023

Great, I think that does narrow this down quite a bit! Can someone test the following Zygote-less example? You'll need to install ChainRulesCore and ChainRules into your test environment.

using CUDA, ChainRulesCore, ChainRules
x = CUDA.rand(Float32,2,1000)
_, back = rrule(mean, x)
_, dx = back(1.0f0)
unthunk(dx)

@koenvos
Copy link
Author

koenvos commented Dec 21, 2023

That works without error and gives me

(NoTangent(), InplaceableThunk(ChainRules.var"#..., Thunk(ChainRules.var"#...)))

@ToucheSir
Copy link
Member

I missed that the pullback returned a thunk. Can you try the edited version above?

@koenvos
Copy link
Author

koenvos commented Dec 21, 2023

If I copy-paste that into a newly-opened REPL, the code works and returns a result. But as soon as I do a CUDA operation afterwards it crashes with (eventually) the 700 error:

julia> using CUDA, ChainRulesCore, ChainRules, Statistics

julia> x = CUDA.rand(Float32,2,1000)
2×1000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.205128  0.344927  0.593451  0.40473   0.826242  0.348554  0.278663  0.0688475  0.481032   0.380625  0.0614015  0.686919  …  0.924595  0.888197  0.0193105  0.242006  0.0873997  0.4841    0.745524  0.159897  0.457518  0.454996  0.280042  0.192795
 0.942227  0.835056  0.717844  0.131631  0.1714    0.969516  0.356804  0.62123    0.0475908  0.554652  0.0157555  0.671628     0.822441  0.548564  0.260347   0.518379  0.706009   0.475598  0.204736  0.129832  0.881828  0.444721  0.802818  0.8167       

julia> _, back = rrule(mean, x)
(0.49776343f0, ChainRules.var"#mean_pullback#1821"{Int64, ChainRules.var"#sum_pullback#1633"{Colon, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}}}(2000, ChainRules.var"#sum_pullback#1633"{Colon, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}}(Colon(), Float32[0.20512752 0.3449274 … 0.28004175 0.19279478; 0.9422275 0.8350562 … 0.80281806 0.81670034], ProjectTo{AbstractArray}(element = ProjectTo{Float32}(), axes = (Base.OneTo(2), Base.OneTo(1000))))))

julia> _, dx = back(1.0f0)
(NoTangent(), InplaceableThunk(ChainRules.var"#..., Thunk(ChainRules.var"#...)))

julia> unthunk(dx)
2×1000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 288.498  295.661  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       
 283.767  127.602  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       

julia> a=cu([1])
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), details=CUDA.Optional{String}(data=nothing))
throw_api_error at C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
check at C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:34 [inlined]
cuMemHostUnregister at C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\utils\call.jl:26 [inlined]
unregister at C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:193 [inlined]
#21 at C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:701 [inlined]
#context!#915 at C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\state.jl:170 [inlined]
context! at C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\state.jl:165 [inlined]
macro expansion at C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:700 [inlined]
macro expansion at .\lock.jl:267 [inlined]
__unpin at C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:694
#1081 at C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\compiler\execution.jl:180
unknown function (ip: 000002a9491e9d46)
run_finalizer at C:/workdir/src\gc.c:417
jl_gc_run_finalizers_in_list at C:/workdir/src\gc.c:507
run_finalizers at C:/workdir/src\gc.c:553
jl_mutex_unlock at C:/workdir/src\julia_locks.h:81 [inlined]
jl_generate_fptr_impl at C:/workdir/src\jitlayers.cpp:467
jl_compile_method_internal at C:/workdir/src\gf.c:2348
jl_compile_method_internal at C:/workdir/src\gf.c:2241 [inlined]
_jl_invoke at C:/workdir/src\gf.c:2750 [inlined]
ijl_apply_generic at C:/workdir/src\gf.c:2940
jl_apply at C:/workdir/src\julia.h:1880 [inlined]
do_call at C:/workdir/src\interpreter.c:126
eval_value at C:/workdir/src\interpreter.c:226
eval_stmt_value at C:/workdir/src\interpreter.c:177 [inlined]
eval_body at C:/workdir/src\interpreter.c:624
jl_interpret_toplevel_thunk at C:/workdir/src\interpreter.c:762
top-level scope at REPL[101]:1
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:912
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:856
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:856
ijl_toplevel_eval at C:/workdir/src\toplevel.c:921 [inlined]
ijl_toplevel_eval_in at C:/workdir/src\toplevel.c:971
eval at .\boot.jl:370 [inlined]
eval at .\Base.jl:68 [inlined]
repleval at c:\Users\koenv\.vscode\extensions\julialang.language-julia-1.61.2\scripts\packages\VSCodeServer\src\repl.jl:229
#110 at c:\Users\koenv\.vscode\extensions\julialang.language-julia-1.61.2\scripts\packages\VSCodeServer\src\repl.jl:192
unknown function (ip: 000002a90d1d8073)
with_logstate at .\logging.jl:514
with_logger at .\logging.jl:626 [inlined]
#109 at c:\Users\koenv\.vscode\extensions\julialang.language-julia-1.61.2\scripts\packages\VSCodeServer\src\repl.jl:193
unknown function (ip: 000002a90d1d7f03)
jl_apply at C:/workdir/src\julia.h:1880 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:774
#invokelatest#2 at .\essentials.jl:819 [inlined]
invokelatest at .\essentials.jl:816
unknown function (ip: 000002a90d1a69a6)
jl_apply at C:/workdir/src\julia.h:1880 [inlined]
do_apply at C:/workdir/src\builtins.c:730
macro expansion at c:\Users\koenv\.vscode\extensions\julialang.language-julia-1.61.2\scripts\packages\VSCodeServer\src\eval.jl:34 [inlined]
#62 at .\task.jl:514
unknown function (ip: 000002a90d18b163)
jl_apply at C:/workdir/src\julia.h:1880 [inlined]
start_task at C:/workdir/src\task.c:1092
WARNING: Error while freeing DeviceBuffer(7.812 KiB at 0x0000000204802200):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), details=CUDA.Optional{String}(data=nothing))

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
  [2] check
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:34 [inlined]
  [3] cuMemFreeAsync
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\utils\call.jl:26 [inlined]
  [4] #free#2
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:97 [inlined]
  [5] free
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:92 [inlined]
  [6] #actual_free#1001
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:78 [inlined]
  [7] actual_free
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:75 [inlined]
  [8] #_free#1026
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:506 [inlined]
  [9] _free
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:493 [inlined]
 [10] macro expansion
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:478 [inlined]
 [11] macro expansion
    @ .\timing.jl:393 [inlined]
 [12] #free#1025
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:477 [inlined]
 [13] free
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:466 [inlined]
 [14] (::CUDA.var"#1032#1033"{CUDA.Mem.DeviceBuffer, Bool})()
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\array.jl:101
 [15] #context!#915
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\state.jl:170 [inlined]
 [16] context!
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\state.jl:165 [inlined]
 [17] _free_buffer(buf::CUDA.Mem.DeviceBuffer, early::Bool)
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\array.jl:89
 [18] release(rc::GPUArrays.RefCounted{CUDA.Mem.DeviceBuffer}, args::Bool)
    @ GPUArrays C:\Users\koenv\.julia\packages\GPUArrays\dAUOE\src\host\abstractarray.jl:42
 [19] unsafe_free!
    @ C:\Users\koenv\.julia\packages\GPUArrays\dAUOE\src\host\abstractarray.jl:90 [inlined]
 [20] unsafe_finalize!(xs::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\array.jl:113
 [21] top-level scope
    @ REPL[101]:1
 [22] eval
    @ .\boot.jl:370 [inlined]
 [23] eval
    @ .\Base.jl:68 [inlined]
 [24] repleval(m::Module, code::Expr, #unused#::String)
    @ VSCodeServer c:\Users\koenv\.vscode\extensions\julialang.language-julia-1.61.2\scripts\packages\VSCodeServer\src\repl.jl:229
 [25] (::VSCodeServer.var"#110#112"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
    @ VSCodeServer c:\Users\koenv\.vscode\extensions\julialang.language-julia-1.61.2\scripts\packages\VSCodeServer\src\repl.jl:192
 [26] with_logstate(f::Function, logstate::Any)
    @ Base.CoreLogging .\logging.jl:514
 [27] with_logger
    @ .\logging.jl:626 [inlined]
 [28] (::VSCodeServer.var"#109#111"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
    @ VSCodeServer c:\Users\koenv\.vscode\extensions\julialang.language-julia-1.61.2\scripts\packages\VSCodeServer\src\repl.jl:193
 [29] #invokelatest#2
    @ .\essentials.jl:819 [inlined]
 [30] invokelatest(::Any)
    @ Base .\essentials.jl:816
 [31] macro expansion
    @ c:\Users\koenv\.vscode\extensions\julialang.language-julia-1.61.2\scripts\packages\VSCodeServer\src\eval.jl:34 [inlined]
 [32] (::VSCodeServer.var"#62#63")()
    @ VSCodeServer .\task.jl:514
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
  [2] check
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:34 [inlined]
  [3] cuMemAllocFromPoolAsync
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\utils\call.jl:26 [inlined]
  [4] alloc(::Type{CUDA.Mem.DeviceBuffer}, bytesize::Int64; async::Bool, stream::CuStream, pool::CuMemoryPool)
    @ CUDA.Mem C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:81
  [5] alloc
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\lib\cudadrv\memory.jl:71 [inlined]
  [6] actual_alloc(bytes::Int64; async::Bool, stream::CuStream, pool::CuMemoryPool)
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:66
  [7] actual_alloc
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:59 [inlined]
  [8] #1019
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:436 [inlined]
  [9] retry_reclaim(f::CUDA.var"#1019#1021"{CuStream, Int64, CuMemoryPool}, isfailed::typeof(isnothing))
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:359
 [10] macro expansion
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:435 [inlined]
 [11] macro expansion
    @ .\timing.jl:393 [inlined]
 [12] #_alloc#1018
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:431 [inlined]
 [13] _alloc
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:427 [inlined]
 [14] #alloc#1017
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:417 [inlined]
 [15] alloc
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\pool.jl:411 [inlined]
 [16] CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}(#unused#::UndefInitializer, dims::Tuple{Int64})
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\array.jl:74
 [17] CuArray
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\array.jl:412 [inlined]
 [18] adapt_storage
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\array.jl:724 [inlined]
 [19] adapt_structure
    @ C:\Users\koenv\.julia\packages\Adapt\cdaEv\src\Adapt.jl:57 [inlined]
 [20] adapt
    @ C:\Users\koenv\.julia\packages\Adapt\cdaEv\src\Adapt.jl:40 [inlined]
 [21] #cu#1065
    @ C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\array.jl:792 [inlined]
 [22] cu(xs::Vector{Int64})
    @ CUDA C:\Users\koenv\.julia\packages\CUDA\YIj5X\src\array.jl:779
 [23] top-level scope
    @ REPL[101]:1

@yolhan83
Copy link

yolhan83 commented Dec 21, 2023

Same for me with
[052768ef] CUDA v5.1.1 [082447d4] ChainRules v1.58.1 [d360d2e6] ChainRulesCore v1.19.0

it's throwing at

julia> using CUDA, ChainRulesCore, ChainRules,Statistics

julia> x = CUDA.rand(Float32,2,1000)
2×1000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.374244  0.402295  0.0273562  0.949032  0.496665  0.447222  …  0.204372  0.895638  0.893051  0.685116  0.653696
 0.664527  0.33093   0.562389   0.881603  0.215851  0.793411     0.577452  0.575083  0.199129  0.868674  0.726484

julia> _, back = rrule(mean, x)
(0.48213166f0, ChainRules.var"#mean_pullback#1821"{Int64, ChainRules.var"#sum_pullback#1633"{Colon, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}}}(2000, ChainRules.var"#sum_pullback#1633"{Colon, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}}(Colon(), Float32[0.37424353 0.4022953 … 0.6851163 0.6536963; 0.66452676 0.3309297 … 0.86867446 0.7264841], ProjectTo{AbstractArray}(element = ProjectTo{Float32}(), axes = (Base.OneTo(2), Base.OneTo(1000))))))

julia> _, dx = back(1.0f0)
(NoTangent(), InplaceableThunk(ChainRules.var"#..., Thunk(ChainRules.var"#...)))

julia> unthunk(dx)
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA C:\Users\yolha\.julia\packages\CUDA\YIj5X\lib\cudadrv\libcuda.jl:27
 [2] nonblocking_synchronize(val::CuContext)
   @ CUDA C:\Users\yolha\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:163
 [3] device_synchronize(; blocking::Bool, spin::Bool)
   @ CUDA C:\Users\yolha\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:174
 [4] device_synchronize()
   @ CUDA C:\Users\yolha\.julia\packages\CUDA\YIj5X\lib\cudadrv\synchronization.jl:169
 [5] top-level scope
   @ C:\Users\yolha\.julia\packages\CUDA\YIj5X\src\initialization.jl:210

working fine with

⌃ [052768ef] CUDA v5.0.0
  [082447d4] ChainRules v1.58.1
  [d360d2e6] ChainRulesCore v1.19.0

I've also tried with different versions of ChainRules and ChainRulesCore, it's not working as long as I keep CUDA v5.1.0 or 5.1.1.
I guess one of the changes in https://github.com/JuliaGPU/CUDA.jl/releases/tag/v5.1.0 did that

@maleadt
Copy link
Contributor

maleadt commented Dec 21, 2023

Ah interesting, seeing the pin/register in there makes me suspect this is due to unified memory behaving differently on Windows. It's likely caused by JuliaGPU/CUDA.jl#2109; I'll investigate further.

@maleadt
Copy link
Contributor

maleadt commented Dec 21, 2023

Looks like broadcast uses Ref boxes to pass scalar inputs (that's not what Refs are for...). Those ephemeral objects get freed immediately, so accessing them from the GPU results in an illegal memory access. I'll revert the change; if people care greatly about the ability to mutate Refs from the GPU, we could root those objects to prevent them getting freed, but that would make launching kernels more expensive.

@yolhan83
Copy link

Working on master, thanks a lot

julia> unthunk(dx)
2×1000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.0005  0.0005  0.0005  0.0005  0.0005  0.0005  0.0005  …  0.0005  0.0005  0.0005  0.0005  0.0005  0.0005  0.0005
 0.0005  0.0005  0.0005  0.0005  0.0005  0.0005  0.0005     0.0005  0.0005  0.0005  0.0005  0.0005  0.0005  0.0005

(@v1.9) pkg> st
Status `C:\Users\yolha\.julia\environments\v1.9\Project.toml`
  [052768ef] CUDA v5.1.1 `https://github.com/JuliaGPU/CUDA.jl.git#master`
⌃ [082447d4] ChainRules v1.42.0
  [d360d2e6] ChainRulesCore v1.19.0
Info Packages marked with ⌃ have new versions available and may be upgradable.

@koenvos
Copy link
Author

koenvos commented Dec 22, 2023

Yep, master works for me too now - awesome!

@ToucheSir
Copy link
Member

Wonderful. @dorn-gerhard can you confirm this works for you too? Then I'll close out the issue.

@dorn-gerhard
Copy link

@ToucheSir yes the problem seems to be fixed with current CUDA#master branch.
Thanks for fixing :)

@tom-plaa
Copy link

tom-plaa commented Jan 3, 2024

I'm still getting the exact same error in Flux 0.14.8 but on Ubuntu 20.04, not Windows.
It also points to the line where the loss functions are being defined.

I think the lines of interest in the stacktrace might be

[32] (::Zygote.Pullback{Tuple{Flux.Losses.var"##mae#13", typeof(Statistics.mean), typeof(Flux.Losses.mae), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.Pullback{Tuple{typeof(Base.Broadcast.materialize), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Tuple{}}, Zygote.ZBack{ChainRules.var"#mean_pullback#1821"{Int64, ChainRules.var"#sum_pullback#1633"{Colon, CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ChainRulesCore.ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ChainRulesCore.ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}}}}, Zygote.Pullback{Tuple{typeof(Base.Broadcast.broadcasted), typeof(abs), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.var"#2849#back#672"{Zygote.var"#map_back#666"{typeof(Base.Broadcast.broadcastable), 1, Tuple{Tuple{}}, Tuple{Val{0}}, Tuple{}}}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing}, Tuple{}}, Zygote.var"#combine_styles_pullback#1166"{Tuple{Nothing, Nothing}}}}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing, Nothing, Nothing}, Tuple{}}, Zygote.var"#4165#back#1424"{Zygote.var"#bc_fwd_back#1398"{CUDA.CuArray{ForwardDiff.Dual{Nothing, Float32, 1}, 2, CUDA.Mem.DeviceBuffer}, Tuple{CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Val{1}}}}}, Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.Pullback{Tuple{typeof(Base.Broadcast.broadcastable), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Tuple{}}}}, Zygote.ZBack{Flux.Losses.var"#_check_sizes_pullback#12"}, Zygote.var"#3770#back#1189"{Zygote.var"#1185#1188"{CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}}})(Δ::Float64)
    @ Zygote ~/.julia/packages/Zygote/WOy6z/src/compiler/interface2.jl:0
 [33] Pullback
    @ ~/.julia/packages/Flux/PpGmk/src/losses/functions.jl:21 [inlined]
 [34] (::Zygote.Pullback{Tuple{typeof(Flux.Losses.mae), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.Pullback{Tuple{Flux.Losses.var"##mae#13", typeof(Statistics.mean), typeof(Flux.Losses.mae), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.Pullback{Tuple{typeof(Base.Broadcast.materialize), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Tuple{}}, Zygote.ZBack{ChainRules.var"#mean_pullback#1821"{Int64, ChainRules.var"#sum_pullback#1633"{Colon, CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ChainRulesCore.ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ChainRulesCore.ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}}}}, Zygote.Pullback{Tuple{typeof(Base.Broadcast.broadcasted), typeof(abs), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Tuple{Zygote.var"#2849#back#672"{Zygote.var"#map_back#666"{typeof(Base.Broadcast.broadcastable), 1, Tuple{Tuple{}}, Tuple{Val{0}}, Tuple{}}}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing}, Tuple{}}, Zygote.var"#combine_styles_pullback#1166"{Tuple{Nothing, Nothing}}}}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing, Nothing, Nothing}, Tuple{}}, Zygote.var"#4165#back#1424"{Zygote.var"#bc_fwd_back#1398"{CUDA.CuArray{ForwardDiff.Dual{Nothing, Float32, 1}, 2, CUDA.Mem.DeviceBuffer}, Tuple{CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Val{1}}}}}, Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.Pullback{Tuple{typeof(Base.Broadcast.broadcastable), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Tuple{}}}}, Zygote.ZBack{Flux.Losses.var"#_check_sizes_pullback#12"}, Zygote.var"#3770#back#1189"{Zygote.var"#1185#1188"{CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}}}}})(Δ::Float64)
    @ Zygote ~/.julia/packages/Zygote/WOy6z/src/compiler/interface2.jl:0
 [35] Pullback
    @ ~/mydir/MyProject/src/mycomponent/mycode.jl:39 [inlined]
 [36] (::Zygote.Pullback{Tuple{MyProject.mycode.var"#L1loss#16"{Float64}, CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Any})(Δ::Float64)

And the line ~/mydir/MyProject/src/mycomponent/mycode.jl:39 is pointed out below:

   (...)
   opt = Flux.Descent(learning_rate)

    ps = Flux.params(model)

    L2penalty() = regularization * 0.5 * sum(x->sum(abs2, x), Flux.params(model))

    L1loss(x, y) = Flux.mae(model(x), y) +  L2penalty() # this is the relevant line
    
    # Batching
    train_data = Flux.Data.DataLoader((inputs, labels), batchsize=batchsize, shuffle=false)

    loss_history = zeros(Float32, epochs)
    
    # Training
    for epoch in 1:epochs
        Flux.train!(L1loss, ps, train_data, opt)
        loss_history[epoch] = L1loss(inputs, labels)
        show_trace ? (@info "Epoch = $epoch : Training Loss = $(loss_history[epoch])") : nothing
    end
    (...)

I can't post the full code but I might try to get an MWE later.

@maleadt
Copy link
Contributor

maleadt commented Jan 3, 2024

Are you using CUDA.jl#master?

@tom-plaa
Copy link

tom-plaa commented Jan 3, 2024

I'm sorry, I was assuming this had already been merged on CUDA.jl since then. Let me try it out and report back.

@maleadt
Copy link
Contributor

maleadt commented Jan 3, 2024

It's been merged on CUDA.jl, but not tagged yet. I hope to release a new version this week.

@tom-plaa
Copy link

tom-plaa commented Jan 3, 2024

This is probably not the right place to post this, but I can't manage to install the version on the master branch in CUDA.jl because it has incompatible requirements with Flux:

pkg> add https://github.com/JuliaGPU/CUDA.jl.git#master
    Updating git-repo `https://github.com/JuliaGPU/CUDA.jl.git`
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Adapt [79e6a3ab]:
 Adapt [79e6a3ab] log:
 ├─possible versions are: 0.3.0-4.0.0 or uninstalled
 ├─restricted to versions 4 by CUDA [052768ef], leaving only versions: 4.0.0
 │ └─CUDA [052768ef] log:
 │   ├─possible versions are: 5.1.1 or uninstalled
 │   └─CUDA [052768ef] is fixed to version 5.1.1
 └─restricted by compatibility requirements with Flux [587475ba] to versions: 3.0.0-3.7.2 — no versions left
   └─Flux [587475ba] log:
     ├─possible versions are: 0.4.1-0.14.8 or uninstalled
     └─restricted to versions 0.14 by MyProject [4b9087f2], leaving only versions: 0.14.0-0.14.8
       └─MyProject [4b9087f2] log:
         ├─possible versions are: 0.3.0 or uninstalled
         └─MyProject [4b9087f2] is fixed to version 0.3.0

@maleadt
Copy link
Contributor

maleadt commented Jan 3, 2024

FluxML/Flux.jl#2362

@tom-plaa
Copy link

tom-plaa commented Jan 3, 2024

All right. For now, I will try to roll back to 5.0.0 on CUDA.jl and when both these changes are released I'll test them again.

@maleadt
Copy link
Contributor

maleadt commented Jan 7, 2024

I just tagged CUDA.jl v5.1.2, which should include the fix without requiring the latest Adapt.jl.

@ToucheSir
Copy link
Member

Thanks Tim!

@tom-plaa
Copy link

Just reporting back. Now on julia v1.10.1 with CUDA.jl v5.2.0 and cuDNN.jl v1.3.0 our model is training. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CUDA All things GPU
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants