CUDA.jl initialisation fails after suspending Ubuntu 20.04 with CUDA 11.2 #605

qin-yu opened this issue Dec 23, 2020 · 2 comments
qin-yu commented Dec 23, 2020

Describe the bug

CUDA.jl initialisation fails after suspending Ubuntu 20.04 with CUDA 11.2

Additional context

You will see an irrelevant error:

Error: Exception while generating log record in module CUDA at 
│   exception =
│    UndefVarError: ex not defined
│    Stacktrace:

this is described in #603 and fixed by #604

To reproduce

The Minimal Working Example (MWE) for this bug:

Launch Juno in Atom

using CUDA

# do some random stuff
W = cu(rand(2, 5)) # a 2×5 CuArray
b = cu(rand(2))

predict(x) = W*x .+ b
loss(x, y) = sum((predict(x) .- y).^2)

x, y = cu(rand(5)), cu(rand(2)) # Dummy data
loss(x, y) # ~ 3

# Suspend the machine

To suspend the machine:

  1. click the top-right of the screen
  2. click Power Off / Log Out
  3. click Suspend

Now wake up the machine and the existing Julia stops working with CUDA.jl, restart Atom/Juno or just Julia in terminal, and Julia now gives ERROR: CUDA.jl did not successfully initialize, and is not usable. when trying to do e.g. cu(rand(2)).

Press Enter to start a new session.
Starting Julia...
   _       _ _(_)_     |  Documentation:
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.0-DEV.136 (2020-12-22)
 _/ |\__'_|_|_|\__'_|  |  Commit 549a73b99d (1 day old master)
|__/                   |
┌ Error: Recursion during initialization of CUDA.jl
└ @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:38
┌ Error: Error during initialization of CUDA.jl
│   exception =
│    CUDA error (code 999, CUDA_ERROR_UNKNOWN)
│    Stacktrace:
│      [1] throw_api_error(res::CUDA.cudaError_enum)
│        @ CUDA ~/.julia/dev/CUDA/lib/cudadrv/error.jl:97
│      [2] __configure__()
│        @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:93
│      [3] macro expansion
│        @ ~/.julia/dev/CUDA/src/initialization.jl:30 [inlined]
│      [4] macro expansion
│        @ ./lock.jl:209 [inlined]
│      [5] _functional(show_reason::Bool)
│        @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:26
│      [6] functional(show_reason::Bool)
│        @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:19
│      [7] libcuda()
│        @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:47
│      [8] macro expansion
│        @ ~/.julia/dev/CUDA/lib/cudadrv/libcuda.jl:29 [inlined]
│      [9] macro expansion
│        @ ~/.julia/dev/CUDA/lib/cudadrv/error.jl:102 [inlined]
│     [10] cuDeviceGet
│        @ ~/.julia/dev/CUDA/lib/utils/call.jl:26 [inlined]
│     [11] CuDevice
│        @ ~/.julia/dev/CUDA/lib/cudadrv/devices.jl:25 [inlined]
│     [12] initialize_thread(tid::Int64)
│        @ CUDA ~/.julia/dev/CUDA/src/state.jl:121
│     [13] prepare_cuda_call()
│        @ CUDA ~/.julia/dev/CUDA/src/state.jl:80
│     [14] device
│        @ ~/.julia/dev/CUDA/src/state.jl:227 [inlined]
│     [15] alloc
│        @ ~/.julia/dev/CUDA/src/pool.jl:293 [inlined]
│     [16] CuArray{Float32, 2}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64})
│        @ CUDA ~/.julia/dev/CUDA/src/array.jl:20
│     [17] CuArray
│        @ ~/.julia/dev/CUDA/src/array.jl:76 [inlined]
│     [18] similar
│        @ ./abstractarray.jl:779 [inlined]
│     [19] convert(AT::Type{CuArray{Float32, N} where N}, A::Matrix{Float64})
│        @ GPUArrays ~/.julia/packages/GPUArrays/jhRU7/src/host/construction.jl:82
│     [20] adapt_storage
│        @ ~/.julia/dev/CUDA/src/array.jl:330 [inlined]
│     [21] adapt_structure
│        @ ~/.julia/packages/Adapt/8kQMV/src/Adapt.jl:42 [inlined]
│     [22] adapt
│        @ ~/.julia/packages/Adapt/8kQMV/src/Adapt.jl:40 [inlined]
│     [23] cu(xs::Matrix{Float64})
│        @ CUDA ~/.julia/dev/CUDA/src/array.jl:342
│     [24] top-level scope
│        @ ~/workspace/3dunet/test-cuda.jl:3
│     [25] eval
│        @ ./boot.jl:369 [inlined]
│     [26] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
│        @ Base ./loading.jl:1090
│     [27] include_string
│        @ ~/.julia/packages/Atom/kFuIK/src/utils.jl:286 [inlined]
│     [28] (::Atom.var"#202#207"{String, Int64, String, Module, Bool})()
│        @ Atom ~/.julia/packages/Atom/kFuIK/src/eval.jl:121
│     [29] withpath(f::Atom.var"#202#207"{String, Int64, String, Module, Bool}, path::String)
│        @ CodeTools ~/.julia/packages/CodeTools/VsjEq/src/utils.jl:30
│     [30] withpath(f::Function, path::String)
│        @ Atom ~/.julia/packages/Atom/kFuIK/src/eval.jl:9
│     [31] (::Atom.var"#201#206"{String, Int64, String, Module, Bool})()
│        @ Atom ~/.julia/packages/Atom/kFuIK/src/eval.jl:119
│     [32] with_logstate(f::Function, logstate::Any)
│        @ Base.CoreLogging ./logging.jl:491
│     [33] with_logger
│        @ ./logging.jl:603 [inlined]
│     [34] #200
│        @ ~/.julia/packages/Atom/kFuIK/src/eval.jl:118 [inlined]
│     [35] hideprompt(f::Atom.var"#200#205"{String, Int64, String, Module, Bool})
│        @ Atom ~/.julia/packages/Atom/kFuIK/src/repl.jl:127
│     [36] macro expansion
│        @ ~/.julia/packages/Atom/kFuIK/src/eval.jl:117 [inlined]
│     [37] macro expansion
│        @ ~/.julia/packages/Media/ItEPc/src/dynamic.jl:24 [inlined]
│     [38] eval(text::String, line::Int64, path::String, mod::String, errorinrepl::Bool)
│        @ Atom ~/.julia/packages/Atom/kFuIK/src/eval.jl:114
│     [39] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
│        @ Base ./essentials.jl:710
│     [40] invokelatest(::Any, ::Any, ::Vararg{Any})
│        @ Base ./essentials.jl:708
│     [41] macro expansion
│        @ ~/.julia/packages/Atom/kFuIK/src/eval.jl:41 [inlined]
│     [42] (::Atom.var"#184#185")()
│        @ Atom ./task.jl:406
└ @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:34

Version info

Details on Julia:
Also tried with the current stable 1.5 version.

julia> versioninfo()
Julia Version 1.7.0-DEV.136
Commit 549a73b99d (2020-12-22 08:49 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, skylake)
  JULIA_EDITOR = atom  -a

Details on CUDA:

# please post the output of:

Driver Version: 460.27.04 CUDA Version: 11.2

maleadt commented Jan 4, 2021

Error 999 is your driver being messed up. Nothing we can do about that.

@maleadt maleadt closed this as completed Jan 4, 2021
Contributor Author

qin-yu commented Jan 4, 2021

Error 999 is your driver being messed up. Nothing we can do about that.

Ah I forgot to close this issue. I started to manage multiple CUDA environments when I tried to play with CUDA.jl, so I deleted the lines that add paths automatically. Somehow before I suspend the system Julia can find the path, but not after.

