Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA.jl initialisation fails after suspending Ubuntu 20.04 with CUDA 11.2 #605

Closed
qin-yu opened this issue Dec 23, 2020 · 2 comments
Closed
Labels
bug Something isn't working

Comments

@qin-yu
Copy link
Contributor

qin-yu commented Dec 23, 2020

Describe the bug

CUDA.jl initialisation fails after suspending Ubuntu 20.04 with CUDA 11.2

Additional context

You will see an irrelevant error:

Error: Exception while generating log record in module CUDA at 
/home/qyu/.julia/dev/CUDA/src/initialization.jl:34
│   exception =
│    UndefVarError: ex not defined
│    Stacktrace:
...

this is described in #603 and fixed by #604

To reproduce

The Minimal Working Example (MWE) for this bug:

Launch Juno in Atom

using CUDA

# do some random stuff
W = cu(rand(2, 5)) # a 2×5 CuArray
b = cu(rand(2))

predict(x) = W*x .+ b
loss(x, y) = sum((predict(x) .- y).^2)

x, y = cu(rand(5)), cu(rand(2)) # Dummy data
loss(x, y) # ~ 3

# Suspend the machine

To suspend the machine:

  1. click the top-right of the screen
  2. click Power Off / Log Out
  3. click Suspend

Now wake up the machine and the existing Julia stops working with CUDA.jl, restart Atom/Juno or just Julia in terminal, and Julia now gives ERROR: CUDA.jl did not successfully initialize, and is not usable. when trying to do e.g. cu(rand(2)).

Press Enter to start a new session.
Starting Julia...
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.0-DEV.136 (2020-12-22)
 _/ |\__'_|_|_|\__'_|  |  Commit 549a73b99d (1 day old master)
|__/                   |
┌ Error: Recursion during initialization of CUDA.jl
└ @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:38
┌ Error: Error during initialization of CUDA.jl
│   exception =
│    CUDA error (code 999, CUDA_ERROR_UNKNOWN)
│    Stacktrace:
│      [1] throw_api_error(res::CUDA.cudaError_enum)
│        @ CUDA ~/.julia/dev/CUDA/lib/cudadrv/error.jl:97
│      [2] __configure__()
│        @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:93
│      [3] macro expansion
│        @ ~/.julia/dev/CUDA/src/initialization.jl:30 [inlined]
│      [4] macro expansion
│        @ ./lock.jl:209 [inlined]
│      [5] _functional(show_reason::Bool)
│        @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:26
│      [6] functional(show_reason::Bool)
│        @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:19
│      [7] libcuda()
│        @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:47
│      [8] macro expansion
│        @ ~/.julia/dev/CUDA/lib/cudadrv/libcuda.jl:29 [inlined]
│      [9] macro expansion
│        @ ~/.julia/dev/CUDA/lib/cudadrv/error.jl:102 [inlined]
│     [10] cuDeviceGet
│        @ ~/.julia/dev/CUDA/lib/utils/call.jl:26 [inlined]
│     [11] CuDevice
│        @ ~/.julia/dev/CUDA/lib/cudadrv/devices.jl:25 [inlined]
│     [12] initialize_thread(tid::Int64)
│        @ CUDA ~/.julia/dev/CUDA/src/state.jl:121
│     [13] prepare_cuda_call()
│        @ CUDA ~/.julia/dev/CUDA/src/state.jl:80
│     [14] device
│        @ ~/.julia/dev/CUDA/src/state.jl:227 [inlined]
│     [15] alloc
│        @ ~/.julia/dev/CUDA/src/pool.jl:293 [inlined]
│     [16] CuArray{Float32, 2}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64})
│        @ CUDA ~/.julia/dev/CUDA/src/array.jl:20
│     [17] CuArray
│        @ ~/.julia/dev/CUDA/src/array.jl:76 [inlined]
│     [18] similar
│        @ ./abstractarray.jl:779 [inlined]
│     [19] convert(AT::Type{CuArray{Float32, N} where N}, A::Matrix{Float64})
│        @ GPUArrays ~/.julia/packages/GPUArrays/jhRU7/src/host/construction.jl:82
│     [20] adapt_storage
│        @ ~/.julia/dev/CUDA/src/array.jl:330 [inlined]
│     [21] adapt_structure
│        @ ~/.julia/packages/Adapt/8kQMV/src/Adapt.jl:42 [inlined]
│     [22] adapt
│        @ ~/.julia/packages/Adapt/8kQMV/src/Adapt.jl:40 [inlined]
│     [23] cu(xs::Matrix{Float64})
│        @ CUDA ~/.julia/dev/CUDA/src/array.jl:342
│     [24] top-level scope
│        @ ~/workspace/3dunet/test-cuda.jl:3
│     [25] eval
│        @ ./boot.jl:369 [inlined]
│     [26] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
│        @ Base ./loading.jl:1090
│     [27] include_string
│        @ ~/.julia/packages/Atom/kFuIK/src/utils.jl:286 [inlined]
│     [28] (::Atom.var"#202#207"{String, Int64, String, Module, Bool})()
│        @ Atom ~/.julia/packages/Atom/kFuIK/src/eval.jl:121
│     [29] withpath(f::Atom.var"#202#207"{String, Int64, String, Module, Bool}, path::String)
│        @ CodeTools ~/.julia/packages/CodeTools/VsjEq/src/utils.jl:30
│     [30] withpath(f::Function, path::String)
│        @ Atom ~/.julia/packages/Atom/kFuIK/src/eval.jl:9
│     [31] (::Atom.var"#201#206"{String, Int64, String, Module, Bool})()
│        @ Atom ~/.julia/packages/Atom/kFuIK/src/eval.jl:119
│     [32] with_logstate(f::Function, logstate::Any)
│        @ Base.CoreLogging ./logging.jl:491
│     [33] with_logger
│        @ ./logging.jl:603 [inlined]
│     [34] #200
│        @ ~/.julia/packages/Atom/kFuIK/src/eval.jl:118 [inlined]
│     [35] hideprompt(f::Atom.var"#200#205"{String, Int64, String, Module, Bool})
│        @ Atom ~/.julia/packages/Atom/kFuIK/src/repl.jl:127
│     [36] macro expansion
│        @ ~/.julia/packages/Atom/kFuIK/src/eval.jl:117 [inlined]
│     [37] macro expansion
│        @ ~/.julia/packages/Media/ItEPc/src/dynamic.jl:24 [inlined]
│     [38] eval(text::String, line::Int64, path::String, mod::String, errorinrepl::Bool)
│        @ Atom ~/.julia/packages/Atom/kFuIK/src/eval.jl:114
│     [39] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
│        @ Base ./essentials.jl:710
│     [40] invokelatest(::Any, ::Any, ::Vararg{Any})
│        @ Base ./essentials.jl:708
│     [41] macro expansion
│        @ ~/.julia/packages/Atom/kFuIK/src/eval.jl:41 [inlined]
│     [42] (::Atom.var"#184#185")()
│        @ Atom ./task.jl:406
└ @ CUDA ~/.julia/dev/CUDA/src/initialization.jl:34
Manifest.toml

# This file is machine-generated - editing it directly is not advised

[[AbstractFFTs]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "051c95d6836228d120f5f4b984dd5aba1624f716"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "0.5.0"

[[Adapt]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "42c42f2221906892ceb765dbcb1a51deeffd86d7"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "2.3.0"

[[ArgTools]]
uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f"

[[Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"

[[BFloat16s]]
deps = ["LinearAlgebra", "Test"]
git-tree-sha1 = "4af69e205efc343068dc8722b8dfec1ade89254a"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.1.0"

[[Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"

[[CEnum]]
git-tree-sha1 = "215a9aa4a1f23fbd05b92769fdd62559488d70e9"
uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
version = "0.4.1"

[[Compat]]
deps = ["Base64", "Dates", "DelimitedFiles", "Distributed", "InteractiveUtils", "LibGit2", "Libdl", "LinearAlgebra", "Markdown", "Mmap", "Pkg", "Printf", "REPL", "Random", "SHA", "Serialization", "SharedArrays", "Sockets", "SparseArrays", "Statistics", "Test", "UUIDs", "Unicode"]
git-tree-sha1 = "a706ff10f1cd8dab94f59fd09c0e657db8e77ff0"
uuid = "34da2185-b29b-5c13-b0c7-acf172513d20"
version = "3.23.0"

[[CompilerSupportLibraries_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae"

[[DataStructures]]
deps = ["Compat", "InteractiveUtils", "OrderedCollections"]
git-tree-sha1 = "fb0aa371da91c1ff9dc7fbed6122d3e411420b9c"
uuid = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8"
version = "0.18.8"

[[Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[DelimitedFiles]]
deps = ["Mmap"]
uuid = "8bb1440f-4735-579b-a4ab-409b98df4dab"

[[Distributed]]
deps = ["Random", "Serialization", "Sockets"]
uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b"

[[Downloads]]
deps = ["ArgTools", "LibCURL", "NetworkOptions"]
uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6"

[[ExprTools]]
git-tree-sha1 = "10407a39b87f29d47ebaca8edbc75d7c302ff93e"
uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04"
version = "0.1.3"

[[GPUArrays]]
deps = ["AbstractFFTs", "Adapt", "LinearAlgebra", "Printf", "Random", "Serialization"]
git-tree-sha1 = "2c1dd57bca7ba0b3b4bf81d9332aeb81b154ef4c"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "6.1.2"

[[GPUCompiler]]
deps = ["DataStructures", "InteractiveUtils", "LLVM", "Libdl", "Logging", "Scratch", "Serialization", "TimerOutputs", "UUIDs"]
git-tree-sha1 = "e282a914b54455dfc26be049a3911ac0d9ff48a3"
uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
version = "0.9.0"

[[InteractiveUtils]]
deps = ["Markdown"]
uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"

[[LLVM]]
deps = ["CEnum", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "a2101830a761d592b113129887fda626387f68d4"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "3.5.1"

[[LazyArtifacts]]
deps = ["Artifacts", "Pkg"]
uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3"

[[LibCURL]]
deps = ["LibCURL_jll", "MozillaCACerts_jll"]
uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21"

[[LibCURL_jll]]
deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"]
uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0"

[[LibGit2]]
deps = ["Base64", "NetworkOptions", "Printf", "SHA"]
uuid = "76f85450-5226-5b5a-8eaa-529ad045b433"

[[LibSSH2_jll]]
deps = ["Artifacts", "Libdl", "MbedTLS_jll"]
uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8"

[[Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[LinearAlgebra]]
deps = ["Libdl"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[MacroTools]]
deps = ["Markdown", "Random"]
git-tree-sha1 = "6a8a2a625ab0dea913aba95c11370589e0239ff0"
uuid = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
version = "0.5.6"

[[Markdown]]
deps = ["Base64"]
uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"

[[MbedTLS_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1"

[[Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"

[[MozillaCACerts_jll]]
uuid = "14a3606d-f60d-562e-9121-12d972cd8159"

[[NNlib]]
deps = ["Compat", "Libdl", "LinearAlgebra", "Pkg", "Requires", "Statistics"]
git-tree-sha1 = "1ae42464fea5258fd2ff49f1c4a40fc41cba3860"
uuid = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
version = "0.7.7"

[[NetworkOptions]]
uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908"

[[OrderedCollections]]
git-tree-sha1 = "cf59cfed2e2c12e8a2ff0a4f1e9b2cd8650da6db"
uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
version = "1.3.2"

[[Pkg]]
deps = ["Artifacts", "Dates", "Downloads", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Serialization", "TOML", "Tar", "UUIDs"]
uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"

[[Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[REPL]]
deps = ["InteractiveUtils", "Markdown", "Sockets", "Unicode"]
uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"

[[Random]]
deps = ["Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[Reexport]]
deps = ["Pkg"]
git-tree-sha1 = "7b1d07f411bc8ddb7977ec7f377b97b158514fe0"
uuid = "189a3867-3050-52da-a836-e630ba90ab69"
version = "0.2.0"

[[Requires]]
deps = ["UUIDs"]
git-tree-sha1 = "e05c53ebc86933601d36212a93b39144a2733493"
uuid = "ae029012-a4dd-5104-9daa-d747884805df"
version = "1.1.1"

[[SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"

[[Scratch]]
deps = ["Dates"]
git-tree-sha1 = "ad4b278adb62d185bbcb6864dc24959ab0627bf6"
uuid = "6c6a2e73-6563-6170-7368-637461726353"
version = "1.0.3"

[[Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[SharedArrays]]
deps = ["Distributed", "Mmap", "Random", "Serialization"]
uuid = "1a1011a3-84de-559e-8e89-a11a2f7dc383"

[[Sockets]]
uuid = "6462fe0b-24de-5631-8697-dd941f90decc"

[[SparseArrays]]
deps = ["LinearAlgebra", "Random"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[[TOML]]
deps = ["Dates"]
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"

[[Tar]]
deps = ["ArgTools", "SHA"]
uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e"

[[Test]]
deps = ["InteractiveUtils", "Logging", "Random", "Serialization"]
uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[[TimerOutputs]]
deps = ["Printf"]
git-tree-sha1 = "3318281dd4121ecf9713ce1383b9ace7d7476fdd"
uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"
version = "0.5.7"

[[UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

[[Zlib_jll]]
deps = ["Libdl"]
uuid = "83775a58-1f1d-513f-b197-d71354ab007a"

[[nghttp2_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d"

Version info

Details on Julia:
Also tried with the current stable 1.5 version.

julia> versioninfo()
Julia Version 1.7.0-DEV.136
Commit 549a73b99d (2020-12-22 08:49 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = atom  -a
  JULIA_NUM_THREADS = 6

Details on CUDA:

# please post the output of:
CUDA.versioninfo()

Driver Version: 460.27.04 CUDA Version: 11.2

@qin-yu qin-yu added the bug Something isn't working label Dec 23, 2020
@maleadt
Copy link
Member

maleadt commented Jan 4, 2021

Error 999 is your driver being messed up. Nothing we can do about that.

@maleadt maleadt closed this as completed Jan 4, 2021
@qin-yu
Copy link
Contributor Author

qin-yu commented Jan 4, 2021

Error 999 is your driver being messed up. Nothing we can do about that.

Ah I forgot to close this issue. I started to manage multiple CUDA environments when I tried to play with CUDA.jl, so I deleted the lines that add paths automatically. Somehow before I suspend the system Julia can find the path, but not after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants