Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CURAND handles are collected early #699

Closed
norci opened this issue Feb 10, 2021 · 3 comments · Fixed by #704
Closed

CURAND handles are collected early #699

norci opened this issue Feb 10, 2021 · 3 comments · Fixed by #704
Labels
bug Something isn't working cuda libraries Stuff about CUDA library wrappers.

Comments

@norci
Copy link
Contributor

norci commented Feb 10, 2021

Describe the bug
using CUDA on the master branch.

Random.seed! is not thread safe, for CUDA.CURAND.default_rng()

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA, Random
map(1:20) do _
Threads.@spawn NVTX.@range Random.randstring() CUDA.stream!(CuStream()) do
rng = CUDA.CURAND.default_rng()
Random.seed!(rng,999)
end
end .|> fetch

Log:

    nested task error: CURANDError: internal library error (code 999, CURAND_STATUS_INTERNAL_ERROR)
    Stacktrace:
      [1] throw_api_error(res::CUDA.CURAND.curandStatus)
        @ CUDA.CURAND ~/.julia/dev/CUDA/lib/curand/error.jl:53
      [2] seed!(rng::CUDA.CURAND.RNG, seed::UInt64, offset::Int64)
        @ CUDA.CURAND ~/.julia/dev/CUDA/lib/curand/random.jl:45
      [3] seed!(rng::CUDA.CURAND.RNG)
        @ CUDA.CURAND ~/.julia/dev/CUDA/lib/curand/random.jl:38
      [4] (::CUDA.CURAND.var"#46#48"{CuContext})()
        @ CUDA.CURAND ~/.julia/dev/CUDA/lib/curand/CURAND.jl:54
      [5] get!
        @ ./iddict.jl:163 [inlined]
      [6] default_rng()
        @ CUDA.CURAND ~/.julia/dev/CUDA/lib/curand/CURAND.jl:40
      [7] #9
        @ ./REPL[3]:3 [inlined]
      [8] #63
        @ ~/.julia/dev/CUDA/src/state.jl:540 [inlined]
      [9] task_local_storage(body::CUDA.var"#63#64"{var"#9#12", CuStream, Int64}, key::Symbol, val::CuStream)
        @ Base ./task.jl:276
     [10] stream!(f::var"#9#12", s::CuStream)
        @ CUDA ~/.julia/dev/CUDA/src/state.jl:537
     [11] macro expansion
        @ ~/.julia/dev/CUDA/lib/nvtx/highlevel.jl:73 [inlined]
     [12] (::var"#8#11")()
        @ Main ./threadingconstructs.jl:169

Expected behavior
pass

Version info

Details on Julia:

julia> versioninfo()
Julia Version 1.6.0-beta1
Commit b84990e1ac (2021-01-08 12:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, znver2)
Environment:
  JULIA_PATH = /usr/local/julia
  JULIA_NUM_THREADS = 12
  JULIA_PKG_SERVER = https://mirrors.sjtug.sjtu.edu.cn/julia
  JULIA_CUDA_USE_BINARYBUILDER = false

Additional context

Shall we add this code to the test case?

@norci norci added the bug Something isn't working label Feb 10, 2021
@norci norci changed the title Random.seed! is not thread safe, for CUDA.CURAND.default_rng() CURAND functions are not thread safe, for CUDA.CURAND.default_rng() Feb 10, 2021
@norci norci changed the title CURAND functions are not thread safe, for CUDA.CURAND.default_rng() CURAND functions are not thread safe Feb 10, 2021
@norci
Copy link
Contributor Author

norci commented Feb 10, 2021

after I got the internal library error, all CURAND calls failed, but only when it's executed in a thread, with a rng.

So I think the bug is in default_rng()

julia> randn(CUDA.CURAND.default_rng(), 2)
2-element CuArray{Float64, 1}:
 -0.026281558163828614
  0.3863620119156326

julia> fetch(Threads.@spawn randn(CUDA.CURAND.default_rng(), 2))
ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] fetch(t::Task)
   @ Base ./task.jl:332
 [3] top-level scope
   @ threadingconstructs.jl:179

    nested task error: CURANDError: internal library error (code 999, CURAND_STATUS_INTERNAL_ERROR)
    Stacktrace:
     [1] throw_api_error(res::CUDA.CURAND.curandStatus)
       @ CUDA.CURAND ~/.julia/dev/CUDA/lib/curand/error.jl:53
     [2] seed!(rng::CUDA.CURAND.RNG, seed::UInt64, offset::Int64)
       @ CUDA.CURAND ~/.julia/dev/CUDA/lib/curand/random.jl:45
     [3] seed!(rng::CUDA.CURAND.RNG)
       @ CUDA.CURAND ~/.julia/dev/CUDA/lib/curand/random.jl:38
     [4] (::CUDA.CURAND.var"#46#48"{CuContext})()
       @ CUDA.CURAND ~/.julia/dev/CUDA/lib/curand/CURAND.jl:54
     [5] get!
       @ ./iddict.jl:163 [inlined]
     [6] default_rng()
       @ CUDA.CURAND ~/.julia/dev/CUDA/lib/curand/CURAND.jl:40
     [7] (::var"#5#6")()
       @ Main ./threadingconstructs.jl:169

julia> fetch(Threads.@spawn CUDA.randn(2))
2-element CuArray{Float32, 1}:
 -0.23838419
  1.6223384

@maleadt
Copy link
Member

maleadt commented Feb 10, 2021

Also reproduces with JULIA_NUM_THREADS=1, so this looks like a multi-tasking issue.

@maleadt
Copy link
Member

maleadt commented Feb 10, 2021

Looks like the RNG handle is getting destroyed, even though I keep it alive from the task finalizer. But the reproducer is very finicky, maybe it depends on the GC's ability to mark both the task and the RNG as dead at the same time?

@maleadt maleadt changed the title CURAND functions are not thread safe CURAND handles are collected early Feb 10, 2021
@maleadt maleadt added the cuda libraries Stuff about CUDA library wrappers. label Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda libraries Stuff about CUDA library wrappers.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants