Deadlock during OOM #706

norci · 2021-02-11T08:15:13Z

On the master branch.

It seems a deadlock, when executing the following code.

I guess this is related to GC.
the GPU memory was full.
the CPU usage was 100%.

using CUDA, Random

for _ = 1:10
    @info "|"
    @sync Threads.@threads for i = 1:200
        NVTX.@range Random.randstring() CUDA.stream!(CuStream()) do
            x = CUDA.rand(100, 100, 100)
            x ./= sum(x; dims = 2)
        end
    end
    @info "-" CUDA.memory_status()
end

output:

# We can see the allocated memory increases...

[ Info: |
Effective GPU memory usage: 95.61% (7.421 GiB/7.762 GiB)
CUDA allocator usage: 5.649 GiB
binned usage: 5.649 GiB (5.649 GiB allocated, 0 bytes cached)
┌ Info: -
└   CUDA.memory_status() = nothing
[ Info: |
^C^C^C^C^CWARNING: Force throwing a SIGINT
^C^C^C^C^C^C^CSegmentation fault (core dumped)

@maleadt ,
Could you have a look? thanks.

I have some knowledge about CUDA dev. But I have no idea about Julia. I'm sorry to say I'm not able to help.

maleadt · 2021-02-11T09:33:32Z

I'll have a look. Note that you don't generally need to CUDA.stream!(CuStream()); every task gets its own stream already.

norci · 2021-02-11T10:24:00Z

I'll have a look. Note that you don't generally need to CUDA.stream!(CuStream()); every task gets its own stream already.

does "task" mean a task created by @async or Threads.@spawn ?

I think most Julia functions are synchronized.
if I use @async to create a few tasks, will the scheduler switch to another one, when a function is blocked by a cuda function call?

maleadt · 2021-02-11T10:32:03Z

does "task" mean a task created by @async or Threads.@spawn ?

Both. Each task also gets its own NVTX identifier, so you don't need to do the NVTX.@range randstring().

I think most Julia functions are synchronized.

In the CUDA sense of synchronization? I've been changing that (this also answers your third question), by (1) changing blocking calls to be asynchronous on the task-local stream + an explicit call to synchronize(stream()), and (2) making synchronize() yield back to the Julia scheduler to allow other tasks to run.

maleadt · 2021-02-11T10:38:23Z

Can you try JuliaGPU/GPUCompiler.jl#150? I'm not seeing the deadlock, but the initial compilation was very slow due to the runtime being generated multiple times over.

norci · 2021-02-12T11:16:45Z

Can you try JuliaGPU/GPUCompiler.jl#150? I'm not seeing the deadlock, but the initial compilation was very slow due to the runtime being generated multiple times over.

No.
I still got the same error, with that patch.

maleadt · 2021-02-12T11:25:42Z

Which error? You don't report an error in this issue, but a deadlock.

norci · 2021-02-12T12:31:51Z

It's memory leak error.

Effective GPU memory usage: 95.61% (7.421 GiB/7.762 GiB)
binned usage: 5.649 GiB (5.649 GiB allocated, 0 bytes cached)

and dead lock, after the GPU run out of memory,

maleadt · 2021-02-12T12:45:09Z

Is this with the modified Julia from #707? I can't reproduce this.

norci · 2021-02-12T12:59:44Z

this happens on Julia Version 1.6.0-rc1.

maleadt · 2021-02-12T13:22:10Z

Yeah, so please test with that modified build.

norci · 2021-02-12T15:31:25Z

Yeah, so please test with that modified build.

The modified build has the same problem.

I'm wondering why this error can not be reproduced in your env.

In my test env, I only added CUDA.

(testcuda) pkg> st --manifest
      Status `/code/testcuda/Manifest.toml`
  [621f4979] AbstractFFTs v1.0.0
  [79e6a3ab] Adapt v3.2.0
  [ab4f0b2a] BFloat16s v0.1.0
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v2.6.0 `/julia_depot/dev/CUDA`
  [d360d2e6] ChainRulesCore v0.9.28
  [34da2185] Compat v3.25.0
  [864edb3b] DataStructures v0.18.9
  [e2ba6199] ExprTools v0.1.3
  [0c68f7d7] GPUArrays v6.2.0
  [61eb1bfa] GPUCompiler v0.10.0
  [929cbde3] LLVM v3.6.0
  [1914dd2f] MacroTools v0.5.6
  [c03570c3] Memoize v0.4.4
  [872c559c] NNlib v0.7.14
  [bac558e1] OrderedCollections v1.3.3
  [189a3867] Reexport v1.0.0
  [ae029012] Requires v1.1.2
  [6c6a2e73] Scratch v1.0.3
  [a759f4b9] TimerOutputs v0.5.7
...

I got different error msg, after I reduced the thread number

begin
    using CUDA, Random
    for _ = 1:100
        @info "-"^30
        @sync Threads.@threads for i = 1:20
            x = CUDA.rand(100, 100, 100)
            x ./= sum(x; dims = 2)
        end
        CUDA.memory_status()
    end
end

log

[ Info: ------------------------------
Effective GPU memory usage: 98.92% (7.678 GiB/7.762 GiB)
CUDA allocator usage: 5.237 GiB
binned usage: 5.237 GiB (5.237 GiB allocated, 0 bytes cached)
[ Info: ------------------------------
ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] threading_run(func::Function)
   @ Base.Threads ./threadingconstructs.jl:34
 [3] macro expansion
   @ ./threadingconstructs.jl:93 [inlined]
 [4] macro expansion
   @ ./task.jl:382 [inlined]
 [5] top-level scope
   @ ./REPL[2]:5

    nested task error: CURANDError: memory allocation failed (code 102, CURAND_STATUS_ALLOCATION_FAILED)
    Stacktrace:
      [1] throw_api_error(res::CUDA.CURAND.curandStatus)
        @ CUDA.CURAND /julia_depot/dev/CUDA/lib/curand/error.jl:53
      [2] seed!(rng::CUDA.CURAND.RNG, seed::UInt64, offset::Int64)
        @ CUDA.CURAND /julia_depot/dev/CUDA/lib/curand/random.jl:45
      [3] seed!(rng::CUDA.CURAND.RNG)
        @ CUDA.CURAND /julia_depot/dev/CUDA/lib/curand/random.jl:38
      [4] (::CUDA.CURAND.var"#46#50"{CuContext})()
        @ CUDA.CURAND /julia_depot/dev/CUDA/lib/curand/CURAND.jl:65
      [5] get!
        @ ./iddict.jl:163 [inlined]
      [6] default_rng()
        @ CUDA.CURAND /julia_depot/dev/CUDA/lib/curand/CURAND.jl:43
      [7] rand
        @ /julia_depot/dev/CUDA/src/random.jl:70 [inlined]

Could you try increasing the loop count, in order to make the GPU OOM?

So I think the main problem is memory leak. then why dead lock after OOM?

maleadt · 2021-02-12T15:57:29Z

[61eb1bfa] GPUCompiler v0.10.0

That's not using the proposed fix?

norci · 2021-02-12T16:17:49Z

sorry I reverted GPUCompiler before.
with GPUCompiler on tb/lock_runtime branch, the result is the same.

Can we add this code as a test case, and run on the CI system?

maleadt · 2021-02-23T16:43:09Z

I finally could reproduce this :-) A fix is up on #734, could you verify?

norci added the bug Something isn't working label Feb 11, 2021

norci changed the title ~~CUDA.stream! is not thread safe~~ CUDA.stream! is not thread safe, or memory leak? Feb 11, 2021

maleadt self-assigned this Feb 11, 2021

maleadt changed the title ~~CUDA.stream! is not thread safe, or memory leak?~~ Initial compilation appears to deadlock Feb 11, 2021

This was referenced Feb 23, 2021

Streamline use of retry_reclaim. #733

Merged

Threading fixes #734

Merged

maleadt changed the title ~~Initial compilation appears to deadlock~~ Deadlock during OOM Feb 23, 2021

maleadt closed this as completed in #734 Feb 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock during OOM #706

Deadlock during OOM #706

norci commented Feb 11, 2021 •

edited

maleadt commented Feb 11, 2021

norci commented Feb 11, 2021

maleadt commented Feb 11, 2021

maleadt commented Feb 11, 2021

norci commented Feb 12, 2021

maleadt commented Feb 12, 2021

norci commented Feb 12, 2021

maleadt commented Feb 12, 2021

norci commented Feb 12, 2021

maleadt commented Feb 12, 2021

norci commented Feb 12, 2021

maleadt commented Feb 12, 2021 •

edited

norci commented Feb 12, 2021

maleadt commented Feb 23, 2021

Deadlock during OOM #706

Deadlock during OOM #706

Comments

norci commented Feb 11, 2021 • edited

maleadt commented Feb 11, 2021

norci commented Feb 11, 2021

maleadt commented Feb 11, 2021

maleadt commented Feb 11, 2021

norci commented Feb 12, 2021

maleadt commented Feb 12, 2021

norci commented Feb 12, 2021

maleadt commented Feb 12, 2021

norci commented Feb 12, 2021

maleadt commented Feb 12, 2021

norci commented Feb 12, 2021

maleadt commented Feb 12, 2021 • edited

norci commented Feb 12, 2021

maleadt commented Feb 23, 2021

norci commented Feb 11, 2021 •

edited

maleadt commented Feb 12, 2021 •

edited