Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNN test failures with CUDNN #267

Closed
CarloLucibello opened this issue May 20, 2018 · 7 comments
Closed

RNN test failures with CUDNN #267

CarloLucibello opened this issue May 20, 2018 · 7 comments

Comments

@CarloLucibello
Copy link
Member

In the last month, CI on gpu has been consistently failing with the following error:

 batch_size = 5: Test Failed
  Expression: rnn.cell.h.grad  collect(curnn.cell.h.grad)
   Evaluated: [-0.152661, -0.251366, -0.159411, -0.00937606, -0.264564]  Float32[-0.163419, -0.602388, 0.0816057, -0.495094, 0.20643]
Stacktrace:
 [1] macro expansion at /var/lib/buildbot/workers/julia/Flux-julia06-x86-64bit/packages/v0.6/Flux/test/cuda/cudnn.jl:31 [inlined]
 [2] macro expansion at ./test.jl:921 [inlined]
 [3] macro expansion at /var/lib/buildbot/workers/julia/Flux-julia06-x86-64bit/packages/v0.6/Flux/test/cuda/cudnn.jl:9 [inlined]
 [4] macro expansion at ./test.jl:921 [inlined]
 [5] macro expansion at /var/lib/buildbot/workers/julia/Flux-julia06-x86-64bit/packages/v0.6/Flux/test/cuda/cudnn.jl:6 [inlined]
 [6] macro expansion at ./test.jl:860 [inlined]
 [7] anonymous at ./<missing>:?
Test Summary:        | Pass  Fail  Error  Total
Flux                 |  179     4      1    184
  Throttle           |   11                  11
  Jacobian           |    1                   1
  Initialization     |   14                  14
  Params             |    2                   2
  Tracker            |   62                  62
  Dropout            |    8                   8
  BatchNorm          |   13                  13
  losses             |   11                  11
  Optimise           |    8                   8
  Training Loop      |    1                   1
  CuArrays           |    5                   5
  RNN                |   40     4      1     45
    R = Flux.RNN     |   16                  16
    R = Flux.GRU     |    6     4      1     11
      batch_size = 1 |    2            1      3
      batch_size = 5 |    4     4             8
    R = Flux.LSTM    |   18                  18
@MikeInnes
Copy link
Member

This is an issue with the CUDNN RNN APIs, which oddly enough I've never seen when actually using them, but which come up regularly in the tests.

As CUDNN has just added API logging, I'm hoping that will help me debug this when I get round to it.

@maleadt
Copy link
Collaborator

maleadt commented Jun 19, 2018

FYI, GPU CI has moved, and somebody with Flux.jl ownership permissions should add it to the JuliaGPU GitLab group. See https://github.com/JuliaGPU/gitlab-ci

@jcreinhold
Copy link

Just thought I'd add in here that I am receiving the same error as @CarloLucibello. The (truncated) stacktrace is below:

[ Info: Testing Flux/CUDNN
batch_size = 1: Error During Test at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9
  Got exception outside of a @test
  CUDNNError(code 3, CUDNN_STATUS_BAD_PARAM)
  Stacktrace:
   [1] macro expansion at /home/jacobr/.julia/packages/CuArrays/f4Eke/src/dnn/error.jl:19 [inlined]
   [2] cudnnRNNBackwardData(::Flux.CUDA.RNNDesc{Float32}, ::Int64, ::Array{CuArrays.CUDNN.TensorDesc,1}, ::CuArray{Float32,1}, ::Array{CuArrays.CUDNN.TensorDesc,1}, ::CuArray{Float32,1}, ::CuArrays.CUDNN.TensorDesc, ::CuArray{Float32,1}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::CuArrays.CUDNN.FilterDesc, ::CuArray{Float32,1}, ::CuArrays.CUDNN.TensorDesc, ::CuArray{Float32,1}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::Array{CuArrays.CUDNN.TensorDesc,1}, ::CuArray{Float32,1}, ::CuArrays.CUDNN.TensorDesc, ::CuArray{Float32,1}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::CuArray{UInt8,1}, ::CuArray{UInt8,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/cuda/cudnn.jl:193
   [3] backwardData(::Flux.CUDA.RNNDesc{Float32}, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::Nothing, ::CuArray{Float32,1}, ::Nothing, ::CuArray{UInt8,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/cuda/cudnn.jl:210
   [4] backwardData(::Flux.CUDA.RNNDesc{Float32}, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::CuArray{UInt8,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/cuda/cudnn.jl:218
   [5] (::getfield(Flux.CUDA, Symbol("##11#12")){Flux.GRUCell{TrackedArray{,CuArray{Float32,2}},TrackedArray{,CuArray{Float32,1}}},TrackedArray{,CuArray{Float32,1}},TrackedArray{,CuArray{Float32,1}},CuArray{UInt8,1},Tuple{CuArray{Float32,1},CuArray{Float32,1}}})(::Tuple{CuArray{Float32,1},CuArray{Float32,1}}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/cuda/cudnn.jl:329
   [6] back_(::Flux.Tracker.Call{getfield(Flux.CUDA, Symbol("##11#12")){Flux.GRUCell{TrackedArray{,CuArray{Float32,2}},TrackedArray{,CuArray{Float32,1}}},TrackedArray{,CuArray{Float32,1}},TrackedArray{,CuArray{Float32,1}},CuArray{UInt8,1},Tuple{CuArray{Float32,1},CuArray{Float32,1}}},Tuple{Flux.Tracker.Tracked{CuArray{Float32,1}},Flux.Tracker.Tracked{CuArray{Float32,1}},Flux.Tracker.Tracked{CuArray{Float32,2}},Flux.Tracker.Tracked{CuArray{Float32,2}},Flux.Tracker.Tracked{CuArray{Float32,1}}}}, ::Tuple{CuArray{Float32,1},CuArray{Float32,1}}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/tracker/back.jl:23
   [7] back(::Flux.Tracker.Tracked{Tuple{CuArray{Float32,1},CuArray{Float32,1}}}, ::Tuple{CuArray{Float32,1},Int64}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/tracker/back.jl:43
   [8] foreach(::Function, ::Tuple{Flux.Tracker.Tracked{Tuple{CuArray{Float32,1},CuArray{Float32,1}}},Nothing}, ::Tuple{Tuple{CuArray{Float32,1},Int64},Nothing}) at ./abstractarray.jl:1836
   [9] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##328#330")){Flux.Tracker.TrackedTuple{Tuple{CuArray{Float32,1},CuArray{Float32,1}}},Int64},Tuple{Flux.Tracker.Tracked{Tuple{CuArray{Float32,1},CuArray{Float32,1}}},Nothing}}, ::CuArray{Float32,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/tracker/back.jl:26
   [10] back(::Flux.Tracker.Tracked{CuArray{Float32,1}}, ::CuArray{Float32,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/tracker/back.jl:45
   [11] back!(::TrackedArray{…,CuArray{Float32,1}}, ::CuArray{Float32,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/tracker/back.jl:62
   [12] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:25 [inlined]
   [13] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
   [14] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9 [inlined]
   [15] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
   [16] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6 [inlined]
   [17] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
   [18] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6
   [19] include at ./boot.jl:317 [inlined]
   [20] include_relative(::Module, ::String) at ./loading.jl:1044
   [21] include(::Module, ::String) at ./sysimg.jl:29
   [22] include(::String) at ./client.jl:392
   [23] top-level scope at none:0
   [24] include at ./boot.jl:317 [inlined]
   [25] include_relative(::Module, ::String) at ./loading.jl:1044
   [26] include(::Module, ::String) at ./sysimg.jl:29
   [27] include(::String) at ./client.jl:392
   [28] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl:45 [inlined]
   [29] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
   [30] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl:26
   [31] include at ./boot.jl:317 [inlined]
   [32] include_relative(::Module, ::String) at ./loading.jl:1044
   [33] include(::Module, ::String) at ./sysimg.jl:29
   [34] exec_options(::Base.JLOptions) at ./client.jl:266
   [35] _start() at ./client.jl:425
batch_size = 5: Test Failed at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:28
  Expression: ((rnn.cell).Wi).grad  collect(((curnn.cell).Wi).grad)
   Evaluated: [-0.00367632 -0.00351105  -0.00199258 -0.00363324; 0.0218888 0.0180698  0.0237639 0.0238772;  ; -1.5432 -1.33318  -1.58462 -1.74017; -1.05911 -1.06296  -1.05237 -1.35042]  Float32[-0.00157259 -0.00118586  -0.00106842 -0.0011475; 0.0174023 0.0131109  0.021793 0.018576;  ; -1.82545 -1.64515  -1.70861 -2.07368; -0.937059 -0.928057  -0.998756 -1.2062]
Stacktrace:
 [1] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:28 [inlined]
 [2] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
 [3] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9 [inlined]
 [4] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
 [5] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6 [inlined]
 [6] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
 [7] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6
batch_size = 5: Test Failed at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:29
  Expression: ((rnn.cell).Wh).grad  collect(((curnn.cell).Wh).grad)
   Evaluated: [-0.00167179 -0.000571634  -0.00737623 -0.00230114; 0.00472922 -0.000953073  -0.00443951 0.0117585;  ; -0.0160675 0.0515638  0.208142 -0.385053; 0.0282091 0.044212  0.154527 -0.288217]  Float32[-0.00680693 -0.00125408  -0.0138485 0.00544874; 0.00551455 -0.000585967  -0.0031398 0.0100578;  ; -0.00432964 0.0610827  0.232323 -0.418382; 0.0216776 0.0394996  0.14176 -0.270818]
Stacktrace:
 [1] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:29 [inlined]
 [2] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
 [3] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9 [inlined]
 [4] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
 [5] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6 [inlined]
 [6] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
 [7] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6
batch_size = 5: Test Failed at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:30
  Expression: ((rnn.cell).b).grad  collect(((curnn.cell).b).grad)
   Evaluated: [-0.00485622, 0.0356765, 0.0566717, -0.0967002, -0.0736521, 0.415462, 0.169964, -0.183906, 1.05687, -0.911806, 0.00403009, -1.50994, -0.834565, -2.4546, -1.56953]  Float32[-0.00230207, 0.0302294, 0.0633711, -0.100256, -0.0690399, 0.270476, 0.153396, -0.207993, 1.27673, -0.803211, 0.344462, -0.880549, -1.12154, -2.79729, -1.42135]
Stacktrace:
 [1] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:30 [inlined]
 [2] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
 [3] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9 [inlined]
 [4] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
 [5] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6 [inlined]
 [6] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
 [7] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6
batch_size = 5: Test Failed at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:31
  Expression: ((rnn.cell).h).grad  collect(((curnn.cell).h).grad)
   Evaluated: [-0.236697, -0.686411, -0.373723, -0.637998, -1.26944]  Float32[-0.0674129, -0.592256, -0.351478, -0.822845, -1.20205]
Stacktrace:
 [1] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:31 [inlined]
 [2] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
 [3] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9 [inlined]
 [4] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
 [5] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6 [inlined]
 [6] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
 [7] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6
Test Summary:        | Pass  Fail  Error  Total
Flux                 |  399     4      1    404
  Throttle           |   11                  11
  Jacobian           |    1                   1
  Initialization     |   14                  14
  Params             |    2                   2
  onecold            |    4                   4
  Optimise           |   10                  10
  Training Loop      |    1                   1
  basic              |   17                  17
  Dropout            |    8                   8
  BatchNorm          |   13                  13
  losses             |   12                  12
  Pooling            |    2                   2
  CNN                |    1                   1
  Tracker            |  248                 248
  CuArrays           |    7                   7
  RNN                |   40     4      1     45
    R = Flux.RNN     |   16                  16
    R = Flux.GRU     |    6     4      1     11
      batch_size = 1 |    2            1      3
      batch_size = 5 |    4     4             8
    R = Flux.LSTM    |   18                  18
ERROR: LoadError: Some tests did not pass: 399 passed, 4 failed, 1 errored, 0 broken.
in expression starting at /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl:24
ERROR: LoadError: failed process: Process(`/home/jacobr/code/julia-1.0.3/bin/julia -Cnative -J/home/jacobr/code/julia-1.0.3/lib/julia/sys.so --compile=yes --depwarn=yes --color=yes --compiled-modules=yes --startup-file=no --code-coverage=none /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl`, ProcessExited(1)) [1]
Stacktrace:
 [1] error(::String, ::Base.Process, ::String, ::Int64, ::String) at ./error.jl:42
 [2] pipeline_error at ./process.jl:705 [inlined]
 [3] #run#503(::Bool, ::Function, ::Cmd) at ./process.jl:663
 [4] run(::Cmd) at ./process.jl:661
 [5] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl:5
 [6] include at ./boot.jl:317 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1044
 [8] include(::Module, ::String) at ./sysimg.jl:29
 [9] include(::String) at ./client.jl:392
 [10] top-level scope at none:0
in expression starting at /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl:3
ERROR: Package Flux errored during testing

I'm using julia v1.0.3 and the package versions are:

    Status `~/.julia/environments/v1.0/Project.toml`
  [3895d2a7] CUDAapi v0.5.3+ #master (https://github.com/JuliaGPU/CUDAapi.jl.git)
  [3a865a2d] CuArrays v0.8.1
  [587475ba] Flux v0.6.10

Is this an actual problem with any part of the Flux implementation? Or is this just some problem with the unittests?

@maleadt
Copy link
Collaborator

maleadt commented Jan 6, 2019

That test failing with CUDNN_STATUS_BAD_PARAM seems different from the original issue here, although it is also RNN related. EDIT: ah the failure count is identical so it was probably just not posted.

@maleadt maleadt changed the title CI on GPU test error RNN test failures with CUDNN Jan 6, 2019
@MikeInnes
Copy link
Member

Unfortunately this error / family of errors is really difficult to debug; it's not deterministic, doesn't show up in interactive sessions, and CUDA api logging doesn't reveal anything insightful.

I'm vaguely hoping that once we unify the Knet and Flux RNN wrappers into CuArrays, this will magically go away.

@tanhevg
Copy link

tanhevg commented Jun 25, 2019

I got this error not in CI, but when running a real model. I had to manually edit packages/CuArrays/.../deps/ext.jl to forcefully disable cudnn to make my model work.

https://discourse.julialang.org/t/flux-rnns-leak-into-curnns/25661

@tanhevg
Copy link

tanhevg commented Sep 30, 2019

Should this be closed as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants