RNN on GPU fails on first backward call #1114

jeremiedb · 2020-04-08T19:17:43Z

In the following minimal RNN example, first call to Flux.train! fails with CUDNN_STATUS_BAD_PARAM. Same error raised either cell is GRU, RNN or LSTM.

using Flux
using Statistics

rnn = Chain(GRU(16, 8),
  Dense(8,1, σ),
  x -> reshape(x,:))

X = [rand(16,10) for i in 1:20]
Y = rand(10,20) ./ 10

rnn = rnn |> gpu
X = gpu(X)
Y = gpu(Y)

θ = Flux.params(rnn)
loss(x,y) = mean((Flux.stack(rnn.(X),2) .- y) .^ 2f0)
opt = ADAM(1e-3)
size(rnn[1].state)
Flux.reset!(rnn)
size(rnn[1].state)
Flux.train!(loss, θ, [(X,Y)], opt)
size(rnn[1].state)
loss(X,Y)

It can be observed that both prior and after reset!, rnn state is of size (8), while after a call to train! on GPU, state becomes the expected proper size (8,10). After each call to reset!, the CUDNN_STATUS_BAD_PARAM error pops out after first call to train!, but subsequent ones are fine as the state size stays (8,10). Can't confirm whether that state size is the root cause, but appears closely tied to the bug. Also, a call to loss(X,Y) results in a proper state dimension of (8,10). Running on CPU doesn't result in any error/warning.

Pkg info (same error also raised wih latest Zygote release):

  [4f1ea46c] AWSCore v0.6.9
  [1c724243] AWSS3 v0.6.10
  [fbb218c0] BSON v0.2.5
  [336ed68f] CSV v0.6.1
  [3a865a2d] CuArrays v2.0.1
  [a93c6f00] DataFrames v0.20.2
  [587475ba] Flux v0.10.4
  [28b8d3ca] GR v0.48.0
  [91a5bcdd] Plots v1.0.5
  [2913bbd2] StatsBase v0.32.2
  [e88e6eb3] Zygote v0.4.15 #master (https://github.com/FluxML/Zygote.jl.git)
  [10745b16] Statistics

CUDA: 10.1.168
CUDNN: 7.6.5

The text was updated successfully, but these errors were encountered:

DhairyaLGandhi · 2020-05-05T06:58:42Z

Seems like we're sending something off in our parameters in the backwards pass

maleadt · 2020-05-05T08:10:47Z

Smaller repro:

using Flux
using Statistics
using Random

function main()
    Random.seed!(1)
    rnn = GRU(1, 1)

    X = [rand(1,2) for i in 1:2]
    Y = rand(2,2) ./ 10

    drnn = rnn |> gpu
    dX = gpu(X)
    dY = gpu(Y)

    θ = Flux.params(drnn)
    loss(x,y) = mean((Flux.stack(drnn.(dX),2) .- y) .^ 2f0)
    opt = ADAM(1e-3)

    try
        Flux.train!(loss, θ, [(dX,dY)], opt)
    catch ex
        @error "First training failed" exception=(ex, catch_backtrace())
    end

    Flux.train!(loss, θ, [(dX,dY)], opt)
    @info "Second training succeeded"
end

isinteractive() || main()

Looks like the dimensionality of the initial state is wrong. Initially, rnn.init is a 1-dimensional array [0], while after the first (failed) training the state becomes a 2-dimensional array with 2 elements. Indeed, fixing the initial state to be a 2-dimensional array of 2 zeros makes the first training work too:

drnn.state = CuArrays.zeros(Float32, (1,2))
Flux.train!(loss, θ, [(dX,dY)], opt)
# works at the first invocation

I'm not sure how/where that data is set and used, so I'll leave that to Flux experts 🙂

DhairyaLGandhi · 2020-05-05T08:18:07Z

The issue seems more to do with that our init breaks now with the shape that we have. The repro should probably add the call to Flux.reset!, since this breakage will happen every time we call the reset! function, when our state is set to zeros(8), instead of zeros(8,10)

DhairyaLGandhi · 2020-05-05T08:18:39Z

(Or zeros(1,2)), in the case of the repro.

jeremiedb · 2020-10-19T23:45:59Z

The CUDNNError: CUDNN_STATUS_BAD_PARAM (code 3) bug continues to show out on latest Flux and Zygote, but only for LSTM:

(Flux) pkg> st
Project Flux v0.11.1
Status `D:\Github\Flux.jl\Project.toml`
  [1520ce14] AbstractTrees v0.3.3
  [79e6a3ab] Adapt v2.3.0
  [052768ef] CUDA v1.3.3
  [944b1d66] CodecZlib v0.7.0
  [5ae59095] Colors v0.12.4
  [d9f16b24] Functors v0.1.0
  [e5e0dc1b] Juno v0.8.4
  [1914dd2f] MacroTools v0.5.6
  [872c559c] NNlib v0.7.5
  [189a3867] Reexport v0.2.0
  [2913bbd2] StatsBase v0.33.2
  [a5390f91] ZipFile v0.9.3
  [e88e6eb3] Zygote v0.5.9
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg
  [de0858da] Printf
  [9a3f8284] Random
  [ea8e919c] SHA
  [10745b16] Statistics
  [8dfed614] Test

MRE:

rnn = Chain(LSTM(3, 4),
  Dense(4,1, σ),
  x -> reshape(x,:))

X = [rand(3,5) for i in 1:10]
Y = rand(5,10) ./ 10

rnn = rnn |> gpu
X = gpu(X)
Y = gpu(Y)

θ = Flux.params(rnn)
loss(x,y) = mean((Flux.stack(rnn.(x),2) .- y) .^ 2f0)
opt = ADAM(1e-3)
Flux.train!(loss, θ, [(X,Y)], opt)
Flux.train!(loss, θ, [(X,Y)], opt)
Flux.reset!(rnn)
Flux.train!(loss, θ, [(X,Y)], opt)

The error shows that the pullback step would be in error:

Flux.jl/src/cuda/curnn.jl

Line 85 in c5c35cc

m̄ = back(dy, dho, dco)

Which relates to the following CUDA.jl:
https://github.com/JuliaGPU/CUDA.jl/blob/31f67d6caf7fca566a9c7b36b850ce31df450ac6/lib/cudnn/rnn.jl#L195

I couldn't figure what's different between the RNN/GRU vs the LSTM that would cause a bug only in the later. I'm not clear about this line where the cell state c_ for LSTM isn't used: (dWi, dWh), db = CUDNN.backwardWeights(rnn, x, h_, y, reserve), is it normal?

1367: RNN update to drop CUDNN, fix LSTM bug and output type stability r=CarloLucibello a=jeremiedb PR related to #1114 #1360 #1365 Some experiment for RNN handling. Hidden state of each cell structure was dropped as they weren't needed (AFAIK, only needed for size inference for CUDNN, but bias size could be used as a substitute to cells' `h` there as well). Looked to drop dependence on CUDNN entirely, so it's a pure Flux/CUDA.jl. File `src/cuda/curnnjl` no longer used. No modifications were made to the cell computations. Initial test seems to show decent performance, but yet to benchmark. Pending issue: despite having dropped completely the CUDNN dependency, there's still an instability issue that seems present when running on GPU. This is illustrated in the test at lines 1-50 of file `test\rnn-test-jdb.jl`. If that test runs on CPU, it goes well thorugh the 100 iterations. However, the same on GPU will thow NAs after couple dozens of iterations. My only hypothesis so far: when performing the iteration over the sequence through `m.(x)` or `map(rnn, x)`, is the order of the execution safe? Ie: is it possible that there isn't a `sync()` on the CUDA side between those seq steps, which may mess up the state? ### PR Checklist - [x] Tests are added - [ ] Entry in NEWS.md - [ ] Documentation, if applicable - [ ] Final review from `@dhairyagandhi96` (for API changes). Co-authored-by: jeremiedb <jeremie_db@hotmail.com> Co-authored-by: jeremie.db <jeremie.db@evovest.com>

CarloLucibello · 2020-12-27T09:03:40Z

@jeremiedb is this fixed by your PR?

jeremiedb · 2020-12-28T23:20:58Z

Yes it's now working in master. Closing the issue.

clintonTE mentioned this issue Apr 10, 2020

gpu function does nothing, but only on first run #1119

Open

DhairyaLGandhi mentioned this issue May 5, 2020

Add correct batch size for RNN hidden layer JuliaGPU/CuArrays.jl#706

Open

jeremiedb mentioned this issue Oct 18, 2020

CUDNNError: CUDNN_STATUS_BAD_PARAM (code 3) while training lstm neural network on GPU #1360

Closed

jeremiedb mentioned this issue Oct 21, 2020

RNN update to drop CUDNN, fix LSTM bug and output type stability #1367

Merged

4 tasks

jeremiedb closed this as completed Dec 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RNN on GPU fails on first backward call #1114

RNN on GPU fails on first backward call #1114

jeremiedb commented Apr 8, 2020

DhairyaLGandhi commented May 5, 2020

maleadt commented May 5, 2020

DhairyaLGandhi commented May 5, 2020

DhairyaLGandhi commented May 5, 2020 •

edited

jeremiedb commented Oct 19, 2020

CarloLucibello commented Dec 27, 2020

jeremiedb commented Dec 28, 2020

RNN on GPU fails on first backward call #1114

RNN on GPU fails on first backward call #1114

Comments

jeremiedb commented Apr 8, 2020

DhairyaLGandhi commented May 5, 2020

maleadt commented May 5, 2020

DhairyaLGandhi commented May 5, 2020

DhairyaLGandhi commented May 5, 2020 • edited

jeremiedb commented Oct 19, 2020

CarloLucibello commented Dec 27, 2020

jeremiedb commented Dec 28, 2020

DhairyaLGandhi commented May 5, 2020 •

edited