Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNN on GPU fails on first backward call #1114

Closed
jeremiedb opened this issue Apr 8, 2020 · 7 comments · May be fixed by JuliaGPU/CuArrays.jl#706
Closed

RNN on GPU fails on first backward call #1114

jeremiedb opened this issue Apr 8, 2020 · 7 comments · May be fixed by JuliaGPU/CuArrays.jl#706

Comments

@jeremiedb
Copy link
Contributor

In the following minimal RNN example, first call to Flux.train! fails with CUDNN_STATUS_BAD_PARAM. Same error raised either cell is GRU, RNN or LSTM.

using Flux
using Statistics

rnn = Chain(GRU(16, 8),
  Dense(8,1, σ),
  x -> reshape(x,:))

X = [rand(16,10) for i in 1:20]
Y = rand(10,20) ./ 10

rnn = rnn |> gpu
X = gpu(X)
Y = gpu(Y)

θ = Flux.params(rnn)
loss(x,y) = mean((Flux.stack(rnn.(X),2) .- y) .^ 2f0)
opt = ADAM(1e-3)
size(rnn[1].state)
Flux.reset!(rnn)
size(rnn[1].state)
Flux.train!(loss, θ, [(X,Y)], opt)
size(rnn[1].state)
loss(X,Y)

It can be observed that both prior and after reset!, rnn state is of size (8), while after a call to train! on GPU, state becomes the expected proper size (8,10). After each call to reset!, the CUDNN_STATUS_BAD_PARAM error pops out after first call to train!, but subsequent ones are fine as the state size stays (8,10). Can't confirm whether that state size is the root cause, but appears closely tied to the bug. Also, a call to loss(X,Y) results in a proper state dimension of (8,10). Running on CPU doesn't result in any error/warning.

Pkg info (same error also raised wih latest Zygote release):

  [4f1ea46c] AWSCore v0.6.9
  [1c724243] AWSS3 v0.6.10
  [fbb218c0] BSON v0.2.5
  [336ed68f] CSV v0.6.1
  [3a865a2d] CuArrays v2.0.1
  [a93c6f00] DataFrames v0.20.2
  [587475ba] Flux v0.10.4
  [28b8d3ca] GR v0.48.0
  [91a5bcdd] Plots v1.0.5
  [2913bbd2] StatsBase v0.32.2
  [e88e6eb3] Zygote v0.4.15 #master (https://github.com/FluxML/Zygote.jl.git)
  [10745b16] Statistics

CUDA: 10.1.168
CUDNN: 7.6.5

@DhairyaLGandhi
Copy link
Member

Seems like we're sending something off in our parameters in the backwards pass

@maleadt
Copy link
Collaborator

maleadt commented May 5, 2020

Smaller repro:

using Flux
using Statistics
using Random

function main()
    Random.seed!(1)
    rnn = GRU(1, 1)

    X = [rand(1,2) for i in 1:2]
    Y = rand(2,2) ./ 10

    drnn = rnn |> gpu
    dX = gpu(X)
    dY = gpu(Y)

    θ = Flux.params(drnn)
    loss(x,y) = mean((Flux.stack(drnn.(dX),2) .- y) .^ 2f0)
    opt = ADAM(1e-3)

    try
        Flux.train!(loss, θ, [(dX,dY)], opt)
    catch ex
        @error "First training failed" exception=(ex, catch_backtrace())
    end

    Flux.train!(loss, θ, [(dX,dY)], opt)
    @info "Second training succeeded"
end

isinteractive() || main()

Looks like the dimensionality of the initial state is wrong. Initially, rnn.init is a 1-dimensional array [0], while after the first (failed) training the state becomes a 2-dimensional array with 2 elements. Indeed, fixing the initial state to be a 2-dimensional array of 2 zeros makes the first training work too:

drnn.state = CuArrays.zeros(Float32, (1,2))
Flux.train!(loss, θ, [(dX,dY)], opt)
# works at the first invocation

I'm not sure how/where that data is set and used, so I'll leave that to Flux experts 🙂

@DhairyaLGandhi
Copy link
Member

The issue seems more to do with that our init breaks now with the shape that we have. The repro should probably add the call to Flux.reset!, since this breakage will happen every time we call the reset! function, when our state is set to zeros(8), instead of zeros(8,10)

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented May 5, 2020

(Or zeros(1,2)), in the case of the repro.

@jeremiedb
Copy link
Contributor Author

The CUDNNError: CUDNN_STATUS_BAD_PARAM (code 3) bug continues to show out on latest Flux and Zygote, but only for LSTM:

(Flux) pkg> st
Project Flux v0.11.1
Status `D:\Github\Flux.jl\Project.toml`
  [1520ce14] AbstractTrees v0.3.3
  [79e6a3ab] Adapt v2.3.0
  [052768ef] CUDA v1.3.3
  [944b1d66] CodecZlib v0.7.0
  [5ae59095] Colors v0.12.4
  [d9f16b24] Functors v0.1.0
  [e5e0dc1b] Juno v0.8.4
  [1914dd2f] MacroTools v0.5.6
  [872c559c] NNlib v0.7.5
  [189a3867] Reexport v0.2.0
  [2913bbd2] StatsBase v0.33.2
  [a5390f91] ZipFile v0.9.3
  [e88e6eb3] Zygote v0.5.9
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg
  [de0858da] Printf
  [9a3f8284] Random
  [ea8e919c] SHA
  [10745b16] Statistics
  [8dfed614] Test

MRE:

rnn = Chain(LSTM(3, 4),
  Dense(4,1, σ),
  x -> reshape(x,:))

X = [rand(3,5) for i in 1:10]
Y = rand(5,10) ./ 10

rnn = rnn |> gpu
X = gpu(X)
Y = gpu(Y)

θ = Flux.params(rnn)
loss(x,y) = mean((Flux.stack(rnn.(x),2) .- y) .^ 2f0)
opt = ADAM(1e-3)
Flux.train!(loss, θ, [(X,Y)], opt)
Flux.train!(loss, θ, [(X,Y)], opt)
Flux.reset!(rnn)
Flux.train!(loss, θ, [(X,Y)], opt)

The error shows that the pullback step would be in error:

= back(dy, dho, dco)

Which relates to the following CUDA.jl:
https://github.com/JuliaGPU/CUDA.jl/blob/31f67d6caf7fca566a9c7b36b850ce31df450ac6/lib/cudnn/rnn.jl#L195

I couldn't figure what's different between the RNN/GRU vs the LSTM that would cause a bug only in the later. I'm not clear about this line where the cell state c_ for LSTM isn't used: (dWi, dWh), db = CUDNN.backwardWeights(rnn, x, h_, y, reserve), is it normal?

bors bot added a commit that referenced this issue Nov 7, 2020
1367: RNN update to drop CUDNN, fix LSTM bug and output type stability r=CarloLucibello a=jeremiedb

PR related to #1114 #1360 #1365 

Some experiment for RNN handling. 

Hidden state of each cell structure was dropped as they weren't needed (AFAIK, only needed for size inference for CUDNN, but bias size could be used as a substitute to cells' `h` there as well). 

Looked to drop dependence on CUDNN entirely, so it's a pure Flux/CUDA.jl. File `src/cuda/curnnjl` no longer used. No  modifications were made to the cell computations. Initial test seems to show decent performance, but yet to benchmark. 

Pending issue: despite having dropped completely the CUDNN dependency, there's still an instability issue that seems present when running on GPU. This is illustrated in the test at lines 1-50 of file `test\rnn-test-jdb.jl`. If that test runs on CPU, it goes well thorugh the 100 iterations. However, the same on GPU will thow NAs after couple dozens of iterations. 
My only hypothesis so far: when performing the iteration over the sequence through `m.(x)` or `map(rnn, x)`, is the order of the execution safe? Ie: is it possible that there isn't a `sync()` on the CUDA side between those seq steps, which may mess up the state?

### PR Checklist

- [x] Tests are added
- [ ] Entry in NEWS.md
- [ ] Documentation, if applicable
- [ ] Final review from `@dhairyagandhi96` (for API changes).


Co-authored-by: jeremiedb <jeremie_db@hotmail.com>
Co-authored-by: jeremie.db <jeremie.db@evovest.com>
@CarloLucibello
Copy link
Member

@jeremiedb is this fixed by your PR?

@jeremiedb
Copy link
Contributor Author

Yes it's now working in master. Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants