Add correct batch size for RNN hidden layer #706

DhairyaLGandhi · 2020-05-05T11:19:19Z

The context here is that on the first call to the layer (also simulated via calling the Flux.reset! on the structure), the gradients for the hidden layer were returned as the same size as the state at the time (a vector), whereas on stepping through time, the size of the state would change. This needs to be accounted for while sending the correct size back. We do this automatically for subsequent calls, only with the first run, is when we see this issue.

@MikeInnes @maleadt, would this be an alright fix?

maleadt · 2020-05-05T11:26:18Z

Maybe add a test?

MikeInnes · 2020-05-05T11:28:16Z

As I understand it, ho should already have been expanded (via hbatch) during the forwards pass. Its gradient dho should have the same size as ho, in which case AFAICT this patch shouldn't be needed.

I don't think this PR would do any harm, but that seems fishy and it may be worth understanding why it happens in case something else is going wrong.

DhairyaLGandhi · 2020-05-05T11:32:45Z

CuArrays.jl/src/dnn/rnn.jl

Line 188 in eac33f8

h_ = CUDNN.hBatch(x, h)

and others is where that expansion happens (for h). But that doesn't happen in the forwards pass...

This is just to apply the same treatment as h to dho.

MikeInnes · 2020-05-05T11:38:11Z

I believe that line mirrors this one which does happen in the forward pass (we just do it again in the backwards pass because we don't have access to the expanded h).

It would be worth printing out the size of ho and dho at the start of the pullback. Pullbacks are supposed to be able to rely on them being the same size (and if they are then there's no issue).

DhairyaLGandhi · 2020-05-05T11:46:39Z

Found this

(size(ho), size(dho)) = ((8, 10), (8,))

This was when the model was reset!d

All subsequent calls have the same size. I am assuming its because h is a vector that we get a vector back? here

CuArrays.jl/src/dnn/rnn.jl

Line 185 in eac33f8

    
           function pullback(rnn::RNNDesc{T}, x::CuArray{T}, h::CuArray{T}) where T <: Union{Float32,Float64}

DhairyaLGandhi · 2020-05-05T11:54:15Z

What should the test actually be? Just calling into the backwardData function?

MikeInnes · 2020-05-05T12:14:09Z

If h is a vector then dh should also be a vector; I'm not sure if we handle that correctly. However, it seems likely to be independent of the current issue.

The fact that size(dho) != size(ho) implies that some other function is getting called with a matrix (ho) but producing a vector (dho) as a gradient, which is wrong. Ideally we'd fix that function, whatever it is.

jeremiedb · 2020-06-23T06:09:03Z

@DhairyaLGandhi @MikeInnes Have you had a moment to look at the above?

Using the following latest release, the issue seems not to be reproducible anymore:

  [3a865a2d] CuArrays v2.2.2
  [a93c6f00] DataFrames v0.21.2
  [587475ba] Flux v0.10.4
  [28b8d3ca] GR v0.50.1
  [91a5bcdd] Plots v1.4.3
  [2913bbd2] StatsBase v0.33.0
  [e88e6eb3] Zygote v0.4.22

However, I couldn't identify what change to either Flux, CuArrays or Zygote may have solved the issue. Any pointer would be welcome.

DhairyaLGandhi · 2020-06-23T06:42:05Z

Am I right in understanding that the mwe in FluxML/Flux.jl#1114 works as expected now?

jeremiedb · 2020-06-23T06:43:12Z

Yes!

DhairyaLGandhi · 2020-06-23T06:43:19Z

Wonder if the Adapt issue is related then, incorrect size in the backwards pass could result from an ndims of a Transpose or similar.

DhairyaLGandhi · 2020-06-23T06:44:37Z

Ref JuliaGPU/Adapt.jl#24, I suppose

Dhairya Gandhi added 2 commits May 5, 2020 16:38

add hBatch

167bbdf

check isnothing(dho)

a76f65a

maleadt added the bugfix label May 5, 2020

maleadt requested a review from MikeInnes May 5, 2020 11:26

maleadt added the needs tests label May 8, 2020

jeremiedb mentioned this pull request Oct 18, 2020

CUDNNError: CUDNN_STATUS_BAD_PARAM (code 3) while training lstm neural network on GPU FluxML/Flux.jl#1360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add correct batch size for RNN hidden layer #706

Add correct batch size for RNN hidden layer #706

DhairyaLGandhi commented May 5, 2020

maleadt commented May 5, 2020

MikeInnes commented May 5, 2020

DhairyaLGandhi commented May 5, 2020 •

edited

Loading

MikeInnes commented May 5, 2020

DhairyaLGandhi commented May 5, 2020

DhairyaLGandhi commented May 5, 2020

MikeInnes commented May 5, 2020

jeremiedb commented Jun 23, 2020

DhairyaLGandhi commented Jun 23, 2020

jeremiedb commented Jun 23, 2020

DhairyaLGandhi commented Jun 23, 2020

DhairyaLGandhi commented Jun 23, 2020

Add correct batch size for RNN hidden layer #706

Are you sure you want to change the base?

Add correct batch size for RNN hidden layer #706

Conversation

DhairyaLGandhi commented May 5, 2020

maleadt commented May 5, 2020

MikeInnes commented May 5, 2020

DhairyaLGandhi commented May 5, 2020 • edited Loading

MikeInnes commented May 5, 2020

DhairyaLGandhi commented May 5, 2020

DhairyaLGandhi commented May 5, 2020

MikeInnes commented May 5, 2020

jeremiedb commented Jun 23, 2020

DhairyaLGandhi commented Jun 23, 2020

jeremiedb commented Jun 23, 2020

DhairyaLGandhi commented Jun 23, 2020

DhairyaLGandhi commented Jun 23, 2020

DhairyaLGandhi commented May 5, 2020 •

edited

Loading