-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDNNError: CUDNN_STATUS_BAD_PARAM (code 3) while training lstm neural network on GPU #1360
Comments
This seems related to same issue as here: #1114 The following PR to Adapt.jl JuliaGPU/Adapt.jl#24, was presumed to have fixed the problem, but at the time I had only performed check on vanilla RNN and GRU, not LSTM. Maybe the patch proposed in JuliaGPU/CuArrays.jl#706 could still be brought to the following CUDA.jl: |
1367: RNN update to drop CUDNN, fix LSTM bug and output type stability r=CarloLucibello a=jeremiedb PR related to #1114 #1360 #1365 Some experiment for RNN handling. Hidden state of each cell structure was dropped as they weren't needed (AFAIK, only needed for size inference for CUDNN, but bias size could be used as a substitute to cells' `h` there as well). Looked to drop dependence on CUDNN entirely, so it's a pure Flux/CUDA.jl. File `src/cuda/curnnjl` no longer used. No modifications were made to the cell computations. Initial test seems to show decent performance, but yet to benchmark. Pending issue: despite having dropped completely the CUDNN dependency, there's still an instability issue that seems present when running on GPU. This is illustrated in the test at lines 1-50 of file `test\rnn-test-jdb.jl`. If that test runs on CPU, it goes well thorugh the 100 iterations. However, the same on GPU will thow NAs after couple dozens of iterations. My only hypothesis so far: when performing the iteration over the sequence through `m.(x)` or `map(rnn, x)`, is the order of the execution safe? Ie: is it possible that there isn't a `sync()` on the CUDA side between those seq steps, which may mess up the state? ### PR Checklist - [x] Tests are added - [ ] Entry in NEWS.md - [ ] Documentation, if applicable - [ ] Final review from `@dhairyagandhi96` (for API changes). Co-authored-by: jeremiedb <jeremie_db@hotmail.com> Co-authored-by: jeremie.db <jeremie.db@evovest.com>
Could you test from master? It should now be fixed. |
I trained one epoch and it worked. The loss decreased and now errror occured, but the training was quite slow |
What do you mean by slow? Maybe validate the data shapes. In the example below: 100 batches, batch size of 256, seq length of 20. And data features 6 as per you model definition. A single epoch takes 2.8 sec on a GTX 1660 GPU. feat = 6
batch_size = 256
num_batches = 100
seq_len = 20
X = [[rand(Float32, feat, batch_size) for i in 1:seq_len] for batch in 1:num_batches];
Y = [rand(Float32, batch_size, seq_len) ./ 10 for batch in 1:num_batches];
X = X |> gpu;
Y = Y |> gpu;
data = zip(X, Y);
opt = ADAM(0.001, (0.9, 0.999))
function loss(X,Y)
Flux.reset!(model)
mse_val = sum(abs2.(Y .- Flux.stack(model.(X), 2)))
return mse_val
end
model = Chain(LSTM(6, 70), LSTM(70, 70), LSTM(70, 70), Dense(70, 1, relu), x -> reshape(x, :)) |> gpu
ps = Flux.params(model)
Flux.reset!(model)
@time Flux.train!(loss, ps, data, opt) |
For the data shapes, I was referring to ensuring the training data is organized in a proper iterator way. You can refer to the docs about that: https://fluxml.ai/Flux.jl/stable/training/training/. The |
As regard to the the original issue, |
Yes it is fine, that you have closed issue. Thanks for your explanation of zip. I will try it.
I think the batch size is confusing me. In my scenario I would like to evaluate the model for the last 50 time steps to bring the state of the model and the model output to my target value after the 50 time steps. So the last 50 time steps or my sequence length and the features are defining my batch (shape). Then I would to it for all elements of my target starting at 51 => |
The Flux's RNN design seems to have been the cause of frequent confusion, so I'll try to provide some clarification that will hopefully be reusable. Starting back from the basic, the classic RNN depiction from Colah's blog: Here, In Flux, we represent such model with, for example: m = Chain(LSTM(6, 70), Dense(70, 1), x -> reshape(x, :)) We can apply a single step from a given sequence comprising 6 features with: x = rand(6)
julia> m(x)
1-element Array{Float32,1}:
0.028398542 The x = rand(6)
julia> m(x)
1-element Array{Float32,1}:
0.07381232 Now, instead of getting a single timestep at a time, we can get the the full seq = [rand(6) for i = 1:5]
julia> m.(seq)
5-element Array{Array{Float32,1},1}:
[-0.17945863]
[-0.20863166]
[-0.20693761]
[-0.21654114]
[-0.18843849] If for some reason one wants to exclude the first 3 timesteps of the chain for the computation of the loss, that can be handled through: function loss(seq,y)
sum((Flux.stack(m.(seq)[4:end],1) .- y) .^ 2)
end
y=rand(2)
julia> loss(seq, y)
1.7021208968648693 Such model would mean that only Note that in your use case, if the 50 first timesteps are to be ignored, there's nothing stopping your to apply the model over 60 timesteps for example so that the gradients is calculated over 10 data points (51 to 60). Alternatively, the "warmup" of the sequence could be performed once, followed with a regular training where all step of the sequence that can be considered for the gradient update: function loss(seq,y)
sum((Flux.stack(m.(seq),1) .- y) .^ 2)
end
seq_init = [rand(6) for i = 1:3]
seq_1 = [rand(6) for i = 1:5]
seq_2 = [rand(6) for i = 1:5]
y1 = rand(5)
y2 = rand(5)
X = [seq_1, seq_2]
Y = [y1, y2]
dt = zip(X,Y)
Flux.reset!(m)
m.(seq_init)
ps = params(m)
opt= ADAM(1e-3)
Flux.train!(loss, ps, dt, opt) In this previous example, there's a warmup period of length 3 that has been applied. Model's state is first reset, then the warmup sequence goes into the model, resulting in a warmup state. Then, the model can be trained for 1 epoch, where 2 batches are provided ( In this scenario, it is important to note that a single continuous sequence is considered. Since the model state is not reset between the 2 batches, the state of the model is maintained, which only makes sense in the context where Batch size would be 1 here as there's only a single sequence within each batch. If the model was to be trained on multiple independent sequences, then these sequences could be added to the input data as a second dimension. For example, in a language model, each batch would contain multiple independent sentences. In such scenario, a single batch would be of the shape:
Which would mean that we have 4 sentences (or samples), each with 6 features (let's say a very small embedding) and each with a length of 5 (5 words per sentence). Doing function loss(seq,y)
Flux.reset!(m)
sum((Flux.stack(m.(seq),1) .- y) .^ 2)
end Hope these bring some clarifications. I think an aspect of ambiguity with RNN is that data is 3 dimensional: features, seq length and samples, though in flux, those 3 dimensions provided through a vector of seq length containing matrix of [features X samples]. I think a language model with multiple sentences being trained simultaneously best provide an example of the relevance of having both multiple time steps and multiple samples. So, in a given mini-batch training, the gradients are updated on both multiple time steps and multiple samples. |
Thank you very much for this comprehensive answer and example. This helps me a lot. |
Hi, The last example is nice and illustrative. I tried to go from the and in the post (below). @jeremiedb uses in the julia> Y[1]
256×20 Array{Float32,2}:
0.0884711 0.00465298 0.0696806 … 0.0348138 0.0749678 0.0218575
...
julia> Flux.stack(model.(X[1]),2)
1×20×256 Array{Float32,3}:
[:, :, 1] =
0.0674857 0.0597369 0.0714865 … 0.0679729 0.057861 0.0686841
...
julia> Y[1] .- Flux.stack(model.(X[1]), 2)
256×20×256 Array{Float32,3}:
[:, :, 1] =
0.0209854 -0.0550839 -0.00180598 … 0.0171068 -0.0468266 Here I don't see it proper So my questions are: Also, if I remove the dimension of Thanks a lot for the clarification.
|
In the first model, there was a missing ingredient: model = Chain(LSTM(6, 70), LSTM(70, 70), LSTM(70, 70), Dense(70, 1, relu), x -> reshape(x, :)) |> gpu You should then get a matching 256x20 dimensions, which matches Y: julia> Flux.stack(model.(X[1]), 2)
256×20 CUDA.CuArray{Float32,2}:
|
@jeremiedb you should consider adding your comment of ten days ago somewhere in the docs, it's a very nice explanation |
@jeremiedb Thanks for the explanation about the loss and the data shape. I got it. I would like to ask about the batches you are describing. Let's assume I have only one realization of a time series (one sensor). My time series has a length
Is it just I'm just trying to figure out how to properly prepare time-series data, such as macroeconomic data, estimation similar to vector autoregression. I want to capture the time-effect using LSTM, but what is batch size to see the propagation/dependence from the past. Thanks a lot. |
I think you skipped a key notion in the explanation about the data format: consecutive timesteps are handled through a Vector of length If the the data shape is What happens is essentially equivalent to: for i in 1:seq_len
m(X[i])
end At each step of the iteration, |
I am also getting |
different cuDNN calls are used on the backwards pass, so validating with inference only is not enough unfortunately. Just to clarify, are these consistent or sporadic errors? Could you put together a MWE and open an issue? |
CUDNNError: CUDNN_STATUS_BAD_PARAM occurs during training with gpu. Single evaluation of the loss works and training with cpu works as well.
X
is a vector ofArray{Float32,2}(6,50)
and andY
is Array{Float32,2}(1,1).Julia version and packages:
Julia v1.5.2
Flux v0.11.1
CUDA v1.3.3
FluxML/NNlib.jl#237
The text was updated successfully, but these errors were encountered: