Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of function "calculateInputSizes(sizes)" in DeepSpeechModel.lua? #54

Closed
NightFury13 opened this issue Oct 7, 2016 · 17 comments
Closed

Comments

@NightFury13
Copy link
Contributor

NightFury13 commented Oct 7, 2016

@SeanNaren I would like to know what exactly is the use of the function calculateInputSizes. I am using my own image data for a scenetext task (I have updated the spatial-conv params accordingly).

It looks like it calculates the size of the tensors obtained after passing the inputs through the 2 spat-conv layers. However, this function is called just before doing the forward-backward passes (here) and that 'sizes' parameter is passed to the CTC-criterion.

AFAIK, the sizes passed in the CTC-criterion are the size of the target labels (as shown here) [NOTE : I might be getting it wrong. I posted the PR for updation of documentation on CTC-readme. So if I'm getting this all wrong, I need to update that readme too :P ]. So shouldn't the size-calculation code be something like the one below? (note that I take 'targets' as inputs instead of 'sizes' as was previously)

local function calculateInputSizes(targets)
    sizes = torch.Tensor(#targets)
    for i=1,#targets do
        sizes[i] = #targets[i]
    end
    return sizes
end

Please let me know what is going wrong here. I get an error saying...

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-8514/cutorch/lib/THC/THCGeneral.c line=676 error=11 : invalid argument
/users/mohit.jain/torch/install/bin/luajit: ...it.jain/torch/install/share/lua/5.1/nnx/CTCCriterion.lua:74: cuda runtime error (11) : invalid argument at /tmp/luarocks_cutorch-scm-1-8514/cutorch/lib/THC/THCGeneral.c:676
stack traceback:
[C]: in function 'resize'
...it.jain/torch/install/share/lua/5.1/nnx/CTCCriterion.lua:77: in function 'inverseInterleave'
...it.jain/torch/install/share/lua/5.1/nnx/CTCCriterion.lua:53: in function 'backward'
./Network.lua:147: in function 'opfunc'
/users/mohit.jain/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'
./Network.lua:166: in function 'trainNetwork'
AN4CTCTrain.lua:42: in main chunk
[C]: in function 'dofile'
...jain/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

If I go to the CTCCriterion.lua file at line-74, I see that its simply creating a new tensor local result = tensor.new():resize(sizes):zero(). By using the original calculateInputSizes function, my sizes tensor has -ve values and hence there are CUDA out of memory errors being thrown. If I however use my variation of the calculateInputSizes function, I'm getting the above stated invalid arguments error. Please help.

@SeanNaren
Copy link
Owner

calculateInputSizes calculates the real size of each sample in the audio tensor so we can ignore the padding in the gradient cost calculation (found in the CTCCriterion).

A way around this would be to do something like:

sizes:resize(outputs:size(1)):fill(outputs:size(2))

If each sample of your output has the same length. Hopefully this helps!

@NightFury13
Copy link
Contributor Author

I have variable length image-samples (same height, varying widths) so the alternate trick won't work. What should be passed in the sizes parameter to the CTCCriterion for loss calculation? (here) From what you suggest, it is the sequence length of the input samples. Can you please confirm?

So, in my case of images, since I pass a column-strip of the image at each time-step, sizes would be the width of each image in the batch after having been passed through the SpatialConv layer?

@SeanNaren
Copy link
Owner

Sorry for the late response!

From what I can tell you will not need to touch the calculateInputSizes. They calculate the sizes respect to the convolution layers, not the input. So as long the input is given correctly in the similar format as the audio data is currently given it should automatically calculate the sizes to pass to the gradient calculation.

And just to confirm, it is the true length of the input samples AFTER going through the convolutional layers (which reduces the number of timesteps, that's why this is necessary).

@NightFury13
Copy link
Contributor Author

NightFury13 commented Oct 13, 2016

Thank @SeanNaren :)

calculateInputSize is really a pretty neat hack! Turns out my problem were the noisy samples in my dataset which had an image-width less than the width of the convolution kernels I was using. Simply removing these corrupted samples from the dataset did the job for me. Thanks again!

@SeanNaren
Copy link
Owner

Ah that is a good point! I think it be nice to add this somewhere into the documentation where appropriate, I ran into the same issue a lot when training these models!

@NightFury13
Copy link
Contributor Author

@SeanNaren I can send you a PR once I myself get the codes working fine. The model trains in a weird fashion currently for me. The training loss keeps fluctuating between really small values and inf :/ (Take a look at the train-logs below). Any tips on what might be going wrong? I am checking if this is indeed exploding gradients (not hopeful of exploding-grads as the loss shouldn't have come back to 'non-inf' values once it exploded right?)

Training Epoch: 3 Average Loss: 3.046032 Average Validation WER: inf Average Validation CER: inf
[======= 5892/5892 ==================================>] Tot: 1h13m | Step: 729ms
Training Epoch: 4 Average Loss: 2.324838 Average Validation WER: inf Average Validation CER: inf
[======= 5892/5892 ==================================>] Tot: 1h16m | Step: 698ms
Training Epoch: 5 Average Loss: 1.797586 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h9m | Step: 693ms
Training Epoch: 6 Average Loss: -inf Average Validation WER: inf Average Validation CER: inf
[======= 5892/5892 ==================================>] Tot: 1h9m | Step: 760ms
Training Epoch: 7 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h9m | Step: 719ms
Training Epoch: 8 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h12m | Step: 749ms
Training Epoch: 9 Average Loss: 0.579901 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h11m | Step: 705ms
Training Epoch: 10 Average Loss: 0.420499 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h12m | Step: 766ms
Training Epoch: 11 Average Loss: 0.287849 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h12m | Step: 706ms
Training Epoch: 12 Average Loss: 0.192960 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h13m | Step: 834ms
Training Epoch: 13 Average Loss: 0.122787 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h12m | Step: 710ms
Training Epoch: 14 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h7m | Step: 506ms
Training Epoch: 15 Average Loss: 0.042043 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 43m38s | Step: 481ms
Training Epoch: 16 Average Loss: 0.023819 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 43m41s | Step: 464ms
Training Epoch: 17 Average Loss: 0.010227 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 43m51s | Step: 418ms
Training Epoch: 18 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 44m14s | Step: 484ms
Training Epoch: 19 Average Loss: 0.005311 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 46m42s | Step: 493ms
Training Epoch: 20 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan

..
..
..

Training Epoch: 33 Average Loss: 0.000093 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 53m33s | Step: 530ms
Training Epoch: 34 Average Loss: 0.000966 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 53m15s | Step: 570ms
Training Epoch: 35 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[=============== 5892/5892 ==================================>] Tot: 53m49s | Step: 521ms
Training Epoch: 36 Average Loss: 0.000915 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 54m17s | Step: 530ms
Training Epoch: 37 Average Loss: -0.000312 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 54m24s | Step: 552ms
Training Epoch: 38 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 49m14s | Step: 509ms
Training Epoch: 39 Average Loss: -0.000470 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 52m59s | Step: 599ms
Training Epoch: 40 Average Loss: 0.000786 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 57m7s | Step: 504ms
Training Epoch: 41 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 52m26s | Step: 457ms
Training Epoch: 42 Average Loss: -0.000240 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 50m47s | Step: 539ms
Training Epoch: 43 Average Loss: 0.000231 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 51m42s | Step: 558ms
Training Epoch: 44 Average Loss: 0.000756 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 51m53s | Step: 599ms
Training Epoch: 45 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 1h39s | Step: 852ms
Training Epoch: 46 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 1h22m | Step: 1s105ms
Training Epoch: 47 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 52m24s | Step: 533ms
Training Epoch: 48 Average Loss: -0.000156 Average Validation WER: nan Average Validation CER: nan
===============5892/5892 ==================================>] Tot: 1h2m | Step: 486ms
Training Epoch: 49 Average Loss: 0.000695 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 51m34s | Step: 469ms
Training Epoch: 50 Average Loss: 0.000689 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 52m41s | Step: 493ms
Training Epoch: 51 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 52m34s | Step: 613ms
Training Epoch: 52 Average Loss: 0.000671 Average Validation WER: nan Average Validation CER: nan
[=============== 5892/5892 ==================================>] Tot: 53m42s | Step: 512ms
Training Epoch: 53 Average Loss: 0.000359 Average Validation WER: nan Average Validation CER: nan

@SeanNaren
Copy link
Owner

Those are some fun losses, have you tried changing cutoff to a lower value like 100?

@NightFury13
Copy link
Contributor Author

@SeanNaren I haven't tried that yet. On it. Btw, by cutoff you mean the MaxNorm right? For normalizing gradients?

@SeanNaren
Copy link
Owner

SeanNaren commented Oct 14, 2016

Sorry exactly! That is what I meant :)

From tests I've done if you keep trying to lower the maxNorm it helps prevent gradients from exploding!

@NightFury13
Copy link
Contributor Author

NightFury13 commented Oct 15, 2016

@SeanNaren I've tested running the codes by linearly bringing down the MaxNorm to a value as low as 10 but I still face the nan losses and inf WER/CER issue. From your experience, should I keep going down further or this is probably not the bug/parameter-tuning I am after? Please help.

@NightFury13
Copy link
Contributor Author

Also, I have tried to reduce the number of RNN-hidden layers to something like 3 instead of the 7 originally. Still no positive signs though.

@SeanNaren
Copy link
Owner

This goes against the grain of DS2, but could you try using cudnn.LSTMs instead of RNNs? Try keep the number of parameters around 80 million. LSTMs might help out since they have a lot of improvements to the standard recurrent net!

@NightFury13
Copy link
Contributor Author

@SeanNaren will simply changing this line do the trick here? Replacing that line to, self.rnn = cudnn.LSTM(outputDim, outputDim, 1).

I see that there are BLSTM implementations also available, so just confirming.

@SeanNaren
Copy link
Owner

SeanNaren commented Oct 16, 2016

Ah my apologies that would be a bit strange, I'd suggest doing this in the DeepSpeechModel.lua class:

Change:

local function RNNModule(inputDim, hiddenDim, opt)
    if opt.nGPU > 0 then
        require 'BatchBRNNReLU'
        return cudnn.BatchBRNNReLU(inputDim, hiddenDim)
    else
        require 'rnn'
        return nn.SeqBRNN(inputDim, hiddenDim)
    end
end

to something like:

local function RNNModule(inputDim, hiddenDim, opt)
        require 'cudnn'
        local rnn = nn.Sequential()
        rnn:add(cudnn.BLSTM(inputDim, hiddenDim, 1)
        rnn:add(nn.View(-1, 2, outputDim):setNumInputDims(2)) -- have to sum activations
        rnn:add(nn.Sum(3))
        return rnn
end

I would suggest changing the hidden size dimension to around 700 as the default for an LSTM would be pretty large!

@NightFury13
Copy link
Contributor Author

Thanks a lot for the clarification! Will update with results 😄

@SeanNaren
Copy link
Owner

Open a new issue (maybe something like better convergence with custom dataset), I think people will find this useful!

@NightFury13
Copy link
Contributor Author

NightFury13 commented Oct 16, 2016

@SeanNaren Can you tell me what role does the outputDim play in rnn:add(nn.View(-1, 2, outputDim):setNumInputDims(2)). What value does it signify?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants