Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Why gradient does not scale linearly with CTCloss? #250
I am trying to use CTCLoss with another loss, and want coefficients of CTCLoss smaller.
This is the extreme cases for small CTCLoss. (CTCLoss coefficient = 0)
In above example, loss is always 0.
Can anyone explain why gradient does not scale linearly with CTCLoss?
referenced this issue
Feb 14, 2018
This was referenced
Jul 13, 2018
I can point people in the direction, however I'm unsure of the fix.
If you look here we don't take grad_outputs into consideration when calculating the gradients, thus any scale factor isn't being applied to the gradients.
There was a commit added to the repo to fix this but I reverted it (see commit here. The reason I reverted is because convergence became significantly longer when this fix was added.
Any help on this would be great :)
referenced this issue
Jul 18, 2018
Hi @SeanNaren, I'm very interested in your explanation that the convergence was influenced by the gradients connection. So I really wonder that why that is related, and why loss to gradients makes the convergence longer? Since I use the pytorch 1.0.0's official
If I want to do the same in the official
@jinserk sorry for being away, was lurking on the forums nonetheless and saw your post
We've been running into the same issue internally; the gradients not being scaled has caused an issue for us.
The issue right now is that convergence seems to be better when the gradients are not scaled, however this is less a feature but more a bug. However when we fix this it causes this line to scale the gradient to the batch size. This seems to reduce the speed of convergence dramatically.
The proper thing to do is scale the gradients by the batch size. This will have to alter the learning rates to match this, and this is something I haven't looked at so far! The paddle paddle DeepSpeech repo swapped to ADAM which might help with since no learning rate is really needed to be specified.
Thank you for replying, @SeanNaren! When I used the official
And the 'slow convergence' is only the problem of DeepSpeech? or any project using CTC has the same issue?
One more thing I would like to ask is that, the InfereneceSoftmax in this implementation is correct or not. When to train, the output is just passed to the input, that means there is no limit of linear scale output of the network model. In the case of negative range, this makes some unstable point in the log probability calculation inside the CTCLoss. Typically the output should be Softmax, to make the output sure from 0 to 1. Could it be this the one reason of fast convergence when no Softmax output, since the linear scale positive value could be much larger than 1?
EDIT: Sorry, I saw the internal code of warpctc and now know that a softmax procedure exists in the