New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why gradient does not scale linearly with CTCloss? #250

Open
gmkim90 opened this Issue Feb 14, 2018 · 7 comments

Comments

Projects
None yet
3 participants
@gmkim90
Copy link

gmkim90 commented Feb 14, 2018

I am trying to use CTCLoss with another loss, and want coefficients of CTCLoss smaller.
But when I tried different

This is the extreme cases for small CTCLoss. (CTCLoss coefficient = 0)

criterion = CTCLoss()
loss = 0*criterion(out, targets, sizes, target_sizes)
model.zero_grad()
loss.backward()
optimizer.step()

for param in model.parameters():
     if param.grad is not None:
          grad_norm += param.grad.norm().data[0]

In above example, loss is always 0.
However, fairly large gradient value still exists even CTCLoss coefficient is 0.

Can anyone explain why gradient does not scale linearly with CTCLoss?

@SeanNaren

This comment has been minimized.

Copy link
Owner

SeanNaren commented Jul 13, 2018

I can point people in the direction, however I'm unsure of the fix.

If you look here we don't take grad_outputs into consideration when calculating the gradients, thus any scale factor isn't being applied to the gradients.

There was a commit added to the repo to fix this but I reverted it (see commit here. The reason I reverted is because convergence became significantly longer when this fix was added.

Any help on this would be great :)

cc @weedwind

@jinserk

This comment has been minimized.

Copy link
Contributor

jinserk commented Oct 20, 2018

Hi @SeanNaren, I'm very interested in your explanation that the convergence was influenced by the gradients connection. So I really wonder that why that is related, and why loss to gradients makes the convergence longer? Since I use the pytorch 1.0.0's official nn.CTCLoss, but I think I have the same problem of convergence.

If I want to do the same in the official nn.CTCLoss, what can I do for it? How to I put 1 to grad_outputs of CTCLoss's backward?

@SeanNaren

This comment has been minimized.

Copy link
Owner

SeanNaren commented Oct 20, 2018

@jinserk sorry for being away, was lurking on the forums nonetheless and saw your post

We've been running into the same issue internally; the gradients not being scaled has caused an issue for us.

The issue right now is that convergence seems to be better when the gradients are not scaled, however this is less a feature but more a bug. However when we fix this it causes this line to scale the gradient to the batch size. This seems to reduce the speed of convergence dramatically.

The proper thing to do is scale the gradients by the batch size. This will have to alter the learning rates to match this, and this is something I haven't looked at so far! The paddle paddle DeepSpeech repo swapped to ADAM which might help with since no learning rate is really needed to be specified.

@jinserk

This comment has been minimized.

Copy link
Contributor

jinserk commented Oct 20, 2018

Thank you for replying, @SeanNaren! When I used the official nn.CTCLoss, the loss itself reduced so fast, but the accuracy stayed not being trained at all. I tried the Adam too, but the result was almost same. Loss was reduced quicker, but the accuracy didn't get better.
If I give the arg size_average=True, is it already scaling the gradient by the batch size? Of course, when the gradient outputs are connected to the gradient in backward.

And the 'slow convergence' is only the problem of DeepSpeech? or any project using CTC has the same issue?

@jinserk

This comment has been minimized.

Copy link
Contributor

jinserk commented Oct 22, 2018

One more thing I would like to ask is that, the InfereneceSoftmax in this implementation is correct or not. When to train, the output is just passed to the input, that means there is no limit of linear scale output of the network model. In the case of negative range, this makes some unstable point in the log probability calculation inside the CTCLoss. Typically the output should be Softmax, to make the output sure from 0 to 1. Could it be this the one reason of fast convergence when no Softmax output, since the linear scale positive value could be much larger than 1?

EDIT: Sorry, I saw the internal code of warpctc and now know that a softmax procedure exists in the cost_and_grad() function. 😄

@SeanNaren

This comment has been minimized.

Copy link
Owner

SeanNaren commented Oct 23, 2018

@jinserk oh god, any idea if the softmax happens internally in the torch version? This would mess things up a bit!

@jinserk

This comment has been minimized.

Copy link
Contributor

jinserk commented Oct 23, 2018

nn.CTCLoss doesn't have the softmax in it. We need to give an explicit log softmax of the input probs to it. I've tested but this isn't making the difference. I guess the most difference was happened by the gradient connection in backward()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment