Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is noise scaled by Ntrain in RMSProp #2

Open
jaak-s opened this issue Mar 5, 2017 · 7 comments
Open

Why is noise scaled by Ntrain in RMSProp #2

jaak-s opened this issue Mar 5, 2017 · 7 comments

Comments

@jaak-s
Copy link

jaak-s commented Mar 5, 2017

In SGLD_RMSprop.m the noise is scaled by opts.N which is set to Ntrain in DNN experiments:
https://github.com/ChunyuanLI/pSGLD/blob/master/pSGLD_DNN/algorithms/SGLD_RMSprop.m#L51

Why is this the case? In the paper (https://arxiv.org/pdf/1512.07666v1.pdf) there is no such scaling.

I also checked SGLD_Adagrad.m and there is no scaling by Ntrain for the noise.

@ChunyuanLI
Copy link
Owner

The scaling for the noise is for faster convergence in practice; Otherwise, we need to train the model for a long time according to the theory.

@jaak-s
Copy link
Author

jaak-s commented Mar 7, 2017

Is the choice of Ntrain for the scaling arbitrary? Or do you think it will work in general for almost any dataset?

@ChunyuanLI
Copy link
Owner

Ntrain is the number of data points in the training dataset.

@jaak-s
Copy link
Author

jaak-s commented Mar 7, 2017

Yes, but you could use other numbers for scaling like a constant number (100) or batch size etc. So my question is whether you expect that Ntrain is a good choice in practice and it will work well for almost any dataset. Or should we try several values for the scaling and choose the best?

@ChunyuanLI
Copy link
Owner

I expect that Ntrain is a good choice in practice.

The "grad" is mean of the gradients computed in the mini-batch. We should use opts.N*grad to approximate the true gradient of the full dataset.

Instead, we consider the scaling issue in the stepsize "lr", and come to the update as following:

grad = lr* grad ./ pcder + sqrt(2*lr./pcder/opts.N).*randn(size(grad)) ;

However, this would take a long time to converge. In practice, I recommend:

grad = lr* grad ./ pcder + sqrt(2*lr./pcder).*randn(size(grad))/opts.N ;

@jaak-s
Copy link
Author

jaak-s commented Mar 8, 2017

Thank you for the explanation. I saw that also SGLD.m uses the same scaling by opts.N. So in your experience the same slow convergence holds true for the SGLD method too?

@ChunyuanLI
Copy link
Owner

Yes, the convergence also holds for SGLD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants