Why is noise scaled by Ntrain in RMSProp #2

jaak-s · 2017-03-05T00:11:39Z

In SGLD_RMSprop.m the noise is scaled by opts.N which is set to Ntrain in DNN experiments:
https://github.com/ChunyuanLI/pSGLD/blob/master/pSGLD_DNN/algorithms/SGLD_RMSprop.m#L51

Why is this the case? In the paper (https://arxiv.org/pdf/1512.07666v1.pdf) there is no such scaling.

I also checked SGLD_Adagrad.m and there is no scaling by Ntrain for the noise.

The text was updated successfully, but these errors were encountered:

ChunyuanLI · 2017-03-07T19:19:14Z

The scaling for the noise is for faster convergence in practice; Otherwise, we need to train the model for a long time according to the theory.

jaak-s · 2017-03-07T22:56:00Z

Is the choice of Ntrain for the scaling arbitrary? Or do you think it will work in general for almost any dataset?

ChunyuanLI · 2017-03-07T23:07:07Z

Ntrain is the number of data points in the training dataset.

jaak-s · 2017-03-07T23:15:57Z

Yes, but you could use other numbers for scaling like a constant number (100) or batch size etc. So my question is whether you expect that Ntrain is a good choice in practice and it will work well for almost any dataset. Or should we try several values for the scaling and choose the best?

ChunyuanLI · 2017-03-08T01:12:01Z

I expect that Ntrain is a good choice in practice.

The "grad" is mean of the gradients computed in the mini-batch. We should use opts.N*grad to approximate the true gradient of the full dataset.

Instead, we consider the scaling issue in the stepsize "lr", and come to the update as following:

grad = lr* grad ./ pcder + sqrt(2*lr./pcder/opts.N).*randn(size(grad)) ;

However, this would take a long time to converge. In practice, I recommend:

grad = lr* grad ./ pcder + sqrt(2*lr./pcder).*randn(size(grad))/opts.N ;

jaak-s · 2017-03-08T09:08:13Z

Thank you for the explanation. I saw that also SGLD.m uses the same scaling by opts.N. So in your experience the same slow convergence holds true for the SGLD method too?

ChunyuanLI · 2017-03-08T14:57:27Z

Yes, the convergence also holds for SGLD.

rb876 mentioned this issue Jun 20, 2019

Inquiry: Scaling the Lr henripal/sgld#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is noise scaled by Ntrain in RMSProp #2

Why is noise scaled by Ntrain in RMSProp #2

jaak-s commented Mar 5, 2017

ChunyuanLI commented Mar 7, 2017

jaak-s commented Mar 7, 2017

ChunyuanLI commented Mar 7, 2017

jaak-s commented Mar 7, 2017

ChunyuanLI commented Mar 8, 2017

jaak-s commented Mar 8, 2017 •

edited

ChunyuanLI commented Mar 8, 2017

Why is noise scaled by Ntrain in RMSProp #2

Why is noise scaled by Ntrain in RMSProp #2

Comments

jaak-s commented Mar 5, 2017

ChunyuanLI commented Mar 7, 2017

jaak-s commented Mar 7, 2017

ChunyuanLI commented Mar 7, 2017

jaak-s commented Mar 7, 2017

ChunyuanLI commented Mar 8, 2017

jaak-s commented Mar 8, 2017 • edited

ChunyuanLI commented Mar 8, 2017

jaak-s commented Mar 8, 2017 •

edited