Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is up with Epoch 150 #7

Closed
kootenpv opened this issue Feb 27, 2019 · 8 comments
Closed

What is up with Epoch 150 #7

kootenpv opened this issue Feb 27, 2019 · 8 comments

Comments

@kootenpv
Copy link

I'm wondering what is happening at epoch 150 in all visualizations? I would like to introduce that into all my models ;-)

https://github.com/Luolc/AdaBound/blob/master/demos/cifar10/visualization.ipynb

@Luolc
Copy link
Owner

Luolc commented Feb 27, 2019

Said in the notebook:
We employ the fixed budget of 200 epochs and reduce the learning rates by 10 after 150 epochs.

@kootenpv
Copy link
Author

kootenpv commented Feb 27, 2019

That doesn't explain to me how come ALL models make these incredibly huge improvements in a single epoch. To be honest... it just looks wrong to me.

@Luolc
Copy link
Owner

Luolc commented Feb 27, 2019

Well, no offense but I think it's more about basic knowledge in the field of machine learning, and we don't really need to make a discussion here.

You may refer to this vedio by Andrew Ng to gain quick insight about lr decay. Or just search learning rate decay in Google and there're already many great posts introducing this technique.

It is broadly used in many machine learning papers/projects nowadays.

@kootenpv
Copy link
Author

kootenpv commented Feb 28, 2019

@Luolc I am aware of learning rate decay. This is why I think it is extremely weird that in all approaches you get at exactly epoch 150 a huge improvement.

This seems to me to indicate a bad initial learning rate (converging to a local optima)?

Usually it is the case that when a huge improvement is suddenly made, that it indicates the optimization before it was perhaps useless, or don't you agree?

I just wanted to warn you that it seems very odd to have such a huge jump relatively late in optimization and I was hoping there was an explanation for it other than a bad initial learning rate.

Thanks.

@Luolc
Copy link
Owner

Luolc commented Feb 28, 2019

Ok I get what you mean.

Regarding the initial lr, for each optimizer, we conducted a grid search to find the best hyperparameters. For each independent settings, we tested 3~5 times. Indeed, hundreds of runnings were done before we see the final visualization now. I am sure that we've already set the best lr we could ever find (at least the best in the grid). More details can be found in the experiment section of the paper.
As mentioned in the demo, the training code is heavily based on this broadly used code base for testing deep CNNs on CIFAR-10. As our best result for SGD even achieves a higher number than that reported in the original repo (~0.4%), I think we did make a successful training.

Usually it is the case that when a huge improvement is suddenly made, that it indicates the optimization before it was perhaps useless, or don't you agree?

I don't think it's approprate to say whether useful or useless. If you may refer to Figure 6(a) in the paper, the learning curves of SGDs with other initial lr are even much worse than what we see in the notebook. So could we say it is the least useless one we may find?

There might be a better decay strategy like making the decay happend earlier --- I totally agree but it's not we are concerning about. What we need is to guarantee the same decay strategy is applied on all the optimizers to make it fair for comparison, rather than finding a best decay strategy.

Finally, I don't think it is a huge jump or odd behavior. I've seen many similar figures in plenty of papers. For example the SWATS.

@siaimes
Copy link

siaimes commented May 17, 2019

@kootenpv You perhaps not understand that the model parameters will be updated many times in one epoch.

@kootenpv
Copy link
Author

kootenpv commented May 17, 2019

@siaimes I understand that obviously, but why would it be such a steep change exactly at the 150th epoch - it looks to me like something is just wrong (bad parameters before the 150th). It does make more sense in @Luolc's explanation that these settings turn out to be "less than optimal" for this particular dataset.

@Luolc
Copy link
Owner

Luolc commented May 18, 2019

@kootenpv I've recently done some more toy experiments on CIFAR-10 and gain deeper insights now.

FYI: we can employ the lr decay earlier at ~75 epoch and achieve similar results after ~100 epoch.

Decaying at 150 epoch is not the best settings considering the time cost, but not affecting the final results. Since the purpose of the paper is not finding SoTA, it's ok due to the fairness among different optimizers.

p.s. I'd like to close this issue if there's no further doubt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants