Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdaMax solver #6263

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

AdaMax solver #6263

wants to merge 2 commits into from

Conversation

Noiredd
Copy link
Member

@Noiredd Noiredd commented Feb 27, 2018

An implementation of a variant of Adam with the L2 norm replaced with infinite norm (see section 7.1 of Kingma & Lei Ba, 2014).
The code is in large part taken from the Adam solver so there's a lot of duplication - if that is an issue, there's another way of implementing this. AdamSolver could have a protected method GradientNorm which would be called by ComputeUpdateValue, then the AdaMaxSolver would derive from it and override this method. For GPU version, the whole adam_update_gpu() would become a virtual method of the class, overridden in AdaMaxSolver.

I considered having both variants in one class (and then just switch between them at runtime), but this raises issues with instantiating the solver (due to how SolverFactory works).

Thoughts?

@naibaf7
Copy link
Member

naibaf7 commented Feb 27, 2018

@Noiredd Do you have any data applying this solver to an existing problem with Caffe? I.e. train something from scratch with Adam vs. AdamMax with this exact code?

@Noiredd
Copy link
Member Author

Noiredd commented Feb 27, 2018

@naibaf7 So far I only ran a basic MNIST + LeNet, but tomorrow I could run it on something more challenging (either one of my own datasets or PASCAL segmentation task).
adamax.log
adamax.proto.txt

This is more of a "do you think we might need this" kind of PR - open for discussion.

@naibaf7
Copy link
Member

naibaf7 commented Feb 27, 2018

@Noiredd I agree, we need this ;) I like new solvers for hyperparameter search.

@Noiredd
Copy link
Member Author

Noiredd commented Feb 28, 2018

Some results on my own datasets: an AlexNet-based classification net (fine-tuned from ImageNet model for 5 epochs (245 iterations, batch 128) on ~6000 images in 7 classes) and a custom segmentation net with first two conv layers from AlexNet (20k images (3750 iterations, batch 16), 2 underrepresented object classes + background). AdaMax was tested against Adam and SGD under exactly the same conditions (fixed base_lr, the same momenta etc.), results given for respective validation sets.

Solver Classification accuracy (loss) Segmentation IoU per class (loss)
SGD 95.59% (0.1370) 0.6657, 0.8421 (0.0633)
Adam 96.37% (0.1470) 0.7318, 0.8600 (0.0584)
AdaMax 96.55% (0.1393) 0.7491, 0.9031 (0.0525)

Those are on my own nets and data, but I'll try to prepare some more reproducible example (e.g. on PASCAL VOC segmentation dataset).

@Noiredd
Copy link
Member Author

Noiredd commented Feb 28, 2018

I refactored the code a bit, making AdaMax derive from Adam. Code duplication is now limited to ComputeUpdateValue and *_update_gpu, which I'd say is acceptable.
Additionally, I added the delta parameter to control numerical stability - in the initial commit I somehow forgot about it and put a constant Dtype(1e-7) instead.

@naibaf7
Copy link
Member

naibaf7 commented Feb 28, 2018

Looks good to me now. @shelhamer?

@Noiredd
Copy link
Member Author

Noiredd commented Mar 2, 2018

I should have mentioned that the results from the previous post were all obtained via fine-tuning of existing models - I edited it to reflect that. Below I present one more bunch of models, some fine-tuned and others trained from scratch.

A custom segmentation network, FCN-style. Two classes + background (+ an ignore class), 3750 iterations in batches of 16 over 20k images. The first two convolutional layers are transferred from an ImageNet model for fine-tuning, MSRA initialization used for training from scratch. Results on the validation set:

Solver From scratch: IoU per class (loss) Fine-tuned: IoU per class (loss)
SGD 0.0000, 0.3987 (0.2251) 0.6676, 0.8473 (0.0619)
Adam 0.6952, 0.8697 (0.0568) 0.7414, 0.8885 (0.0602)
AdaMax 0.7365, 0.8832 (0.0523) 0.7501, 0.9045 (0.0470)

AlexNet derivative for classification of 7 classes, 245 iterations with batch 128 over 6k images. Convolutional layers transferred from an ImageNet model for fine-tuning, MSRA initialization for training from scratch. Learning rate fixed, 0.001 for fine-tuning, 0.01 from scratch.

Solver From scratch: Accuracy (loss) Fine-tuned: Accuracy (loss)
SGD 69.73% (0.8689) 95.86% (0.1299)
Adam 68.44 (0.9471) 96.57% (0.1542)
AdaMax 69.81% (0.7978) 96.66% (0.1253)

Bonus: FCN-32s on PASCAL VOC segmentation dataset. No augmentations, just VGG pre-trained for classification converted to FCN and fine-tuned for ~15k iterations (10 epochs) with batch 1, lr 1e-4.

Solver Average accuracy (loss)
SGD 88.32% (0.3957)
Adam 79.83% (0.7583)
AdaMax 87.88% (0.4967)

@Noiredd Noiredd requested review from ronghanghu and removed request for ronghanghu March 6, 2018 08:12
@wk910930
Copy link
Contributor

wk910930 commented Nov 7, 2018

The results look nice. Will this pull request be accepted and merged in near future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants