AdaMax solver #6263

Noiredd · 2018-02-27T19:22:36Z

An implementation of a variant of Adam with the L2 norm replaced with infinite norm (see section 7.1 of Kingma & Lei Ba, 2014).
The code is in large part taken from the Adam solver so there's a lot of duplication - if that is an issue, there's another way of implementing this. AdamSolver could have a protected method GradientNorm which would be called by ComputeUpdateValue, then the AdaMaxSolver would derive from it and override this method. For GPU version, the whole adam_update_gpu() would become a virtual method of the class, overridden in AdaMaxSolver.

I considered having both variants in one class (and then just switch between them at runtime), but this raises issues with instantiating the solver (due to how SolverFactory works).

Thoughts?

naibaf7 · 2018-02-27T19:35:58Z

@Noiredd Do you have any data applying this solver to an existing problem with Caffe? I.e. train something from scratch with Adam vs. AdamMax with this exact code?

Noiredd · 2018-02-27T20:01:16Z

@naibaf7 So far I only ran a basic MNIST + LeNet, but tomorrow I could run it on something more challenging (either one of my own datasets or PASCAL segmentation task).
adamax.log
adamax.proto.txt

This is more of a "do you think we might need this" kind of PR - open for discussion.

naibaf7 · 2018-02-27T20:06:28Z

@Noiredd I agree, we need this ;) I like new solvers for hyperparameter search.

Noiredd · 2018-02-28T09:53:30Z

Some results on my own datasets: an AlexNet-based classification net (fine-tuned from ImageNet model for 5 epochs (245 iterations, batch 128) on ~6000 images in 7 classes) and a custom segmentation net with first two conv layers from AlexNet (20k images (3750 iterations, batch 16), 2 underrepresented object classes + background). AdaMax was tested against Adam and SGD under exactly the same conditions (fixed base_lr, the same momenta etc.), results given for respective validation sets.

Solver	Classification accuracy (loss)	Segmentation IoU per class (loss)
SGD	95.59% (0.1370)	0.6657, 0.8421 (0.0633)
Adam	96.37% (0.1470)	0.7318, 0.8600 (0.0584)
AdaMax	96.55% (0.1393)	0.7491, 0.9031 (0.0525)

Those are on my own nets and data, but I'll try to prepare some more reproducible example (e.g. on PASCAL VOC segmentation dataset).

Noiredd · 2018-02-28T13:18:07Z

I refactored the code a bit, making AdaMax derive from Adam. Code duplication is now limited to ComputeUpdateValue and *_update_gpu, which I'd say is acceptable.
Additionally, I added the delta parameter to control numerical stability - in the initial commit I somehow forgot about it and put a constant Dtype(1e-7) instead.

naibaf7 · 2018-02-28T16:04:24Z

Looks good to me now. @shelhamer?

Noiredd · 2018-03-02T07:08:53Z

I should have mentioned that the results from the previous post were all obtained via fine-tuning of existing models - I edited it to reflect that. Below I present one more bunch of models, some fine-tuned and others trained from scratch.

A custom segmentation network, FCN-style. Two classes + background (+ an ignore class), 3750 iterations in batches of 16 over 20k images. The first two convolutional layers are transferred from an ImageNet model for fine-tuning, MSRA initialization used for training from scratch. Results on the validation set:

Solver	From scratch: IoU per class (loss)	Fine-tuned: IoU per class (loss)
SGD	0.0000, 0.3987 (0.2251)	0.6676, 0.8473 (0.0619)
Adam	0.6952, 0.8697 (0.0568)	0.7414, 0.8885 (0.0602)
AdaMax	0.7365, 0.8832 (0.0523)	0.7501, 0.9045 (0.0470)

AlexNet derivative for classification of 7 classes, 245 iterations with batch 128 over 6k images. Convolutional layers transferred from an ImageNet model for fine-tuning, MSRA initialization for training from scratch. Learning rate fixed, 0.001 for fine-tuning, 0.01 from scratch.

Solver	From scratch: Accuracy (loss)	Fine-tuned: Accuracy (loss)
SGD	69.73% (0.8689)	95.86% (0.1299)
Adam	68.44 (0.9471)	96.57% (0.1542)
AdaMax	69.81% (0.7978)	96.66% (0.1253)

Bonus: FCN-32s on PASCAL VOC segmentation dataset. No augmentations, just VGG pre-trained for classification converted to FCN and fine-tuned for ~15k iterations (10 epochs) with batch 1, lr 1e-4.

Solver	Average accuracy (loss)
SGD	88.32% (0.3957)
Adam	79.83% (0.7583)
AdaMax	87.88% (0.4967)

wk910930 · 2018-11-07T06:22:43Z

The results look nice. Will this pull request be accepted and merged in near future?

AdaMax solver with tests

8979901

Noiredd added the enhancement label Feb 27, 2018

derive AdaMax from Adam, use delta param

a4998dd

Noiredd requested review from ronghanghu and removed request for ronghanghu March 6, 2018 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdaMax solver #6263

AdaMax solver #6263

Noiredd commented Feb 27, 2018

naibaf7 commented Feb 27, 2018

Noiredd commented Feb 27, 2018

naibaf7 commented Feb 27, 2018

Noiredd commented Feb 28, 2018 •

edited

Noiredd commented Feb 28, 2018

naibaf7 commented Feb 28, 2018

Noiredd commented Mar 2, 2018

wk910930 commented Nov 7, 2018

AdaMax solver #6263

Are you sure you want to change the base?

AdaMax solver #6263

Conversation

Noiredd commented Feb 27, 2018

naibaf7 commented Feb 27, 2018

Noiredd commented Feb 27, 2018

naibaf7 commented Feb 27, 2018

Noiredd commented Feb 28, 2018 • edited

Noiredd commented Feb 28, 2018

naibaf7 commented Feb 28, 2018

Noiredd commented Mar 2, 2018

wk910930 commented Nov 7, 2018

Noiredd commented Feb 28, 2018 •

edited