New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AdaMax solver #6263
base: master
Are you sure you want to change the base?
AdaMax solver #6263
Conversation
@Noiredd Do you have any data applying this solver to an existing problem with Caffe? I.e. train something from scratch with Adam vs. AdamMax with this exact code? |
@naibaf7 So far I only ran a basic MNIST + LeNet, but tomorrow I could run it on something more challenging (either one of my own datasets or PASCAL segmentation task). This is more of a "do you think we might need this" kind of PR - open for discussion. |
@Noiredd I agree, we need this ;) I like new solvers for hyperparameter search. |
Some results on my own datasets: an AlexNet-based classification net (fine-tuned from ImageNet model for 5 epochs (245 iterations, batch 128) on ~6000 images in 7 classes) and a custom segmentation net with first two conv layers from AlexNet (20k images (3750 iterations, batch 16), 2 underrepresented object classes + background). AdaMax was tested against Adam and SGD under exactly the same conditions (fixed base_lr, the same momenta etc.), results given for respective validation sets.
Those are on my own nets and data, but I'll try to prepare some more reproducible example (e.g. on PASCAL VOC segmentation dataset). |
I refactored the code a bit, making AdaMax derive from Adam. Code duplication is now limited to |
Looks good to me now. @shelhamer? |
I should have mentioned that the results from the previous post were all obtained via fine-tuning of existing models - I edited it to reflect that. Below I present one more bunch of models, some fine-tuned and others trained from scratch. A custom segmentation network, FCN-style. Two classes + background (+ an ignore class), 3750 iterations in batches of 16 over 20k images. The first two convolutional layers are transferred from an ImageNet model for fine-tuning, MSRA initialization used for training from scratch. Results on the validation set:
AlexNet derivative for classification of 7 classes, 245 iterations with batch 128 over 6k images. Convolutional layers transferred from an ImageNet model for fine-tuning, MSRA initialization for training from scratch. Learning rate fixed, 0.001 for fine-tuning, 0.01 from scratch.
Bonus: FCN-32s on PASCAL VOC segmentation dataset. No augmentations, just VGG pre-trained for classification converted to FCN and fine-tuned for ~15k iterations (10 epochs) with batch 1, lr
|
The results look nice. Will this pull request be accepted and merged in near future? |
An implementation of a variant of Adam with the L2 norm replaced with infinite norm (see section 7.1 of Kingma & Lei Ba, 2014).
The code is in large part taken from the Adam solver so there's a lot of duplication - if that is an issue, there's another way of implementing this.
AdamSolver
could have a protected methodGradientNorm
which would be called byComputeUpdateValue
, then theAdaMaxSolver
would derive from it and override this method. For GPU version, the wholeadam_update_gpu()
would become a virtual method of the class, overridden inAdaMaxSolver
.I considered having both variants in one class (and then just
switch
between them at runtime), but this raises issues with instantiating the solver (due to howSolverFactory
works).Thoughts?