## Summer School on Deep Learning Surathkal, Karnataka - 2019
## Optimizers
* https://keras.io/optimizers/

### Variations of Gradient Descent
* Stachastic Gradient Descent
* RMSprop
* Adagrad
* Adadelta
* Adam
* Adamax
* Nadam

### Stachastic Gradient Descent
#### Arguments
* lr: float >= 0. Learning rate.
* Momentum: float >= 0. Parameter that accelerates SGD in the relevant direction and dampens oscillations.
* Decay: float >= 0. Learning rate decay over each update.
* Nesterov: boolean, whether to apply Nesterov momentum.


In [0]:
import keras
from keras import optimizers
from keras.models import Sequential
from keras.layers import *

model = Sequential()
model.add(Dense(64, kernel_initializer='uniform', input_shape=(10,)))
model.add(Activation('softmax'))

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 64)                704       
_________________________________________________________________
activation_2 (Activation)    (None, 64)                0         
Total params: 704
Trainable params: 704
Non-trainable params: 0
_________________________________________________________________


### RMSprop
####  Arguments
* lr: float >= 0. Learning rate.
* rho: float >= 0.
* epsilon: float >= 0. Fuzz factor. If None, defaults to K.epsilon().
* decay: float >= 0. Learning rate decay over each update.

In [0]:
import keras
from keras import optimizers
from keras.models import Sequential
from keras.layers import *

model = Sequential()
model.add(Dense(64, kernel_initializer='uniform', input_shape=(10,)))
model.add(Activation('softmax'))

rms = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=None, decay=0.0)
model.compile(loss='mean_squared_error', optimizer=rms )
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 64)                704       
_________________________________________________________________
activation_2 (Activation)    (None, 64)                0         
Total params: 704
Trainable params: 704
Non-trainable params: 0
_________________________________________________________________


### Adagrad
#### Arguments
* Lr: float >= 0. Initial learning rate.
* epsilon: float >= 0. If None, defaults to K.epsilon().
* decay: float >= 0. Learning rate decay over each update


In [0]:
import keras
from keras import optimizers
from keras.models import Sequential
from keras.layers import *

model = Sequential()
model.add(Dense(64, kernel_initializer='uniform', input_shape=(10,)))
model.add(Activation('softmax'))

adagrad = keras.optimizers.Adagrad(lr=0.01, epsilon=None, decay=0.0)
model.compile(loss='mean_squared_error', optimizer=adagrad  )
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_6 (Dense)              (None, 64)                704       
_________________________________________________________________
activation_6 (Activation)    (None, 64)                0         
Total params: 704
Trainable params: 704
Non-trainable params: 0
_________________________________________________________________


### Adadelta
#### Arguments
* lr: float >= 0. Initial learning rate, defaults to 1. It is recommended to leave it at the default value.
* rho: float >= 0. Adadelta decay factor, corresponding to fraction of gradient to keep at each time step.
* epsilon: float >= 0. Fuzz factor. If None, defaults to K.epsilon().
* decay: float >= 0. Initial learning rate decay.

In [0]:
import keras
from keras import optimizers
from keras.models import Sequential
from keras.layers import *

model = Sequential()
model.add(Dense(64, kernel_initializer='uniform', input_shape=(10,)))
model.add(Activation('softmax'))

adadelta = keras.optimizers.Adadelta(lr=1.0, rho=0.95, epsilon=None, decay=0.0)
model.compile(loss='mean_squared_error', optimizer=ada )
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 64)                704       
_________________________________________________________________
activation_4 (Activation)    (None, 64)                0         
Total params: 704
Trainable params: 704
Non-trainable params: 0
_________________________________________________________________


### Adam
#### Arguments
* lr: float >= 0. Learning rate.
* beta_1: float, 0 < beta < 1. Generally close to 1.
* beta_2: float, 0 < beta < 1. Generally close to 1.
* epsilon: float >= 0. Fuzz factor. If None, defaults to K.epsilon().
* decay: float >= 0. Learning rate decay over each update.
* amsgrad: boolean. Whether to apply the AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and Beyond".

In [0]:
import keras
from keras import optimizers
from keras.models import Sequential
from keras.layers import *

model = Sequential()
model.add(Dense(64, kernel_initializer='uniform', input_shape=(10,)))
model.add(Activation('softmax'))

adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
model.compile(loss='mean_squared_error', optimizer=ada )
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_7 (Dense)              (None, 64)                704       
_________________________________________________________________
activation_7 (Activation)    (None, 64)                0         
Total params: 704
Trainable params: 704
Non-trainable params: 0
_________________________________________________________________


### Adamax (Nesterov Adam optimizer)
####Arguments
* lr: float >= 0. Learning rate.
* beta_1: floats, 0 < beta < 1. Generally close to 1.
* beta_2: floats, 0 < beta < 1. Generally close to 1.
* epsilon: float >= 0. Fuzz factor. If None, defaults to K.epsilon().
* bdecay: float >= 0. Learning rate decay over each update.

In [0]:
import keras
from keras import optimizers
from keras.models import Sequential
from keras.layers import *

model = Sequential()
model.add(Dense(64, kernel_initializer='uniform', input_shape=(10,)))
model.add(Activation('softmax'))

adamx = keras.optimizers.Adamax(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
model.compile(loss='mean_squared_error', optimizer=ada )
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_8 (Dense)              (None, 64)                704       
_________________________________________________________________
activation_8 (Activation)    (None, 64)                0         
Total params: 704
Trainable params: 704
Non-trainable params: 0
_________________________________________________________________


### Nadam
#### Arguments
* lr: float >= 0. Learning rate.
* beta_1: floats, 0 < beta < 1. Generally close to 1.
* beta_2: floats, 0 < beta < 1. Generally close to 1.
* epsilon: float >= 0. Fuzz factor. If None, defaults to K.epsilon().
* schedule_decay: floats, 0 < schedule_decay < 1.

In [0]:
import keras
from keras import optimizers
from keras.models import Sequential
from keras.layers import *

model = Sequential()
model.add(Dense(64, kernel_initializer='uniform', input_shape=(10,)))
model.add(Activation('softmax'))

nadam = keras.optimizers.Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
model.compile(loss='mean_squared_error', optimizer=ada )
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_9 (Dense)              (None, 64)                704       
_________________________________________________________________
activation_9 (Activation)    (None, 64)                0         
Total params: 704
Trainable params: 704
Non-trainable params: 0
_________________________________________________________________
