<a href="https://colab.research.google.com/github/Anjasfedo/Learning-TensorFlow/blob/main/eat_tensorflow2_in_30_days/Chapter5_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 5-7 optimizers

There is a group of magic cooks in machine learning. Their daily life looks like:

- They grab some raw material (data)
- Put them into a pot (model)
- Light some file (optimization algorithm)
- And wait until the cuisine is ready

However, anyone who has cooking experience knows that file controlling is the key part. Even using same material with the same recipe, different fire level leads to totally diferent results: medium well, brunt, or still raw.

This theory on cooking also applies to the machine learning. The choice of the optimization algorithm determines the final performance of the final model. An unsatisfying performance is not necessarily due to the problem of feature or model designing, instead, it might be attributed to the choice of optimization algorithm.

The evolution of the optimization algoritm for the deep learning is: SGD -> SGDM -> NAG -> Adagrad -> Adadelta (RMSprop) -> Adam -> Nadam

For the beginners, choosing Adam as the optimizaer and using default parameter will set everyting for you.

Someresearchers who are chaising better metrics for publications could use Adam as the intial optimizer and use SGD later for fine-tuning the parameters for better performance.

These are some cutting-edge optimization algorithm cleaiming a better performance, e.g. LazyAdam, Look-ahead, RAdam, Ranger, etc.

## 1. How to Use the Optimizer

Optimizer accepts variables and corresponding gradient through `apply_gradients` method to iterate over the given variables. Another way is using `minimize` method to optimize the target function iteratively.

Another common way is passing the optimizer into `Model` of keras, and call `model.fit` method to optimize the loss function.

A variable named `optimizer.iterations` will be created during optimizer initialization to record the number of iteration. Thus the optimizer should be created outside the decorator `@tf.function` with the same reason as `tf.Variable`.

In [4]:
import tensorflow as tf
import numpy as np

In [5]:
# Time stamp
@tf.function
def printbar():
  ts = tf.timestamp()
  today_ts = ts%(24*60*60)

  hour = tf.cast(today_ts//3600+8, tf.int32)%tf.constant(24)
  minute = tf.cast((today_ts%3600)//60, tf.int32)
  second = tf.cast(tf.floor(today_ts%60), tf.int32)

  def timeformat(m):
    if tf.strings.length(tf.strings.format("{}", m)) == 1:
      return(tf.strings.format("0{}",m))
    else:
      return(tf.strings.format("{}",m))

  timesting = tf.strings.join([timeformat(hour),timeformat(minute),
                                timeformat(second)],separator = ":")
  tf.print("=========="*8, end="")
  tf.print(timesting)

In [6]:
# The minimal value of f(x) = a*x**2 + b*x + c
# Here usage of optimizer.apply_gradient

x = tf.Variable(0.0, name="x", dtype=tf.float32)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

@tf.function
def minimizef():
  a = tf.constant(1.0)
  b = tf.constant(-2.0)
  c = tf.constant(1.0)

  while tf.constant(True):
    with tf.GradientTape() as tape:
      y = a*tf.pow(x,2) + b*x + c
    dy_dx = tape.gradient(y,x)
    optimizer.apply_gradients(grads_and_vars=[(dy_dx,x)])

    # Condition of termaniting the iteration
    if tf.abs(dy_dx) < tf.constant(0.00001):
      break

    if tf.math.mod(optimizer.iterations, 100) == 0:
      printbar()
      tf.print("step = ", optimizer.iterations)
      tf.print("x = ", x)
      tf.print("")

  y = a*tf.pow(x,2) + b*x + c
  return y

tf.print("y =", minimizef())
tf.print("x =", x)

step =  100
x =  0.867380381

step =  200
x =  0.98241204

step =  300
x =  0.997667611

step =  400
x =  0.999690711

step =  500
x =  0.999959

step =  600
x =  0.999994516

y = 0
x = 0.999995232


In [7]:
# Minimal value of f(x) = a*x**2 + b*x + c
# Here usage of optimizer.minimize

x = tf.Variable(0.0, name='x', dtype=tf.float32)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

def f():
  a = tf.constant(1.0)
  b = tf.constant(-2.0)
  c = tf.constant(1.0)
  y = a*tf.pow(x,2) + b*x + c
  return y

@tf.function
def train(epoch=1000):
  for _ in tf.range(epoch):
    optimizer.minimize(f, [x])
  tf.print("epoch = ", optimizer.iterations)
  return (f())

train(1000)
tf.print("y = ", f())
tf.print("x = ", x)

epoch =  1000
y =  0
x =  0.999998569


In [10]:
# Minimal value of f(x) = a*x**2 + b*x + c
# Here is usage of model.fit

tf.keras.backend.clear_session()

class FakeModel(tf.keras.models.Model):
  def __init__(self, a, b, c):
    super(FakeModel, self).__init__()
    self.a = a
    self.b = b
    self.c = c

  def build(self):
    self.x = tf.Variable(0.0, name='x')
    self.built = True

  def call(self, input):
    loss = self.a*(self.x)**2 + self.b*(self.x) + self.c
    return (tf.ones_like(input) * loss)

def myloss(y_true, y_pred):
  return tf.reduce_mean(y_pred)

model = FakeModel(tf.constant(1.0), tf.constant(-2.0), tf.constant(1.0))

model.build()
model.summary()

model.compile(optimizer = tf.keras.optimizers.SGD(learning_rate=0.01), loss=myloss)
history = model.fit(tf.zeros((100,2)),
                    tf.ones(100),
                    batch_size=1,
                    epochs=10)

Model: "fake_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
Total params: 1 (4.00 Byte)
Trainable params: 1 (4.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [11]:
tf.print("x=",model.x)
tf.print("loss=",model(tf.constant(0.0)))

x= 0.999998569
loss= 0


## 2. Pre-defined Optimizers

There are corresponding classes in `keras.optimizers` sub-module aas the implementation of optimizer.
- `SGD`, the default parameters is for a pure SGD. For a non-zero parameter `momentum`, the optimizer changes to SGDM since it considers the first-order momentum. For `nestrov`=True, the optimizer changes to NAG (Nesterov Accelerated Gradient), which calculates the gradient of the one further step.
- `Adagrad`, considers the second-order momentum and equiped with self-adaptive learning rate; the drawback is a slow learning rate at a later stage to early ceasing of learing due to the monotonocally descending learning rate.
- `RMSprop`, considers the second-order momentum and equiped with self-adaptive learning rate; improves the `Adagrad` through exponential smooting, which only considers the second-order momentum in a given window length.
- `Adadelta`, considers the second-order momentum, similiar as `RMSprop` but more complicated with an improvement self-adaptation.
- `Adam`, consider both the first-order and the second-order momentum; it improves `RMSprop` by including first-order momentum.
- `Nadam`, improves `Adam` by including Nesterov Acceleration.
