<a href="https://colab.research.google.com/github/Sanim27/DeepL_from_scratch/blob/main/Adam%26Momentum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Lets start with creating mini batch.

2 Steps:



1.   Shuffle examples
2.   Partition them into equal size except for last mini batch which might be smaller.



In [1]:
def random_mini_batches(X,Y,mini_batch_size=64):
  m=X.shape[1]
  mini_batches=[]
  permutation=list(np.random.permutation(m))
  shuffled_X=X[:,permutation]
  shuffled_Y=Y[:,permutation].reshape((1,m))
  num_complete_mini_batches=math.floor(m/mini_batch_size)
  for k in range(0,num_complete_mini_batches):
    mini_batch_X=shuffled_X[:,k*mini_batch_size:(k+1)*mini_batch_size]
    mini_batch_Y=shuffled_Y[:,k*mini_batch_size:(k+1)*mini_batch_size]
    mini_batch=(mini_batch_X,mini_batch_Y)
    mini_batches.append(mini_batch)

  if m%mini_batch_size!=0:
    mini_batch_X=shuffled_X[:,num_complete_mini_batches*mini_batch_size:m]
    mini_batch_Y=shuffled_Y[:,num_complete_mini_batches*mini_batch_size:m]
    mini_batch=(mini_batch_X,mini_batch_Y)
    mini_batches.append(mini_batch)

  return mini_batches

Since mini-batch gradient descent makes update by seeing only a fraction of examples, hence it is necessary to use some sort of momentum so that it doesnt deviate from going to the bottom of the loss function curve.

### Mini-batch-gradient-descent with Momentum Implementation

Initializing all V's (velocities) is required at the beginning.

In [2]:
import numpy as np

In [3]:
def initialize_velocity(parameters):
  L=len(parameters)//2
  v={}
  for l in range(L):
    v["dW"+str(l+1)]=np.zeros((parameters["W"+str(l+1)].shape[0],parameters["W"+str(l+1)].shape[1]))
    v["db"+str(l+1)]=np.zeros((parameters["b"+str(l+1)].shape[0],parameters["b"+str(l+1)].shape[1]))
  return v

Now that V's are ready lets update them using momentum.

In [4]:
def update_parameters_with_momentum(parameters,grads,v,beta,learning_rate):
  L=len(parameters)//2
  for l in range(L):
    v["dW"+str(l+1)]=beta*v["dW"+str(l+1)]+(1-beta)*grads["dW"+str(l+1)]
    v["db"+str(l+1)]=beta*v["db"+str(l+1)]+(1-beta)*grads["db"+str(l+1)]

    parameters["W"+str(l+1)]=parameters["W"+str(l+1)]-learning_rate*v["dW"+str(l+1)]
    parameters["b"+str(l+1)]=parameters["b"+str(l+1)]-learning_rate*v["db"+str(l+1)]

  return parameters,v

For this beta can be between 0.8 to 0.999 but 0.9 is generally used if we dont want to tune it. Lazy Fellas.

Lets move onto Adam now.

It combines RMSprop and momentum hence we can use this to also know how RMSprop works.

First lets initialize parameters for Adam.

In [10]:
def initialize_parameters_Adam(parameters):
  L=len(parameters)//2
  v={}
  s={}
  for l in range(L):
    v["dW"+str(l+1)]=np.zeros((parameters["W"+str(l+1)].shape[0],parameters["W"+str(l+1)].shape[1]))
    v["db"+str(l+1)]=np.zeros((parameters["b"+str(l+1)].shape[0],parameters["b"+str(l+1)].shape[1]))
    s["dW"+str(l+1)]=np.zeros((parameters["W"+str(l+1)].shape[0],parameters["W"+str(l+1)].shape[1]))
    s["db"+str(l+1)]=np.zeros((parameters["b"+str(l+1)].shape[0],parameters["b"+str(l+1)].shape[1]))
  return v,s

In [12]:
def update_parameters_Adam(parameters,grads,v,s,t,learning_rate=0.01,beta1=0.9,beta2=0.999,epsilon=1e-8):
  L=len(parameters)//2
  v_corrected={}
  s_corrected={}
  for l in range(L):
    v["dW"+str(l+1)]=beta1*v["dW"+str(l+1)]+(1-beta1)*grads["dW"+str(l+1)]
    v["db"+str(l+1)]=beta1*v["db"+str(l+1)]+(1-beta1)*grads["db"+str(l+1)]
    s["dW"+str(l+1)]=beta2*s["dW"+str(l+1)]+(1-beta2)*np.square(grads["dW"+str(l+1)])
    s["db"+str(l+1)]=beta2*s["db"+str(l+1)]+(1-beta2)*np.square(grads["db"+str(l+1)])

    v_corrected["dW"+str(l+1)]=v["dW"+str(l+1)]/(1-np.power(beta1,t))
    v_corrected["db"+str(l+1)]=v["db"+str(l+1)]/(1-np.power(beta1,t))
    s_corrected["dW"+str(l+1)]=s["dW"+str(l+1)]/(1-np.power(beta2,t))
    s_corrected["db"+str(l+1)]=s["db"+str(l+1)]/(1-np.power(beta2,t))

    parameters["W"+str(l+1)]=parameters["W"+str(l+1)]-learning_rate*v_corrected["dW"+str(l+1)]/np.sqrt(s_corrected["dW"+str(l+1)]+epsilon)
    parameters["b"+str(l+1)]=parameters["b"+str(l+1)]-learning_rate*v_corrected["db"+str(l+1)]/np.sqrt(s_corrected["db"+str(l+1)]+epsilon)

  return parameters,v,s