# Adam, RMSProp, and Momentum Mini-Batch Gradient Descent Optimizer for Breast Cancer Prediction

The purpose of this project is to apply Adam, RMSProp, and Momentum mini-batch gradient descent for the breast cancer prediction. Then, the F1 score of the test set using these optimizers will be compared with one another.

The data used in this project was taken from the Wisconsin Breast Cancer datasets, available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. Overall, there are 569 examples of tumor with different size of radius, texture, concavity, smoothness, and compactness available.

First, let's import all of the necessary libraries for this project.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy
import pandas as pd
import math
from main import *

In order to use mini-batch gradient descent, the first step that we need to do is to partition the training set as well as test set into several mini-batches. In this project, each mini-batch consists of 64 training and test examples.

However, since each mini-batch consists of 64 elements, then it is natural that the last mini-batch will have elements less than 64. We can compute the number of elements in the last mini-batch with the following formula:

$$m-mini_\_batch_\_size \times \lfloor \frac{m}{mini\_batch\_size}\rfloor$$

where $m$ is the total examples of training set, $mini_\_batch_\_size$ is the size of each mini-batch, which is 64, and $\lfloor \rfloor$ is the sign to rounding down float number to closest integer.

Let's implement the mini batch size in a function.

In [2]:
def random_mini_batches(X, Y, mini_batch_size, seed =0):
    
    np.random.seed(seed)            
    m = X.shape[1]                  
    mini_batches = []
        
    # Shuffling the training set and test set
    permutation = list(np.random.permutation(m))

    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))
  
    # Partition to mini batches
    num_minibatches = math.floor(m/mini_batch_size) # number of generated mini-batches
    
    for k in range(0, num_minibatches):
        
        mini_batch_X = shuffled_X[:,k*mini_batch_size:(k+1)*mini_batch_size]
        mini_batch_Y = shuffled_Y[:,k*mini_batch_size:(k+1)*mini_batch_size]
        
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
  
    # Handling the end tail of mini batches which have elements < 64
    
    if m % mini_batch_size != 0:
       
        mini_batch_X = shuffled_X[:, 0:m - mini_batch_size*math.floor(m/mini_batch_size)]
        mini_batch_Y = shuffled_Y[:, 0:m - mini_batch_size*math.floor(m/mini_batch_size)]
        
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

The code above consists of two steps: 
- First, the training set and the corresponding response value $Y$ is shuffled randomly to generate random mini-batches.
- Second, the shuffled training set and the corresponding response value $Y$ is partitioned into mini-batches of 64 elements.

## Momentum Optimizer

The first optimizer that will be used in this project is Momentum optimizer. One of the biggest challenge in applying pure mini-batch gradient descent is that the parameters update will occur only after seeing just a subset of training set (i.e each mini-batch), hence there will be oscillations in the process of reaching the optimum value.

In order to overcome this, let's apply the momentum mini-batch gradient descent. The advantage of applying momentum is that it reduces the oscillations that will typically be found in the mini-batch gradient descent during parameter updates and thus, speeds up the learning rate.

Equations for momentum optimizer:

$$v_{d_w} = \beta*v_{d_w} + (1-\beta) dW$$
$$v_{d_b} = \beta*v_{d_b} + (1-\beta) db$$
$$W = W - \alpha v_{d_W}, b = b - \alpha v_{d_b}$$

where $v_{d_w}$ is the direction of previous gradient for the weight $W$, $v_{d_b}$ is the direction of previous gradient for the bias term $b$, and $\beta$ is momentum parameter.

Let's implement mathematical equations above in a function.

In [3]:
def initializeVelocity_momentum(parameters):
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    
    # Initialize velocity
    
    for l in range(L):
    
        v["dW" + str(l+1)] = np.zeros((parameters['W' + str(l+1)].shape))
        v["db" + str(l+1)] = np.zeros((parameters['b' + str(l+1)].shape))
        
        
    return v

In [4]:
 def updateParameters_with_momentum(parameters, grads, v, beta, learningRate):
   

    L = len(parameters) // 2 # number of layers in the neural networks
    
    # Momentum update for each parameter
    for l in range(L):

        v["dW" + str(l+1)] = beta*v["dW" + str(l+1)] + ((1-beta)*grads['dW' + str(l+1)])
        v["db" + str(l+1)] = beta*v["db" + str(l+1)] + ((1-beta)*grads['db' + str(l+1)])
    
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learningRate*grads['dW' + str(l+1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learningRate*grads['db' + str(l+1)]
     
        
    return parameters, v

From the equation above, we can see that if we set $\beta$ = 0, then we have the same parameters updating as if we are not applying any Momentum. The bigger the $\beta$, the smoother the updates will be.


## RMSProp Optimizer 

The intuition behind RMSProp is basically the same as momentum mini-batch gradient descent, in which we try to dampen the oscillations of the path in each parameter updates of mini-batch gradient descent. The difference between RMSProp and Momentum is the term to update the weight $W$ and bias term $b$. In RMSProp, the root mean squared technique is applied with the formula:

$$S_{d_w} = \beta*S_{d_w} + (1-\beta) dW^2$$
$$S_{d_b} = \beta*S_{d_b} + (1-\beta) db^2$$
$$W = W - \alpha \frac{dW}{\sqrt{S_{d_w}}}$$
$$b = b - \alpha \frac{db}{\sqrt{S_{d_b}}}$$

Now let's apply the formula above in a function.

In [5]:
def initializeVelocity_RMSProp(parameters):
    
    L = len(parameters) // 2 # number of layers in the neural networks
    s = {}
    
    # Initialize velocity
    
    for l in range(L):
    
        s["dW" + str(l+1)] = np.zeros((parameters['W' + str(l+1)].shape))
        s["db" + str(l+1)] = np.zeros((parameters['b' + str(l+1)].shape))
        
        
    return s

In [6]:
 def updateParameters_with_RMSProp(parameters, grads, s, beta, learningRate, epsilon):
   

    L = len(parameters) // 2 # number of layers in the neural networks
    
    # Momentum update for each parameter
    for l in range(L):

        s["dW" + str(l+1)] = beta*s["dW" + str(l+1)] + ((1-beta)*grads['dW' + str(l+1)]**2)
        s["db" + str(l+1)] = beta*s["db" + str(l+1)] + ((1-beta)*grads['db' + str(l+1)]**2)
    
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learningRate*grads['dW' + str(l+1)]/np.sqrt(s["dW" + str(l+1)]+epsilon) 
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learningRate*grads['db' + str(l+1)]/np.sqrt(s["db" + str(l+1)]+epsilon)
     
        
    return parameters, s

## Adam Optimizer

Adam optimizer is the optimizer that combines both of the intuition from Momentum and RMSProp mini-batch gradient descent. The formula for Adam optimizer is as follows:

$$ v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } $$
$$v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t}$$
$$s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2$$
$$s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_2)^t}$$
$$W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon}$$

where:
- $t$ is the number of steps taken.
- $\beta_1$ and $\beta_2$ are the hyperparameters controlling the exponentially weighted averages. 
- $\varepsilon$ is the small number to avoid the division by zero.

As we can see from the mathematical equation above, adam optimizer uses the combination of formula that are already applied in Momentum and RMSProp optimizer.

In [7]:
def initializeAdam(parameters) :
 
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    s = {}
    
    for l in range(L):
   
        v["dW" + str(l+1)] = np.zeros((parameters["W" + str(l+1)].shape))
        v["db" + str(l+1)] = np.zeros((parameters["b" + str(l+1)].shape))
        s["dW" + str(l+1)] = np.zeros((parameters["W" + str(l+1)].shape))
        s["db" + str(l+1)] = np.zeros((parameters["b" + str(l+1)].shape))
    
    return v, s

In [8]:
def updateParameters_with_adam(parameters, grads, v, s, t, learningRate, beta1, beta2, epsilon):

    L = len(parameters) // 2                 # number of layers in the neural networks
    v_corrected = {}                         
    s_corrected = {}                        
    
    
    for l in range(L):
      
        v["dW" + str(l+1)] = beta1*v["dW" + str(l+1)]+((1-beta1)*grads['dW' + str(l+1)])
        v["db" + str(l+1)] = beta1*v["db" + str(l+1)]+((1-beta1)*grads['db' + str(l+1)])

        v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)]/(1-(beta1)**t)
        v_corrected["db" + str(l+1)] = v["db" + str(l+1)]/(1-(beta1)**t)

        s["dW" + str(l+1)] = beta2*s["dW" + str(l+1)]+((1-beta2)*(grads['dW' + str(l+1)])**2)
        s["db" + str(l+1)] = beta2*s["db" + str(l+1)]+((1-beta2)*(grads['db' + str(l+1)])**2)
      
        s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)]/(1-(beta2)**t)
        s_corrected["db" + str(l+1)] = s["db" + str(l+1)]/(1-(beta2)**t)

        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - (learningRate*v_corrected["dW" + str(l+1)]/((np.sqrt(s_corrected["dW" + str(l+1)]))+ epsilon))
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - (learningRate*v_corrected["db" + str(l+1)]/((np.sqrt(s_corrected["db" + str(l+1)]))+ epsilon))
    

    return parameters, v, s

## Mini-Batch Gradient Descent

Now we already define the optimizer functions necessary for the purpose of this project. Next, we need to build a function to wrap-up all of the steps that we defined so far. This function will take all of the steps from deep neural networks and apply the algorithm according to the optimizer of our choose.

The algorithm for deep neural networks has already been done in another project called "Deep Neural Networks with Different Activation Functions for Breast Cancer Prediction". So in this project, all we need to do is just call the corresponding function from that project.

In [None]:
def model(X, Y, layersDimension, activation, optimizer, learningRate, mini_batch_size, beta, beta1, 
          beta2,  epsilon, num_epochs):


    L = len(layersDimension)         # number of layers in the neural networks
    costs = []                       
    t = 0                            # initializing the counter required for Adam update
    seed = 10                      
    m = X.shape[1]                   # number of training examples
    
    # Initialize parameters
    parameters = initializeParameters(layersDimension)

    # Initialize the optimizer
    if optimizer == "rmsprop":
        s = initializeVelocity_RMSProp(parameters)
    elif optimizer == "momentum":
        v = initializeVelocity_momentum(parameters)
    elif optimizer == "adam":
        v, s = initializeAdam(parameters)
    
    # Optimization loop
    for i in range(num_epochs):
        
        # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
        seed = seed + 1
        minibatches = random_mini_batches(X, Y, mini_batch_size, seed)
        cost_total = 0
        
        for minibatch in minibatches:

            # Select a minibatch
            (minibatch_X, minibatch_Y) = minibatch

            # Forward propagation
            AL, caches = modelForwardProp(minibatch_X, parameters, activation)

            # Compute cost and add to the cost total
            cost_total += computeCost_mini_batch(AL, minibatch_Y)

            # Backward propagation
            grads = modelBackProp(AL, minibatch_Y, caches, activation)

            # Update parameters
            if optimizer == "rmsprop":
                parameters, s = updateParameters_with_RMSProp(parameters, grads, s, beta2, learningRate, epsilon)
            elif optimizer == "momentum":
                parameters, v = updateParameters_with_momentum(parameters, grads, v, beta, learningRate)
            elif optimizer == "adam":
                t = t + 1 # Adam counter
                parameters, v, s = updateParameters_with_adam(parameters, grads, v, s,
                                                               t, learningRate, beta1, beta2,  epsilon)
        cost_avg = cost_total / m
        
       
        
        # Print the cost every 1000 epoch
        if i % 10000 == 0:
            print ("Cost after epoch %i: %f" %(i, cost_avg))
            
    return parameters

## Loading Dataset

Now we already defined all of the function necessary to run mini-batch gradient descent with different optimizers. Next, let's load the datasets.

In [12]:
np.random.seed(1)
trainX = pd.read_csv('trainX.csv')
trainY = pd.read_csv('trainY.csv', names = ['diagnosis'])
testX = pd.read_csv('testX.csv')
testY = pd.read_csv('testY.csv', names = ['diagnosis'])

Here the training and test data are already splitted into the proportion of 70%-30%. Each feature of the data has been normalized as well using min-max approach. Hence, the data now is ready to be used for classification study.

Let's take a look first at the size of the training data and test data.

In [13]:
print('Shape of the training set: '+str(trainX.shape))
print('Shape of the test set: '+str(testX.shape))

Shape of the training set: (398, 30)
Shape of the test set: (171, 30)


Next, we can take a look at the first five rows of the training set.

In [14]:
trainX.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,0.607175,0.420697,0.595743,0.473595,0.412386,0.255567,0.346532,0.472068,0.263636,0.084035,...,0.68979,0.502665,0.679267,0.543846,0.528495,0.279138,0.429073,0.820619,0.237138,0.138463
1,0.49406,0.536016,0.488632,0.341251,0.433059,0.292068,0.394096,0.327883,0.125253,0.183235,...,0.360726,0.427772,0.348573,0.205417,0.350855,0.147481,0.223882,0.377663,0.007491,0.086187
2,0.317052,0.223876,0.303849,0.183245,0.362372,0.163088,0.04105,0.093439,0.288384,0.244103,...,0.28175,0.218017,0.254943,0.144564,0.364723,0.125263,0.096326,0.299107,0.244628,0.149416
3,0.273984,0.395671,0.264184,0.154358,0.314706,0.143028,0.072915,0.142346,0.320202,0.271904,...,0.207044,0.30597,0.19239,0.096908,0.14997,0.060628,0.041422,0.164021,0.121033,0.089663
4,0.333617,0.39026,0.317877,0.19508,0.343685,0.15358,0.034255,0.094235,0.230808,0.176706,...,0.263252,0.486674,0.238358,0.130333,0.379912,0.120315,0.049768,0.273643,0.130298,0.138594


And first five rows of response variables.

In [15]:
trainY.head()

Unnamed: 0,diagnosis
0,1
1,1
2,0
3,0
4,0


Next, we can define all of the constant parameters that we will need in order to run the gradient descent like the learning rate $\alpha$, the number of iterations, as well as the number of hidden layers and the hidden inputs.

In [16]:
trainX =trainX.values

trainY = trainY.values
trainY = trainY.reshape((len(trainY),1))

testY = testY.values
testY = testY.reshape((len(testY),1))

In [17]:
layerDimension = [30, 5, 3, 1] #4 layers with 2 hidden layers
learningRate = 0.0001
num_epochs = 100010
beta = 0.9
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
activation = 'relu'
mini_batch_size = 64

## About the Metrics

Since the classification that will be conducted is about the severity of the tumor: whether the tumor is benign ot malignant, then the accuracy metrics would not be the best metrics for this condition. The emphasize should be focused in supressing the amount of false negative in the algorithm because the cost of having miss-classified an actual positive will be huge. 

Just imagine where there is a patient that has a breast tumor and we make a decision that the tumor is benign when in fact it is malignant. This situation will endanger the patient life and thus, creating a machine learning algorithm that can supress the amount of false negative should be a priority.

Because of this, instead of accuracy or Jaccard index, the metrics that will be used for this problem is F1-score, which is the 'average' of precision and recall.

In [18]:
def predict(X, y, parameters, activation):
    
    m = X.shape[1] 
    p = np.zeros((1,m))
    
    # Forward propagation
    probas, caches = modelForwardProp(X, parameters, activation)

    # convert probas to 0/1 predictions
    for i in range(0, probas.shape[1]):
        if probas[0,i] > 0.5:
            p[0,i] = 1
        else:
            p[0,i] = 0
    
    truePositive = []
    trueNegative = []
    falsePositive = []
    falseNegative = []
    
    for i in range (0, probas.shape[1]):
        truePositive.append((p[0,i] == 1 and y[0,i] == 1))
        trueNegative.append(p[0,i] == 0 and y[0,i] == 0)
        falsePositive.append(p[0,i] == 1 and y[0,i] == 0)
        falseNegative.append(p[0,i] == 0 and y[0,i] == 1)
    
    epsilon = 10e-8
    recall = truePositive.count(True)/(truePositive.count(True)+falseNegative.count(True)+epsilon)
    precision = truePositive.count(True)/(truePositive.count(True)+falsePositive.count(True)+epsilon)
    
    F1_score = 2*(precision*recall/(precision+recall+epsilon))
        
    return F1_score

## Momentum Optimizer Mini-Batch Gradient Descent

The first optimizer that will be implemented here is the Momentum optimizer. First, the functions that already defined in the previous sections will be called. Then, the metrics F1 score can be used to quantify how well the model is predicting the training set and test set using Momentum optimizer.

In [19]:
parameters = model(trainX.T, trainY.T, layerDimension, activation, 'momentum', learningRate, mini_batch_size, 
                   beta, beta1, beta2,  epsilon, num_epochs)

Cost after epoch 0: 0.695905
Cost after epoch 10000: 0.654073
Cost after epoch 20000: 0.643471
Cost after epoch 30000: 0.534853
Cost after epoch 40000: 0.169943
Cost after epoch 50000: 0.103632
Cost after epoch 60000: 0.087595
Cost after epoch 70000: 0.077848
Cost after epoch 80000: 0.071317
Cost after epoch 90000: 0.066766
Cost after epoch 100000: 0.063319


Let's predict the F1 score from the training set.

In [20]:
predictTrain = predict(trainX.T, trainY.T, parameters, 'relu')
predictTrain

0.97508891729335

And F1-score from the test set.

In [21]:
predictTest = predict(testX.T, testY.T, parameters, 'relu')
predictTest

0.9701492023279153

As we can see above, the F1 score from training set is 97,5%, which is very good, while the F1 score from the training set is 97%. This means that the Momentum optimizer with seleced constant parameters is able to generalize the test data really well.

Next let's see how RMSProp optimizer performs for this particular problem.

## RMSProp Optimizer

For RMSProp optimizer, the learning rate $\alpha$ needs to be re-adjusted to the smaller steps because if we run the optimizer with the default learning rate, then the gradient descent would diverge into a nonsense solution. Hence, the smaller learning rate, which is $10^{-5}$, will be used for this optimizer in order to ensure that the gradient descent will be updated in a small step in each epochs.

In [27]:
learningRate = 0.00001

In [28]:
parameters = model(trainX.T, trainY.T, layerDimension, activation, 'rmsprop', learningRate, mini_batch_size, 
                   beta, beta1, beta2,  epsilon, num_epochs)

Cost after epoch 0: 0.695697
Cost after epoch 10000: 0.328326
Cost after epoch 20000: 0.181345
Cost after epoch 30000: 0.115237
Cost after epoch 40000: 0.079268
Cost after epoch 50000: 0.056441
Cost after epoch 60000: 0.052054
Cost after epoch 70000: 0.047193
Cost after epoch 80000: 0.044094
Cost after epoch 90000: 0.040674
Cost after epoch 100000: 0.036392


Next, we can observe the F1-score of RMSProp optimizer for training set

In [29]:
predictTrain = predict(trainX.T, trainY.T, parameters, 'relu')
predictTrain

0.9823321047884256

And the F1-score from the test set.

In [30]:
predictTest = predict(testX.T, testY.T, parameters, 'relu')
predictTest

0.9411764192149681

From the result above, the F1-score from the prediction set is about 98,2%, which is even better than the one with Momentum optimizer. However, the F1-score from the test set is just 94.1%, which means it performs worse than Momentum optimizer. From the differnce between training test and the test set, we can conclude that the model is overfitting the training data. The possible solution for this is by applying regularization methods such as L2 or dropout.

## Adam Optimizer

The last optimizer that will be investigated is Adam optimizer. The learning rate that will be used for this optimizer is the same as the one used in RMSProp, since if we apply learning rate the same as Momentum, then the gradient descent will diverge.

Let's look how this optimizer performs.

In [33]:
parameters = model(trainX.T, trainY.T, layerDimension, activation, 'adam', learningRate, mini_batch_size, 
                   beta, beta1, beta2,  epsilon, num_epochs)

Cost after epoch 0: 0.695902
Cost after epoch 10000: 0.332563
Cost after epoch 20000: 0.183415
Cost after epoch 30000: 0.116519
Cost after epoch 40000: 0.079799
Cost after epoch 50000: 0.056686
Cost after epoch 60000: 0.052223
Cost after epoch 70000: 0.047291
Cost after epoch 80000: 0.044181
Cost after epoch 90000: 0.040723
Cost after epoch 100000: 0.036438


Next, let's see the F1-score generated by the model from the training set.

In [34]:
predictTrain = predict(trainX.T, trainY.T, parameters, 'relu')
predictTrain

0.9823321047884256

And the F1-score from the test set.

In [35]:
predictTest = predict(testX.T, testY.T, parameters, 'relu')
predictTest

0.9411764192149681

From the F1-score result from both training and the test set, we can see that Adam optimizer performs similarly with RMSProp for particular setting of hyperparameters. The conclusion is the same, that there is an overfitting in the training set and we have a variance problem. The solution to remove this problem is by applying regularization algorithm like L2 regularization or dropout in some of the hidden units in the hidden layers of neural networks.

## Summary

The mini-batch gradient descent with different optimizers has been performed to predict the severity of cancer prediction. Below is the summary of the result.

| Optimizer          | Learning Rate | Training | Test      |
|-----------------------|-----------|----------|-----------|
| Momentum              | $10^{-4}$ | 97.5%    | 97%       |
| RMSProp               | $10^{-5}$ | 98.2%    | 94.1%     | 
| Adam                  | $10^{-5}$ | 98.2%    | 94.1%     | 

It can be seen from summary above that there is a variance problem in RMSProp and Adam optimizer, meaning that we overfit the training data. The solution for this is to apply regularization method for the parameters in the hidden units in the hidden layers of neural networks.