My project for this course is looking at breast cancer data from an ICAR breast cancer history competition using a densenet CNN developed by my colleague in my lab Haben Berhane and optimizing it further.

This project looks at the different methods that tensorflow utilizes that builds off the basics we have learned in the course.  These examples include:

Momentum based learning: Nesterov momentum, ADAM, ADAGRAD

Step based learning: .1, .01, and .001 steps were investigated

Regularization: .1, .01, and .001 regularization was looked at

Batch Learning: full and half batch learning was investigated.

The explaination of these will be discused below

## Nesterov Momentum

In class we talked about a standard momentum function which modifies the direction of the gradient descent by also weighing in the previous direction.  That is

$$w^k = w^{k-1} + \alpha d^{k-1} $$
$$ d^{k-1} = \beta d^{k-2} -  (1 - \beta) \nabla g(w)$$

This functioned as to fix the zig-zag nature of gradient descent, especially in narrow quadratic functions.  Nesterov Momenum also does this concept but in a different fashion.

Nesterov Momentum instead is:
$$w^k = w^{k-1} + d^{k-1} $$
$$ d^{k-1} = \beta d^{k-2} -  \alpha \nabla g(w+ \beta*d^{k-2}) $$

Nesterov momentum tends to converge faster than classical momentum for complex cost functions.  By taking the first "big" jump and then correcting it, we can travel further in the cost function space than regular momentum, but also ameliorate the zig-zagging by correcting.

[Reference]: 

https://stats.stackexchange.com/questions/179915/whats-the-difference-between-momentum-based-gradient-descent-and-nesterovs-acc

In [2]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://i.stack.imgur.com/wBIpz.png")

## ADAM Moment

ADAM is another optimization for momentum. Formally, it functions as:

$$w^k = w^{k-1} - \frac{\alpha}{\sqrt{v^{k-1}} + \epsilon} m^{k-1}  $$

$$m^{k} = \frac{1}{1-\beta_1^k} (\beta_1 m^{k-1} + (1- \beta_1) g(w^k) $$

$$v^k = \frac{1}{1-\beta_2^k} (\beta_2 v^{k-1} + (1-\beta_2) g(w^k)^2 $$

Where the first term, represents an exponential decay of that tends to 1 as iterations continue.  This is due to the inital condition being something that can be converged to, so early on, you want to weight the input from $m$ and $v$ less.  This momentum optimizer is really looking for flat minima error surfaces.  It also combines previous gradient and a previous gradient squared information, augmenting classical momentum (by m and v respectively).

http://ruder.io/optimizing-gradient-descent/index.html#adam

## ADAGRAD Momentum

ADAGRAD is the 3rd momenumt optimizer I looked at.  It changes the learning parameter based upon feature density and thus works well with sparse data (which  may be useful as these images tend to be fairly uniform with low contrast and could be sparse in that regard).

It changes our work as follows:

$$w^k_i = w^{k-1}_i - \frac{\alpha}{\sqrt{G^{k-1}_{ii} + \epsilon}} g(w^{k-1}_i)$$

So here it is stepping through each feature a different way, chiefly based on the squared gradients of previous time steps summed up.



Project:

For this project, I used the above momentum optimizers with different learning rates on full batch data to generate 2 epochs of learning with the loss being used to determine how well the neural network was performing.

I then performed a half and single batch of the best learning parameter with each momentum for 2 epochs. I also tested different regularization constants for the best learning parameter choice with full batch as well.

2 epochs were used as training could take 1 to 1.5 hours on my laptop for each session, so even though I started the project early, I couldn't have my laptop out of comission to run the code that often, so I made the practical decision to run many different optimization schemes over a short duration to get a flavor for how they performed.

## Initial Coniditions

Here were the initial conditions for each momentum type below:

$\textbf{Nesterov}$:

init_learning_rate = 0.1

dropout_rate = 0.2

nesterov_momentum = 0.9

weight_decay = 10

exp_decay_steps = 6000

exp_decay_rate = 1

regularization = .0001


$\textbf{ADAM}$:

same as above but

 beta1=0.9, 
 
 beta2=0.999, 
 
 epsilon=1e-8


$\textbf{ADA}$:

same as above but

initial_accumulator_value=0.1


## Learning Parameter

For 1.0 learning the momentum optmizers performed as: 

ADAGRAD: 29.09683 , 33.38682 , 33.34616 , 33.33678

ADAM: 20.70812 , 48.04075 , 134.737 , 76.29205

Nesterov: 101.6043 , 1.5E+10 , nan , nan

For .01 learning:

ADAGRAD: 41.20542 , 35.09968 , 36.40806 , 33.60392

ADAM: 13.20862 , 26.06066 , 22.15719 , 23.21645

Nesterov: 38.48826 , 71.19455 , 52.33072 , $\textbf{25.16515}$

For .001 learning:

ADAGRAD: 35.0094 , 33.44678 , 35.32433 , $\textbf{31.89558}$

ADAM: 19.09145 , 16.81667 , 13.19629 , $\textbf{18.54706}$

Nesterov: 38.1938, 15.957069, 21.710157, 26.298061

So the bolded values indicate the lowest loss and were used for the learning prameter for determing regularization

## Regularization

Using the best learning rate, I then moved onto determing the best regularization

.0001 regularization:

ADAGRAD: 35.0094 , 33.44678 , 35.32433 , 31.89558

ADAM: 19.09145 , 16.81667 , 13.19629 , 18.54706
 
Nesterov: 38.48826 , 71.19455 , 52.33072 , 25.16515

.001:

ADAGRAD: 38.40235 , 24.26172 , 24.5208 , 22.04238

ADAM: 50.26275 , 24.89122 , 19.50465 ,20.66568

Nesterov: 33.01615 , 22.5851 , 10.08508 , 27.59055

.01:

ADAGRAD: 32.16628 , 21.32779 , 18.086 , $\textbf{15.82132}$

ADAM: 58.83173 , 19.95081 , 23.7641 , 17.11846

Nesterov: 33.28561 , 30.74485 , 24.63548 ,  $\textbf{16.47052}$

.1:

ADAGRAD: 35.02432 , 21.66205 , 25.99742 , 18.49324

ADAM: 45.23226 , 13.65605 , 22.18794 , $\textbf{15.33831}$

Nesterov: 37.52027 , 20.29804 , 24.76661 , 29.77377

Again, the bolded values are the best.

## Batch Learning

So with these best learning paramters and regularizations, I then compared full 48 batch to half 24 batch.

Full batch:

ADAGRAD: 32.16628 , 21.32779 , 18.086 , $\textbf{15.82132}$

ADAM: 45.23226 , 13.65605 , 22.18794 , $\textbf{15.33831}$

Nesterov: 33.28561 , 30.74485 , 24.63548 ,  $\textbf{16.47052}$

Half batch:

ADAGRAD: 43.01466 , 25.42512 , 19.26644 , 18.02733

ADAM: 30.30286 , 24.31583 , 17.27347 , 21.92849

Nesterov: 49.44444 , 24.63725 , 21.94733 , 21.19691


## Conclusion

Suprisingly, full batch outperfomed half batch but this could be due to the limited amount of epochs taken.  Not as suprisingly, more smoothening helped compared to less smoothening and smaller step lengths were better than longer ones.