> # Gradient Problem #

In model training step, we will face vanishing or exploding gradient. This means gradient is getting smaller or bigger as we go lower layer. These problrms makes model not be trained well. 

One reason is activation function. For example, differential value of sigmoid activation function is between 0 and 1. As input value is getting bigger to positive or negative, the gradient goes close to 0. So when backpropagation algorithm works, model can't train well since there are little gradient to send.

![logistic.png](attachment:logistic.png)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
import warnings; warnings.filterwarnings("ignore")

- ### Glorot and He Initialization. ###

There is a thesis that suggest a way to alletive gradient problem. Bidirectional(forward in prediction, backward in backpropagation) signal should flow properly. Researchers said variances of ouput and input in each layer should be same. And Gradient variance before and after passing through layer should same. In fact if numbers of input and output(called fan-in and fan-out) connection are not same, we can't assure two conditions. But there is an alternative. It is to randomly initialize connection weight of each layer to blow fomular. This is called Glorot initialization.

$ fan_{avg} = (fan_{in}+fan_{out})\ / \ 2 $

$ \mathsf{A\ normal\ distribution} \quad avg=0,\ variance=\frac{1}{fan_{avg}} $
$ \mathsf{Or\ a\ uniform\ distribution\ between} \quad -r\ and\ +r, \ r=\sqrt{\frac{1}{fan_{avg}}}$

If we change $fan_{avg}$ to $fan_{in}$, it is called LuCun initialization. 

Other theses suggested similar strategy about other activation function. Initialization strategy about ReLU function is called He initialization. And we should use LuCun initialization for SELU function.

Keras basically uses Glorot initialization. We can use He initialization by **he_uniform** or **he_normal**.

In [2]:
keras.layers.Dense(10, activation='relu', kernel_initializer="he_normal")

<tensorflow.python.keras.layers.core.Dense at 0x1c4f874ee88>

If we want to use uniform distribution He initialization, we can use VarianceSclaing like below.

In [3]:
he_avg_init = keras.initializers.VarianceScaling(scale=2, mode='fan_avg', distribution='uniform')
keras.layers.Dense(10, activation='sigmoid', kernel_initializer=he_avg_init)

<tensorflow.python.keras.layers.core.Dense at 0x1c4ff110b08>

In previous, we thought that sigmoid function is the best choice. But, other activation functions work better in DNN. Specially ReLU has a big advantage that it doesn't converge certain value. However ReLU is not perfect. Dying Lelu problem can happen. When it occurs, some neurons print only 0 value. Especially half of neurons dies if we use big learning rate. It happens when weights of nerons is changed and sum of weights becomes negative about all samples.

To solve this problem we use variety of ReLU like LeakyReLU.

$ \mathsf{LeakyReLU_\alpha}(z)=max(\alpha z,z)$

Generally $\alpha$ is setted 0.01. This makes LeakyLeRU not to die. A neuron can falls into a coma but it has probability of waking up again. RReLU(choose $\alpha$ randomly in given area) also works well. And PReLU($\alpha$ is trained during train step) works well in huge dataset. But overfitting danger is in small data.

![leaky.png](attachment:leaky.png)

And ELU function is suggested. This function's performance outstrips other ReLU varieties in experiment.

$ \mathsf{ELU_\alpha}(z) =  \alpha(\mathsf{exp}(z)-1) \quad in \ z<0$

$ \mathsf{ELU_\alpha}(z) =  z \quad in \ z\geq 0$

![elu.png](attachment:elu.png)

Since neagative value incomes when z < 0, mean output of activation function is closed to 0. This minigates vanishing gradient. Hyperparameter $\alpha$ sets a value that converges to when z is a large negative value. And gradient never becomes 0, so it doesn't make dead neuron. Major disadvantage of avtivation function is that calculation of ELU is slower than other ReLU family. Although convergence speed is quick in training step, ELU is slower than ReLU in test set. 

And there is SELU function which is variety of ELU. Reasearcher said if we make neural network using fully connected layers and use SELU function, the network seems to be self-normalized. Output of each layer is maintained 0 mean and 1 variance. This blocks vanishing and exploding gradient. But there are some conditions to make self-normalize happen. 1. Inout feature should be standatdizaiton(0 mean, 1 standard deviation). 2. Weights of all hidden layers should be initialized by LuCun initialization. 3. Network should be composed in a rows of layers.

To use LeakyReLU, make LeakyReLU layer and add to layer want to apply.

In [5]:
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2)
])

To use SELU, some settings need.

In [6]:
layer = keras.layers.Dense(10, activation='selu', kernel_initializer='lecun_normal')

- ### Batch Normalization ###

Some activation functions reduce probability of vanishing or exploding gradient in the beginning of training. But there is no guarantee it doesn't occur again.

And some researchers suggest batch normalization. This method add an operation before or after passing activation function at each layer. This operation sets inputs at the origin and normalizes it. Then adjusts scale of outputs and moves in each layer by two new parameters. One is used for adjusting scale and the other is used for moving. If we add batch normalization at first layer, we don't need to do normalization.

For normalization, algorithm should estimate mean and variation.

- $\mu_B=\frac{1}{m_B}\sum_{m_B}^{i=1}\mathbf{x^{i}} \quad \quad $      

- $\sigma_B ^2 = \frac{1}{m_B}\sum_{m_B}^{i=1}(\mathbf{x^{i}}-\mu_B)^2 \quad \quad$ 

- $\hat{\mathbf{x}}^i = \frac{\mathbf{x}^i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}$

- $z^i=\gamma\otimes\hat{\mathbf{x}}^i+\beta$

$\mu_B$ is mean vector of input about mini batch B. $\sigma_B$ is standard deviation vector about mini batch B. $m_B$ is number of samples in minibatch. $\hat{\mathbf{x}}^i$ is input of normalized sample whose mean is 0. $\gamma$ is output scale parameter vector of layer. $\otimes$ is element-wise-multiplication. $\beta$ is output move parameter vector of layer. Each input moves as much as the parmeters. $\epsilon$ is a little number to prevent denominator becomes 0. $z^i$ is output of minibatch normalization operation.

In test set, there is mo way to calculate mean and standard deviation of input. One way is to calculate mean and standard deviation about each input of mini batch layer by passing all train set through network. 