# Lecture VIII: Tow Major Problems in Deep Learning-Regularization and Vanishing Gradient  


## Regularization

Sometimes the variance of the data is very high. In this case, the neural network is probably overfitting the data. To solve the problem of overfitting (or high-variance), we add one extra term in the definition of the cost function. The additional term controls the excessively fluctuating function such that the coefficients don't take extreme values. For example, consider the cost function in the logistic regression method in the neural network with one hidden layer:

\begin{equation}
J(\omega, b) = \frac{1}{m} \Sigma_{i=1}^{m} L(a, y^i) \tag{1}
\end{equation}

To avoid high variance, we re-define the cost function by adding a new term such that
\begin{equation}
J(\omega, b) = \frac{1}{m} \Sigma_{i=1}^{m} L(a, y^i) + \frac{\lambda}{2m} |\omega|^2 . \tag{2}
\end{equation}

This extra term called **'$L_2$-Regularization'**, and $\lambda$ called **Regularization parameter** which is a hyperparameter and should be given to the nueral network. In the $L_2$-Regularization, the term $|\omega|^2$ is actually the *Eculidean norm*:
\begin{equation}
|\omega|^2 = \Sigma_{j = 1}^{n_x} \omega_j^2 \tag{3}
\end{equation}

where $n_x$ is the number of nodes in the hidden layer. In general, if you have more than one hidden layer the new-defined  cost function will be
\begin{equation}
J(\omega^{[1]}, b^{[1]}, ..., \omega^{[L]}, b^{[L]}) = \frac{1}{m}\Sigma_{i=1}^{m} L(a, y^i) + \frac{\lambda}{2m}\Sigma_{l = 1}^{L} |\omega^{[l]}|^2 , \tag{4}
\end{equation}

where $L$ is the number of layers and $|\omega^{[l]}|^2$ is called **Forbinus norm**:
\begin{equation}
|\omega^{[l]}|^2 = \Sigma_{i=1}^{n^{[l-1]}} \Sigma_{j=1}^{n^{[l]}} (\omega_{ij}^{[l]})^2 . \tag{5}
\end{equation}


There are also other ways to regualrize the cost fucntion. For example, another less-common way is by using **'$L_1$-Regularizatiob'**. In this way the extra term added to the cost function is $\frac{\lambda}{2m} |\omega|$, where $|\omega| = \Sigma_{j = 1}^{n_x} \omega_j$. 

## Vanishing Gradient

In a very deep neural network sometime the gradient is so small (or extremely large) that making the network hard to train. This problem is related to multiple activation functions used in deep neural networks; activation functions map a large input space into a small input space (for example, $tanh(z)$ maps to $[-1, 1]$). Therefore, a large change in the input of the activation function will cause a small change in the output. Hence, the derivative becomes small.

If the neural network has a few layers, this is not a major problem. However, when more layers are used, it can be a major barrier to training the neural network.

One way to avoid the vanishing gradient problem is initializing model parameters by multiplying with the square root of their variance. Let's define the variance of $\omega$ is

\begin{equation}
var(\omega^{[l]}) = \frac{1}{n^{[l-1]}} . \tag{6}
\end{equation}

Thus, the random initial value for $\omega$ would be defined as

\begin{equation}
\omega^{[l]} = random(\omega^{[l]}) * \sqrt{\frac{1}{n^{[l - 1]}}} . \tag{7}
\end{equation}

If you use the ReLU activation function, it is shown that the variance should be defiened as
\begin{equation}
var(\omega^{[l]}) = \frac{2}{n^{[l-1]}} . \tag{8}
\end{equation}

Therefore for ReLU actiovation function the intitial value of $\omega$ is better to be defined as 
\begin{equation}
\omega^{[l]} = random(\omega^{[l]}) * \sqrt{\frac{2}{n^{[l - 1]}}} . \tag{9}
\end{equation}

# Homework

Find other kinds of commonly used regularization terms.