# Training Neural Networks


## Types of errors: two types

1) Underfitting - Over simplifying the problem (error due to bias)

2) Overfitting - Overly complicated (error due to variance)

<img src='typesoferrors.png'>

We can err on the side of overfitting though, and then compensate for the issues of over fitting.

## Early Stopping

<img src='epocherror.png'>

<img src='modelcompgraph.png'>

We do gradient descent until the testing error stops decreasing and starts to increase, and then we stop training

## Regularization

Regularization is the process of penalizing weights that reduce the error

<img src='regularization.png'>

**General guidelines for deciding between L1 and L2 regularization:**
L1: we get sparse vectors (small weights go to zero, less features/dimensionality reduction)

L2: we get vector with small and homogenous weights
<img src='l1l2.png'>

## Dropout

Apply some probability that a certain node/activation will be left out of the neural net for both forward and backward propogation for each epoch

## Random Restart

When using Gradient Descent, we may end up at a **local minima** given parameters such as step and start point

We can avoid this to some degree, or increase the likelihood we find the **global minima** by employing *Random Restart*.

<img src='RandomRestart.png'>

## Vanishing Gradient

With the sigmoid activation function, the gradient is small. When we use it for gradient descent, multiplying each step to update our backpropogation means we end up with a tiny value. Essentially, we may never arrive at the bottom.

<img src='VanGrad.png'>
<img src='VanDes.png'>

## Other Activation Functions
**Note on Images: Origin is 0, not 0.5**

1) Hyperbolic Tangent:
$$tanh(x) = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$$
<img src='tanh.png'>

2) ReLU (Rectified Linear Unit):
$$ relu(x)= \begin{cases} 
      x & x\ge 0 \\
      0 & x \le 0
   \end{cases}$$
<img src='relu.png'>

Now, with these different activation functionnds, we get larger derivatives and thus allow us to do gradient descent:
<img src='BackPwDiffAFct.png'>

Here, we can see now a multi-layer perceptron with different activarion functions indicated in the nodes, particularly ReLU in the hidden layers. Notice, the sigmoid is still at the output. This is because we need a probability between zero and one at the end.
<img src='MultLayerwDiffAct.png'>

## Batch vs Stochastic Gradient Descent

1) Batch Gradient Descent: Take all of the data and run it through the entire NN (forward prop), find the predictions (output), calculate error (y-yhat), and then update weights (backprop)
* Slow, computationally intensive

2) Stochastic Gradient Descent: Take small subsets of the data and run it through the entire NN (forward prop), find the predictions (output), calculate error (y-yhat), calculate the gradient of the error function based on those points, and then update weights (backprop), and move one step in that direction, and then iterate for each subset
* Depends on the data being well distributed to give us a good idea

## Learning Rate Decay

A larger learning rate may end up in overshooting the minimum point while a small learning rate may make the model take a very long time

<img src='BigSmallLearnRate.png'>

We can also update the learning rate as the problem goes forward. Such that, if the gradient is steep we take big steps, but if it is flat we take small steps

## Momentum

Instead of only taking steps based on the gradient, include some parameter, $\beta$, that continues the steps with "momentum".

<img src='Momentum.png'>