# Training Neural Networks

## Testing
- Separate training and testing data set.
- Evaluate results on the testing set.
- Prefer a simple model over a complicated model that is only marginally better.

## Overfitting and underfitting
- Type of errors
  - Underfitting
    - Use a simple model to solve a complicated problem.
    - Error due to bias.
  - Overfitting
    - Use an overly-specific model and fail to generalize.
    - Error due to variance.
- It is hard to find the right network structure.
- Prefer to error on complicated models, and use various techniques to prevent overfitting.

### Early stopping
- Do gradient descent until the testing error stops decreasing and starts to increase.

<img src="images/early-stopping.png">

### Regularization
- Large coefficients lead to overfitting.

<img src="images/active-function-too-certain.png">

<img src="images/russell-quote.png">

- Penalize large weights by add the weights into the error function.
- Original error function
$$ E = - \dfrac{1}{m} \sum_{i=1}^{m}{(1-y_i)ln(1-\hat{y}_i)} + y_i ln(\hat{y}_i) $$
- L1 regularization
$$ \lambda(\sum_{i=1}^{n}{|w_i|})$$
  - Add sum of absolute values of the weights.
  - Tend to end up with sparse vectors.
  - Small weights tend to go to zero.
  - Example: $(1, 0, 0, 1, 0)$
  - Good for feature selection (reduce the number of weights).

- L2 regularization
$$ \lambda(\sum_{i=1}^{n}{w_i^2})$$
  - Add sum of the squared values of the weights.
  - Tend to maintain all the weights homogeneously small.
  - Example: $(0.5, 0.3, -0.2, -0.4, 0.1)$
  - Good for training models.

- $\lambda$ determines how much to penalize the coefficients.
- Compare two vectors: $(1, 0)$ and $(0.5, 0.5)$. Both L1 and L2 regulations result in $1$ for the former vector. For the latter vector, L1 leads to $1$, but L2 leads to $0.5$. So L2 prefers the latter over the former.

### Dropout
- Sometimes one part of the network has very large weights, dominating all the training.
- For each epoch, randomly turn of some of the nodes. The other parts of the network have to pick up the slack, and take more part in the training.

## Local minima
- Gradient descent can stuck at a local minimum.

### Random restart
- Perform gradient descent from multiple random places.

### Momentum
- When stuck at a local minimum, move with momentum to get over the hump to look for a lower minimum.
- Use the weighted average of previous 3 or 4 gradients.
- Momentum
  - $\beta$: a constant in the range $[0, 1]$.
  - $step(n) \rightarrow step(n) + \beta step(n-1) + \beta^2 step(n-2) + ...$

## Vanishing gradient
- Derivative is the element that tells in what direction to move.
- When the gradient is too small, each epoch makes very little progress.
  - Sigmoid function gets flat at far left or right.
  - Its derivative is small.

<img src="images/flat-sigmoid-function.png">

- Solution: use other activation function.

### Hyperbolic tangent function
  - $ tanh(x) = \dfrac{e^x - e^{-x}}{e^x + e^{-x}} $
  - The range is $[-1, 1]$ and has larger derivative.
  <img src="images/hyperbolic-tangent-function.png">

### Rectified linear unit (ReLU)
  - $ relu(x) = x \geq 0 \ ? \ x \ : \ 0 $
  - Can improve the training significantly without sacrificing much accuracy.
  <img src="images/relu-function.png">

## Batch vs stochastic gradient descent
- Batch: run through all data points in each epoch.
- Stochastic: run only a subset of data points.
  - Split the data into several batches.
  - In each epoch, only run through one batch.

## Learning rate decay
- With a high learning rate, each step is large, and may miss the minimal, making the model chaotic.
- With a low learning rate, each step is small, and may make the model slow.
- If the model is not working, decrease the learning rate.
- Rule
  - If steep: long steps.
  - If plain: small steps.