# Train, Dev and Test sets

As previously mentioned, training a neural network is a very iterative process. The are many hyperparameters that needs to be chosen and adjusted iteratively. To take the most advantage of the data and find out the optimal values for the hyperparameters, the dataset can be split into three sets:
- The training set. This is used to train the algorithm and its hyperparameters.
- The hold out/cross validation/development set. This is used to test the algorithm and its parameters while searching for the optimal hyperparameters. This test set will be used multiple times through out the iterative process of network training.
- Once hyperparameters are found and the algorithm stabilizes, it is then tested on a completely separate test set.

So, the workflow is to train algorithms and their parameters on the training set, test their performance and compare them on the development set. Finally, once the final model is found, it is then evaluated on the test set in order to get an unbiased estimate of the algorithm's performance.

## train, dev, and test sets sizes

Previously, the ration of 70%/30% splits for train/test sets or maybe 60%/20%/20% splits for train/development/test splits was widely considered the best practice. However, these ratios was applied on a small datasets. 

On the modern deep learning era, the size of the datasets is extremely large. Having a very large test set may reduce the possibility to better exploit the data for further variations and patterns. Hence, usually, the ratio becomes that around 99% of the dataset is reserved for training, 1% for development, and 1% for test sets.

One important remark to consider is to keep an eye on **the train/test sets distributions**. Both sets should come from the same distribution. For example, if the network is trained on a high quality images, the test set be of high quality images as well.

Finally, it might be quite enough to have only train and test set. However, in this case, the algorithm might suffer some bias as it is tested on the same development set used to compare its performance with other algorithms and hyperparameters.

# Bias/Variance

Bias and variance are rough metrics that describes the performance of the learning algorithm. To illustrate more, consider the following graphs from AndrewNG's class:

![bias and virance](images/bias_variance.png)

- If the algorithm provides a shallow fit for the data, say a linear fit, then the algorithm will perform poor on the training set. In this case, it is said to have high bias. Another term that can be used is that the algorithm underfits the data.
- If the algorithm draws a sophisticated classifier to fit the data, it may be very hard to generalize well. In this case, it will perform well on the training set, but it will perform poorly on the development set. In this case, the algorithm is said to have a high variance. Another term that can be used is that the algorithm overfits the data.
- If the algorithm performs well on both, the training and development set, then it is said to have low bias and low variance.
- In some cases, the algorithm may underfit most of the data and overfits some portions of the data. In this case, the training error will be quite high and the development error mostly is higher. In this case, the algorithm is said to have high variance and high bias.

The above analysis assumes that the optimal error, for example, human classification error, is close to zero.

## Bias and Variance analysis, how to approach? "Basic recipe for machine learning"

This subsection provides a basic recipe to deal with the variance/bias problems.

- high bias: 
  - Try a bigger network.
  - Train for more time.
  - Try another deep architectures.
- high variance:
  - Get more data.
  - Regularization.
  - Try some other alternative architectures.

### Bias/Variance tradeoff

Previously, there was a lot of discussion about what is called bias/variance tradeoff. it was very hard to reduce bias/variance without hurting the other one. However, in the modern days, a plenty amount of data and various algorithms are available on the shelf for many problems. It was not a strict correlation between variance and bias any more

# Regularization

One of the useful techniques to reduce variance/overfitting is to apply regularization to the neural network model. Regularization adds a penalty to the cost function. This penalty, consequently, propagates to the model's weights updates. The added term positively correlates to the weights themselves.

## Types of regularization

Based on the added term to the cost function, regularization can be of multiple types. Usually, some hyper parameters are added to add further flexibility. Usually, a hyperparameter $\frac{\lambda}{2m}$ is applied for all types.

- L2 norm regularization

The added term here is the Frobenius form of the weights matrix. 

$$||w||^2_2 = \frac{\lambda}{2m} \sum^{n_x}_{j=1} w_j^2 = w^Tw$$

Although this type is widely known as $L_2$ regularization, for some arcane reasons in linear algebra. It is called the Frobenius form which means just the sum of the square elements of the matrix.

The weights update of this norm is $$\partial w = \alpha \partial w + \frac{\lambda}{m}w^{[l]}$$

Hence, the final update of the weights matrix is :

$$w^{[l]} = w^{[l]} - \alpha [\partial w + \frac{\lambda}{m} w^{[l]}]$$
$$w^{[l]} = w^{[l]} - \frac{\alpha \lambda}{m} w^{[l]} - \alpha (\partial w)$$

- L1 norm regularization

The added term is:

$$||w||_1 = \frac{\lambda}{2m} \sum^{n_x}_{j=1} |w|$$

This type of regularization makes the weights matrix sparse. It is claimed that such sparsity benefits the model with compressing the model as less memory will be consumed. However, from practical point of view, this might not be much useful. 

Why do we only regularize the weights matrix but not the biases?

Technically, biases can be regularized as well. However, as it is only a single parameter, it wont contribute as much as the weights matrix.