
# How to avoid over-fitting

## Number of units?
In neural network, the input and output units is generally determined by the dimensionality of the data set, whereas the number $M$ of hidden units is a free parameter. We might expect that there will be an optimum value of $M$ that gives the best generalization performance, corresponding to the optimum balance between under-fitting and over-fitting.

The generalization error, however, is not a simple function of $M$ due to the presence of local minima in the error functoin.

## Regularizer
A general approach to avoid over-fitting:

Choose a relatively large value for $M$ and then to control complexity by the addition of a regularization term to the error function. The simplest regularizer is

$$\tilde{E}(\mathbf{w}) = E(\mathbf{w}) + \frac{\lambda}{2}\mathbf{w}^T\mathbf{w} \tag{5.112}$$

This regularizer is also known as *weight decay*. As we have seen previously, this regularizer can be interpreted as the negative logarithm of a zero-mean Gaussian prior distribution over the weight vector $\mathbf{w}$.


---------------
# Consistent Gaussian priors

One of the limitations of simple weight decay is that is inconsistent with certain scaling properties of network mappings.

Data scaling might be quite common in neural network. Here we consider a two-layer network for regression.

$$\begin{align*}
&\text{Hidden layer:} & z_j &=h\left(\sum_i w_{ji}x_i + w_{j0}\right) \tag{5.113}\\
&\text{Output layer:} & y_k &=\sum_j w_{kj}z_j + w_{k0} \tag{5.114}
\end{align*}$$

Data scaling:

$$\begin{align*}
&\text{Hidden layer:} & x_i\rightarrow \tilde{x}_i &=ax_i+b \tag{5.115}\\
&\text{Output layer:} & y_k\rightarrow \tilde{y}_k &=cy_k+d \tag{5.118}
\end{align*}$$

<font color='red'>If we want the network with the transformed data to present the same performance as the network with the original data</font>, then we can make a transformation of the weights that connect to the inputs and outputs as follows.

$$\begin{align*}
&\text{Hidden layer:} & w_{ji}\rightarrow \tilde{w}_{ji} &=\frac{1}{a}w_{ji} \tag{5.116}\\
& &w_{j0}\rightarrow \tilde{w}_{j0} &= w_{j0} - \frac{b}{a}\sum_i w_{ji} \tag{5.117}\\
&\text{Output layer:} & w_{kj}\rightarrow \tilde{w}_{kj} &=cw_{kj} \tag{5.119}\\
& &w_{k0}\rightarrow \tilde{w}_{k0} &= cw_{k0} + d \tag{5.120}
\end{align*}$$

It works well on the error function with not regularizer. However, the consistency is not satisfied in terms of the weight decay because weight decay treats all weights and biases on an equal footing.

Here is a regularizer that is <font color='orange'>invariant to re-scaling</font> of the weights

$$\frac{\lambda_1}{2}\sum_{w\in \mathcal{W}_1}w^2 + \frac{\lambda_2}{2}\sum_{w\in\mathcal{W}_2} w^2 \tag{5.121}$$

where $\mathcal{W}_1$ denotes the set of weights in the first layer, $\mathcal{W}_2$ denotes the set of weights in the second layer, and <font color='blue'>biases are excluded from the summations</font>. This regularizer will remain unchanged under the weight transformations provided the regularization parameters are rescaled using $\lambda_1\rightarrow a^{1/2}\lambda_1$ and $\lambda_2\rightarrow c^{1/2}\lambda_2$.

The regularizer (5.121) corresponds to a prior of the form

$$p(\mathbf{w}|a_1, a_2)\propto exp\left(-\frac{a_1}{2}\sum_{w\in\mathcal{W}_1}w^2 - \frac{a_2}{2}\sum_{w\in\mathcal{W}_2}w^2\right)\tag{5.122}$$

Note that priors of this form are improper because the bias parameters are unconstrained. <font color='orange'>It is commom to include separate priors for the biases having their own hyperparameters.</font>

More generally, we can consider priors in which the weights are devided into any number of groups $\mathcal{W}_k$ so that

$$p(\mathbf{w}) \propto exp\left(-\frac{1}{2}\sum_{k} a_k\|\mathbf{w}\|_k^2\right) \qquad \text{where } \|\mathbf{w}\|_k^2 = \sum_{w_j\in\mathcal{W}_k} w_j^2 \tag{5.123,5.124}$$

---------------
# Early stopping

The error measured with respect to independent data, generally called a validation set, often shows a decrese at first, followed by an increase as the network starts to over-fit. Training can therefore be stopped at the point of smallest error with respect to the validation data set in order to obtain a network having good generalization performance.

Early stopping exhibit similar behaviour to regularization using a simple weight-decay term.
