# Avoiding Overfitting through Regularization

Deep neural networks have many parameters to fit that give them incredible freedom and flexibility. But, it tends to overfit dataset because of this, so we need regularization techniques.

We already studied early stopping and Batch Normalization (which is not for overfitting).

## $l_1$ and $l_2$ Regularization

You can use $l_2$ regularization to constrain connection weights or use $l_1$ if you want sparse model.

```python
layer = keras.layers.Dense(100, activation='elu',
                          kernel_initialization='he_normal',
                          kernel_regularizer=keras.regularizers.l2(0.01))
```

This computes regularization loss at every steps and then add it to the final loss.

`keras.regularizers.l1()` for $l1$ regularization. `keras.regularizers.l1_l2()` if you want to use both.

If you want to use same activation, regularization, and initialization to every layer, then you can do it with Python's `functools.partial()` function:

```python
from functools import partial

RegularizedDense = partial(keras.layers.Dense, 
                           activation='elu',
                           kernel_initialization='he_normal',
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation='softmax',
                    kernel_initializer='glorot_uniform')
])
```

## Dropout

*Dropout* is one of the most popular regularization and default in recent deep neural networks.

It is a fairly simple algorithm: at every training step, every neurons (excluding output neurons) has a probability $p$ of being temporarily "dropped out", meaning it will be ignored during this training step, but it may be active during the next step. $p$ is a hyperparameter called *dropout rate*, usually between 10% to 50%: closer to 40-50% in CNNs, and 20-30% in RNNs. After training it will be removed.

It works because neurons now not depend on the neighbor neurons, it will learn on their own and it will focus on every input neurons makes it less sensitive to slight changes in input neurons. Thus performing a great generalization.

Another power of dropout is that a new network is generated at each step. So if you train for 10,000 epochs, you have 10,000 different networks. These networks are not independent as they share many of their weights. This results in ensemble of many neural networks.

One technical detail is when applying dropout, remaining neurons have extra input weights (in average) as it should be, which leads to total much more weights than we want. To solve this multiply each input connection weights by $(1 - p)$ *keep probability* after training or alternatively divide each neuron's output by keep probability during training.

To implement dropout in Keras, you can use `keras.layers.Dropout`:

```python
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(100, activation='elu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(10, activation='softmax')
])
```

> Since dropout is active only during training, comparing training loss with validation loss can be misleading. So make sure to evaluate training loss after training without dropout.

Dropout does tends to significantly slow down convergence, but it results a much better model.

> If you want to regularize a self-normalizing network based on SELU activation function, you can use *alpha dropout*: this preserves mean and variance of its inputs.

## Monte Carlo Dropout

*MC Dropout* is a powerful technique which can boost the performance of any trained dropout. Let's look at implementation:

```python
y_probas = np.stack([model(X_test_scaled, training=True)
                    for sample in range(100)]) # [100, 10000, 10]
y_probas = y_probas.mean(axis=0) # [10000, 10]
```

We set `training=True` for dropout to be active. Because of this all the predictions are different, so we stack up 100 different predictions and average them all. Then we get single prediction. 

When model predicts single prediction with dropout off it shows best probability. But by MC Dropout technique result is more general and sensitive. So when you want to make a sensitive model (like medical or financial) MC Dropout would be helpful.

Moreover, model's accuracy also increased by MC Dropout without retraining it.

> The number of MC samples (100 in this example) is a hyperparameter. Higher would be more accurate the predictions and uncertainty will be. If you double it prediction time doubles. So its a trade-off between latency and accuracy.

If your model consists other special layers like `BatchNormalization` layers. Then you should not force `training=True`, instead use `MCDropout` layer given below:

```python
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training=True)
```

## Max-Norm Regularization

*Max-norm regularization* constrains the weights $w$ of the incoming connections such that $||w||_2 < r$, where $r$ is the max-norm hyperparameter. 

It is typically implemented by computing $||w||_2$ after each training step and rescaling $w$ if needed by $w \gets w\frac{r}{||w||_2}$.

Reducing $r$ increases regularization.

```python
keras.layers.Dense(100, activation='elu',
                  kernel_constraint=keras.constraints.max_norm(1.))
```

You can also constrain bias term by setting `bias_constraint`.

The`max_norm()` has an `axis` argument defaults to 0. But when using convolutional layers make sure what axis you want to use (usually `axis=[0, 1, 2]`).

