In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Basic setup

Create anaconda environment
<br>
```bash
conda create -n ml python=3.7.5 jupyter
```
Install fastai library
<br>
```bash
conda install -c pytorch -c fastai fastai
```

## Regularizations

<b>
Any modification we make to a learning algorithm that is intended to reduce its generalization error but not it's training error
</b>

There been many experiments for regularization of deep neural networks, to be more precise we split regularization techniques in two major approaches:
- Cost functions penalty (soft influence)
- Parameters regularization directly (hard penalization)

For cost function norm regularization:

$$
\tilde C(X, y, W, b) = C(X, y, W, b) + \lambda \Omega(W)
$$

Biases learn from few data and are not regularized generally. They have a lower dimension than weights. Often regularize biases causes the underfitting because their low dimensionality.

In some cases $\lambda$ parameter is used per layer, but to avoid to many hyperparameters, we set the same parameter for each layer.

## Weight decay - L2 regularization

Lets consider:
$$
\tilde C(X, y, W, b) = C(X, y, W, b) + \lambda \Omega(W)
$$
with
$$
\Omega(W) = \frac{1}{2}||W||_2^2
$$
Here $W$ is considered as a vector of all weights from all layers

In literature it is called as <b>Ridge regression</b> or <b>Tikhonov</b> regularization.
<br>
In deep learning it's called <b>Weight decay</b>

Study of gradient of the regularized function (assume that we don't have biases)
$$
\tilde C(X, y, W) = \frac{\lambda}{2}W^TW + C(X, y, W)
$$

With the corresponding parameter gradient:
$$
\nabla_w C(X, y, W, b) = \lambda w + \nabla_w C(X, y, W, b)
$$

Then for gradient descent step:
$$
w \leftarrow w - \alpha (\lambda w + \nabla_w C(X, y, W, b)) = w - \alpha \lambda w - \alpha \nabla_w C(X, y, W, b)
$$
or
$$
w \leftarrow (1 - \alpha \lambda)w - \alpha \nabla_w C(X, y, W, b)
$$

As might be observed we have a weights shrink on each step by the constant factor $1 - \alpha \lambda$ before gradient update.
<br>
So weights are around the zero, they might be around any value (by the linear nature of neural networks)

#### Analysis of weight decay (optional)

<div>
<img src="images/memes/technical.jpeg"  height="600" width="800" />
</div>

For quadratic cost $C$ and $w^* = argmin_{w}(C(W))$ Taylor approximation gives:
$$
\tilde C(w) \approx C(w^*) + \frac{1}{2}(w - w^*)^TH(w - w^*)
$$
Where $H$ is Hessian matrix of $C$ on the specific argument $W^*$, the first part of $\nabla C(w^*) = 0$ series because of $w^*$ is minimal 

Then for minimum of $\tilde C(w)$ in $\tilde w$ we have that:
$$
\nabla_w \tilde C(w) = H(w - w^*)
$$
should be $0$
$$
H(\tilde w - w^*) = 0
$$

If we add weight decay to $\tilde C(W)$ and calculate it's gradient:
$$
\lambda \tilde w + H(\tilde w - w^*) = 0
$$
<br>
$$
(H + \lambda I)\tilde w = H w^*
$$
<br>
$$
\tilde w = (H + \lambda I)^{-1}H w^*
$$

According to the fact that the Hessian matrix of a convex function is positive semi-definite and (Schwarz'stheorem) symmetric, thus we can decompose it as $H = Q \Lambda Q^T$ of orthonormal matrices and eigenvalues:
$$
\tilde w = (Q \Lambda Q^T + I \lambda)^{-1} Q \Lambda Q^T w^* = (Q(\Lambda + \lambda I)Q^T)^{-1}Q \Lambda Q^T w^* = \\
Q(\Lambda + \lambda I)^{-1}\Lambda Q^T w^*
$$


This means that for 
$$
\Lambda = \begin{bmatrix}
  l_1 & 0 & \cdots & 0 \\
  0 & l_2 & \cdots & 0 \\
  \vdots & \vdots & \ddots & \vdots \\
  0 & 0 & \cdots & l_n
\end{bmatrix}
$$

Thus:
$$
(\Lambda + \lambda I)^{-1}\Lambda = 
\begin{bmatrix}
  \frac{l_1}{\lambda + l_1} & 0 & \cdots & 0 \\
  0 & \frac{l_2}{\lambda + l_2} & \cdots & 0 \\
  \vdots & \vdots & \ddots & \vdots \\
  0 & 0 & \cdots & \frac{l_n}{\lambda + l_n}
\end{bmatrix}
$$

Each value is rescaled of factor $\frac{l_i}{\lambda + l_i}$ 
Along the directions where the eigenvalues of $H$ relatively large, for example,where $l_i \gg \lambda$, the effect of regularization is relatively small. Yet components with $l_i \ll \lambda$ will be shrunk to have nearly zero magnitude (Goodfellow at al)

## Label smoothing (add noise to target, target augmentation)

We will see how data augmentation can help model to generalize better, but there also exists target augmentation techniques. When target is a image (segmentation mask), bounding box, text, graph etc, it might be obvious. But for some times augmentation on probabilities also helps.

When training classifier with softmax, instead of training on hard targets $0, 1, 2, \dots, n$ we can use:
$$
l_i^{'} = (1- \epsilon)l_i + \frac{\epsilon}{k}
$$
for some "small" $\epsilon$ where $k$ is a total number of classes

When label is hard, softmax, which returns probabilities, almost newer predicts hard labels and model continuous training, and increase weights, in contrary, soft label can reduce training time, and can serve as regularizer of weights. 

Trained model with label smoothing is good for classification tasks but performs poorly as a teacher model. 

## Dropout

Recall ensemble methods, bagging, they perform better, because different models can not have the same error. On the other hand training the ensemble of deep neural networks is computationally expensive and might need significant enlargement of dataset. 

Consider the similar approach within the single neural network. Switch of activations per layer with probability $p^l$ in mini-batch, which can be considered as random sampling, bootstrapping,  and then average the cost function at the end.  

Dropout:
<div>
<img src="images/regs/drpt_1.png"  height="600" width="800" />
</div>

When we switch off the activation, we influence all the input weights in previous layers (less when layer is far enough)

Dropout:
<div>
<img src="images/regs/drpt_2.png"  height="600" width="800" />
</div>

On the end we should multiply activations on the inverse of switch off probability:
$$
a^l = \frac{1}{p}a^l
$$
For instance if (as often is used) half of the layers where switched off, then activations should be: 
$$
a^l = 2 a^l
$$

Most common implementation of dropout is the inverted dropout:
<br>
for probability keep-prob - $p = 0.5$ we keep alive only this amount of activations. We create a mask vector of the dimension of activation with randomly selected zeros and ones (with $p$ ones) and then multiply activation element by element on this vector

Or we can simply use inverted dropout with keep-prob argument (probability to keep active neurons) and divide activations on it:
$$
a^l = \frac{a^l}{p}
$$
For instance if (as often is used) half of the layers where switched off, then keep-prob will be $p=0.5$ activations should be: 
$$
a^l = \frac{a^l}{0.5} = 2 a^l
$$

For instance:
$$
z^l = W^l a^{l-1} + b^l
$$
In order to approximate the expected value, we divide activations $a^{l-1}$ on keep-prob probability

Note that we eliminate activation elements randomly and different per iteration or per mini-batch which makes dropout one step ensemble method
<br>
We don't use dropout during the test or validation

Note: We often vary keep-prob per layer, depending on the layer and number of units, for instance we won't apply dropout to the unit with one layer or with two layers with keep-prob 0.2 might not be wise

Dropout can be considered as denoising quality of model (adding the random noise directly in the hidden units)

It is important to understand that a large portion of the power of dropout arises from the fact that the masking noise is applied to the hidden units. This can be seen as a form of highly intelligent, adaptive destruction of the information content of the input rather than destruction of the raw values of the input

For example, if the model learns a hidden unit $h_i$ that detects a face by finding the nose, then dropping $h_i$ corresponds to erasing the information that there is a nose in the image

The model must learn another $h_j$, that either redundantly encodes the presence of a nose or detects the face by another feature, such as the mouth

Traditional noise injection techniques that add unstructured noise at the input are not able to randomly erase the information about a nose from an image of a face unless the magnitude of the noise is so great that nearly all the information in the image is removed

Destroying extracted features rather than original values allows the destruction process to make use of all the knowledge about the input distribution that the model has acquired so far

Goodfellow at al

<a href="https://www.youtube.com/watch?v=kAwF--GJ-ek">Geoffry Hinton about dropouts</a>

## Data augmentation

Data augmentation helps model in better generalization. It's obvious that more balanced data always gives better results in statistical models.

Augmentation should be different in different tasks, for instance for images, there exists random noise, lightning, blurring, rotation, random cropping, center cropping, channel flipping, etc 

Augmentations:
<div>
<img src="images/regs/aug_1.jpeg"  height="600" width="800" />
</div>

Augmentations with labels:
<div>
<img src="images/regs/aug_2.jpeg"  height="600" width="800" />
</div>

Augmentations big picture:
<div>
<img src="images/regs/aug_3.jpeg"  height="600" width="800" />
</div>

Augmentations cropping:
<div>
<img src="images/regs/aug_4.png"  height="600" width="800" />
</div>

Sometimes augmentation depends on task and data even for images, for instance for OCR we can not flip 
<br>
<b>d</b>
in opposite case we'll get
<br>
<b>b</b>
and vice-versa

Or rotate in familiar MNIST dataset
<br>
<b>6</b>
in opposite case we'll get
<br>
<b>9</b>
and vice-versa

For sound recognition noise augmentation is used, for images there exists denoising autoencoders which learn to draw original image from augmented image (example of self-supervised generative model)

Denoising autoencoders:
<div>
<img src="images/regs/dae_1.png"  height="600" width="800" />
</div>

Denoising autoencoder:
<div>
<img src="images/regs/dae_2.png"  height="600" width="800" />
</div>

For text switching words with synonyms, masking (can be considered as denoising autoencoder for language models) works as the state of the art for the language model pre-training

Text masking language model:
<div>
<img src="images/regs/mae_1.png"  height="600" width="800" />
</div>

Text masking:
<div>
<img src="images/regs/mae_2.png"  height="600" width="800" />
</div>

## Early stopping

During the training we use training set and validation set, first for training and second for observation. When both,  training and validation errors go down, no matter if with different speed and velocity, training is going well. But if training error goes down, but validation error starts to increase, this is the sign of overfitting.

Example of overfitting:
<div>
<img src="images/regs/ovft_1.png"  height="600" width="800" />
</div>

In this case we can stop the training and use last stored parameters, while validation error was going down. In this case we need to store parameters once, every amount a time.

Early stopping:
<div>
<img src="images/regs/ovft_2.png"  height="600" width="800" />
</div>

It has downside, that we should save weights after some period (checkpoints), but we won't use this weights in training and thus this procedure have minimal impact on training time.

On the other hand we can consider early stopping as regularization with minimal effort.

Early stopping can be also considered as prevention of weights increase. Steps are limited so as the trajectory of weights. 

Early stopping might happen on insufficient error. We should change the model architecture, augment data more, add more data, use label smoothing or other regularization, model might perform better.

Early stopping might be used with other regularization techniques.

Often many regularization techniques are used together in order to achieve the best result.

One example of early stopping is learning rate finder

## Weights initialization

There are several weight initialization techniques.

Consider the model:

<div>
<img src="images/regs/wi_1.png"  height="600" width="800" />
</div>

Vanishing or exploding gradients:
$$
dw^l = dw^{l-1}dw^{l-2} \dots dw^{1} 
$$
so if each $dw^l$ is small $dw^l\ll 1$ then gradient becomes near zero and if $dw^l \gt 1$ then gradient might overflow 

Another case is mean and variance with random initialization:
```python
# random init
w1 = torch.randn(784, 50) 
b1 = torch.randn(50)
w2 = torch.randn(50, 10) 
b2 = torch.randn(10)
w3 = torch.randn(10, 1) 
b3 = torch.randn(1)
def linear(x, w, b):
    return x@w + b
def relu(x):
    return x.clamp_min(0.)
t1 = relu(linear(x_valid, w1, b1))
t2 = relu(linear(t1, w2, b2))
t3 = relu(linear(t2, w3, b3))
print(t1.mean(), t1.std())
print(t2.mean(), t2.std())
print(t3.mean(), t3.std())
############# output ##############
tensor(13.0542) tensor(17.9457)
tensor(93.5488) tensor(113.1659)
tensor(336.6660) tensor(208.7496)
```

Xavier initialization:
$$
w^l = randn(N^l\times M^l) \cdot \sqrt{\frac{1}{N^{l-1}}}
$$
<br>
and
$$
b^l = 0
$$

Initialization depends on the activation function and the layer architecture.

For the ReLU activation function "Kaiming Initialization" weight initialization works:
$$
w^l = randn(N^l\times M^l) \cdot \sqrt{\frac{2}{N^{l-1}}}
$$

For tanh activation function better use $1$ instead of $2$ (Xavier initialization):
$$
w^l = randn(N^l\times M^l) \cdot \sqrt{\frac{1}{N^{l-1}}}
$$

Bengio at al:
$$
w^l = randn(N^l\times M^l) \cdot \sqrt{\frac{2}{N^l + N^{l-1}}}
$$

<a href="https://arxiv.org/abs/1706.02515">Self-Normalizing Neural Networks</a>
<br>
<a href="https://twitter.com/SELUAppendix/status/873882218774528003" >Self-normalizing neural networks paper</a>

## Features (data) normalization and standartization

For different range of data:
- loss function might be asymmetric and cause the slower training. 
- One feature with large scale might change whole prediction.

If we normalize input features with mean and variance:
$$
\mu = \frac{1}{m}\sum_{i = 1}^{m}X^{(i)}
$$
<br>
and
$$
\sigma^2 = \frac{1}{m}\sum_{i = 1}^{m}(X^{(i)} - \mu)^2
$$

Normalized features:
$$
\tilde X^{(i)} = \frac{X^{(i)} - \mu}{\sigma^2}
$$

<div>
<img src="images/regs/fn_1.png"  height="600" width="800" />
</div>

Then cost function will be more symmetric which reduces training time significantly 

<div>
<img src="images/regs/fn_2.jpg"  height="600" width="800" />
</div>

Recall the gradient descent:
<div>
<img src="images/regs/fn_3.png"  height="600" width="800" />
</div>

On the other hand changes in particular feature does not harm the prediction

<div>
<img src="images/regs/fn_4.png"  height="600" width="800" />
</div>

Note: Use $\mu$ and $\sigma^2$ calculated on training data for validation data and inference

If the features have the similar scale, than the cost function will be symmetric, but normalizing in this case won't harm it at all

## Batch normalization

In case of input variables, we normalize features with mean and variance in order to make the training faster, make the cost function more symmetric. We can normalize the outputs of each layer as well to reduce training time further. 

Batch normalization gives us wider range for hyperparameter choise and works with "bigger" learning rates

In many models batch normalization is used before the activation functions.

Lets calculate mean and variance for each layer per batch:
$$
\mu^l = \frac{1}{m}\sum_{i = 1}^{m}(Z^l)^{(i)}
$$
and
$$
(\sigma^l)^2 = \frac{1}{m}\sum_{i = 1}^{m}((Z^l)^{(i)} - \mu^l)^2
$$

Now normalize each $Z^l$ as:
$$
(Z^l_{norm})^{(i)} = \frac{(Z^l)^{(i)} - \mu^l}{\sqrt{(\sigma^l)^2) + \epsilon}}
$$
where $\epsilon$ is a "small" number $10^{-8}$ in order to avoid zero division

Now set another trainable parameters $\gamma^l$ and $\beta^l$ as mean and variance:
$$
(\tilde Z^l)^{(i)} = \gamma^l (Z^l_{norm})^{(i)} + \beta^l
$$
where $\gamma$ and $\beta$ are rearnable parameters

If $\gamma^l = \mu^l$ and $\beta^l = \sqrt{(\sigma^l)^2) + \epsilon}$ then:
$$
 (\tilde Z^l)^{(i)} = (Z^l)^{(i)}
$$

Why we need this additional parameters?
<br>
If we have activation function for instance sigmoid, then mean in $0$ and variance $1$ covers only near-linear part of sigmoid

<div>
<img src="images/regs/bn_1.png"  height="600" width="800" />
</div>

We can consider batch normalization as an addition to the activation function:
$$
{bn}^l : Z^l \to \tilde Z^l
$$
<br>
$$
\tilde a^l = a^l \circ {bn}^l
$$

So the $\gamma^l$ and $\beta^l$ might be learned as well:
$$
\gamma^l = \gamma^l - \alpha d\gamma^l 
$$
<br>
and
$$
\beta^l = \beta^l - \alpha d\beta^l 
$$
or with other optimization algorithms (momentum, RMSProp, Adam) as well

Note: In practice we use mini-batches, and mean and variance for batch norms are calculated for each mini-batch

The mean calculation cancels the biases:
$$
Z^{(i)} - \mu = W a^{(i)} + b - \frac{1}{m}\sum_{i=1}^{m}(W a^{(i)} + b) = W a^{(i)} + b - b - \frac{1}{m}\sum_{i=1}^{m}W a^{(i)} = \\ 
W a^{(i)} - \frac{1}{m}\sum_{i=1}^{m}W a^{(i)}
$$
<br>
so if we use batch normalization we can remove biases ($b^l = 0$) and rely on $\beta^l$ parameters

Batch normalization during the inference (one example instead of batch):
- Calculate exponential moving average of mean and variance of all the batches for each layer
- Use this averages for normalize $Z^l$ (calculate $Z_{norm}^l$) for each layer
- Use learned $\gamma^l$ and $\beta^l$ for batch normalization for each layer

Why batch normalization works?

Covariant shifting:
<div>
<img src="images/regs/bn_2.png"  height="600" width="800" />
</div>

It reduces the shift of activation distribution per layer and generalizes better

Regularization effect:
- Batch norm is calculated on mini-batches
- Adds noise per layer as dropout if mini-batches are "small" enough
- If mini-batches are "large" then regularization effect is small
- Can be used with other regularizations

With batch normalization:
<div>
<img src="images/regs/bn_3.png"  height="600" width="800" />
</div>

Without batch normalization:
<div>
<img src="images/regs/bn_4.png"  height="600" width="800" />
</div>

Without batch normalization:
<div>
<img src="images/regs/bn_5.png"  height="600" width="800" />
</div>

Small batches can be fixed by $\epsilon$ or keeping moving averages of statistics, FastAI's running batch normalization with exponential moving averages, de-biasing and reducing the variance squeeze

Why the batch normalization works anyway:
<div>
<img src="images/regs/bn_6.png"  height="600" width="800" />
</div>

Training with batch normalization:
<div>
<img src="images/regs/bn_7.png"  height="600" width="800" />
</div>

Covariance shift:

<div>
<img src="images/regs/bn_8.png"  height="600" width="800" />
</div>

Persistence against noise:

<div>
<img src="images/regs/bn_9.png"  height="600" width="800" />
</div>

Loss landscape smoothness:

<div>
<img src="images/regs/bn_10.png"  height="600" width="800" />
</div>

In every "big" model with "small" batch sizes, especially when train starts with scratch, batch normalization won't work. It's also hard for RNN models.

## Weight normalization

Instead of batch normalization which has a limitation with smaller batch size, sometimes it's practical to use weights normalization

The weights normalization:
$$
W = u \frac{g}{||u||}
$$
<br>
where $u$ and $g$ are the learnable parameters as in case of batch normalization

The purpose was to separate weights amount from directions

The use-case familiar to me is the super-resolution based on augmented UNet encoder and decoder

Mostly layer normalization is used in segmentation tasks, or in detection tasks unless we use pre-trained encoder, than layer normalization is used mostly in decoder because in such a "big" models, we can apply only "small" size of batches, such as $2$ or even $1$.

## Layer normalization

For the layer normalization, we normalize input layer-wise in height rather than in width

<div>
<img src="images/regs/ln_1.png"  height="600" width="800" />
</div>

Layer normalization also can be considered as an improvement over the batch-normalization, but still batch normalization is widely used

<div>
<img src="images/regs/ln_2.png"  height="600" width="800" />
</div>

## Other normalizations

There are other normalizations such as: instance normalization, group normalization, etc

<div>
<img src="images/regs/on_1.png"  height="600" width="800" />
</div>

But this normalization might be used in some architectures, but still are not as popular as batch normalization

<div>
<img src="images/regs/on_2.png"  height="600" width="800" />
</div>

Many of them has a concrete realm where they work.  Layer norm in RNN models, instance norm for style-transfer, etc

## Questions?