<a id="dlmove"></a>
# Deep Learning as a movement

Brakthroughs of Deep Learning brought about next "AI spring", leading to the exponential rise of the number of publications in AI - keeping up with "Moore's Law" in computation.

<a href="https://cdn-images-1.medium.com/max/1000/0*znJS1Aygd_B-u9rA"><img src="https://drive.google.com/uc?export=view&id=1pBg72VF_MyyVLaaStmGw5aRiAqXrCT9z" width=600 heigth=600></a>

[Source](https://medium.com/syncedreview/google-ai-chief-jeff-deans-ml-system-architecture-blueprint-a358e53c68a5)

Not just techniology, funding of AI also changed dramaticly, shifting from public to private.

<a href="https://fabernovel-main-www.cdn.prismic.io/fabernovel-main-www/992eae50f2f0d6c2ccbeadcfc7d5860df8fec000_darpa-nsf-funding-of-ai-1024x406.png"><img src="https://drive.google.com/uc?export=view&id=1mAyePM5TA2TczEZBWCoo3feMKkpIjDJE" width=60%></a>

(Though military funding is still there...)

[Source](https://en.fabernovel.com/insights/economy/8-facts-about-ai-research-funding)

Counterintuitively, since at the same time "open science" and "open access" (pioneered by eg. [ArXiv](https://arxiv.org/)) gained traction, the number of openly available papers also exploded.

<a href="https://arxiv.org/help/stats/2017_by_area/cumsubs.png"><img src="https://drive.google.com/uc?export=view&id=18F4FLxGGTryURPNCpmyR4Moej7l5__e3" width=600 heigth=600></a>

We can surely say, that "multitudes" of unsung heroes are working on AI papers right now.

<a id="initialization"></a>
# Initialization


We incrementally minimize the loss function by optimizing the weights and biase. Convergence properties strongly influenced (or may be lost) by initial values for the weights.

- If weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
- If weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.
- For early deep learning applications vanishing and exploding gradients were problematic, so this hindered progress. 
- Summation drives the feed forward process, so we can only calculate contributions of individual weights to the error - and thus modify them - if they are different. If they are all the same, there is "symmetry", we essentially reduced our. network's capacity to one neuron - same gradients for every one of them, same modifications,
- ... Even more extreme case if we do it with zero, since that prevents any learning.

<a href="https://raw.githubusercontent.com/ritchieng/machine-learning-stanford/master/w5_neural_networks_learning/zerotheta_initialisation.png"><img src="https://drive.google.com/uc?export=view&id=1fb9qx-6W04eoHlwo4EmgSvhucBl6Twsb" width=600 heigth=600></a>

- Need **non-zero** and **symmetry breaking** initialization.
- Even using small random variables is suboptimal choice, since we still can get slow convergence. (These things were solved by Xavier initialization - see below).

Source for examples comes from [here](https://intoli.com/blog/neural-network-initialization/). 

1. Sample net:

(From [Keras examples](https://web.archive.org/web/20200710053207/https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py).) A Convolutional Neural Net architecture (elaborated in next classes also), 2 convolutional layers, using maxpooling, dropout, and ReLU as activaiton.


<a href="https://intoli.com/blog/neural-network-initialization/img/training-losses.png"><img src="https://drive.google.com/uc?export=view&id=18jjBdKS9VkNxDkzCuJ4wQJV5_1N2r7mp"></a>
Left: All weights 0, only oscillation in cost, no learning

Middle: Slow convergence

Right: Weights from a "good" distribution, inverse proportion to the number of input weights

### Pre-training as initialization

For a short period of time around 2006 - 2012, the problem of initializing a deeper network was solved by layerwise pre-training (which has a strong connection to autoencoders - we will discuss them later on). 

<a href="http://drive.google.com/uc?export=view&id=105D-cYATtqHXXZnR4ssm-j37RVTWr_XE"><img src="https://drive.google.com/uc?export=view&id=1EOIQZK8td3nnryxM-COecGVGA2GFr_f-"  width=600 heigth=600></a>

See paper [here](https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf).

The unsupervised pre-training was thought to be necessary to get the weights roughly into a realistic region based on the data. Later on it turned out, that the distribution of weights had to satisfy only some basic conditions to be practically trainable.

Though autoencoders will have a much bigger future ahead of them. We come back to this later in frames of representation learning.

### Xavier / He initialization

<a href="https://t1.daumcdn.net/cfile/tistory/2777CD4E57A0077436"><img src="https://drive.google.com/uc?export=view&id=1F54pyLyNeKfxbPfW6V94yEbTea_6Whvv" width=400 heigth=400></a>

Where `fan_in` is the width of the input and `fan_out` is the width of the output of the given layer. 

Original paper for Xavier-algorithm can be found [here](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf). Simple explanation [here](http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization).

Later on the ["He" version](https://arxiv.org/pdf/1502.01852.pdf) of initialization
Same idea, divided by a factor of 2.

Xavier initialization works well with activation functions where the expected value is 0, so for ReLU He method is to be used, which enables the training of really deep networks.


Example: 

- 5 layers multi layer neural network, uniform number of weigths in layers
- Can observe the shifting of scales because of the three initialization schemes
- First case: "upper" layers become "zeroed out"
- Second case: distribution of weights is rather uniform
- Third case: upper layer's weights "explode", since they are getting too big.

<a href="https://intoli.com/blog/neural-network-initialization/img/linear-output-progression-violinplot.png"><img src="https://drive.google.com/uc?export=view&id=1OiUlCLpljSl55YTtrHFOX8QGZk-__1qt" width=600 heigth=600></a>

<a href="https://intoli.com/blog/neural-network-initialization/img/relu-output-progression-violinplot.png"><img src="https://drive.google.com/uc?export=view&id=18m6j6D1vG6jvN3LJ-Mz9vmq2-KE5cVzw" width=600 heigth=600></a>

Another source [here](https://towardsdatascience.com/random-initialization-for-neural-networks-a-thing-of-the-past-bfcdd806bf9e)


## Initialization based on data

The above mentioned methods rely on the architectural properties of networks, but the question naturally arises, if we can try to fit the initialization scheme to the data in any sense?

This is what [this paper](https://arxiv.org/abs/1710.10570) investigated in 2017.


1. PCA based initialization

<a href="http://drive.google.com/uc?export=view&id=1X13X3XsUNRHwOC0RCoTjmS0Gk7H01mLW"><img src="https://drive.google.com/uc?export=view&id=1onVpPpEePIvOxLxHunn7sCv2LGfladEl" width=60%></a>


2. Init based on data statistics

<a href="http://drive.google.com/uc?export=view&id=1I9ezXPYaAi-uZQ70lulBcF_BlYis9rg2"><img src="https://drive.google.com/uc?export=view&id=1nNtx11AYHyrA0M7eudQVEYZtZuWmzREv" width=60%></a>


This idea in a sense echoes back to the "original" deep learning solution of [Bengio et al.](https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf) which was layerwise pre-training neural nets with "autoencoders" or "restricted Bolzmann machines", as well as strongly reinforces the ideas we have discussed about representation learning. (We will return to this topic again.)

<a id="regularize"></a>
# Regularization / overfitting 



"Empirical risk minimization" situation

"Holdout set", "test error", we would like to measure generalization.


<a href="http://drive.google.com/uc?export=view&id=1dgGMzI4cEwVeyQsSJKcIJSBcTrEimc9v"><img src="https://drive.google.com/uc?export=view&id=1PCJMk4mXJTxPjfVmPhYjpbgh3MjyMsqD"  width=900 heigth=900></a>


<a href="https://media.nature.com/m685/nature-assets/nmeth/journal/v13/n9/images/nmeth.3968-F1.jpg"><img src="https://drive.google.com/uc?export=view&id=1S1cL3LEsY9j8QA7iH1VkEZIjU8UpKJ_Z"></a>



## Neural networks should not work...

- Neural networks even with two layers are universal approximators, and massively overparametrized
- Degrees of freedom, capacity or ability to have absurd amount of "curvature" as functions is huge
- Just gets worse with if number of layers is greater than 2.

<a href="https://cs231n.github.io/assets/nn1/layer_sizes.jpeg"><img src="https://drive.google.com/uc?export=view&id=1GrjQNd-IBdJ0NYhfQk71ju5yolUr5Vak"></a>


## ...And did in fact not work for a while

"... neural network should be able to overfit training data given sufficient training iterations and a legitimate learning algorithm, especially considering that Brady et al. (1989) showed that an inferior algorithm was able to overfit the data. Therefore, this phenomenon should have played a critical role
in the research of improving the optimization techniques. Recently, the studying of cost surfaces of neural networks have indicated the existence of saddle points (Choromanska et al., 2015; Dauphin et al., 2014; Pascanu et al., 2014), which may explain the findings of Brady et al back in the late 80s."

[Source](https://arxiv.org/pdf/1702.07800.pdf)


## But now works none the less. WHY?

As mentioned many times: 

- "Some technical developments"
    - Part of this is the material for this class
- Hardware + amount of data...


## I. "Capacity" usage as the progressive growth of weights during training

Change of decision surface equals direction of downward "step" in direction of error gradients.

<a href="https://iamtrask.github.io/img/sgd_optimal.png"><img src="https://drive.google.com/uc?export=view&id=1Yq73R3HMAz2afnw1TUcH4w0M15MKLYo2"></a>
<a href="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/23104835/Qc281.jpg"><img src="https://drive.google.com/uc?export=view&id=1C5VsVjrzBmBMtLZjot4eo02ChUu0Q6qH"></a>

- If data is not linearly separable, curvature of decision surface increases progressively. 
- With some simplification, we can assume, that the weights will increase to produce a higher order polynomial.

Training can be understood to **increase the absolute value of weights** to utilize more and more capacity.

Small SGD based NN training on MNIST: charted the sum of absolute values of weights during the epochs.

<a href="http://drive.google.com/uc?export=view&id=1mbWqsUFHX-suGkJ15ZxNQcZvchBJprPf"><img src="https://drive.google.com/uc?export=view&id=1JlYqjXJsKVCq2hka6gmRfvXPtjfa0l9_"></a>



## Early stopping

Why do we check validation data in every epoch?

<a href="https://deeplearning4j.org/images/guide/earlystopping.png"><img src="https://drive.google.com/uc?export=view&id=1XaVTakoOBgvVnNutu0JQlHoKCggG-MiO" width=50%></a>

<a href="https://elitedatascience.com/wp-content/uploads/2017/09/early-stopping-graphic.jpg"><img src="https://drive.google.com/uc?export=view&id=1d3wAac34gh4fwlBvtS0v8ZIi1-oHRmEi"></a>

**We stop, where validation error starts to increase.**

### "Starts to"?

An example of a validation error curve might look like this:

<a href="http://drive.google.com/uc?export=view&id=1tfV_C9pS8QZUKqVbXxY47tXEa1hDOMzK"><img src="https://drive.google.com/uc?export=view&id=1m0pBVq3oDZQ48LaUzxMRulu3o2HYifuE" width="70%"></a>

- If we get a constantly updating, noisy series, it might not be the best idea to stop the process if *one* epoch caused an increase in validation error. 
- Should "wait a bit", do some smoothing. (Moving average? - The default guess of TensorBoard, Continuous increase for some epochs?...)

#### Early stopping metaalgorithm

On the one hand we have to "get a feel" for it, on the other hand there are attempts to come up with heuristics.

<a href="http://drive.google.com/uc?export=view&id=1XhKuVB6mqj5ZMxsUBK8VgWF8uceP0ra-"><img src="https://drive.google.com/uc?export=view&id=1KEkPzOgiiNzN6gZJmPgUS-9hkXbSO5A4" width="70%"></a>

([Source: Goodfellow, Bengio & Courville: Deep Learning (MIT, 2016)](http://www.deeplearningbook.org/contents/regularization.html))

There are several approaches for the systematic timing of early stopping.
**A good summary well worth reading [here](http://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf).**


## Classic normalization - weight norm

We would like to achieve the increase of performance with the smallest possible amount of weight increase during training. "Do your best with the least capacity".

### Added terms to the cost function

How can we achieve this?
We should add some regularizing terms to the cost. These are the "additional constraints" mentioned in the beginning, and - no surprise - can be same as in case of linear models.

<a href="https://image.slidesharecdn.com/10-170929054651/95/neural-network-part2-7-638.jpg?cb=1506665197"><img src="https://drive.google.com/uc?export=view&id=1Ifjhs2ecIMTZoWwGnWaHV6tLLoGvAk7s"></a>

"Eror" :-D

[Source](https://www.slideshare.net/21_venkat/neural-network-part2)



### L1-L2 penalty
No change from the linear models, same concept.

<a href="http://slideplayer.com/slide/2511721/9/images/3/Cost+function+Logistic+regression:+Neural+network:.jpg"><img src="https://drive.google.com/uc?export=view&id=18SptdB3bdawjhy8IlTYLbBoh1LJyVGNU"></a>

#### L1
**If parallelly few covariants have a "non zero" value, the degree of freedom for the function is constrained.**

<a href="https://jamesmccaffrey.files.wordpress.com/2017/06/l1_regularization_1.jpg?w=640&h=361"><img src="https://drive.google.com/uc?export=view&id=1P_ASa0pkCG-jOHFfJkIfKODbGavf0OuS"></a>

[Source](https://jamesmccaffrey.wordpress.com/2017/06/27/implementing-neural-network-l1-regularization/)


#### L2
**If the size of the covariants is small, the degree of freedom for the function is constrained.**

<a href="https://jamesmccaffrey.files.wordpress.com/2017/06/l2_regularization_equations.jpg?w=640&h=379"><img src="https://drive.google.com/uc?export=view&id=1RH7dj3XhAnFeM9p4XdE-IdTa8guhpN4P"></a>

[Source](https://jamesmccaffrey.wordpress.com/2017/06/29/implementing-neural-network-l2-regularization/)

#### Result
<a href="https://cs231n.github.io/assets/nn1/reg_strengths.jpeg"><img src="https://drive.google.com/uc?export=view&id=161UKSLmLb2WTeCbxRRGChRuaobjAqYhH"></a>



## II. Increase of cross-coupling during training (“anti-robust” model)

### Dropout

#### What is it?

The basic idea is to **randomly choose** some neurons (typically with 0.5 probability "coin flip") that we "switch off", we regard them as temporarily not being part of the network. We set their activation to 0.

We calculate gradients according to this, since their contribution to the error will be zero.

In the next forward pass, we "flip the coins" anew.

<a href="https://everglory99.github.io/Intro_DL_TCC/intro_dl_images/dropout1.png"><img src="https://drive.google.com/uc?export=view&id=13ONCe_Q0cbJvhbQcUhWj72YXEa-OvLVo"></a>

**When we finish the training, we use THE WHOLE NETWORK as one**, but we have to norm the outputs with a normalization term.

**[Original paper](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)**


#### WHYYYY?

1. First of all: we don't know exactly. There are some ideas, and an emerging consensus.
2. It is plausible that the "interdependency" ("coupling", "entanglement of hidden factors") is also increasing during training (quasi complexity, more like dependency), which causes more "brittle" or arbitrary decision boundaries (not a large margin one). This is a nice metaphor, but the literature is not backing it fully up .
3. It is in practice a form of model averaging, that is, a self contained "ensemble model" - see below.

More: [here](https://arxiv.org/abs/1512.05287)


#### Observations

1. Dropout makes training more difficult, **decreases training performance**
2. As a rule of thumb, using dropout **_approximately_ doubles required training time**, but epoch time usually decreases (remember, you are zeroing out half of the gradients...)
3. In case of $h$ hidden neurons when we apply dropout with probability $p$, we train approximately $2^h$ models, and we do the prediction with their "ensemble", decreasing activations by a factor of $p$
4. Dropout forces the network to learn *robust* features, which are useful for many other random neuron groups in their decisions

#### Connection with ensemble methods

According to the most widespread opinion, during dropout, we create a weakly coupled group or *ensemble* of learners.

"Dropout training is similar to bagging (Breiman, 1994), where many different models are trained on different subsets of the data. Dropout training differs from bagging in that each model is trained for only one step and all of the models share parameters. For this training procedure (dropout) to behave as if it is training an ensemble rather than a single model, each update must have a large effect, so that it makes the sub-model induced by that µ fit the current input v well."

[Source](http://proceedings.mlr.press/v28/goodfellow13.pdf)

It is interesting to think about the connection of deep neural nets with *"boosted trees"* and "traditional" models like *RandomForest* and *XGBoost*.

<a href="https://i0.wp.com/dimensionless.in/wp-content/uploads/RandomForest_blog_files/figure-html/voting.png?w=1080&ssl=1"><img src="https://drive.google.com/uc?export=view&id=1bXh9NOoMtOfXwKD9DIjZ1keKJHddXPgd"></a>

#### "Dropout" wrapper

It became such a standard practice, that there is a "wrapper" function in TensorFlow (and all other frameworks) as a standard operation.

It is considered here as separate layer, that "wraps" the prior one, executes the "switch off" (technically by multiplying with zero or other tricks).

```python
# Fully connected layer (in tf contrib folder for now)
fc1 = tf.layers.dense(fc1, 1024)
# Apply Dropout (if is_training is False, dropout is not applied)
fc1 = tf.layers.dropout(fc1, rate=dropout, training=is_training)
```
        

Details in the [documentation](https://www.tensorflow.org/api_docs/python/tf/nn/dropout)

(WARNING!, `tf.layers.dropout` is a wrapper around `tf.nn.dopout` - at least till the random "lords" of TF don't decide otherwise!) 

#### OK, but where to use it? 

Non trivial question is where to place the dropout "layer"?

<a href="http://drive.google.com/uc?export=view&id=1zvL2s2gXB4ymfcnOXXraR8QFSm3Gy54Z"><img src="https://drive.google.com/uc?export=view&id=14fY8b02R40szsza9p6Wb_DthggzIrZPK"></a>

This does not yet have a settled universal answer, just partial aspects were investigated.

For example in case of handwriting recognition the result is [this](http://ieeexplore.ieee.org/document/7333848/?reload=true).

This strongly depends on the architecture and represents an even more complex problem in case of recurrent models. A three part summary of the problem can be found [here](https://becominghuman.ai/learning-note-dropout-in-recurrent-networks-part-1-57a9c19a2307).

#### Sidenote: Dropout may be more important than we think

There is also work, (eg. by the group of [René Vidal](http://openaccess.thecvf.com/content_cvpr_2017/papers/Haeffele_Global_Optimality_in_CVPR_2017_paper.pdf)) which tries to characterize - under certain constraints - the supposedly non-convex optimization challenge of training a neural network as a convex problem. This can lead to a deeper understanding of how deep networks can - in theory, not just in practice - reach their high performance.

<a href="http://drive.google.com/uc?export=view&id=1lzDovsqObw6RizIEx5duSBX0G5Cx2EOf"><img src="https://drive.google.com/uc?export=view&id=1QUiQWybahjTpcEfvk5YkdUFgOSXc9Ud-" width=600 heigth=600></a>

In this framework, they propose, that dropout - together with structural properties of deep NN-s and the optimization process is essentially causing the problem to become convex. This line of research is pretty novel, not well understood, but definitely worth following.


## III. Robustness

### "Information dropout"

#### What? 

Achille & Soatto (2015)

What if we don't "switch off" nodes, as in dropout, but we consciously add noise from a known distribution to the inputs?

"We call “representation” any function of the data that is useful for a task. An optimal representation is most useful(sufficient), parsimonious (minimal), and minimally affected by nuisance factors (invariant). Do deep neural networks approximate such invariants sufficiently?
The cross-entropy loss most commonly used in deeplearning does indeed enforce the creation of sufficient representations, but the other defining properties of optimal representations do not seem to be explicitly enforced by the commonly used training procedures. 
However, **we show that this can be done by adding a regularizer, which is related to the injection of multiplicative noise in the activations, with the surprising result that noisy computation facilitates the approximation of optimal representations.** In this paper we establish connections between the theory of optimal representations for classification tasks, variational inference, dropout and disentangling in deep neural networks."

**"(3)We show that, counter-intuitively, injecting multiplicative noise to the computation improves the properties of a representation and results in better approximation of anoptimal one (Section 6).**


 (4)We relate such a multiplicative noise to the regularizer,and  show  that  in  the  special  case  of  Bernoulli  noise, regularization reduces to dropout [3], thus establishing a connection to information theoretic principles. We also provide a more efficient alternative, called InformationDropout, that makes better use of limited capacity, adapts to  the  data, and  is related to Variational Dropout"

<a href="http://drive.google.com/uc?export=view&id=1b0gwNXYvJ7J2ZlF26dP4a4iM3iSI2U6D"><img src="https://drive.google.com/uc?export=view&id=1Wp44C_bGBpZsWs3GWJguPIitQG7NtTgQ" width=500 heigth=500></a>

Original paper [here](https://arxiv.org/abs/1611.01353), implementation is not yet "standard", but can be found [here](https://github.com/ucla-vision/information-dropout)

**Learning: Noise is _not_ necessarily your enemy in Deep learning.**

### Connections to other methods

#### Variational autoencoders

Even the original paper of information dropout makes an explicit Reference to variational autoencoders, so we have to keep this in mind when we get back to those later on.

#### Data augmentation

What if we have too little data, and "make some more" - by applying transformations to the original data, which are plausible in the given domain - like rotation and scaling in image domains?

<a href="http://drive.google.com/uc?export=view&id=17zfUo3UoD0AqwnEgCkROIQc_2X8cZx51"><img src="https://drive.google.com/uc?export=view&id=1dTTVfpEoLgETfKMFh_zcBsTshxRBfD-y"></a>

Detailed analysis in case of images can be found [here](http://cs231n.stanford.edu/reports/2017/pdfs/300.pdf)

And additionally if we do such "plausible" transformations, **we greatly enhance the generalization ability of our models**.

Remember: More data is better! (Moser) 

Also, augmenting data is a form of injecting domain specific knowledge.

Source: Lecture series of Michael C. Mozer at DeepLearn2017 Bilbao

<a href="http://drive.google.com/uc?export=view&id=17IyTGGXib9u9cXDbADGXyDsLUwUIuwOc"><img src="https://drive.google.com/uc?export=view&id=1FNQXY-_vD7NlqPQ9yQng0g4uJ1zcPmsW"></a>

And again it is notable, that even Hinton admitted, that deep learning did not work because the lack of data.

##### Again connection:

This topic is in strong connection with "data imputation" whereby we try to mitigate the effect of missing data by "imputing" missing variables. There are also additional approaches, which we can not discuss in detail now, but you can find [here](http://www.stat.columbia.edu/~gelman/arm/missing.pdf).

Even one step further is something like "Cluttered MNIST", where the database is saved in an altered form. Downloadable from [here](https://github.com/deepmind/mnist-cluttered).



#### Adversarial examples

What if we add _targeted_ noise to the dataset?

For the human eye the noise is absolutely not perceptible (in extreme cases it can be **one pixel**), and with high confidence we can force the network to classify the example into another class?

<a href="https://cdn-images-1.medium.com/max/1400/1*Nj_toOwx_Hc5NLn97Jv-ww.png"><img src="https://drive.google.com/uc?export=view&id=1P2G0F-3OvQhLIcc6h7twyYQD61x0noYM"></a>

**This is one of the most serious challange for neural models presently!!! Just think about the security implications!!!**

Naturally the experiments started to test out this attack vector in physical settings by adding minimal optical noise (in form of colorful stickers) to stop signs. Well, it succeeded.

<a href="http://bair.berkeley.edu/blog/assets/yolo/image1.png"><img src="https://drive.google.com/uc?export=view&id=1Y5tLCrQbY0R_sjI0yUvqoBbXVfIX9j_o" width=600 heigth=600></a>

Original [here](http://bair.berkeley.edu/blog/2017/12/30/yolo-attack/)

**This shows that the decision boundary between the classes is absolutely _not robust_!**

On the other hand, nothing prevents us from **regarding adversarial examples as a form of data augmentation**, and adding them to the datasets.

This method though went into the extreme recently, where Google researchers utilized it to "reprogram" a trained network to carry out a different task than the original.

[Google Brain researchers demo method to hijack neural networks](https://venturebeat.com/2018/07/02/google-brain-researchers-demo-method-to-hijack-neural-networks/)

<a href="https://venturebeat.com/wp-content/uploads/2018/07/Capture-boring.png?fit=578%2C451&strip=all"><img src="https://drive.google.com/uc?export=view&id=1nia2Awq4jkDLaYWVD-SWNzFQ6zzrCkAf" width=600 heigth=600></a>

(This has some relevance for transfer learning also, since it can be considered as domain mapping - see later.)

#### Mixup - “Vicinal Risk Minimization” 

So getting back to the question of "large margins", we attack the base paradigm of empirical risk minimization, and we move over to "vicinical risk minimization"?

This means that we do not just try to fit a model based on the error on the observed data ("empirical risk"), but use a method to force the model to learn something "between the classes", so we in essence try to constrain the behavior of the model in the space "inbetween".
(Just think about embedding and representation learning!)

<a href="http://drive.google.com/uc?export=view&id=1xXp9PUoKg5TpsEO5v_q62D32bTZaF00o"><img src="https://drive.google.com/uc?export=view&id=1xdCE-lr3kq6fLHIQeEaB4zn96g9zR3ur" width=800 heigth=800></a>

<a href="https://www.inference.vc/content/images/2017/11/download-87.png"><img src="https://drive.google.com/uc?export=view&id=1L6tM5SNlUXC6i9wdAuxHjZDGQPu3GdbS" width=60%></a>

Original paper [here](https://arxiv.org/abs/1710.09412), a very good in-detail analysis [here](https://www.inference.vc/mixup-data-dependent-data-augmentation/).

This is yet a "fringe" movement in deep learning, but it can have great potential!


## IV. “Covariate shift”

### The problem

"Covariate shift" is a general and rather nasty problem, when the distribution of your input shifts in a subtle way, implying that the learned model is no longer appropriate.

<a href="http://slideplayer.com/slide/5237435/16/images/5/Covariate+Shift+Adaptation.jpg"><img src="https://drive.google.com/uc?export=view&id=1K8nzgHkRO1MSpenw2iYnu4_zgYsD1H2D" width=700 heigth=700></a>

If we can realize this - eg. with testing in time, or with performance monitoring "a bit late" - we can try to re-train the model. (See eg. ["importance sampling"](https://en.wikipedia.org/wiki/Importance_sampling), or some kind of transfer learning, as elaborated later).

More about detecting covariate shift can be found [here](https://datascience.stackexchange.com/questions/8278/covariate-shift-detection).

For our purposes it is important that Szegedy et al. demonstrated, that even **during training, inside backprop, covariate shift is happening** - if the network is deep enough, since when we update the weights, we update the output distribution of a layer, thus modifying the input distribution for the next layer - obviously, since we want to learn something. With this we are effectively "shooting at a moving target" during training.

<a href="https://image.slidesharecdn.com/dlmmdcud1l06optimization-170427160940/95/optimizing-deep-networks-d1l6-insightdcu-machine-learning-workshop-2017-8-638.jpg?cb=1493309658"><img src="https://drive.google.com/uc?export=view&id=1t99pHcGAJLbOkV5Mv4Mzn-mUOPZk532k" width=600 heigth=600></a>

### The solution: Batchnorm


"Batch normalization", that is normalizing the activations of the nodes during backprop with the minibatch mean and variance.

<a href="http://drive.google.com/uc?export=view&id=1S55NF6zKwnOp8WEE6woWyDlw4LuF1ZaR"><img src="https://drive.google.com/uc?export=view&id=1qbOwyXIOjomvQMgIwkXqCn5N5zPnc1dC" width=400 heigth=400></a>

A nice analysis of the original method can be found [here](https://gab41.lab41.org/batch-normalization-what-the-hey-d480039a9e3b), original paper [here](https://arxiv.org/abs/1502.03167). 

### Why does it work?

The normalization inside a batch ensures that even with changing weights, which change the input distribution, it's mean and variance remains the same. It can not "creep away" in the numeric range.

Detailed explanationby the famous Andrew Ng [here](https://www.coursera.org/learn/deep-neural-network/lecture/81oTm/why-does-batch-norm-work).

Recently - as of June 2018 - some doubt came up, if batchnorm is counteracting covariate shift at all (see [this](https://arxiv.org/pdf/1805.11604.pdf) paper) - though it is still considered beneficial (by smoothing the error surface).

## Newer normalization techniques

Though the application of batchnorm layers proved to be effective in speeding up convergence, some important shortcomings gradually became visible:

1. In case of "online" (example by example) learning or small batch sizes it does not work since the mean and variance statistics just don't mean anything about the other training data.

2. In case of recurrent networks (see later) it would be very cumbersome and a huge waste of resources.

Reacting to the above problems, multiple new normalization techniques have been introduced, which - in the end - also try to normalize the output of a neuron, but without relying on the batch statistics.

### Weight normalization

This method - similarly to batchnorm - uses a simple reparametrization trick: it stabilizes the output by decomposing the $\mathbf w$ weight vector for the neuron into the product of a scalar length parameter $g$ and a direction parameter $\mathbf v$:

$$
\mathbf w = \frac{g}{||\mathbf v||} \mathbf v
$$

with this trick we can "fix" the weight vector for the neuron in $g$, thus mitigating the covariate shift phenomena.

### Layer normalization

Similarly to batchnorm, we would like to fix the input distribution to the neuron, but not in the "batch dimension", but in the dimensions of the representations of individual datapoints: We standardize every individual input vector so, that the whole vector should have 0 mean and unit variance, after this (very similarly to batchnorm) we scale and shift all values with learned parameters for every neuron. 

<a href="https://i1.wp.com/mlexplained.com/wp-content/uploads/2018/01/%E3%82%B9%E3%82%AF%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%B7%E3%83%A7%E3%83%83%E3%83%88-2018-01-11-11.48.12.png?resize=768%2C448"><img src="https://drive.google.com/uc?export=view&id=1ks6yiXCPy0nrGYzqjPhCurmaN5S2XK71"></a>

Similarly to the other methods, here we also stabilize one of the components defining the output of the neuron, but unlike batchnorm, we don't manipulate the weights, but directly the distribution of the inputs.

### Further reading

More information about the normalization methods:

Original papers:

[Layer normalization](https://arxiv.org/pdf/1607.06450.pdf), [Weight normalization](https://arxiv.org/pdf/1602.07868.pdf)

### Maybe normalization is not necessary?

In some recent works, eg. [FIXUP initialization: Residual learning without normalization](https://arxiv.org/pdf/1901.09321.pdf) the authors argue, that with a more proper initialization scheme the need for batchnorm even in case of very deep residual convolutional networks (more on those later) falls away, thus the networks can become easily and efficiently trainable. 

<a href="http://drive.google.com/uc?export=view&id=1-ti0idMx9vFbyLjpqzEfrCBW1w_m8ule"><img src="https://drive.google.com/uc?export=view&id=1GhW0a6ML3uiOSA9TdwJRxifaK0Ye6y1H" width=75%></a>

These are quite new results, "handle with care"!

## A sidenote again: Synthetic gradients

The research on parallelizability of neural network training brought Google DeepMind to reflect upon the fact, that backprop on large enough NN-s is inherently bound by the fact that the gradients of the layers are to be calculated in a sequential order. Telling is the fact, that for problems on DeepMind horizon, this is considered a constraint, thus they try to parallelize the layer update calculation. 

For this to be achievable, they have introduced the concept of *"synthetic gradients"*. (Details can be found [here](https://deepmind.com/blog/decoupled-neural-networks-using-synthetic-gradients/).)

<a href="https://storage.googleapis.com/deepmind-live-cms/images/3-3.width-1500_qGOdtUS.png"><img src="https://drive.google.com/uc?export=view&id=1R1F1JuarEWMgEiZUC0MlMu8isXWGDURK" width=400 heigth=400></a>

"So, how can one decouple neural interfaces - that is decouple the connections between network modules - and still allow the modules to learn to interact? In this paper, we remove the reliance on backpropagation to get error gradients, and instead learn a parametric model which predicts what the gradients will be based upon only local information. We call these predicted gradients *synthetic gradients*."

<a href="https://storage.googleapis.com/deepmind-live-cms/images/3-5.width-1500_EZm1qhu.png"><img src="https://drive.google.com/uc?export=view&id=1HcqIPdC8nKe2D1j5nAAIOQD-ksybfmET" width=600 heigth=600></a>
"The synthetic gradient model takes in the activations from a module and produces what it predicts will be the error gradients - the gradient of the loss of the network with respect to the activations.

... and use the synthetic gradients (blue) to *update Layer 1 before the rest of the network has even been executed.*"

This is again a paradigmatically interesting concept, since it goes back to the roots of Hebbian learning, whereby only local information is necessary for the update of weights, but with a twist: sooner or later *some* global information is necessary, but much less. 

This view is shared by other scholars (eg. Piere Baldi in his talk ["Deep Learning: Theory, Algorithms and Applications to Natural Sciences"](https://drive.google.com/file/d/1lbqi2EB24dhm5lHGdmRhGWrwsyRvOcrf/view?usp=sharing) at DeepLeran2018 Summer School, Genova) who state, that purely hebbian learning is not possible, at least some partial "distant" supervision is necessary.

<a id="gdparams"></a>
# GD training hyperparameters - Learning rate and co.
<a href="https://i1.wp.com/theaviationist.com/wp-content/uploads/2017/12/XB-70-cockpit.jpg"><img src="https://drive.google.com/uc?export=view&id=1YoUsLfCxV4bISRv5rYOq3aOQz4E3DnAY" width="700px"></a>


## What control points do we have over our model?

- Structure
    - Architecture (Fully connected layers - till this point, but we will see more)   
        - Selection from "general" or standard architectures 
        - Custom "wiring"
    - Layer number
    - Layer size
- Learning parameters
    - Optimization method (Adam vs SGD vs…)
    - Parameters of these  
        - Pl learning Learning rate
        - In case of Adam $\beta$1,  $\beta$2
        ...
- Epoch number 
    - Worth constant regard to “early stopping”
- Regularization parameters
    - Cost function regularization parameters
    - Dropout rate
    - Inf dropout noise level
   
If we consider all these "control points" and their combinations, we still have a *huge* space we can choose from. 

Since we do not have any analytic insights about the appropriate settings of these, we have to rely on "expert intuition" or "experience" (so much so, that some researchers [criticize](https://medium.com/@Synced/lecun-vs-rahimi-has-machine-learning-become-alchemy-21cb1557920d) their own field as "alchemy"), or we choose to explore the space of these settings - which can be casted itself as a search or optimization problem.

Naturally enough, researchers tried to attack these optimization problem with the well known machine learning methods, specifically:

- Grid search (standard practice for other ML methods, implemented in [ScikitLearn](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), usable [for DL](https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/))
- Random search: [Bergstra and Bengio 2012](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)
- [Bayesian optimization](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f)
- [Genetic algorithms](https://ai.googleblog.com/2018/03/using-evolutionary-automl-to-discover.html)
- [Reinforcement learning](https://arxiv.org/abs/1602.04062)

The usage of these techniques is out of scope of this course, but it is worth mentioning, that "AutoML" as a trend is the product of this approach. It's proponents argue, that it is a perfect automatic solution for data science problems, it's opponents though tend to point out, that it is exorbitantly costly to train as well as acts as a kind of "replacement" for well understood solutions.


From the above parameter space, we restrict ourselves to discussing Learning Rate, because it's peculiar properties.

## Learning rate and it's "schedules"

Although, as we have already seen, there are GD variants that adaptively set the learning rate during training (Adadelta etc.), non-adaptive GD variants, e.g. vanilla SGD and Momentum are still frequently used for optimization and require manual setting.

For these variants, after setting an initial learning rate (which is a crucial hyperparameter), the rate is typically kept constant during an epoch, but is changed (typically decreased) after every epoch (or every $n$ epochs with a fixed $n$) according to a schedule. The most frequently used schedules are

+ **Step decay**: The learning rate is reduced by a $\rho$ factor after every epoch or $n$ epochs, that is an
$$\text{lr} = \varrho \cdot \text{lr}$$ 
update is performed where $0<\varrho<1$. 

+ **Exponential decay**: The learning rate is set to be
$$\text{lr} = \text{lr}_0\cdot e^{-k t}
$$
where $k$ is a hyperparameter and $t$ is the number of already trained epochs (or $n$ epochs). It is easy to see that this is equivalent to a step decay with $\varrho = e^{-k}$.
+ **${1}/{t}$ decay**: The learning rate is set to
$$\text{lr} = \frac{\text{lr}_0 }{1 + k t}$$ where $k$ is a hyperparameter and $t$ is the number of already trained epochs (or $n$ epochs).
+ **Constant learning rate** across epochs.

These schedules can be combined into more complex ones, e.g. a schedule may keep the learning rate constant for a number of epochs and then start a step decay. The switch between the simple schedules can happen simply at a predetermined epoch, but it is frequently connected to a validation metric, e.g. can be triggered when a metric has stopped improving.

<a href="https://cdn-images-1.medium.com/max/1600/1*VQkTnjr2VJOz0R2m4hDucQ.jpeg"><img src="https://drive.google.com/uc?export=view&id=1Q2IPPPzH2-pGdtR83U8nCKOPAFvzIeVU" width=600 heigth=600></a>
<a href="https://cdn-images-1.medium.com/max/1200/1*iSZv0xuVCsCCK7Z4UiXf2g.jpeg"><img src="https://drive.google.com/uc?export=view&id=1RewxptgGfQeB28XpsgpXPhnpkJ52dqyU" width=600 heigth=600></a>

[source](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)

## Systematic thinking about LR

[Setting the learning rate of your neural network](https://www.jeremyjordan.me/nn-learning-rate/)

<a href="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-24-at-11.47.09-AM.png"><img src="https://drive.google.com/uc?export=view&id=1j8ZG4mABW8IyT36iN56ERkrq94PjcaDC" width=55%></a>
<a href="https://www.jeremyjordan.me/content/images/2018/02/lr_finder.png"><img src="https://drive.google.com/uc?export=view&id=1shs2UPZZjTDtgN3P8fM0Ay9vnwoRXuFb" width=55%></a>
    


## Connection of LR and batch size 

Furthermore a recently published paper draws attention to the fact that the decrease of the learning rate according to schedule can be subsituted by the **gradual increase of the batch size**.

[Don’t decay the learning rate, increase batch size](https://arxiv.org/abs/1711.00489)

<a href="http://drive.google.com/uc?export=view&id=1ka9PN4zecdNVT6uE05VXEvbY_8EygEAv"><img src="https://drive.google.com/uc?export=view&id=1TJJthWLZOdxJQ_86687GJRicf_oGDD0x" width=700 heigth=700></a>

A potential cause of this phenomenon is, that when we train on  a bigger dataset (relatively for each gradinet update, by the increase of the batch size), we get a more and more "regularized", "smooth" gradient.

If this is true, we can benefit, since the larger batch size speeds up training considerably.


## Caution with big batches!

A bit on contrary with the statements above, it is worth mentioning, that there is emerging evidence, that small batch training has it's very distinctive advantages that are worth capitalizing on.

In a paper [Revisiting Small Batch Training for Deep Neural Networks](https://arxiv.org/abs/1804.07612) the authors conduct large scale training runs to determine the effect of batch size, and find, that large batch sizes actually decrease generalization performance of models, and there is a rather small optimal batch size for training. 

<a href="https://www.graphcore.ai/hs-fs/hubfs/ResNet32_CIFAR100_Aug_VB_val_dist.png?t=1536059639530&width=600&height=521&name=ResNet32_CIFAR100_Aug_VB_val_dist.png"><img src="https://drive.google.com/uc?export=view&id=17X1QiH8rNbqzNXmrMFaF0ny9xCOkHdhg" width=400 heigth=400></a>

This is directly counteracting the trend of larger batches for parallelism that gained traction with the advent of large-scale distributed training methods.

**We should note, that this points out that batchsize is crucial, though during history was many times falsely determined by technical limitations (or luckily, in case of original SGD).**

This can be caused by the effect of - under constant learning rate - a larger batch's gradients are approximations of a smaller one's, so they may prevent fine-grained convergence.

<a href="https://www.graphcore.ai/hs-fs/hubfs/SGD_fig.png?t=1536059639530&width=900&name=SGD_fig.png"><img src="https://drive.google.com/uc?export=view&id=18JeRkwucZkXpd8QgxWHC76fuDnwOYl8B" width=400 heigh=400></a>

A detailed exploration of the paper can be found [here](https://www.graphcore.ai/posts/revisiting-small-batch-training-for-deep-neural-networks).

**Conclusion: Use small  batch sizes, it might help - though will be slower.**

## Loss topology and architecture

"The loss landscape of a neural network (visualized below) is a function of the network's parameter values quantifying the "error" associated with using a specific configuration of parameter values when performing inference (prediction) on a given dataset. This loss landscape can look quite different, even for very similar network architectures. The images below are from a paper, Visualizing the Loss Landscape of Neural Nets, which shows how residual connections in a network can yield an smoother loss topology."

<a href="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-26-at-10.50.53-PM.png"><img src="https://drive.google.com/uc?export=view&id=1_54gU3_A8z22Ncwdk3CBs4y2SmOePiDs" width=75%></a>

[Setting the learning rate of your neural network](https://www.jeremyjordan.me/nn-learning-rate/)

## Only decreasing?

**If it is true, that the loss topology is "bumpy", does it mske sense to ONLY decrease the LR?**

What if we sometimes **when we are stuck, we increase a bit, and then decrease again**?

What if we anticipate this in advance?

<a href="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-25-at-8.44.49-PM.png"><img src="https://drive.google.com/uc?export=view&id=1sIoyscsvAc8PVYJLu71pMiLaEniBkiCM" width=55%></a>

## Warm restarts

Original paper:
["Stochastic gradient descent with warm restarts"](https://arxiv.org/abs/1608.03983) 

"In this paper, we propose to periodically simulate warm restarts of SGD, where in each restart the
learning rate is initialized to some value and is scheduled to decrease. Four different instantiations
of this new learning rate schedule are visualized in Figure 1. Our empirical results suggest that SGD
with warm restarts requires 2× to 4× fewer epochs than the currently-used learning rate schedule
schemes to achieve comparable or even better results."

<a href="http://drive.google.com/uc?export=view&id=12a2yIch8Nnf27g8UEQ9ydJy9MhuAzysR"><img src="https://drive.google.com/uc?export=view&id=17tm7ekUg_6at0Q0FKTehLl2v9T5-4Hz9" width=700 heigth=700></a>

## "Superconvergence"

The usage of cyclic LR has some interesting sideeffects also, which made it very popular. Enter **"superconvergence"**.

"we describe a phenomenon, which we named “super-convergence”, where neural networks can be trained an order of magnitude faster than with standard training methods. The existence of super-convergence is relevant to understanding why deep networks generalize well. One of the key elements of super-convergence is training with one learning rate cycle and a large maximum learning rate. A primary insight that allows super-convergence training is that large learning rates regularize the training, hence requiring a reduction of all other forms of regularization in order to preserve an optimal regularization balance."

<a href="https://cdn-images-1.medium.com/max/1200/1*cmYfSsmlXm8XdjNddsQqZw.jpeg"><img src="https://drive.google.com/uc?export=view&id=1-N-FgcHib5hK6V91s7Stq4aV14zYxhMl" width=65%></a>

Original source [here](https://arxiv.org/pdf/1708.07120.pdf)

More approachable summary [here](https://towardsdatascience.com/https-medium-com-super-convergence-very-fast-training-of-neural-networks-using-large-learning-rates-decb689b9eb0)

The "why" for cyclic LR:

"The motivation for the “One Cycle” policy was the following: The learning rate initially starts small to allow convergence to begin but as the network traverses the flat valley, the learning rate becomes large, allowing for faster progress through the valley. In the final stages of the training, when the training needs to settle into the local minimum, the learning rate is once again reduced to a small value."

<a href="https://cdn-images-1.medium.com/max/1200/1*Y3Xnw8qxOH6zdGmTlCMDbA.jpeg"><img src="https://drive.google.com/uc?export=view&id=1XEkoAKAFNA3riD5xStEqXH3CeIvISqjL" width=65%></a>

### LR as regularizer

An also interesting effect is, that the original paper argues: **large learning rates act as regularizers**, which by the way is consistent from the observations above about the connection with batch sizes: small batch and large LR are both regularizers.


### Sidenote: how to find the learning rate?

One of the key questions of the "1cycle" policy is, what is the maximally tolerable learning rate, that can help in training?

Whilst investigating the superconvergence phenomenon, the authors of the paper above rely on [these results](https://arxiv.org/abs/1506.01186), which investigate an empirical procedure of setting the learning rate as Sylvain Gugger writes:

"Over an epoch begin your SGD with a very low learning rate (like $10^{−8}$) but change it (by multiplying it by a certain factor for instance) at each mini-batch until it reaches a very high value (like 1 or 10). Record the loss each time at each iteration and once you're finished, plot those losses against the learning rate. You'll find something like this:"

<a href="https://sgugger.github.io/images/art2_courbe_lr.png"><img src="http://drive.google.com/uc?export=view&id=1LX3TMSQ-yqszAzENI_-eKAfxtWKHGtQP"></a>

"The loss decreases at the beginning, then it stops and it goes back increasing, usually extremely quickly. That's because with very low learning rates, we get better and better, especially since we increase them. Then comes a point where we reach a value that's too high and the phenomenon shown before happens. Looking at this graph, what is the best learning rate to choose? Not the one corresponding to the minimum.

Why? Well the learning rate that corresponds to the minimum value is already a bit too high, since we are at the edge between improving and getting all over the place. We want to go one order of magnitude before, a value that's still aggressive (so that we train quickly) but still on the safe side from an explosion. In the example described by the picture above, for instance, we don't want to pick $10^{−1}$ but rather $10^{−2}$.

This method can be applied on top of every variant of SGD, and any kind of network. We just have to go through one epoch (usually less) and record the values of our loss to get the data for our plot."

[source](https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html)


### Snapshot ensembles

Warm restarts represented originally a separate line of research, but got combined with ["Snapshot ensembles"](https://arxiv.org/abs/1704.00109). This technique holds a connection - even in name - with ensemble models. 

During this procedure, we save all "good" model states - typically before warm restarts - and finally do an ensemble of them in the classical sense.

It's biggest (proposed) advantage is, that we hope to collect solutions from similarly powerful, but distinct optima, which were visited during training, so their ensemble will be more powerful than any of them alone.

<a href="http://ruder.io/content/images/2017/11/snapshot_ensembles.png"><img src="https://drive.google.com/uc?export=view&id=1ALXu1G4Zpxqw7P41j95tHDbmwk2wbYdX" width=700 heigth=700></a>

This also gives an answer to the question - posed in early stopping context - if we should store and use the final model. No, not necessarily.

Moreover if we accept the fact that a number of "global" optima are present for a given model, we are inclined to "sample" from more "basins", which even connects us to boosting methods.

For opinions on this [this](http://www.argmin.net/2018/02/05/linearization/) post is worth reading. It can point towards a more unified view.

An also very interesting summary of the developments in this regard is [this](http://ruder.io/deep-learning-optimization-2017/).


(Sebastian and the team at AYLIEN are very up to date in NLP, worth following [here](http://ruder.io/#open).)

## An unexpected side-effect: Better optimizers!

### RAdam

The "one cycle" policy, especially it's first part, the **"warmup" phase**, where we **start with a small initial learning rate, and gradually increase it** can have unexpected sideeffects in case of adaptive learning rate optimizers, especially **Adam**, since **warmup mitigates the generalization problem of adaptive LR methods**.

The crucial understanding of the paper ["On the Variance of the Adaptive Learning Rate and Beyond"](https://arxiv.org/abs/1908.03265) is, that much of the loss in generalization performance in case of Adam comes from it's naive over reliance on the **initial variance in gradients**, thus the **weight distribution gets quickly distorted** and never fully finds it's way back to fruitful, more global optima.

<a href="http://drive.google.com/uc?export=view&id=1mYfms1HJeL7O_OuvPcDg3HZpdVpyh2ow"><img src="https://drive.google.com/uc?export=view&id=1lzYQSLn8AhJU1OjKjbyuvhic_-Qx1-fv" width=50%></a>

So the newest, state of the art optimization method seems to be a form of Adam, namely **rectified Adam (or RAdam)** that incorporates some learnings from the warm-up method and variable learning rates.

_It is basically trying to set up an adaptive regularization scheme with which it balances the amount of Adam's adaptive LR properties, thus in extreme case it can act as SGD, disregarding the aggregated variance data, and only if appropriate does it start to behave like Adam._

Or with the words of the authors:

"Comparing these two strategies (warmup and RAdam), RAdam deactivates the adaptive learning rate when its variance is divergent, thus avoiding undesired instability in the first few updates."
 
A nice introduction can be found [here](https://medium.com/@lessw/new-state-of-the-art-ai-optimizer-rectified-adam-radam-5d854730807b).

It promises to be fast, but more importantly **robust across a wide selection of learning rates**. 

<a href="https://miro.medium.com/max/700/1*BMwu8Km-CtPsvaH8OM5_-g.jpeg"><img src="https://drive.google.com/uc?export=view&id=1DCnEETLqnoQaiRkvc4wZzepCNBv9dGzT" width=65%></a>

Since the [paper](https://arxiv.org/abs/1908.03265) describing the method is still pretty fresh, the verdict is still out, but looks very promising! (Implementations are not yet mainstream...)

### Lookahead

Though the inspiration is not that direct and obvious, but the idea of storing some weights during the training had some "spin-off" ideas in optimization. In their recent paper [LookAhead optimizer: k steps forward, 1 step back](https://arxiv.org/abs/1907.08610) Zhang et al. proposed an optimization method where they **keep a copy of the weights, and use two optimization regimes, one "slow" and one "fast" for the network.**

<a href="http://drive.google.com/uc?export=view&id=1EhpRMvpgKowimKMnuVrkcXtd6hB79iIo"><img src="https://drive.google.com/uc?export=view&id=1ttOcZKzax1Zh5Ar5M39ooAwy7XrYkobg" width=75%></a>

**After a short period (some some iterations, eg. 5) they than "synchronize" the weights.**

Or with the words of the authors: 

"Lookahead maintains a set of slow weights $φ$ and fast weights $θ$, which get synced with the fast weights every $k$ updates. The fast weights are updated through applying $A$, any standard optimization algorithm, to batches of training examples sampled from the dataset $D$. After $k$ inner optimizer updates using $A$, the slow weights are updated towards the fast weights by linearly interpolating in weight space, $θ − φ$. We denote the slow weights learning rate as $α$. After each slow weights update, the fast weights are reset to the current slow weights value."

Why is this good?

As this [excellent description](https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d) puts it:

"...in effect it allows a faster set of weights to ‘look ahead’ or explore while the slower weights stay behind to provide longer term stability. The result is reduced variance during training, and much less sensitivity to sub-optimal hyper-parameters and reduces the need for extensive hyper-parameter tuning... By way of simple analogy, LookAhead can be thought of as the following. Imagine you are at the top of a mountain range, with various dropoff’s all around. One of them leads to the bottom and success, but others are simply crevasses with no good ending. To explore by yourself would be hard because you’d have to drop down each one, and assuming it was a dead end, find your way back out. But, if you had a buddy who would stay at or near the top and help pull you back up if things didn’t look good, you’d probably make a lot more progress towards finding the best way down because exploring the full terrain would proceed much more quickly and with far less likelihood of being stuck in a bad crevasse."

<a href="http://drive.google.com/uc?export=view&id=1A43MSBp0s-zKO8H8EtUxY_B1rcTLJnRC"><img src="https://drive.google.com/uc?export=view&id=1Ugm-nexj9KsNuJZ5NLFEmPRVw5Uwy0TO" width=85%></a>

The thing seems to actually work!

### Surprise: a combination, Ranger

Well, if both RAdam and Lookahead achieved impressive new state-of-the-art results, why not combine the two?

This is exactly what happened, thus the new optimization method [Ranger](https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d) was born.

**The results were impressive, the normalization effects of RAdam at the beginning, and the overall stabilization of Lookahead combine seamlessly!**

As of 2019 September, this was the state-of-the-art, but the saga continues... 


# Epilogue

## Did we understand generalization well enough?

We are by now all to familiar with the drawbacks of empirical risk minimization, that is: we are very much afraid of learning the "quirks of the dataset" (Hinton) and overfit. Further more, we wholeheartedly ascribed to the understandings of **statistical learning theory** about the **relationship between complexity and overfitting / generalization** as follows:

<a href="http://drive.google.com/uc?export=view&id=1aPYe74krTI3RY1Nf2rYO5G06qZVbI3Sc"><img src="https://drive.google.com/uc?export=view&id=12cxR79w_Kv_OEPcbRzsrLY7VIPTcLGXi" width=35%></a>

But what if this is not the while picture? What if the **connection between complexity and generalization is not that simple?**

Some recent observations (like in the recent paper [Reconciling modern machine learning and the bias-variance trade-off](https://arxiv.org/abs/1812.11118)) point in the other direction. Maybe the effect of more capacity is detrimental **only in case of models smaller than the "memorization capacity"**, and maybe **we should have gona even bigger!**

<a href="http://drive.google.com/uc?export=view&id=1xvtWpiUxYkiYqPzrCx7BwFe1_YYh3eo5"><img src="https://drive.google.com/uc?export=view&id=1lFXEXRxZBU9uDFzUXyaBI0xlld1MFEJp" width=85%></a>

Much is still yet unknown.

There are some interesting results pointing in a bit of the opposite direction also, raising the question:


## Do we need training and big networks at all?

To say that the field of deep learning is in flux is a mayor understatement. We at least assumed, that the fact, that we need large networks and train them for an extensive period of time with sophisticated methods holds true.

Well, maybe not so, in two ways:

### The "Lottery Ticket Hypothesis"

"...after training a network, **set all weights smaller than some threshold to zero (prune them), rewind the rest of the weights to their initial configuration, and then retrain the network from this starting configuration keeping the pruned weights weights frozen (not trained).** Using this approach, they obtained two intriguing results.

First, they showed that the pruned networks performed well. **Aggressively pruned networks (with 99.5 percent to 95 percent of weights pruned) showed no drop in performance compared to the much larger, unpruned network. Moreover, networks only moderately pruned (with 50 percent to 90 percent of weights pruned) often outperformed their unpruned counterparts.**

Second, as compelling as these results were, the characteristics of the remaining network structure and weights were just as interesting. Normally, if you take a trained network, re-initialize it with random weights, and then re-train it, its performance will be about the same as before. But with the skeletal Lottery Ticket (LT) networks, this property does not hold. The network trains well only if it is rewound to its initial state, including the specific initial weights that were used. Reinitializing it with new weights causes it to train poorly. As pointed out in Frankle and Carbin’s study, it would appear that the **specific combination of pruning mask** (a per-weight binary value indicating whether or not to delete the weight) **and weights underlying the mask form a lucky sub-network** found within the larger network, or, as named by the original study, a winning “Lottery Ticket.”"

<a href="https://1fykyq3mdn5r21tpna3wkdyi-wpengine.netdna-ssl.com/wp-content/uploads/2019/05/blog_header_2-1068x458.png"><img src="https://drive.google.com/uc?export=view&id=1zVaznTkOUe6_iDpc5LT1xfNY2IZm9ESh" width=85%></a>

[Original paper](https://arxiv.org/pdf/1803.03635.pdf)

A [more thorough analysis](https://eng.uber.com/deconstructing-lottery-tickets/) 

**Takeaways:**
- Much of the capacity of deep models and the associated training time is wasted
- Initialization is a dominant factor, maybe the size of networks only matters for giving large enough room to randomity to come up with "lottery tickets"
- There is a very interesting interplay between structure, learning and performance in deep networks
    
### Weight agnostic networks

"...Schmidhuber et al. have shown that a randomly-initialized LSTM [13] with a learned linear output layer can predict time series... we aim to search for **weight agnostic neural networks**, architectures with strong inductive biases that **can already perform various tasks with random weights.**"

<a href="https://storage.googleapis.com/quickdraw-models/sketchRNN/wann/png/mnist_cover.png"><img src="https://drive.google.com/uc?export=view&id=1uDX-0HZ8iih7kHwGeWFbsz2bP9fBfCdc" width=85%></a>


[Original](https://weightagnostic.github.io)

**Takeaways:**
- Maybe we do not need (that much) training at all?
- Structure can be key, as well as the right inductive bias!

# Strategic advice: How to debug ML models?

In case of so many "knobs" one can get lost pretty easily, and the chance for making  mistake is high. The question is then: How to start?

Though there are multiple strategies, one of the interesting ones was presented by the leader of OpenAI **Josh Tobin**:

[Troubleshooting Deep Neural Networks](http://josh-tobin.com/troubleshooting-deep-neural-networks.html)

<a href="http://josh-tobin.com/assets/debugging_overview.jpg"><img src="https://drive.google.com/uc?export=view&id=1mOFalWJOS4-_Zlo-xYGBpKQSHZEDfWLg" width=55%></a>

In [None]:
from IPython.display import IFrame 
IFrame("https://www.youtube.com/embed/XtCNNwDi9xg", width="560", height="315")

Very detailed guide, many good advice, worth reading and / or watching!

Another good source is **Andrej Karpathy's
[A Recipe for Training Neural Networks](https://karpathy.github.io/2019/04/25/recipe/)**

**Final remark:**
    
A recent work carried out a thorough investigation of hyperparameters, and came up with "sensible default values".

["Rethinking Defaults Values: a Low Cost and Efficient Strategy to Define Hyperparameters"](https://arxiv.org/abs/2008.00025) is definitely worth a read!