<a id="howmanyneurons"></a>
# Architectures, depth, width


## Depth or width?
Since it has been proven by Cybenko (and extended upon by others) that a neural network with only one hidden layer can approximate arbitrary functions, it is rather surprising, that we are not talking about "wide learning", that is models with one, albeit very wide, hidden layer. Practice shows otherwise, and theory is starting to catch up:

(For details see the answer [here](https://stats.stackexchange.com/questions/182734/what-is-the-difference-between-a-neural-network-and-a-deep-neural-network-and-w). Most probably it is true, that we would need exponentially more width in a network for the same result achievable with depth - but this is far from proven.)

Empirically it is true, that depth in itself is useful, since we suppose and observe (like [here](http://cvlab.cse.msu.edu/pdfs/DoConvolutionalNeuralNetworksLearnClassHierarchy.pdf), that the networks **learn a hierarchy of features** about the data, thus increasing depth is substantially beneficial - and not just as an increase of "raw cappacity".

<img src="https://qph.fs.quoracdn.net/main-qimg-94b020eda6d3894258166dd38a1f6255" width=600 heigth=600>

### Depth as successive application of hierarchical kernels

As we have seen, Universal Approximation case, we can piecewise construct complex functions from neurons.

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQRs8Auv4obVSRNCE-qWgEy96NGSY9oZlRtC_No3wPE_uuRVfLU" width=45%>

<img src="https://i.stack.imgur.com/xcdwn.png" width=60%>
<img src="https://i.stack.imgur.com/blIBz.png" width=60%>
<img src="https://ars.els-cdn.com/content/image/1-s2.0-S089360809700097X-gr3.gif" width=55%>

This is strongly resembling the effect that we can observe in case of ensemble models.

<img src="https://i.stack.imgur.com/HXGXw.png" width=60%>

However what we can observein the case of neural networks is a more powerful effect, in a sense a combination of the piecewise consruction of decision boundaries with the learning of successive, hierarchic "kernel operations", that "embed" the data space into a meaningful representation, enabling easy separability.

<img src="http://drive.google.com/uc?export=view&id=1tQu8JagtQKjd7xVbB5uDBA0CebjQcZ2B" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1q6TEXhcZ0hU9nv4CycGcNJyUb9RqC_Xy" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1UFV35b84geZTymaTKQBpLXva8efafloW" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1jAyFn9iKhjSADG-YViN73goVic1x8iu5" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1XSrsBdnan08LVjcVjiJwRwn6_u3HyzvA" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1Aqx6qLy9pVt1-p2IG_CM0SKLnh7-Y2cI" width=50%>

[source](https://github.com/random-forests/applied-dl/blob/master/examples/twist-and-fold-moons.ipynb)

For some interactive / video illustrations see [here](https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html) and [here](https://srome.github.io/Visualizing-the-Learning-of-a-Neural-Network-Geometrically/).

**One could argue, that this in itself is the decisive feature behind the success of deep learning.**


In recent times a counter-intuitive reasoning has also arisen, that proposes, that the depth of networks, that is their "overparametrization" in itself is not bad, or even beneficial for training, see for example [here](http://www.offconvex.org/2018/03/02/acceleration-overparameterization/). 


## Invariances

The other big factor behind the push for depth is in strong connection with the application of complex "architectures" on neural networks (instead of just using a fully connected feedforward setup), whereby the network "design" explicitly shows some "wiring", in most cases motivated by invariances of the learning domain, exhibiting special properties.

For the later part of the course we'll investigate these architectural possibilities (CNNs, LSTMs and beyond).

<a id="dlmove"></a>
# Deep Learning as a movement

Brakthroughs of Deep Learning brought about next "AI spring", leading to the exponential rise of the number of publications in AI - keeping up with "Moore's Law" in computation.

<img src="https://cdn-images-1.medium.com/max/1000/0*znJS1Aygd_B-u9rA" width=600 heigth=600>

[Source](https://medium.com/syncedreview/google-ai-chief-jeff-deans-ml-system-architecture-blueprint-a358e53c68a5)

Not just techniology, funding of AI also changed dramaticly, shifting from public to private.

<img src="https://www.fabernovel.com/content/uploads/2017/03/DARPA-NSF-funding-of-AI-1024x406.png" width=600 heigth=600>

(Though military funding is still there...)

[Source](https://en.fabernovel.com/insights/economy/8-facts-about-ai-research-funding)

Counterintuitively, since at the same time "open science" and "open access" (pioneered by eg. [ArXiv](https://arxiv.org/)) gained traction, the number of openly available papers also exploded.

<img src="https://arxiv.org/help/stats/2017_by_area/cumsubs.png" width=600 heigth=600>

We can surely say, that "multitudes" of unsung heroes are working on AI papers right now.

<a id="initialization"></a>
# Initialization


We incrementally minimize the loss function by optimizing the weights and biase. Convergence properties strongly influenced (or may be lost) by initial values for the weights.

- If weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
- If weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.
- For early deep learning applications vanishing and exploding gradients were problematic, so this hindered progress. 
- Summation drives the feed forward process, so we can only calculate contributions of individual weights to the error - and thus modify them - if they are different. If they are all the same, there is "symmetry", we essentially reduced our. network's capacity to one neuron - same gradients for every one of them, same modifications,
- ... Even more extreme case if we do it with zero, since that prevents any learning.

<img src="https://raw.githubusercontent.com/ritchieng/machine-learning-stanford/master/w5_neural_networks_learning/zerotheta_initialisation.png" width=600 heigth=600>

- Need **non-zero** and **symmetry breaking** initialization.
- Even using small random variables is suboptimal choice, since we still can get slow convergence. (These things were solved by Xavier initialization - see below).

Source for examples comes from [here](https://intoli.com/blog/neural-network-initialization/). 

1. Sample net:

(From [Keras](https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py), we will discuss that toolset in the later lectures). A Convolutional Neural Net architecture (elaborated in next classes also), 2 convolutional layers, using maxpooling, dropout, and ReLU as activaiton.


<img src="https://intoli.com/blog/neural-network-initialization/img/training-losses.png">
Left: All weights 0, only oscillation in cost, no learning

Middle: Slow convergence

Right: Weights from a "good" distribution, inverse proportion to the number of input weights

### Pre-training as initialization

For a short period of time around 2006 - 2012, the problem of initializing a deeper network was solved by layerwise pre-training (which has a strong connection to autoencoders - we will discuss them later on). 

<img src="http://drive.google.com/uc?export=view&id=105D-cYATtqHXXZnR4ssm-j37RVTWr_XE"  width=600 heigth=600>

See paper [here](https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf).

The unsuperwised pre-training was thought to be necessary to get the weights roughly into a realistic region based on the data. Later on it turned out, that the distribution of weights had to satisfy only some basic conditions to be practically trainable.


### Xavier / He initialization

<img src="https://t1.daumcdn.net/cfile/tistory/2777CD4E57A0077436" width=400 heigth=400>

Where `fan_in` is the width of the input and `fan_out` is the width of the output of the given layer. 

Original paper for Xavier-algorithm can be found [here](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf). Simple explanation [here](http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization).

Later on the ["He" version](https://arxiv.org/pdf/1502.01852.pdf) of initialization
Same idea, divided by a factor of 2.

Xavier initialization works well with activation functions where the expected value is 0, so for ReLU He method is to be used, which enables the training of really deep networks.


Example: 

- 5 layers multi layer neural network, uniform number of weigths in layers
- Can observe the shifting of scales because of the three initialization schemes
- First case: "upper" layers become "zeroed out"
- Second case: distribution of weights is rather uniform
- Third case: upper layer's weights "explode", since they are getting too big.

<img src="https://intoli.com/blog/neural-network-initialization/img/linear-output-progression-violinplot.png" width=600 heigth=600>

<img src="https://intoli.com/blog/neural-network-initialization/img/relu-output-progression-violinplot.png" width=600 heigth=600>

Another source [here](https://towardsdatascience.com/random-initialization-for-neural-networks-a-thing-of-the-past-bfcdd806bf9e)

Intuitive explanation:
- In each layer the distribution of the weights should have a roughly constant variance with not too extreme of a distribution
- Strongly uneven distribution of weights indicates that gradients are either exploding or vanishing

Suppose we have a linear activation. Then the weights af the next layer are :

$$ x_k^{(i+1)} \approx \sum_{j = 1}^{n_i}  x_j^{(i)} w_{jk}^{(i)}\,.$$


Assuming that the weights and activations of each layer vary jointly per layer and that their means are zero, we can use basic properties of variance to express the variance of the (i+1)-th layer’s outputs in terms of the variances of the 
i-th layer’s weights and outputs:

$$\begin{align*}
\text{Var}(x^{(i+1)})
& \approx \text{Var} \left( \sum_{j = 1}^{n_i}  x_j^{(i)} w_{jk}^{(i)} \right) \\
& \approx \sum_{j=1}^{n_i} \text{Var}( x^{(i)} ) \text{Var}( w^{(i)} )\,.
\end{align*}$$

This then simplifies to:

$$ \text{Var}(x^{(i+1)}) = n_i \text{Var}(x^{(i)}) \text{Var}(w^{(i)})$$

In order to achieve $$\text{Var}(x^{(i+1)}) = \text{Var}(x^{(i)})$$ we therefore have to impose the condition


$$\text{Var}( w^{(i)}) = \frac{1}{n_i}\,$$



## Initialization based on data

The above mentioned methods rely on the architectural properties of networks, but the question naturally arises, if we can try to fit the initialization scheme to the data in any sense?

This is what [this paper](https://arxiv.org/abs/1710.10570) investigated in 2017.


1. PCA based initialization

<img src="http://drive.google.com/uc?export=view&id=1X13X3XsUNRHwOC0RCoTjmS0Gk7H01mLW"  width=600 heigth=600>


2. Init based on data statistics

<img src="http://drive.google.com/uc?export=view&id=1I9ezXPYaAi-uZQ70lulBcF_BlYis9rg2"  width=600 heigth=600>


This idea in a sense echoes back to the "original" deep learning solution of [Bengio et al.](https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf) which was layerwise pre-training neural nets with "autoencoders" or "restricted Bolzmann machines", as well as strongly reinforces the ideas we have discussed about representation learning. (We will return to this topic again.)

<a id="sgdandco"></a>
# SGD and variants

## Three basic forms of gradient descent

### Vanilla/Batch/Full Gradient Descent

Calculates the full gradient for the loss on the whole training data set and updates the weights with it $\Rightarrow$ update is only once per epoch.

#### Properties

- **+** Updates with the true, full gradient for the whole dataset
- **+** The individual gradients can be computed in parallel -- useful for concurrent implementations
- **+** The full gradient updates can lead to a more stable convergence
- **-** For large data sets doing only one update per epoch leads to very slow convergence.
- **-** Requires accumulating a huge amount of gradients for larger data sets which can be computationally intensive
- **-** Works with the full training gradient without any type of regularization.

### Stochastic Gradient Descent

Approximates the full gradient by calculating the gradient for a single example. Updates weights with the computed gradient after each example. The "stochastic" label comes from the fact that (in contrast to full GD) the (typically random) order of the examples influences the trajectory of the parameter evolution. 

#### Properties

- **+** The frequent updates can lead to very fast learning in certain cases.
- **+** Noisy update process can help avoiding local minima.
- **+** Approximating the gradient is a type of regularization.
- **-** The very frequent updates are computationally expensive.
- **-** Cannot be parallelized.
- **-** The noisy update process can make the convergence difficult.

### Minibatch Stochastic Gradient Descent

A balanced middle ground between full and stochastic gradient descent: it divides the training set into smaller sets, the so called "minibatches". For each minibatch, the gradient for the data points in the minibatch is calculated and an update is made. Minibatch size varies by application -- it is typically between 32 and 256.

- **+** Relatively frequent updates help fast convergence.
- **+** Using an approximation of the full gradient has a regularizing effect and helps avoiding local minima.
- **+** Can be parallelized, utilizes fast, hardware supported tensor operations.
- **-** Introduces an additional hyperparameter, the minibatch size.
- **-** Requires the summation of individual gradients in the minibatch.

<font color='red'>Important:</font> Somewhat confusingly, in the AI/ML community by SGD people commonly mean Minibatch Stochastic Gradient Descent... 

A nice introduction to the GD variants, on which the present discussion was based, can be found here: [A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size](https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size)

See also [Ruder's discussion](http://ruder.io/optimizing-gradient-descent).




## GD optimization algorithms

### Problems/Challenges for vanilla GD

- It is difficult to find the proper learning rate. With a wrong value the loss can be divergent or convergence can be very slow.
- Learning rate regimes are general and rigid, the rate is not adapted to the data set.
- It might be advisable to update different parameters with different learning rates, e.g. we might want to have larger updates for examples with very rare features.
- The problem of **saddle points**, where the gradients are 0 but it is not a local extremum, e.g. because the point is a local minimum in one dimension and a local maximum in another:

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/Saddle_point.svg/300px-Saddle_point.svg.png" width="600px">

- In general, vanilla minibatch GD does not make use of the earlier gradients, that is, it ignores  information about the curve that is contained in the past trajectory.

The most well known improved GD variants in use are the following:

### Momentum

This method calculates the new update vector at time $t$, $\mathbf v_t$  by adding the weighted previous $\mathbf v_{t-1}$ update vector to the actual gradient:

\begin{align}
\mathbf v_t &= \gamma \mathbf v_{t-1} + \eta \nabla_\theta J( \theta_t) \\  
\theta_{t+1} &= \theta_t - \mathbf v_t
\end{align}

Where the $\gamma$ weight is typically set to a value close to but smaller than 1, e.g. 0.9, and $\eta$ is the learning rate.

Consequences:

- Momentum can move over saddle points because the inherited momentum term will can change the parameters even if the local gradient is zero.
- Compared to vanilla (S)GD the update component increases for dimensions whose gradient's sign remains the same across updates while decreases for directions whose gradient's sign oscillates. The consequences are reduced oscillation and faster convergence.

Without momentum:

<img src="http://ruder.io/content/images/2015/12/without_momentum.gif">

With momentum:

<img src="http://ruder.io/content/images/2015/12/with_momentum.gif">

(Both figures from [Ruder's GD optimization blog entry](http://ruder.io/optimizing-gradient-descent/))

### Nesterov momentum (NAG)

A small modification of momentum, which tries to "look ahead" when calculating the gradient. Since momentum adds the momentum term $\gamma \mathbf v_{t-1}$ to the computed gradient at $\theta_t$, we  can use this term to look ahead and calculate the gradient for a point which is a bit further in the direction we are heading,  namely, at $\theta_t -  \gamma \mathbf v_{t-1}$. The equations with this modification are 

\begin{align}
\mathbf v_t &= \gamma \mathbf v_{t-1} + \eta \nabla_\theta J( \theta_t-\gamma v_{t-1}) \\  
\theta_{t+1} &= \theta_t - \mathbf v_t
\end{align}

<img src="http://cs231n.github.io/assets/nn3/nesterov.jpeg">

(The source of the figure is Karpathy's discussion at http://cs231n.github.io/neural-networks-3/)






### Adagrad

In contrast to the previous methods, this is not concerned with the direction of the update vectors, but with adapting the learning rate for the individual parameters based on their past gradients.

Concretely, Adagrad maintains the $s^p$ running sum of squares of the previous subgradients for each $p$ parameter (weight or bias), and uses it to adapt the (separate) learning rate for the parameters at each step in the following way:

$$
p_{t+1} = p_{t} - \frac{\eta}{\sqrt {s^p_{t} + \epsilon}} \Delta_pJ(\theta_t)
$$

$$
s^p_{t}=s^p_{t-1} + (\Delta_pJ(\theta_{t}))^2
$$

$\epsilon$ here is a small (typically $\sim 10^{-8}$) smoothing term to avoid division by zero.

Using the $\odot$ elementwise (Hadamard) product operation between vectors we can write the update rules in a vectorized form:

$$
\theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt {\mathbf s_{t} + \epsilon}} \odot \Delta_pJ(\theta_t)
$$

$$
\mathbf s_{t}=\mathbf s_{t-1} + \Delta_\theta J(\theta_t)\odot \Delta_\theta J(\theta_t)
$$

where all scalar operations on $\mathbf s$ in $\frac{\eta}{\sqrt {\mathbf s_{t} + \epsilon}}$ in the first equation are to be interpreted elementwise.

**Advantages:**
- The different learning rates are very useful for convergence, especially since Adagrad can selectively make larger updates to weights that are associated with sparse data (many 0 subgradients), while smaller ones to weights for more frequent features.
- Requires no learning rate tuning/regime -- $\eta$ is typically set to the default (0.01) at the beginning and remains unchanged.

**Problems:**
- The running gradient sum of squares are constantly growing so the learning rates monotonically decrease and become very small  basically stopping the learning process.



### RMSProp

Tries to solve the main problem of Adagrad: the aggressive, monotonic decrease of the adaptive learning rates.  The RMSprop solution is to apply a _decay_ term $\rho$ in the maintained running sum of the squared subgradients, so the new value is

$$
\mathbf s_{t} = \rho \mathbf s_{t-1} + (1-\rho) (\Delta_\theta J(\theta_t)\odot\Delta_\theta J(\theta_t))
$$
the result is that the effect of older subgradienst is deminishing over time. ($\rho$ is typically set to 0.9, like the momentum term).



### Adam 

Adam (Adaptive Moment Estimation) is similar to RMSprop in that it adapts the learning rates for parameters by maintaining running sums of the squares of subgradients but -- somewhat similarly to momentum -- it also relies on the previous gradient vectors in the form of a decayed sum:

$$
\mathbf s_{t} = \beta_1 \mathbf {s}_{t-1} + (1-\beta_1) \Delta_\theta J(\theta_t)\odot\Delta_\theta J(\theta_t)
$$

$$ \mathbf m_{t} = \beta_2 \mathbf m_{t-1} + (1-\beta_2)\Delta_\theta J(\theta_t)$$

Since with 0 initialization and the usually slow decay ($\beta$ and $\delta$ are very close to 0) these sums tend to be biased towards $\mathbf 0$ so the update rule uses unbiased values

$$ \hat{\mathbf s}_t = \frac{\mathbf s_t}{1-\beta_1} $$
$$ \hat{\mathbf m}_t = \frac{\mathbf m_t}{1-\beta_2} $$

With these unbiased values, the update rule is a combination of the Momentum and the Adagrad rule:


$$
\theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt {\hat{\mathbf s_{t}} + \epsilon}} \odot \hat{\mathbf m_t}
$$

The recommended settings for $\beta_1$ and $\beta_2$ are 0.9  for $\beta_1$,  0.999 for $\beta_2$, and $10^{-8}$ for ϵ.

 ### AdamW
 
 An interesting development for Adam is that after the initial hype it was not used much because researchers found it worse than the alternatives (e.g. simple Momentum). Now it seems that the bad performance might have been due to the complicated interaction between Adam and and regularization, specifically L2 weight decay -- in contrast to vanilla SGD there is a huge difference between adding L2 to the loss function vs applying weight decay during updates. And the latter, called **AdamW**  seems to lead to way more superior results. See the blog entry [AdamW and Super-convergence is now the fastest way to train neural nets](http://www.fast.ai/2018/07/02/adam-weight-decay/) for details.

### Adadelta

Similarly to Adam, Adadelta can also be considered a combination of RMSprop and Momentum. In addition to accumulating the usual $\mathbf s$ running sum of the squared gradients, Adadelta also keeps track of a $\mathbf v$ decayed sum of previous parameter updates:

$$\mathbf s_{t} = \rho \mathbf s_{t-1} + (1-\rho) \Delta_\theta J(\theta_t)\odot\Delta_\theta J(\theta_t)
$$

$$\mathbf v_{t} = \rho\mathbf v_{t-1}  + (1-\rho)(\theta_{t+1}-\theta_t)\odot(\theta_{t+1}-\theta_t)  $$

The Adadelta update rule using these running sums is 

$$
\theta_{t+1} = \theta_{t} - \frac{\sqrt {\mathbf v_{t-1} + \epsilon}}{\sqrt {\mathbf s_{t} + \epsilon}} \odot \Delta_\theta J(\theta_t)
$$

($\mathbf v _{t-1}$ is used because we don't know yet $\mathbf v _{t}$ at the update).

Notice that in contrast to all methods we heve seen, here the $\eta$ learning rate has been completely eliminated.  The only parameter to be set is the decay term $\rho$ which is typically set to a value close to 1, say 0.9. 




### Demonstration

(source: http://ruder.io/optimizing-gradient-descent/index.html, original source: https://twitter.com/alecrad)

<img src="https://cdn-images-1.medium.com/max/800/1*XVFmo9NxLnwDr3SxzKy-rA.gif">


<img src="https://d33ypg4xwx0n86.cloudfront.net/direct?url=https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1*SjtKOauOXFVjWRR7iCtHiA.gif&resize=w704">

### Recommended readings about GD optimization algorithms

- [Ruder:  An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/)
- [Karpathy: Stanford cs231n  part 3](http://cs231n.github.io/neural-networks-3/)

### Tensorflow implementations

- [Momentum](https://www.tensorflow.org/api_docs/python/tf/train/MomentumOptimizer)
- [Adagrad](https://www.tensorflow.org/api_docs/python/tf/train/AdagradOptimizer)
- [Adam](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer)
- [AdamW](https://www.tensorflow.org/api_docs/python/tf/contrib/opt/AdamWOptimizer)
- [RMSprop](https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer)
- [Adadelta](https://www.tensorflow.org/api_docs/python/tf/train/AdadeltaOptimizer)
