In [None]:
# structure of ANN
# equations for back-prop
# typical error functions (cross entropy, MSE loss)
# why optim in high dimension works

<br>

# What is Deep Learning?
---

<br>

### Motivation: limitations of manually selected basis functions

A lot of Machine Learning techniques (such as linear regression, support vector machines, etc.) are based on linear separation. To deal with non-linearly separable inputs, we need to transform each input $x$ into $\phi(x) = (\phi_1(x) \dots \phi_n(x))$, to make our inputs linearly separable again. The $\phi_i(x)$ are a fixed set of **basis functions** in this new space.

The pipeline of the learning algorithm is therefore:

&emsp; $x \mapsto \phi(x) \mapsto f(\phi(x), \theta)$
&emsp; where $f$ is our hypothesis set and $\theta$ the parameters to learn

While powerful, this approach is limited by the need to identify a good set of basis functions for our problem, the limited size of the basis function (the more there are, the more computation is needed), or the limited form of the basis functions (SVM allow to have an infinite number of functions, but all of the same shape).

<br>

### Deep Learning

Deep Learning is an attempt at removing the need to handcraft and select manually the basis functions, by making the learning algorithm able to find the basis functions by itself. Instead of making the learning happen only at the lastest stage of the pipeline described above, **learning is done at every stage of the pipeline**:

&emsp; $x \longmapsto h_1 = f_1(x,\theta_1) \longmapsto h_2 = f_2(h_1, \theta_2) \longmapsto \; \cdots \; \longmapsto h_n = f_n(h_{n-1}, \theta_n)$

During the training period, all the parameters $\theta_i$ are typically **learned conjointly**. Indeed, one approach could have been to learn the parameters one by one (for instance cycling between them).

This separate optimization is in fact sometimes used in Deep Learning as well when doing **transfer learning**. We use a bigger task (or many small tasks, what we call **multi-task learning**) to  we learn the parameters of the lower levels, before transfering them to a smaller tasks that only tunes the higher level parameters.

<br>

### Why not learn one big function instead?

This can actually be done. One theorem stipulates that we can approximate any function with just a two layer neural network, but there are many practical and theoretical advantages in learning a combinations of small functions rather than a big function.

The first advantage has to do with the size of the model. Any 2 layer neural network can approximate any function, but requires an exponentially increasing number of parameters to do so. Shallow and wide networks are also often correlated with overfitting and it is **more robust to learn a composition of simple functions rather than one big function**.

The second advantage has to do with parameter sharing and reuse. If we learn small functions, there might be ways to recompose the lower layer functions into new neural networks to perform similar tasks. In fact, one way to make model learn more general function is indeed to **share the simple functions between several tasks**.

Now, the big drawback in increasing depth rather than width (or complexity of the composed functions) is that it represents some challenges in terms of numerical optimization. Lots of mechanisms help with this: Batch Normalization, Residual connections, Rectified linear units, but in general, the deeper the more difficult to train.

<br>

# Back propagation
---

The typical way to train a Deep Learning model is to use gradient descent. Inputs are fed into the model (the pipeline of function which must all be differentiable) and the outputs are compared to the expected values via a **loss function** (which represents our objective) that we want to minimize, giving us an error. We call the computation of the loss the **forward pass**:

&emsp; $x$
&emsp; $\mapsto$
&emsp; $h_1 = f_1(x,\theta_1)$
&emsp; $\mapsto$
&emsp; $h_2 = f_2(h_1,\theta_2)$
&emsp; $\mapsto$
&emsp; $h_3 = f_3(h_2,\theta_3)$
&emsp; $\mapsto$
&emsp; $e = loss(h_3)$

We then tune the parameters of our model using gradient descent:

&emsp; $\displaystyle \theta_i \leftarrow \theta_i - \alpha \, \frac{\partial e}{\partial \theta_i}$
&emsp; where $e$ is the error

To do so, we have to compute the partial derivative of the loss with respect all parameters:

&emsp; $\displaystyle \frac{\partial e}{\partial \theta_3} = \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial \theta_3}$

&emsp; $\displaystyle \frac{\partial e}{\partial \theta_2} = \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial h_2} \times \frac{\partial h_2}{\partial \theta_2}$

&emsp; $\displaystyle \frac{\partial e}{\partial \theta_1} = \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial h_2} \times \frac{\partial h_2}{\partial h_1} \times \frac{\partial h_1}{\partial \theta_1}$

We can see that the prefix of the derivative match up, and so the most efficient way to compute the derivatives is by doing a **backward** pass, going the other way that the forward pass (starting with the higher layers), and memoizing the results for the next set of parameter gradients to be efficiently computed.

**Note**: The notations used above are the ones of partial derivatives. They are typically used when the parameters $\theta_i$ are single variable (not vectors) and when the outputs of each function $f_i$ are reals. The algorithm however still works with higher dimensional outputs, we just need to use Gradients and Jacobians instead of partial derivatives notations:

&emsp; $\displaystyle \frac{\partial e}{\partial \theta_i} \in \mathbb{R}^N$
&emsp; becomes
&emsp; $\displaystyle \nabla_{\theta_i} e$
&emsp;
&emsp; and
&emsp; 
&emsp; $\displaystyle \frac{\partial h_n}{\partial h_{n-1}} \in \mathbb{R}^{M \times N}$
&emsp; becomes
&emsp; $\displaystyle J_{f_n}$

These notations however only serve to clutter the logic above. It is easier to consider the partial derivatives as behing potentially vectors or even matrices when it needs to be.

<br>

# Deep Neural Networks
---

In theory, there is no specific need for Deep Learning to prefer any specific form of functions $f_1 \dots f_n$ to compose together in a pipeline. In practice though, parameterizerd linear functions separated by simple non-linearities makes the learning of the parameters $\theta_1 \dots \theta_n$ easier and more efficient.

<br>

### Partial derivatives of matrices

If we write the partial derivative of a multi-layer perceptron, we see that they are basically matrix products, which can be as easily parallelized on GPU in the backward pass than the forward pass:

&emsp; $\displaystyle y = A x \implies \frac{\partial y}{\partial x} = A \;\; \text{and} \;\; \frac{\partial y}{\partial A} = \begin{pmatrix} x^T \\ \vdots \\ x^T \end{pmatrix}$
&emsp; because each output dimension of A is of the form $row^T x$

To see how to handle the point-wise (dimension-wise) non-linearity such as sigmoids, let us look at a simple **logistic regression** pipeline, where $x \in \mathbb{R}^2$ and $y \in \mathbb{R}^2$:

&emsp; $z = \sigma(y) = \begin{pmatrix} \sigma(y_1) \\ \sigma(y_2) \end{pmatrix} \;\; \text{with} \;\; y = A x \;\; \text{and} \;\; A=\begin{pmatrix} a & b \\ c & d \end{pmatrix}$

The partial derivative of $z$ with respect to $y$ is a Jacobian (because $z$ and $y$ both have two dimensions):

&emsp; $\displaystyle \frac{\partial z}{\partial y} = J_z(y) = \begin{pmatrix} \nabla_y \sigma(y_1)^T \\ \nabla_y \sigma(y_2)^T \end{pmatrix} = \begin{pmatrix} \sigma'(y_1) & 0 \\ 0 & \sigma'(y_2) \end{pmatrix}$

So all of our linear layer and point-wise operations can be implemented by matrices multiplications, some of which (for non-linearity) can be implemented by diagonal matrices multiplication (which is super fast).

<br>

### Efficient Back Propagation

We can now apply these partial derivative to a deep learning pipeline of computation to see how efficiently we can compute the gradients necessary for gradient descent:

&emsp; $x$
&emsp; $\mapsto$
&emsp; $h_1 = A_1 x$
&emsp; $\mapsto$
&emsp; $h_2 = \sigma(h_1)$
&emsp; $\mapsto$
&emsp; $h_3 = A_2 h_2$
&emsp; $\mapsto$
&emsp; $e = loss(h_3)$

The equation of back-propagation becomes:

&emsp; $\displaystyle \frac{\partial e}{\partial A_2} = \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial A_2} = \frac{\partial e}{\partial h_3} \times \begin{pmatrix} h_2^T \\ \vdots \\ h_2^T \end{pmatrix}$

&emsp; $\displaystyle \frac{\partial e}{\partial A_1} = \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial h_2} \times \frac{\partial h_2}{\partial h_1} \times \frac{\partial h_1}{\partial A_1} = \frac{\partial e}{\partial h_3} \times A_2 \times \begin{pmatrix} \sigma'(h_{11}) & \cdots & 0 \\ \vdots & & \vdots \\ 0 & \cdots & \sigma'(h_{1n}) \end{pmatrix} \times \begin{pmatrix} x^T \\ \vdots \\ x^T \end{pmatrix}$

All of these operations can be performed at a high throughput on modern CPU and GPU. This is why Deep Learning has focused on Linear layers separated by sigmoids, rectified linear units, or similar simple non-linear functions.


<br>

### Good generalization and expressivity

These focus on linear units could have come at a price of expressivity, but this is not the case, quite the contrary. It has been shown that 2 linear layers with a sigmoid non-linearity in between is enough to approximate any function. Deeper network can achieve the same with even better generalization.

Linear layers are also known for their good generalization. They are simple and therefore follow Occam's razor. When they work on training data (when the explanation of the training data is a linear model), they empirically also generalize well on unseen data.

<br>

### Toward different architectures

It is worth noticing that Deep Learning is more and more using functions and architectures that would not fall into the multi-layer perceptron category: lots of architectures are not linear layers separated by sigmoids or rectified linear units. For instances, a CNN uses convolutions (see paragraphs below).

<br>

# Optimization considerations
---

Gradient descent is not garantied to find the global minimum of the error. The parameters of the different layers will likely not be optimal. The result will also greatly depend on the starting point.

<br>

### Saddle points

Gradient descent will stay stuck at point where the gradient $\nabla f$ is zero. A minimum is reached if the Hessian $H$ is posivite definite ($\forall x, x^T H x \ge 0$), maximum if the Hessian is negative definite ($\forall x, x^T H x \le 0$), and is otherwise a saddle point.

In high dimensional problems, it is believed that the main issue are saddle points. Indeed, if we take a random function, the chances of all dimensions having a global minimum grows as $2^{-D}$ where $D$ is the number of dimensions (1 chance over 2 for each dimension to grow when following that direction).

In addition to being rare, the local minimum are, in the case of deep neural network empirically not very far away from the global minimums, and so the main danger is really to avoid saddle points. There are many ways to avoid them, among which:

* gradient descent momentum (to keep going in the same direction in case of zero derivatives)
* stochastic gradient descent (the estimate of the gradient will introduce noise and avoid being stuck)

<br>

### Optimisation vs Generalization

Another reason why local minimums are not as terrible as it might first seem is that in Machine Learning, we are as much interested in optimizing the loss function as optimizing the **performance out of sample**, that is generalizing to non-observed data. The global minimum might not be the optimal solution in a generalization perspective.

Furthermore, we never actually quite go to the lowest points, and even local minima, because of **early-stopping**. We generally interrupt the training loop before it is over, by watching the error on the validation set. Optimization remains really important, but mostly to avoid the saddle points or the few bad local minimum.

<br>

### Multi-dimensional escape routes

Another thing that fails us as human is our intuition of low dimensional spaces when dealing with high dimensional spaces. Our intuitions about local minimums are also low dimensional as we cannot visualize high dimensional spaces.

The big classic to demonstrate this is the demonstration that the maximum of the volume of a sphere is located near the surface of a sphere in high dimensional spaces. Indeed, the volume of a sphere grows as the radius to the power of the number of dimensions, and so the 1% close to the surface will quickly occupy most of the volume:

&emsp; $\displaystyle V = R^D \implies \frac{(R + dR)^D}{R^D} = \big(1 + \frac{dR}{R} \big)^D \underset{D \rightarrow \infty}{\longrightarrow} \infty$
&emsp; (ratio of volumes goes to infinity)

Similarly, in high dimensional spaces, unless our low dimensional spaces, and especially with a noisy gradient estimate, there seems to always be some kind of escape route from a local minimum (at least with high probability).

<br>

### Initialization

Gradient descent will converge to different solutions depending on the initial conditions. Although we argued that local minimum are less dangerous in high dimensional spaces, it remains important to avoid the most difficult area in the landscape of our multi-dimensional (1 dimension by parameter) loss function.

To that regard, a lot of efforts is being poured in to find the best way to initialize the weights for each layer of a neural network, here taken from the Pytorch documentation:

* Linear layers: $w \sim U \big(-\sqrt{\frac{1}{N}}, \sqrt{\frac{1}{N}} \big)$ where $N$ is the number of input features

* CNN layers: $w \sim U \big(-\sqrt{\frac{1}{K \times C}}, \sqrt{\frac{1}{K \times C}} \big)$ where $K$ is the kernel size and $C$ the number of channels

In both cases, the goal is to ensure that the average weight is 0 (symmetric in outcomes positive or negative) and that the variance in the outputs is around 1, if the variance in the input is also around 1.

&emsp; $\displaystyle \mathbb{V}[x] = \frac{1}{2a} \int_{-a}^{a} x^2 \, dx = \frac{1}{2a} \Big[\frac{x^3}{3} \Big]_{-a}^{a} = \frac{a^2}{3}$
&emsp; $\implies$
&emsp; $\displaystyle \mathbb{V}[w] = \frac{1}{3 N}$

&emsp; $\displaystyle \mathbb{V}[x w] = \mathbb{V}[x] \, \mathbb{V}[w] + \mathbb{V}[x] \, \mathbb{E}[w]^2 + \mathbb{E}[x]^2 \, \mathbb{V}[w] = \mathbb{V}[w]$
&emsp; $\implies$
&emsp; $\displaystyle \mathbb{V}\Big[\sum_n x_n w_n\Big] = N \mathbb{V}[w] = \frac{1}{3}$

Indeed, neural network like their inputs to be normalized around 0 with variance 1, and outside of batch normalization, a carefule initialization of the weights is necessary to ensure that it holds.

<br>

# Parameters sharing
---

Deep Learning is about training together a chain of simple functions to adjust their parameter conjointly, so that the overall composition of these simple function approximate and generalize an unknown function from data.

<br>

### Limiting the hypothesis space

The shape of the chain of functions to learn defines the shape of the overall search space we are looking into. This search space is called the **hypothesis space**. The larger the hypothesis space, the more powerful it is (it can model more function), but the harder it is to search.

If we have a good idea of the kind of functions we are looking for, we can restrict the hypothesis space to the relevant parts, getting the best of both worlds. This can be accomplished in many ways:

1. focus the learning on some specific part and hardcode the rest
2. select the appropriate depth for the network
3. pick some specific function forms at some depth
4. share some parameters between functions or tasks

In particular, the first point is one of the important design decision: **what do we want to approximate?** For instance, in the game of Chess, we could learn the value of a board and let the rest to a minimax algorithm. Or we could learn how to select the next move, and give it to a rollout policy for MCTS. Or we could put everything into the neural network and let it play.

<br>

### Encoding our believes through invariance

One of the major way to facilite learning is to share parameters. We can share the parameters between different tasks. For instance, in a deep neural network used for regression, the lower layer are shared between the output dimensions, and so their **weights are tuned to answer multiple tasks**. This encodes our belief that these tasks have some lower representation in common.

Another way to share parameters, as used in recurrent network or convolutional neural nets, is to force the same weights used for different parts of a sequence, or different part of an image, to be the same. This encodes our belief that the lower level of our model should be **invariant to the position in the sequence or the image**.

It turns out that encoding our belief rather than letting our model explore an unconstraint territory helps a lot:

1. restricting the hypothesis space helps with the stability of learning
2. sharing parameters makes the parameters learn more generic representations (more profound)
3. sharing parameters increases the statistical strength of these parameters (and helps generalizing)

In particular for point 3, learning a convolution that will detect the edges of an image, applied generically at each position of the image, means that the parameters of the convolutions will have been trained on much more than $N$ images, but instead on $N \times H \times W$ images, where $W$ and $H$ refer to the width and height of each image.

In addition to this, what has been learn on the center of the image, will apply the same on the top left corner of the image. The knowledge is generalized and most importantly, we do not need to multiply the data points to make sure that edge detection works everywhere in the image.

<br>

# Convolutional networks
---

* what is convolution?
* how it works
* what kind of invariance it enforces

<br>

# Recurrent networks
---

<br>

### Back propagation

Let us consider a chain of 3 successive call to the function $f$ (for a sequence of 3 elements):

&emsp; $x_0, h_0$
&emsp; $\mapsto$
&emsp; $h_1 = f(x_0, h_0,\theta)$
&emsp; $\mapsto$
&emsp; $h_2 = f(x_1, h_1,\theta)$
&emsp; $\mapsto$
&emsp; $h_3 = f(x_2, h_2,\theta)$
&emsp; $\mapsto$
&emsp; $e = loss(h_3)$

To adjust the parameters $\theta$ of the function $f$, in order to minimize the loss via gradient descent, we have to compute the partial derivative of the loss with respect to $\theta$:

&emsp; $\displaystyle \frac{\partial e}{\partial \theta} = \frac{\partial e}{\partial h_3} \big( \frac{\partial h_3}{\partial \theta} + \frac{\partial h_3}{\partial h_2} \big( \frac{\partial h_2}{\partial \theta} + \frac{\partial h_2}{\partial h_1} \times \frac{\partial h_1}{\partial \theta} \big) \big)$

&emsp; $\displaystyle \frac{\partial e}{\partial \theta} = \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial \theta} + \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial h_2} \times \frac{\partial h_2}{\partial \theta} + \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial h_2} \times \frac{\partial h_2}{\partial h_1} \times \frac{\partial h_1}{\partial \theta}$

We can see that the prefix of the summed derivatives match up, and so once again, the most efficient way to compute the derivatives is by doing a **backward** pass, memoizing the results for the next set of parameters.

<br>

### Gradient vanishing and explosion

The longer the dependency, the more difficult it will be to transmit information via the hidden state. To see why, we can isolate the component of the gradient of $\theta$ that comes from $x_0$ is equal to: 

&emsp; $\displaystyle \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial h_2} \times \frac{\partial h_2}{\partial h_1} \times \frac{\partial h_1}{\partial \theta}$

If the function $f$ was linear, we could summarize this as:

&emsp; $f(x, h) = W \begin{pmatrix} x \\ h \end{pmatrix} = \begin{pmatrix} W_{11} & W_{12} \\ W_{21} & W_{22} \end{pmatrix} \begin{pmatrix} x \\ h \end{pmatrix}$
&emsp; $\implies$
&emsp; $h_n = W_{21} x_{n-1} + W_{22} h_{n-1}$

&emsp; $h_n = W_{21} x_{n-1} + W_{22} h_{n-1} = W_{21} x_{n-1} + W_{22} (W_{21} x_{n-2} + W_{22} h_{n-2})$

We see that the $n^{th}$ value of $h$ is function of $W_{22}^n$ with respect to $h_0$. $W_{22}$ can be diagonalized into $W_{22} = V \Lambda V^{-1}$ since it is squared, and so $W_{22}^n = V \Lambda^n V^{-1}$ and so the eigen values will either explode or vanish.

<br>

# Structuring a Neural Net
----

* any structure as long as derivable?