In [None]:
# structure of ANN
# equations for back-prop
# typical error functions (cross entropy, MSE loss)
# why optim in high dimension works

<br>

# What is Deep Learning?
---

<br>

### Limitations of manually selected basis functions

A lot of Machine Learning techniques (such as linear regression, support vector machines, etc.) are based on linear separation. To deal with non-linearly separable inputs, we need to transform each input $x$ into $\phi(x) = (\phi_1(x) \dots \phi_n(x))$, to make our inputs linearly separable again. The $\phi_i(x)$ are a fixed set of **basis functions** in this new space.

The pipeline of the learning algorithm is therefore:

&emsp; $x \mapsto \phi(x) \mapsto f(\phi(x), \theta)$
&emsp; where $f$ is our hypothesis set and $\theta$ the parameters to learn

While powerful, this approach is limited by the need to identify a good set of basis functions for our problem, the limited size of the basis function (the more there are, the more computation is needed), or the limited form of the basis functions (SVM allow to have an infinite number of functions, but all of the same shape).

<br>

### Deep Learning

Deep Learning is an attempt at removing the need to handcraft and select manually the basis functions, by making the learning algorithm able to find the basis functions by itself. Instead of making the learning happen only at the lastest stage of the pipeline described above, **learning is done at every stage of the pipeline**:

&emsp; $x \longmapsto h_1 = f_1(x,\theta_1) \longmapsto h_2 = f_2(h_1, \theta_2) \longmapsto \; \cdots \; \longmapsto h_n = f_n(h_{n-1}, \theta_n)$

During the training period, all the parameters $\theta_i$ are typically **learned conjointly**. Indeed, one approach could have been to learn the parameters one by one (for instance cycling between them).

This separate optimization is in fact sometimes used in Deep Learning as well when doing **transfer learning**. We use a bigger task (or many small tasks, what we call **multi-task learning**) to  we learn the parameters of the lower levels, before transfering them to a smaller tasks that only tunes the higher level parameters.

<br>

### Deep Neural Networks

In theory, there is no specific need for Deep Learning to prefer any form of specific functions $f_1 \dots f_n$ to compose together to form the pipeline we described above. In practice though, we have to make compromise in order to make the learning of the parameters $\theta_1 \dots \theta_n$ possible.

First, we have to find a way to learn these parameters. The typical approach in Deep Learning is gradient descent: each function must be differentiable in its input and parameters, to allow the parameters to be tuned to minimize a **loss function** which represents our objective (what we want to minimize, for instance the distance between our prediction and the actual results).

Second, we want to make this learning as efficient as possible. In that regards, linear functions separated by non-linearity are both easier to optimize and faster to optimize than arbitrary functions. This is why Deep Learning has focused on Linear layers separated by sigmoids, rectified linear units, or similar simple non-linear functions.

It is worth noticing though that Deep Learning is more and more using functions that would not fall into the typical "neural network" category: a CNN is not exactly following the perceptron model.

<br>

### Why not learn one big function instead?

This can actually be done. One theorem stipulates that we can approximate any function with just a two layer neural network, but there are many practical and theoretical advantages in learning a combinations of small functions rather than a big function.

The first advantage has to do with the size of the model. Any 2 layer neural network can approximate any function, but requires an exponentially increasing number of parameters to do so. Shallow and wide networks are also often correlated with overfitting and it is **more robust to learn a composition of simple functions rather than one big function**.

The second advantage has to do with parameter sharing and reuse. If we learn small functions, there might be ways to recompose the lower layer functions into new neural networks to perform similar tasks. In fact, one way to make model learn more general function is indeed to **share the simple functions between several tasks**.

Now, the big drawback in increasing depth rather than width (or complexity of the composed functions) is that it represents some challenges in terms of numerical optimization. Lots of mechanisms help with this: Batch Normalization, Residual connections, Rectified linear units, but in general, the deeper the more difficult to train.

<br>

# Back propagation
---

As mentioned before, the typical way to train a Deep Learning model is to use gradient descent: inputs are fed into the model, and the output is compared to the expected value. The differences in expectations are summed together to form an error, and the parameters of the model are adjusted to minimize the error:

&emsp; $\displaystyle \theta_i \leftarrow \theta_i - \alpha \, \frac{\partial e}{\partial \theta_i}$
&emsp; where $e$ is the error

We call the computation of the loss the **forward pass**. If we consider a chain of functions 3 layers deep $(f_1, f_2, f_3)$ with respective parameters $(\theta_1, \theta_2, \theta_3)$ followed by a loss function, this is what it looks like:

&emsp; $x$
&emsp; $\mapsto$
&emsp; $h_1 = f_1(x,\theta_1)$
&emsp; $\mapsto$
&emsp; $h_2 = f_2(h_1,\theta_2)$
&emsp; $\mapsto$
&emsp; $h_3 = f_3(h_2,\theta_3)$
&emsp; $\mapsto$
&emsp; $e = loss(h_3)$

Now, we have to compute the partial derivative of the loss with respect all parameters:

&emsp; $\displaystyle \frac{\partial e}{\partial \theta_3} = \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial \theta_3}$

&emsp; $\displaystyle \frac{\partial e}{\partial \theta_2} = \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial h_2} \times \frac{\partial h_2}{\partial \theta_2}$

&emsp; $\displaystyle \frac{\partial e}{\partial \theta_1} = \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial h_2} \times \frac{\partial h_2}{\partial h_1} \times \frac{\partial h_1}{\partial \theta_1}$

We can see that the prefix of the derivative match up, and so the most efficient way to compute the derivatives is by doing a **backward** pass, going the other way that the forward pass (starting with the higher layers), and memoizing the results for the next set of parameter gradients to be efficiently computed.

<br>

### Multiple parameters

The notations used above are the ones of partial derivatives. They are typically used when the parameters $\theta_i$ are single variable (not vectors) and when the outputs of each function $f_i$ are reals. The algorithm however still works with higher dimensional outputs, we just need to use Gradients and Jacobians instead of partial derivatives notations.

Instead of the partial derivative of the loss with respect to the parameter $\theta_i$, we use the gradient of the loss with respect to $\theta_i$, which is the collection of all the partial derivatives. Instead of the intermediary partial derivatives, we use Jacobian matrices (where each row is a gradient of one output dimension):

&emsp; $\displaystyle \frac{\partial e}{\partial \theta_i} \in \mathbb{R}^N$
&emsp; becomes
&emsp; $\displaystyle \nabla_{\theta_i} e$
&emsp;
&emsp; and
&emsp; 
&emsp; $\displaystyle \frac{\partial h_n}{\partial h_{n-1}} \in \mathbb{R}^{M \times N}$
&emsp; becomes
&emsp; $\displaystyle J_{f_n}$

These notations however only serve to clutter the logic above. It is easier to consider the partial derivatives as behing potentially vectors or even matrices when it needs to be.

<br>

# Optimization considerations
---

Gradient descent is not garantied to find the global minimum of the error. The parameters of the different layers will likely not be optimal. The result will also greatly depend on the starting point.

* we do not search the global minimum
* importance of selection of a right start (randomization)
* there are few local minimum (exponentially few)
* the saddle points are the real problem
* often a way out in high dimension
* stochastic gradient descent

<br>

# Feed forward networks
---

<br>

# Parameters sharing
---

* todo: and explain CNN and the implementation with gradient

<br>

# Recurrent networks
---

<br>

### Back propagation

Let us consider a chain of 3 successive call to the function $f$ (for a sequence of 3 elements):

&emsp; $x_0, h_0$
&emsp; $\mapsto$
&emsp; $h_1 = f(x_0, h_0,\theta)$
&emsp; $\mapsto$
&emsp; $h_2 = f(x_1, h_1,\theta)$
&emsp; $\mapsto$
&emsp; $h_3 = f(x_2, h_2,\theta)$
&emsp; $\mapsto$
&emsp; $e = loss(h_3)$

To adjust the parameters $\theta$ of the function $f$, in order to minimize the loss via gradient descent, we have to compute the partial derivative of the loss with respect to $\theta$:

&emsp; $\displaystyle \frac{\partial e}{\partial \theta} = \frac{\partial e}{\partial h_3} \big( \frac{\partial h_3}{\partial \theta} + \frac{\partial h_3}{\partial h_2} \big( \frac{\partial h_2}{\partial \theta} + \frac{\partial h_2}{\partial h_1} \times \frac{\partial h_1}{\partial \theta} \big) \big)$

&emsp; $\displaystyle \frac{\partial e}{\partial \theta} = \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial \theta} + \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial h_2} \times \frac{\partial h_2}{\partial \theta} + \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial h_2} \times \frac{\partial h_2}{\partial h_1} \times \frac{\partial h_1}{\partial \theta}$

We can see that the prefix of the summed derivatives match up, and so once again, the most efficient way to compute the derivatives is by doing a **backward** pass, memoizing the results for the next set of parameters.

<br>

### Gradient vanishing and explosion

The longer the dependency, the more difficult it will be to transmit information via the hidden state. To see why, we can isolate the component of the gradient of $\theta$ that comes from $x_0$ is equal to: 

&emsp; $\displaystyle \frac{\partial e}{\partial h_3} \times \frac{\partial h_3}{\partial h_2} \times \frac{\partial h_2}{\partial h_1} \times \frac{\partial h_1}{\partial \theta}$

If the function $f$ was linear, we could summarize this as:

&emsp; $f(x, h) = W \begin{pmatrix} x \\ h \end{pmatrix} = \begin{pmatrix} W_{11} & W_{12} \\ W_{21} & W_{22} \end{pmatrix} \begin{pmatrix} x \\ h \end{pmatrix}$
&emsp; $\implies$
&emsp; $h_n = W_{21} x_{n-1} + W_{22} h_{n-1}$

&emsp; $h_n = W_{21} x_{n-1} + W_{22} h_{n-1} = W_{21} x_{n-1} + W_{22} (W_{21} x_{n-2} + W_{22} h_{n-2})$

We see that the $n^{th}$ value of $h$ is function of $W_{22}^n$ with respect to $h_0$. $W_{22}$ can be diagonalized into $W_{22} = V \Lambda V^{-1}$ since it is squared, and so $W_{22}^n = V \Lambda^n V^{-1}$ and so the eigen values will either explode or vanish.

<br>

# Structuring a Neural Net
----

* any structure as long as derivable?