## Fundamentals of deep learning

**Deep Learning** (DL) is a class of Machine Learning methods that aims at learning feature hierarchies. DL is not  the solution but a useful set of tools for building A.I.

We have lots of successful applications on the following contents, which can be viewed as a hierarchical structure.

- Vision 
  pixel -> edge -> texton -> super-pixel -> part -> object
- Text 
  character -> word -> NP/VP/... -> clause -> sentence -> story
- Audio
  sample -> spectral band -> formant -> motif -> phone -> word

The hierarchical models usually need to be deep.

We will introduce two main deep learning methods and their implementations.

- Convolutional neural network (CNN)
- Recurrent Neural Network (RNN)		

### 1. A Neural Network

![](figs/nn.png)
<div align="center">*Example of a 2 hidden layer neural network.*</div>


A neural network is constructed by an input layer, several hidden layers and an output layer. 
An [**activation fuction**](http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks) is followed by each layer, generally there are three types -- sigmoid, tanh and ReLU, here we use ReLU.
The ReLU introduces non-linearity: 

$$f(x)=max(0, x)$$

The ReLU layers accomplish as a local linear approximation. Multiple layers yeild exponential savings in number of parameters, through parameter sharing.

Let's explicitly write expression of the computational process from input to output, which is called **Forward Propagation**.
- $h_1 = max(0, W_1x + b_1)$
- $h_2 = max(0, W_2h_1 + b_2)$
- ...
- $o = max(0, W_nh_{n-1} + b_n)$

In the following, we use **softmax** funtion to turn the outputs to the probabilities, which can be write as:
$$p(c_j=1|x) = \frac{e^{o_j}}{\sum_{i=1}^{c}e^{o_i}}$$

We now have the probability vector as the output. Since we already have the labelled vector $y$, we can use a **Loss** function (or **cost** function) to measure the distance between these two vectors
$$L(x, y; \theta) = - \sum_j y_j \text{log} p(c_j|x)$$
This loss function is also called **Cross entropy**.

The training (learning) process mainly involve in minimizing the loss function, which use another important method, [**back propagation**](http://neuralnetworksanddeeplearning.com/chap2.html).
Back propagation is about understanding how changing the weights and biases in a network changes the loss function.
The back propagation is based on the chain rule,
- $\frac{\partial L}{\partial W_n} = \frac{\partial L}{\partial o}\frac{\partial o}{\partial W_n}$, $\frac{\partial L}{\partial h_{n-1}} = \frac{\partial L}{\partial o}\frac{\partial o}{\partial h_{n-1}}$
- ...
- $\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial h_2}\frac{\partial h_2}{\partial W_2}$, $\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_2}\frac{\partial h_2}{\partial h_1}$
- $\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial h_1}\frac{\partial h_1}{\partial W_1}$

Follow this procedure, we can compute $\partial L/\partial W_i$ and $\partial L/\partial b_i$.

We use [**Stochastic Gradient Descent**](http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/) method for optimization, which updates the parameters $\theta$ as
$$\theta = \theta - \alpha \nabla_{\theta} L(\theta)$$
$\alpha$ is the learning rate.