# Deep Learning

## <center> Three step

### Step1, Function set(Model):

input layers: feature X

hidden layer:
* input: output of previous layer
* Neuron: each neuron is a function, takes all input and generate one number, then put it to activation function to generate output 
* output: output of each neuron
* structure: fully connect...etc

output layers:
* intput: output of last hidden layer
* function: self designed
* output: required output


### Step2, Loss function:

self-designed Loss function

### Step3, find the best function:

Gradient: 

$$
\begin{bmatrix}
    \frac{\partial L}{\partial w_1} \\
    \frac{\partial L}{\partial w_2} \\
    \frac{\partial L}{\partial w_3} \\  
    ... \\
    \frac{\partial L}{\partial b_1} \\
    \frac{\partial L}{\partial b_2} \\
    ... \\
\end{bmatrix}
$$

## <center> Backpropagation

### How to compute gradient (partial derivative)?

* parameters: $\theta = \{w_1, w_2, w_3,..., b_1, b_2,...\}$

* our work: 

    $l^1$ is the loss of $y^1$ and $\hat{y^1}$ as  and so on.

$$
\frac{\partial L(\theta)}{\partial w} = \displaystyle\sum^{N}_{n = 1}\frac{\partial l^n}{\partial w} = \displaystyle\sum^{N}_{n = 1}\frac{\partial z}{\partial w} \frac{\partial l}{\partial z}
$$

* introduce new algorithm

### two steps of Backpropagation
### step1: forword pass:

compute all $\frac{\partial z}{\partial w}$, where $z$ is the function in neuron. When z is linear model like $z = wx $

$$
\frac{\partial z}{\partial w} = \text{coefficent of parameter w, it is x in our example}
$$ 

### step 2: backwork pass:

inverse our neuron network

input:  

$$
\frac{\partial l}{\partial z}\text{, where z is in the output layer}
$$

hidden layer:

* input: $w_1 \frac{\partial l}{\partial z_1} + w_2 \frac{\partial l}{\partial z_2}$, in here, we only need to compute partial derivative of output layer, then use it as the input of next layer
* active function: derivative of activation function in neuron network = $\sigma '(z)$, we use sigmoid function as example in here
* output: partial derivative

$$
\frac{\partial l}{\partial z} = \sigma '(z)[w_1 \frac{\partial l}{\partial z_1} + w_2 \frac{\partial l}{\partial z_2}]
$$



## <center> Activation function

### Useful material, some basic understanding of activation functon

https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0

https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6

### why acitvation function

Reason 1:

    image our NN doesn't have activation function.

    NN could be written into $Y = W_3 \cdot (W_1 \cdot X + W_2 \cdot X)$ (one hidden layer, two nueron).

    Wait? This is a linear function even though you can add as many hidden layer or nueron as you want. It doesn't give you new power and we want a non-linear as well as more complex function for a difficult problem. That is deep learning for. 

    When we add activation function. 

    NN is converted to $Y = g( W_3 (\cdot f(W_1 \cdot x) + f(W_1 \cdot x)))$. Now it is non-linear since our activation function $f$ and $g$ are not linear.
    
Reason 2:
    
    activation function helps us to decide which neuron should be fired and which is more important to our output. When some neuron are not activated, our NN get lighter so parameters are reduced.

### various activation function and their problem

**linear activation function**

a linear function is something like $f(x) = kx + b$ because its plot is a straight line.

Problem? 
* changes nothing. Same as NN without activation function.

* the derivative of linear function is constant($k$ in here). Which means when we tune our parameters, it doesn't follows the change of $x$.

* no bound, range is $(-inf, +inf)$

**sigmoid function**

$f(x) = \frac{1}{1 + e^{-x}}$

advantages:

* have bound, won't blow up the activations, range is $(-1, +1)$

* y changes rapidly when x in $(-2, +2)$ if you observe the plot. That makes clear distinction on x.

* non-linear(it is a curve)

* gradient it's not linear

Problems:

* Vanishing Gradient Problem, look at either end of sigmoid function. The Y values is towards to -1 or +1 and has quite litter changes. Which means, the gradient at that region is going to be small. Each time the value are passed to sigmoid function, the gradient will be decreased.

**Tanh function**

$f(x) = \frac{2}{1 + e^{-2x}} -1 = 2sigmoid(2x) -1$

It is actually scaled sigmoid function!

advantages:
    
* similar to sigmoid function

* gradient changes more significant than sigmoid

Problems:

* vanishing gradient problem

**ReLu**

$ 
f(x) =
  \begin{cases}
     x  &  \text{if } x >= 0 \\
     0  &  \text{if } x < 0
  \end{cases}
$

advantages:

* It is not linear when x locate into different region (greater than 0 or less than 0)

* solve the vanishing Gradient problem, since it is linear when we look at one region

* make your NN sparse. When some Xs are less than 0, the output is 0. That means this neuron changes nothing to your output(fewer neurons are firing). This is actually we want because this make neurons to focus on small region and they are identical. Parameters are reduced at the same time.

* fast to compuate(no exponential, less neurons are firing)

problems:

* the gradient of x when x less than 0 is 0. Sometime it cause some neuron doesn't work at all.

**Leaky ReLu**

$ 
f(x) =
  \begin{cases}
     x  &  \text{if } x >= 0 \\
     0.01x  &  \text{if } x < 0
  \end{cases}
$

**Parametric ReLU**

$ 
f(x) =
  \begin{cases}
     x  &  \text{if } x >= 0 \\
     ax  &  \text{if } x < 0
  \end{cases}
$

### How to choose them?

* Sigmoid and Tanh function woks well for classifier. (see the graph of it. Y changes significantly)

* ReLU works almost all scenarios.

## <center> how to train DNN

### Have a bad result on Training data

### Solution:
* New activation function:
    * Vanishing Gradient Problem: 
    
    * ReLU (Rectified Linear Unit): make sure structure is linear
    $$
    \sigma(z) =
        \begin{cases}
          z & \quad \text{if } z>=0 \\
          0 & \quad \text{if } z<0 
        \end{cases}             
    $$
    
    * Leaky ReLU:
    $$
    \sigma(z) =
        \begin{cases}
          z & \quad \text{if } z>=0 \\
          0.01z & \quad \text{if } z<0 
        \end{cases}             
    $$
    
    * Parametric ReLU:
    $$
    \sigma(z, \alpha) =
        \begin{cases}
          z & \quad \text{if } z>=0 \\
          \alpha z & \quad \text{if } z<0 
        \end{cases}             
    $$
    
    * Maxout (ReLU is a special cases of Maxout):
    
    group output of each layer, output of activation function is the largest value in a group
    
* Adaptive Learning Rate:
    * RMSProp:
    $$
    w^{t+1} \leftarrow w^t - \frac{\eta}{\sigma^t} g^t , \\
    \sigma^0 = g^0 ,\\
    \sigma^t = \sqrt{\alpha(\sigma^{t-1})^2 + (1 - \alpha)(g^t)^2}  
    $$
    
    * Momentum:
    $$
    w^{t+1} = w^{t} + v^{t+1}, \\
    v^0 = 0. \\
    v^{t+1} = \lambda v^t - \eta * \text{graident}^t
    $$
    
    * Adam:
    
    RMSProp + Momentum
    
    increase training speed

### Have a bad result on Testing data
### Solution:
* Early Stopping:
Stop when Loss doesn't decrease on validation set

* Regularization:
    $$
        L^{\prime}(\theta) = L(\theta) + \lambda\frac{1}{2}\|\theta\|,
    $$
    * L2: 
    $$
        \|\theta\| = (w_1)^2 + (w_2)^2 + ...
    $$
    * L1:
    $$
        \|\theta\| = |w_1| + |w_2| + ..., \\
        \frac{\partial{L}^{\prime}}{\partial{w}} = \frac{\partial{L}}{\partial{w}} + \lambda sgn(w)
    $$
    
* Dropout:

    Each neuron has p% to dropout before update parameters,

    final weight should times 1 - p%

    Dropout can train a lot of models and use average as final model. 

    Hence it has a good result on testing data
    
    Drop can add to any layer including cnn layer

## <center> Hyper-parameters in DNN

### reference

https://towardsdatascience.com/a-guide-to-an-efficient-way-to-build-neural-network-architectures-part-i-hyper-parameter-8129009f131b

library:

Hyperas, Hyperopt

### what is Hyper-parameters

It is configurable value that should be set before training model.

### Hyper-parameters:

**Number of Layers**

* large number: over-fitting, vanishing or exploding graident problems

* low number: high bias and too simple model

* depending on the data size

**Number of hidden units per layer**

* depending on data size

**Activation Function**

* sigmoid and Tanh may do well for shallow networks

**Optimizer**

* SGD works well for shallow networks, cannot escape saddle points and local minima

* Adam is the best generally

* Adagrad for sparse data

**Learning Rate**

* try power of 10, 0.001, 0.01, 0.1, 1...

**Initialization**

* doesn't play a necessary role while could use He-normal/uniform initializaiton

* avoid zero or any constant value

**Batch Size**

* try power of 2

**Numer of Epochs**

* ...

**Dropout**

* ...

**L1/L2 Regularization**

* L1 is stronger than L2, try coefficience with power of 10, 0.0001, 0.001...


## <center> Convolutional Neural Network,  CNN

### What can CNN do?

1. recognize small region of the who image

2. recognize same patterns appear in different regions (use same parameters)

3. subsampling(sample a part of image) and won't change the object


### Convolution, for 1 and 2
* filter: Each filter detects a small pattern. 
    * filter is a matrix, the product of filter and a part of image represent the content in this part. Same product means that these two part both have same content and greater number means more similar
* stride: move distance of filter
* output: output of A filter is a small part of image. Then the next function could use this output.

### Max Pooling, for 3
* group the resullt of dot product of filter and part of image
* output the max of each group
* the max value is recognised part of the image, which could be useful for neuron network

### More applications:
* Deep Dream
* alpha go
* audio recognisation
* Text

### what does CNN learn?

Since We use filter to detect pattern. What the machine can learn is that what kind of picture or input can activate this filter or more likely to the pattern we want to find, let say a mouse or a eye.

## <center> Hyper-parameters in CNN

### Hyper-parameters

**kernel/Filter Size**

* small filters collect local information. Use it if you think what differientiates objects are some small and local features (5 * 5, 3 * 3)

* large filters look on global and high-level information. Use it if you think that a great amout of pixels are necessary for the network to recognize the object (11 * 11, 9 * 9)

* generally use odd size 

* often use (3 * 3, 5 * 5, 7 * 7) for small-sized images


**padding**

* add zeros on columns and rows. It will keep the size of input after filtering

* set 'padding = same' to enable padding when you think the border is important

* default is 'padding = valid' where the output size if $\lceil ((n-f+1)/s) \rceil$, where n it the input dimensions, f is filter size and s is stride length
    
* padding for maxpooling, reference:https://www.pico.net/kb/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-tensorflow
    * padding = 'valid',
        * out_height	= ceil(float(in_height - filter_height + 1) / float(strides[1]))
        * out_width	= ceil(float(in_width - filter_width + 1) / float(strides[2]))
    * padding = 'same',
        * out_height	= ceil(float(in_height) / float(strides[1]))
        * out_width	= ceil(float(in_width) / float(strides[2]))
        * pad_along_height	= max((out_height - 1) * strides[1] + filter_height - in_height, 0)
        * pad_along_width	= max((out_width - 1) * strides[2] + filter_width - in_width, 0)
        * pad_top	= pad_along_height // 2
        * pad_bottom	= pad_along_height - pad_top
        * pad_left	= pad_along_width // 2
        * pad_right	= pad_along_width - pad_left

**stride**

* 1 or 2

**Number of Channels/filters**

* It means the number of color channels for a image(3 for RGB) but in CNN it is the number of filters

* greater number, more features learnt, more chances to over-fit or vice-versa

* should be low in the beginning such that it detects low-level features then increase

* start by using small value then gradually increase to reduce the generated feature space width

* or stay the same

* 32-64-128... or 32-32-64-64...

**Pooling-layer parameters**

* 2 * 2 or 3 * 3

### Tips

* keep adding until over-fit, then try *dropout, regularization, batch norm, data augmentation...*

* try classic networks like LeNet, AlexNet, VGG-16, VGG-19 etc...

* try Conv-Pool-Conv-Pool or try Conv-Conv-Pool-Conv-Conv-Pool 

## <center> Why Deep Learning?

* Deep Learning uses less paramters, less data
* Modularization

## <center> Batch Normalization

### what is it?

https://arxiv.org/pdf/1805.11604.pdf

**Recall:**

Normalization can push data in similar region to eliminate influence of scale and boost training speed.

Batch Normalization is a normalization to centre data by subtracting mean, then resualts are divided by standard diviation. After doing BN, noramlised data has mean with 0 and standard divation with 1.



### In NN

**where to use them?**

Typically, BN use whole batch to compute mean and standard diviation. Since it normalize the input data, we need to apply them between each layer(after or before activation function). 

If you use it before acitivation function, it may mitigate vanishing gradients problem because data is located in central region(imagin the plot of sigmoid function).

**How to do backpropagation**

Let's look at a simple NN, only one hidden layer $W, B$ and inputs are $X$.

The training process is:
1. $Z_1 = WX + B$
2. $Z_2 = \frac{Z_1 - \mu_{x}}{\sigma_{x}}$, where $\mu = \frac{1}{m} \sum_i^m x_i$, and $\sigma = SD$. This tell us $\mu$ and $sigma$ depend on x.
3. $Y = f(Z_2)$, where $f()$ is a activation function

Hence, when we do backpropagation, $\mu$ and $\sigma$ should be considered.

**Tips**

* use less dropout probability

* for those activation function has better performance on larger mean or SD, we can expand NN.
    * using the example given above.
    * keep $Z_1$, $Z_2$
    * add one more layar, $Z_3 = \gamma \cdot Z_2 + \beta$
    * this new layer can change mean and SD.
    * Don't need to worry that $\gamma$ and $\beta$ are the same as $\mu$ and $\sigma$ because $\gamma$ and $\beta$ are independant on $X$. In other word, They are generated by learning.

### reference

https://stackoverflow.com/questions/34716454/where-do-i-call-the-batchnormalization-function-in-keras

https://www.reddit.com/r/MachineLearning/comments/67gonq/d_batch_normalization_before_or_after_relu/

https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/

### Benefits

* reduce training times, and make very deep net trainable
    * use large learning rates, since BN makes the network more stable during training. This may require the use of much larger than normal learning rates. 
    * Less exploding/vanishing gradients, sigmoid, tanh, etc.
    
* It can be used with most network types, multilayer perceptrons, CNN, RNN...

* can make training deep works less sensitive to the weights initialization.
    * initial $W_1 = w$, $W_2 = k \cdot w$
    * output: $Z_1 = W_1 \cdot X = w \cdot X$ and $Z_2 = W_2 \cdot X = k \cdot w \cdot X$, outpus are significantly different.
    * use BN, $Z_1 = \frac{Z_1 - \mu_x}{\sigma_x} = \frac{w \cdot X - \mu_x}{\sigma_x}$
    * $Z_2 = \frac{Z_2 - k \cdot \mu_x}{k \cdot \sigma_x} = \frac{k \cdot w \cdot X - k \cdot \mu_x}{k \cdot \sigma_x}$  
    * same results
* reduce demand for regularization.
