# Deep Learning An Overview


## Neural Metworks and Deep Learning

### Commonly Seen Networks
1. Standard Nueral Network
2. CNN
3. RNN
4. Custome Hybtid

### Logistic Regression  
Given $x$, want  $\hat{y} = P(y=1 \mid x)$, $x \in \Re^{n_x}$  
Parameters: $w \in \Re^(n_x), b \in \Re$  
Output: $\hat{y} = \sigma (w^Tx+b)$
$\sigma(z) = \frac{1}{1+e^{-z}}$

Loss(Error) Function    
$L(\hat{y}, y) = -[ylog\hat{y}+(1-y)log(1-\hat{y})]$
Cost Function  
$J(w,b) = \frac{1}{m} \sum_1^m L(\hat{y}^{(i)},y) = \frac{1}{m}\sum_{1}^{m}[y^{(i)}log\hat{y}^{(i)}+(1-y^{(i)})log(1-\hat{y}^{(i)})]$

### GD(Gradient Descent)  
Repeadly{  
    $w:=w - \alpha\frac{dJ(w)}{dw}$ 
}

### Normalizing  
$x_normalized =  \frac{x}{\left \| x \right \|}$  


### Softmax  
for $x \in \Re^{'(xn)}, softmax(x) = softmax([x_1,... ,x_n])$
$= [\frac{e^{x_1}}{\sum_{j}e^{x_j}},... ,\frac{e^{x_n}}{\sum_{j}e^{x_j}}]$ 

### Useful `np` Functions  
```python
# All are vectorization supported
# e^x
np.exp(x) 

# ||x||
np.linalg.norm(x,axis=1,keepdim=True)

# sum
np.sum(x,axis=1,keepdim=True)

# dot product
np.dot(x,y)

# outer product
np.outer(x,y)

# mathematical numerical multiply
np.multiply(x,y)

# reshape
np.reshape(row,col)
```

### Common Steps for pre-processing a new dataset are:  
1. Figure out the dimensions and the shapes of the problem(m_trian, m_test, num_px,...)
2. Reshape the datasets such that each example is now a vector of size(num_px\*num_px\*3,1)
3. Standardize the data.(Images can be applied by being devided by 255, which is the length of the color area)

### To Implement a NN:  
1. Initialize(w,b)
2. Optimize the loss iteratively to learn parameters such as (w,b)   
    - computing the cost and its gradient  
    - updating the parameters using gtradient descent  
3. Use the leart (w,b) to predict the labels for a given set of examples

### For Better Algorithm effeciency and accuracy  
1. Preprocessing the dataset is important
2. Implement each function seperatedly and build a model together
3. Tuning the learning rate (a kind of 'hyper-parameter') can make big difference to the algorithm

### Shallow Neural Network  
Repeatedly{
$Z^{[i]} = W^{[i]}+b{[i]},$
$A^{[i]} = \sigma(Z^{[i]}）$
}

### Activation Functions
1. sigmoid:  
$$
a(x) = \frac{1}{1+e^{-x}}
$$
2. tanh:  
$$
a(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}
$$
3. ReLU(Rectified Linear Unit):  
$$
    a(x) =max(0,x)
$$
4. Leaky ReLU:  
$$
a(x) = max(0.01*x,x)
$$

### Why do we need non-linear activation function?  
If we use linear activation function, we just get a result of linear computation.

### Derivatives of activation functions
1. sigmoid:  
$$
g'(z)=g(z)[1-g(z)]
$$
2. ranh:
$$
g'(z)=1-[tan(z)]^2
$$
3. Relu:  
$$
g(z) = 
\left\{\begin{matrix}
0 & ,if\ z < 0 \\ 
1 & ,if\ z >0 \\
undefined & ,if z=0 
\end{matrix}\right.
$$
4. Leaky ReLU:
$$
g(z) = 
\left\{\begin{matrix}
0.01 & ,if\ z < 0 \\ 
1 & ,if\ z >0 \\
undefined & ,if z=0 
\end{matrix}\right.
$$

### General GD Algorithm
1. Forward Propagation:
$$
Z^{[i]} = W^{[i]}X^{[i-1]} + b^{[i]},\\  
A^{[i]} = g^{[i]}(Z^{[i]})
$$
2. Backward Propagation:  
$$
d_Z^{[i]} = W^{[i+1]T}d_z^{[i+1]} * g'^{[i]}(Z^{[i]})\\  
d_w^{[i]} = \frac{1}{m}d_z^{[i]}X^T\\ 
d_b^{[i]} = \frac{1}{m}\sum d_z^{[i]}
$$
3. Update Weights:
$$
W = W - \alpha * d_W\\
b = b - \alpha * d_b
$$



## Improving Deep NN - Hyper-parameters tuning, Regularization and Optimization  

### Setting Up Machine Learning Applications
#### Hyper-parameters
1. \# layers
2. \# hidden layers
3. learning rate
4. activation functions

| Condition   |Train Sets     | Dev Sets (Cross Validation)    | Test Sets(Optional)    |
| :------------- | :------------- | :------------- |:------------- |
| small amount of data     | 60      | 20    | 20    |
| big data    | 98    | 1    | 1    |

#### Make sure Dev and Test the same distribution.
Test ensures how performance will be on the target so it is necessary to keep them the same distribution.

#### Bias And Variance
- Bias measures how the performance is between train set accuracy and the baseline(generally, will be human level performance).
- Variance measures how the performance of Dev set compared with the Train set.

#### Solution for high Bias and Variance
1. High Bias:
    - Bigger Network
    - Train Longer
    - NN architecture research
2. High Variance:
    - More Data
    - Regularization
    - NN architecture research

### Regularization
- $L_1$ Regularization:
$$
\left \| w \right \|^{2}_{2} = \sum_{j = 1}^{n_x} w_j^2 = w^Tw
$$

- $L_2$ Regularization: 
$$
\frac{\lambda}{2m}\sum_{j = 1}^{n_x}|w_j|=\frac{\lambda}{2m}\left \| w \right \|_1
$$
- Frobenius Norm
$$
\left \| w^{[L]} \right \|^2_F = \sum_{i = 1}^{n^{[L-1]}}|w_j|\sum_{j = 1}^{n^{[L-1]}}(w_{ij})^2
$$

#### How does regularization prevent overfitting?
Some of the hidden layer has been waken so that the overfitting has been modified, too.  
Because of the regularization parameter to reduce the weights to more close to zero, so the real function is more close to a linear function.

### Dropout Regularization
#### What's a dropout?
- Eliminating some hidden unit randomly
- No dropout when making prediction 

#### Why does dropout works?
- Overfitting is happening so we can't rely on any one feature, and thus we choose to spread out weights.  
- To different Layer, choose relatively drop-out `keep_prov` value(usually not for input layer).
- No overfitting, no dropout.

### Other way to regularization
- Data Augmentation
    - flipping horizontally
    - randomly distortion/zooming
- Early Stopping

### Orthogonalization 
1. Optimize cost function:
    - Gradient descent
    - Momentum
    - RMS Prop
    - Atom
2. Non-overfit
    - Regularization
    - Getting more data
3. Early stop can't make you handle two of above problem independently.

4. Bias $\to $ error rate & Variance $\to$ overfitting


## Setting up your optimization problem 
 
### Normalizing training sets
Normalizing training set can make the train process have a equal  probability at any direction so that it can improve train efficiency.

### Gradients Vanishing/Exploding

When the number of layers becomes very large that stacked together weights from them can be magnified or reduced to a very large/small value (that a computer memory unit can not hold), this situation is called the Gradient Exploding/Vanishing.

### Weights Initialization 
- A [reference](https://zhuanlan.zhihu.com/p/25110150)
- Zero Initialization, output will always be same, equivalent to linear regression.
- Random Initialization, hard to train
- Xavier Initialization: $random*\sqrt\frac{1}{layer^{l-1}}$
- He Initialization: $random*\sqrt\frac{2}{dim^{previous\_layer}}$


### Gradient Checking can be an efficient way to debug a neural network

Often we use **Approximation of gradients** to check the gradients.

#### Gradient Checking Notes
1. Don't use in training, ONLY for debug
2. If algorithm fails grad check, look at components to try to identify bug.
3. Remember Regularization when calculate the grad
4. Doesn't work with dropout.(set dropout parameter `keep_prob` to `1`)
5. Run at random initialization; perhaps again after some training.

### Practice Instructions
#### Initialization
##### zero initialization
- the weight $W^{[l]}$ should be initialized randomly to break symmetry.
- It is okay to initialize $b^{[l]}$ to zeros, symmetry is still broken so long as $W^{[l]}$  is initialized randomly.

##### Random Initialization 
Problem may happen: infinite after 0 iteration.
- initializing weights to very large random values doesn't work well.
Problem may happen: infinite cost.
- Initializing weights to smaller values

##### Xavier Initialization
$$
random\_number*\sqrt\frac{1}{layer^{[l-1]}}
$$

##### He Initialization 
$$
random\_number*\sqrt\frac{2}{dimension\_of\_previous_layer}
$$

##### Tips
1. Different initialization lead to different results
2. Random initialization is used to break symmetry and make sure different hidden units can learn different things.
3. Don't initialize to values that are two large.
4. He Initialization works well form networks with ReLU activations

#### Regularization
##### $L_2$ Regularization
Depend on the assumption that a model with smell weights than the larger ones.  

Implementation of $L_2$ Regularization is on: 
- The cost function: 
    - A regularization term is added to the cost-entropy cost
- The back-propagation function: 
    - There are extra terms in the gradients with respect to weights matrix
- Weights end up smaller("Weights decay" situation) 
    - weights are pushed to smaller ones.

##### Dropout
- A regularization technique
- Only use dropout in training, never in training.
- Apply dropout both in forward and backward
- Scale the value using dividing `keep_prob` due to the shutting off of the neurons(keep result the same level of numeric).

#### Tips
- Regularization will help reduce overfitting problem
- Regularization will drive weights to lower values.
- $L_2$ Regularization and Dropout are two very effective regularization techniques.
    

## Optimization Algorithms
### MIni-Batch vs Batch Training
Problem to address: the data is so giant that really cost time and computation resources.

mini-batch method will update weights in a batch-split way, and the weights will be apparently not updated continuously.
The size of batch should not be 1(No vectorization,train one example each time) or m(the batch training, losing efficiency) but some value between them.
#### Methodology
- If training set is small(typically less than thousand), choose batch training.
- If the size of training set is large, the exponentials of 2 are recommended as the batch size to fit the Computer Memory units size.

### Exponentially weighted averages.
(Also known as exponentially weighted moving averages)
The current value(weights) depends on the several previous ones(controlled by the parameter $\beta$). This technique will be used for speed up closing to the convergence point.
$$
V_t = \beta V_{t-1} + (1-\beta)Q_t
$$
**Bias Correction**
The value may be very small so the following value by this method may be at a very large bias.
To eliminate this, we can use this to remove the bias:
$$
\frac{V_t}{1-\beta^t}
$$


### Optimization Algorithms
#### Momentum
##### On iteration t:
Compute dw,db on the current mini-batch and
$$
V_{dw} = \beta V_{dw} + (1-\beta)dw \\
V_{db} = \beta V_{db} + (1-\beta)db \\
W = W - \alpha V_{dw}, b = b - \alpha V_{db}
$$

##### Hyper-parameters
$\alpha, \beta(0.9\ by\ convention)$

#### RMSprop
##### On iteration t:
Compute dw,db on the current mini-batch and
$$
S_{dw} = \beta_2 S_{dw} + (1-\beta_2)dw^2 \\
S_{db} = \beta_2 V_{db} + (1-\beta_2)db^2 \\
W = W - \alpha \frac{dw}{\sqrt{S_{dw}+\epsilon}}, b = b - \alpha \frac{db}{\sqrt{S_{db}+\epsilon}}
$$
$\epsilon$ is for the zero case of the `dw` and `db`.

##### Hyper-parameters
$\alpha, \beta_2, \epsilon$

#### Adam Optimization
##### On iteration t:
Firstly, $V_{dw}=0, S_{dw}=0, V_{db}=0, S_{db}=0$  
Compute dw,db on the current mini-batch and
1. The momentum part:  
$$
V_{dw} = \beta_1 V_{dw} + (1-\beta_1)dw \\
V_{db} = \beta_1 V_{db} + (1-\beta_1)db \\
$$
2. The RMS prop part:
$$
S_{dw} = \beta_2 S_{dw} + (1-\beta_2)dw^2 \\
S_{db} = \beta_2 V_{db} + (1-\beta_2)db^2 \\
$$

3. Bias Correction:
$$
V_{dw}^{corrected} = \frac{V_{dw}}{1-\beta_1^t}, V_{db}^{corrected} = \frac{V_{db}}{1-\beta_1^t} \\
S_{dw}^{corrected} = \frac{S_{dw}}, S_{db}^{corrected} = \frac{S_{db}}{1-\beta_2^t}
$$

4. Update the parameters:
$$
W = W - \alpha \frac{V_{dw}^{corrected}}{\sqrt{{S_{dw}^{corrected}+\epsilon}}}, \\
b = b - \alpha \frac{V_{db}^{corrected}}{\sqrt{{S_{db}^{corrected}+\epsilon}}}
$$

##### Hyper-parameters
| Parameter Name    | Suggest Value     | Meaning    |  
| :---------| :--------| :--------|
| $\alpha$    | needs to be tuned    | learning rate    |
| $\beta_1$    | 0.9    | $dw$ for computing exponentially average    | 
| $\beta_2$    | 0.999    | $dw^2$ for computing exponentially average     |  
| $\epsilon$    | $10^{-8}$    | Eliminating the illegal division by zero    |


#### Learning Rate Decay
Problem to address: learning rate to be smaller when the training is on thousands of iteration.
- One way
$$
\alpha = \frac{1}{1 + decay\_rate * epoch\_num}\alpha_0
$$
$\alpha_0$ and the decay rate are another kind of hyper-parameter.

-Other ways:  
Exponential Decay
$$
\alpha = 0.95 * \alpha_0
$$
the `0.95` is an example but it should be some number less than one slightly.
$$
\alpha = \frac{k}{\sqrt{epoch\_num}}*\alpha_0, \\
or\\
\alpha = \frac{k}{\sqrt{t}}*\alpha_0
$$

#### Practice Instructions
##### Mini-Batch(for the samples):
- Shuffling and Partition are two steps required to build mini-batches
- Powers of two are often chosen to be the mini-batch size, lie 2,4,8,...

##### Momentum
 - The Velocity is initialized with zeros so it will take a few iterations to build up and start to bigger steps.
 - If $\beta=0$, the momentum is disabled
 - choosing of $\beta$: 
     - Larger value of $\beta$, the steps are going smoother.
     common value range from 0.8 to 0.999, 0.9 is usually used for default.
 - Tips:
     - Momentum takes past gradients into account to smooth out the steps of GD.
     - It can be applied with batch gradient descent, mini-batch GD or stochastic GD(SGD).
     - You have to tune a momentum parameter $\beta$ and a learning rate $\alpha$


#### Adam
The Advantages of Adam; 
- Relatively low memory requirements
- Usually works well even with little tuning of hyper-parameters(except for $\alpha$)


## Hyper-parameter Tuning

### Tuning Priority.
P=Priority:  
P1. Tuning $\alpha$.  
P2.1. Tuning $\beta$.  
Default Order: $\beta_1$ $\beta_2$ $\epsilon$  
P2.2. Mini-batch size.
P2.3. Number of hidden units.
P3.1. Number of layers.
P3.2. Learning rate decay. 

Try random values rather than a grid. We use random search because we want search more in the possible important hyper-parameter results space.

Summary: use random sampling not grid searching and optionally consider implementing a coarse to fine sampling schema.

### Using an appropriate scale to pick hyper-parameter.

$\alpha \in [0.0001,1]$, but not search within the range of this. Instead, use a candidate value for $\alpha$ like 0.0001,0.001,0.01,0.1,...or 
$$
r = -4 * random \\
\alpha = 10^{r}
$$

### Hyper-parameter for exponentially weighted averages.
Cause the value for $\beta$ measures the importance range, like $\beta = 0.9$ can conclude a result from last 10 values.  
Suggestion for the $\beta$ value is: $\beta \in [0.001,0.1]$. Implementation example:  
```python
r = np.random.randn(-3,-1)
beta = 1 - 10**(r)
```
Why not use linear way to search the possible value?  
When the parameters go to some value, the value becomes more sensitive so that the search should be concentrate on those areas.


### Hyper-parameter tuning in practice: Panda way vs. Caviar  
If having large scale of computation resources, use a parallel way to search the possible hyper-parameters. This is the way we called Caviar way. However if the resources is not so enough, we just babysitting one model once. This is called the Panda way.



### Batch Normalization
For hidden layer l, can we normalize the input of layer l to make learning faster?  
$$
\mu = \frac{1}{m}\sum x^{(i)} \\
x = x - \mu \\
\sigma^2 = \frac{1}{m}\sum {(x^{i})}^2 \\
x = \frac{x}{\sigma^2}
$$ 

Implementation of batch normalization:   
Given some intermediate value in neural network, like: $z^{(1)},...,z^{(i)}$  
$$
\mu = \frac{1}{m}\sum z^{(i)} \\
\sigma^2 = \frac{1}{m}\sum (z_i - \mu)^2 \\
z_{norm}^{(i)} = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}} \\
\tilde{z}^{(i)} = \gamma z_{norm}^{i} + \beta
$$
The parameter $\gamma$ is a learnable parameter. And if $\gamma = \sqrt{\sigma^2 + \epsilon}$, the $\tilde{z}^{(i)} = z^{(i)}$.

Note that when implementing normalization, parameters in the normalization and regularization should also be updated.

#### Why does the Batch Normalization work?
What Batch Normalization does is it reduces the amount that the distribution of these hidden units values shifts around. NORMALIZE is just make is normal to use.
##### Batch Normalization as regularization 
- Each mini-batch is scaled by the mean/variance computed on just that mini-batch
- This adds some noise to the value $z^{[l]}$ within that mini-batch. So it's similar to dropout operation. it adds some noise to each hidden layer's activations by randomly shutting down some units.
- This just has a SLIGHT regularization effects so do not turn it to a regularization choice.  
PS: When the mini-batch size becomes larger, the regularization effect become more obvious.  

### Multi-class classification  
#### Softmax Regression
let the C to be the number of classes.  
The output probabilities added together should be 1.  
Say layer l has; $Z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}$  
The Activation function is:
$$
t = e^{z^{[l]}}, \\
a^{[l]}_i = \frac{t_i}{\sum_{j=0}^{C}t_j}
$$

### Deep Learning frameworks
#### Famous Frameworks
- Caffe/Caffe2
- CNTK
- DL4J
- Keras
- Lasague
- mxnet
- Paddlepaddle
- Tensorflow
- Theano
- Torch

#### When choosing a Deep Learning framework, Consider:
- Ease of programming(development and deployment)
- Running speed
- Truely open(open source and good governance)


#### When using Tensorflow, the common steps are:
1. Define variables and placeholders
2. Define a cost function.
3. Define a train method.
4. Initialize the variables
5. Get the session and iterate.

#### Practice Instructions
- Create placeholders:
```python
x = tf.placeholder(tf.float32,name='')
```

- Specify the computation graph to operations you want to compute:
```python
Y = WX + b
sig = tf.sigmoid()
```

- Create the session;
```python
with tf.session as session:
    .....
```

- Run the session using a feed dictionary if necessary to specify placeholder variables names:
```python
session.run (sigmoid, feed_dict={...;...})
```



## Structuring Machine Learning Projects
### Why ML Strategy?
When already finish some work in a model, but it is not so accurate. So the question is where is the better place to start finding a solution to improve it?

### Orthogonalization
It should be a orthogonalized way to find ways.


### Setting up goal
#### Single number evaluation metric
The evaluation metric should be single and single only.  

- $F_1$ Score:
$$
F_1 = \frac{2}{\frac{1}{P}+\frac{1}{R}}
$$

#### Train/Dev/Test Distributions
- Test sets is the result of the final model, the confidence of your algorithm.
- Dev and validation metric should be same with test so that there will we can highly efficiently innovate and optimize the algorithm.

#### Size of Dev and Test sets
Total size is larger, the train set is larger.
It's okay not having a test set.

#### We could add some extra bias weights to some special case, like porn problem

### Human Level Performance.
#### Why Human Level Performance?
1. Machine Learning becomes efficient suddenly to compete with human-level performance.
2. Work flow is much more efficient to allow Machine Learning project to do what human can also do.
3. Human is good at lots of tasks so human-level performance is a approximate to Bayer Optimal error.

To get a better performance to human-level performance(or some kind of Bayer Optimal level):
- Get labeled data from human
- Gain insight from manual error analysis like: Why did a person get this right(but the model under building is not)?
- Better analysis of bias/variance.

#### Avoidable bias and variance.
Say we have example:    

| Object    | Accuracy 1    | Accuracy 2    |  
| :--------| :--------| :--------|  
| Human    | 1    | 7.5    |
| Train Error    | 8    | 8    |
| Dev Error    | 10    | 10    |
| Solution    | Focus on Bias    | Focus on variance     |

In the example, the error between human level performance and the Train Error is "Avoidable Bias" and the error between Train error and the Dev Error is variance.

#### Understanding human level performance
##### Human level as a proxy for Bayer Error.
In the following example, a medical image classification example, suppose:

| Object    | Accuracy    |  
| :--------| :--------|  
| Typical Human    | 3%    |  
| Typical Doctor    | 1%    |  
| Experimental Doctor    | 0.7%    |  
| Team of experimental Doctor    | 0.5%    |

So what is a "human-level performance" ?
the last column, a 0.5% error accuracy. (By Bayer Definition, the top level performance)

When Choosing a model to next iteration or production or some thing. Choose the one that has closest accuracy to the human level performance.


##### Surpassing Human Level Performance.
When the performance of a model or system is higher than the human can achieve.
Examples:  
- Online advertising
- Product Recommendation
- Logistics
- Loan Approvals
- ...


##### Improving your model performances.
- To eliminate avoidable bias:
    - Train bigger model
    - Train longer/better optimization algorithms.(Momentum, RMS prop, Adam,...)
    - Neural Network Architecture/Hyper-parameter search(RNN,CNN).
- To eliminate variance:
    - More data
    - Regularization($L_2$, Dropout, data augmentation)
    - Neural Network Architecture/Hyper-parameter search.
    




-----  

<a rel="license" style="text-decoration:none" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">
    <div style="margin-top:0.5em;margin-bottom:1em;">
        <img alt="Creative Comments" style="display:inline;"  src="https://mirrors.creativecommons.org/presskit/icons/cc.svg"/>
        <img alt="Attribution" style="display:inline;margin-top:0;"  src="https://mirrors.creativecommons.org/presskit/icons/by.svg"/>
        <img alt="Non-Commercial" style="display:inline;margin-top:0;"  src="https://mirrors.creativecommons.org/presskit/icons/nc.svg"/>
        <img alt="Non-Commercial" style="display:inline;margin-top:0;"  src="https://mirrors.creativecommons.org/presskit/icons/nc-jp.svg"/>
        <img alt="Share Alike" style="display:inline;margin-top:0;"  src="https://mirrors.creativecommons.org/presskit/icons/sa.svg"/>
     </div>
</a>
<br />
This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

CREATED BY [ArvinSiChuan](mailto:arvinsc@foxmail.com?subject=Deep%20Learning%20An%20Overview), 06-Apr-2018.  
Updated at 06-Apr-2018, VERSION SNAPSHOT-1.0.0