# Neural Networks

## Activation Functions

[Activation Functions: Neural Networks](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6)

### Softmax

$$softmax(z) = softmax(z +c) \qquad  \text{(c is a constant)}$$

$$softmax(z_i) = \frac{e^{z_i}}{\sum_j{e^{z_j}}}$$

Assume $\hat y = softmax(z), y$ is one-hot vector only labels a correct output 1, $L(z)$ is cross entropy function:

$$L(z) = - \sum_i y_i log(\hat{y}_i) = - log(\hat y_i) \tag{1} $$

$$\frac{\partial{L(z)}}{\partial{z}}= \hat y - y=\hat{y}_i-1 \tag{2} $$

> e.g. $\hat y = [0.015,0.866,0.117 ], \ y =[0,1,0].$ <br>
if $\hat{y}_2 = 0.866$ is true ouput, $ \partial{L_{z_2}}=0.866-1=-0.134, \ \frac{\partial{L(z)}}{\partial{z}} = [0.015,-0.134,0.117]$


### Sigmoid

<img src="images/sigmoid.png" style="width: 400px;"/>

$$sigmoid(z) = \frac{1}{1 + e^{-z}}$$

The main reason why we use sigmoid function is because it exists between **(0 to 1)**. Therefore, it is especially used for models where we have to predict the probability **as an output**.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

The function is **differentiable**.That means, we can find the slope of the sigmoid curve at any two points.

The function is **monotonic** but function’s derivative is not.

The logistic sigmoid function can cause a neural network to **get stuck** at the training time.

The **softmax** function is a more generalized logistic activation function which is used for **multiclass classification**.

### Tanh

<img src="images/tanh.jpeg" style="width: 400px;"/>

$$tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$

The range of the tanh function is from **(-1 to 1)**. tanh is also sigmoidal (s - shaped). The advantage is that the negative inputs will be mapped strongly negative **(zero-mean, derivative slope steepest around 0).**

The function is **monotonic & differentiable** while its derivative is not monotonic.

The tanh function is normally superior to sigmoid at hidden layer.

### ReLU & Leaky ReLU

<img src="images/sigmoid_vs_relu.png" style="width: 600px;"/>

$$ReLU(z) = max(0, z)$$


\begin{equation}
    \frac{\partial ReLU(z)}{\partial(z)}=\begin{cases}
        0, & \text{if $x<0$}.\\
        1, & \text{otherwise}.
    \end{cases}
\end{equation}



**Non-differentiable on zero** is not big deal for computer (Offen zero in very small demical like $1e^{-10}$).

But the issue is that all the negative values become zero immediately which decreases the ability of the model to fit or train from the data properly. That means any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turns affects the resulting graph by not mapping the negative values appropriately. So here is **leaky ReLU** came out.

<img src="images/relu_vs_leakyrelu.jpeg" style="width: 600px;"/>

<center>Fig : ReLU v/s Leaky ReLU</center>

The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so.

When a is not 0.01 then it is called Randomized ReLU.

Therefore the range of the Leaky ReLU is (-infinity to infinity).

Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their derivatives also monotonic in nature.

### Cheetsheet & Derivative

<img src="images/act_fun_cheetsheet.png" style="width: 800px;"/>

<center>Fig: Activation Function Cheetsheet</center><br>

<img src="images/af_derivative.png" style="width: 600px;"/>

## Fine-tune

### Bias vs Variance

<img src="images/general_formulation.png" style="width: 800px;"/>           

<img src="images/general_formulation_ex.png" style="width: 800px;"/>

<img src="images/bias_variance.png" style="width: 800px;"/>

1. Fit training set well in cost function
    - If it doesn’t fit well, the use of a bigger neural network or switching to a better optimization algorithm might help.
2. Fit dev-train set well on cost function
    - If it doesn't fit well, perform manual error analysis to understand the error differences. Make/ collect data similar to dev and test sets. Artificial synthesis may help, but mind accidentally simulating data only from a tiny subset of the space of all possible examples. Development should never be done on test set to avoid overfitting.
3. Fit development set well on cost function
    - If it doesn’t fit well, regularization or using bigger training set might help. 
4. Fit test set well on cost function
    - If it doesn’t fit well, the use of a bigger development set might help 
5. Performs well in real world
    - If it doesn’t perform well, the development test set is not set correctly or the cost function is not evaluating the right thing.
    
### Strategies

1. Use single evaluation metric, like F1 over Recall & Precision

2. N-metrics: 1 optimazing, N-1 satisfying, like below a limit of running time, choose algorithm with the best evaluation metric

3. Randomly shuffle data into dev set and test set the same distribution. And choose them to reflect data you expect to get in the future and consider important to do well on.

4. 70%-30%, 60%-20%-20%, 70%-15%-15% are normal splits on data set less than 10,000. If over 1m  , data can be divided as 98%-1%-1%. 

5. When actual performance is not satisfied engough, change evaluation metric/cost funtion or dev/test set.

### Error Analysis

1. Make a table to count up each percentage of error suspects for analysis.

2. Consider incorrectly labeled tranning sets, add it to error analysis table. 

3. If need correction of labels, please do it all in dev/test.

4. If starting on a new system, build it quickly, then iterate.


### Norm Penalties

$$ J(\theta) = L(\hat{y}^{(i)}, y^{(i)})$$

$$ \tilde J(\theta) = {J}(\theta)+ \Omega(\theta) $$

$$ argmin_{\theta} = - \frac{1}{m} (\sum_{i=1}^m J(\theta) + \sum_{j=1}^l \Omega(\theta)), \quad \text{for $m$ is example size, $l$ is layer number} $$



#### L2 Norm (Hinge)

$$ \Omega(\theta) = \frac{\lambda}{2} {||\omega||}_2^2 $$

$$ {||\omega||}_2^2 = {\omega}^T \omega $$

$$ \nabla_{\omega}\tilde {J}(\omega) = \lambda \omega  + \nabla_{\omega}J(\omega) $$

For update of $\omega$ gradient decent with learning rate $\alpha$:

$$ \omega \leftarrow \omega - \alpha(\lambda \omega + \nabla_{\omega}J(\omega)) $$

$$ \omega \leftarrow (1 - \alpha\lambda)\omega - \alpha \nabla_{\omega}J(\omega) $$

>"weight decay": $(1 - \alpha\lambda)$, greater effect on direction of bigger eigen vector value of Hessian Matrix

#### L1 Norm (LASSO)

$$ \Omega(\theta) = \lambda {||\omega||}_1 $$

$$ {||\omega||}_1 = \sum_i |{\omega}_i| $$

$$ \nabla_{\omega}\tilde {J}(\omega) = \lambda sign(\omega) + \nabla_{\omega}J(\omega) $$

For update of $\omega$ gradient decent with learning rate $\alpha$:

$$ \omega \leftarrow \omega - \alpha(\lambda sign(\omega) + \nabla_{\omega}J(\omega)) $$

>Compared with L2 regulation, L1 regulation tend to **ouput sparser solution** and is used for  **feture selection**. L1 regulation make part of weight parameters zero, that means the corresponding feture can be safely ignored.

### Dropout

- 用于解决过拟合问题

- Dropout存在两个版本：直接（不常用）和反转。(这里只对Inverted Dropout进行说明)

- dropout是指在深度学习网络的训练过程中，对于神经网络单元，按照一定的概率将其暂时从网络中丢弃。（注意是暂时）


<img src="images/dropout.png" style="width: 400px;"/>


#### dropout 如何工作

[dropout 正则化](https://www.jianshu.com/p/257d3da535ab)

我们知道，典型的神经网络其训练流程是将输入通过网络进行正向传导，然后将误差进行反向传播。Dropout就是针对这一过程之中，随机地删除隐藏层的部分单元，进行上述过程。

综合而言，上述过程可以分步骤为：

- 随机删除网络中的一些隐藏神经元，保持输入输出神经元不变

- 将输入通过修改后的网络进行前向传播，然后将误差通过修改后的网络进行反向传播

- 对于另外一批的训练样本，重复上述操作

在训练阶段期间对激活值进行缩放，而测试阶段保持不变

<img src="images/inverted_dropout.png" style="width: 800px;"/>

#### dropout为何有效

- 由于每次用输入网络的样本进行权值更新时，隐含节点都是以一定概率随机出现，因此不能保证每2个隐含节点每次都同时出现，这样权值的更新不再依赖于有固定关系隐含节点的共同作用，阻止了某些特征仅仅在其它特定特征下才有效果的情况，减少神经元之间复杂的共适应性。

- 由于每一次都会随机地删除节点，下一个节点的输出不再那么依靠上一个节点，也就是说它在分配权重时，不会给上一层的某一结点非配过多的权重，起到了和L2正则化压缩权重差不多的作用。

- 可以将dropout看作是模型平均的一种，平均一个大量不同的网络。不同的网络在不同的情况下过拟合，虽然不同的网络可能会产生不同程度的过拟合，但是将其公用一个损失函数，相当于对其同时进行了优化，取了平均，因此可以较为有效地防止过拟合的发生。对于每次输入到网络中的样本（可能是一个样本，也可能是一个batch的样本），其对应的网络结构都是不同的，但所有的这些不同的网络结构又同时共享隐含节点的权值，这种平均的架构被发现通常是十分有用的来减少过拟合方法。

#### dropout使用技巧

- 在可能出现过拟合的网络层使用dropout

- dropout也可以被用作一种添加噪声的方法，直接对input进行操作。输入层设为更接近1的数，使得输入变化不会太大

- 使用 Inverted dropout 进行验证时不用使用dropout

#### dropout缺点

- 明确定义的损失函数每一次迭代都会下降，而dropout每一次都会随机删除节点，也就是说每一次训练的网络都是不同的，损失函数不再被明确地定义，在某种程度上很难计算，我们失去了调试工具。

#### 当前Dropout的使用情况

当前Dropout被大量利用于全连接网络，而且一般人为设置为0.5或者0.3，而在卷积隐藏层由于卷积自身的稀疏化以及稀疏化的ReLu函数的大量使用等原因，Dropout策略在卷积隐藏层中使用较少。
总体而言，Dropout是一个超参，需要根据具体的网路，具体的应用领域进行尝试。

### Early Stopping (吴老湿不太推荐，因为同时作用于bias和variance，不方便tunning)

Test set log loss errors is growing up as training continues.

<img src="images/early_stop.png" style="width: 600px;"/>

One way to think of early stopping is as a very efficient hyperparameter selection
algorithm. In this view, the number of training steps is just another hyperparameter.

<img src="images/para_earlystop.png" style="width: 600px;" />

<img src="images/meta_earlystop.png" style="sidth:600px;" />

## Optimization

### Batch Normalization

[Implementing Batch Normalization in Tensorflow](https://r2rt.com/implementing-batch-normalization-in-tensorflow.html)

[BN2015 paper](https://arxiv.org/pdf/1502.03167v3.pdf)

Tensorflow has an easy to use batch_normalization layer in the tf.layers module. Just be sure to wrap your training step in and it will work.  

```python
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): 
```


To normalize a value across a batch (i.e., to batch normalize the value), we subtract the batch mean, $\mu_B$, and divide the result by the batch standard deviation, $\sqrt{\sigma_B^2 + \epsilon}$. Note that a small constant 
$\epsilon$ is added to the variance in order to avoid dividing by zero.

Thus, the initial batch normalizing transform of a given value, $z_i$ ($i^{th}$ linear output of the given layer) is:

$$ BN_{initial}(z_i)= \frac{z_i-\mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$

$$ \mu_B = \frac{1}{m} \sum_i z_i $$

$$ \sigma_B^2 = \frac{1}{m} \sum_i (z_i - \mu_B)^2 $$

Because the batch normalizing transform given above restricts the inputs to the activation function to a prescribed normal distribution, this can limit the representational power of the layer. Therefore, we allow the network to undo the batch normalizing transform by multiplying by a new scale parameter $\gamma$ and adding a new shift parameter $\beta$. $\gamma$ and $\beta$ are learnable parameters.

Adding in $\gamma$ and $\beta$ producing the following final batch normalizing transform(**Note: each layer got its own $\gamma^{[l]}$ and $\beta^{[l]}$ **):

$$ BN(z_i)=\gamma(\frac{z_i-\mu_B}{\sqrt{\sigma_B^2 + \epsilon}}) + \beta $$

At test time, overall $\mu$ and $\sigma^2$ is getting from exponentially weighted average of $\mu_B,\sigma_B^2$ each batch. 

### Xavier 

[An Explanation of Xavier Initialization](http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization)

Help to fix Gradient exploding\vanishing

$$ Var(W) = \frac{1}{n_{in}}, \text{(sigmoid & tanh)} \quad \text{ OR} \quad Var(W) = \frac{2}{n_{in}+n_{out}}, \text{(ReLU)} $$

Assume:

$$Y=W_1X_1+W_2X_2+⋯+W_nX_n$$

$$ Var(W_iX_i)=E[X_i]^2Var(W_i)+E[W_i]^2Var(X_i)+Var(W_i)Var(X_i) $$

Now if our inputs and weights both have mean 0, that simplifies to

$$ Var(W_iX_i) = Var(W_i)Var(X_i) $$

Then if we make a further assumption that the Xi and Wi are all independent and identically distributed, we can work out that the variance of Y is

$$ Var(Y) = Var(W_1X_1+W_2X_2+⋯+W_nX_n) = nVar(W_i)Var(X_i) $$

Or in words: the variance of the output is the variance of the input, but scaled by $nVar(W_i)$. So if we want the variance of the input and output to be the same, that means $nVar(W_i)$ should be 1. Which means the variance of the weights should be

$$ Var(W_i)=\frac{1}{n}=\frac{1}{n_{in}} $$

Glorot & Bengio’s formula needs a tiny bit more work. If you go through the same steps for the backpropagated signal, you find that you need

$$ Var(W_i)==\frac{1}{n_{out}} $$

to keep the variance of the input gradient & the output gradient the same. These two constraints can only be satisfied simultaneously if $n_{in}=n_{out}$, so as a compromise, Glorot & Bengio take the average of the two:

$$ Var(W_i) = \frac{2}{n_{in}+n_{out}}$$

The assumption most worth talking about is the “linear neuron” bit. This is justified in Glorot & Bengio’s paper because immediately after initialization, the parts of the traditional nonlinearities - tanh,sigm - that are being explored are the bits close to zero, and where the gradient is close to 1. For the more recent rectifying nonlinearities, that doesn’t hold, and in a recent paper by He, Rang, Zhen and Sun they build on Glorot & Bengio and suggest using

$$ Var(W) = \frac{2}{n_{in}} $$

instead. Which makes sense: a rectifying linear unit is zero for half of its input, so you need to double the size of weight variance to keep the signal’s variance constant.

### Gradient Checking

$$ f'(\theta) = \lim_ {\epsilon \to 0} \frac{f(\theta + \epsilon)- f(\theta - \epsilon)}{2\epsilon}, O(\epsilon^2) $$

<img src="images/gradient_checking.png" style="width: 800px;"/>

### Mini-Batch gradient descent

Assume 5000 batches of 1000 exsamples each for 1 epoch.

for t = 1,2 ...,5000 batches:

1. fowardprop on $X^{\{t\}}, A^{[l]} = g^{[l]}(W^{[l]}X^{\{t\}} + b^{[l]})$,(l for layers), Vectorized 1000 examples on $W^{[l]},b^{[l]}$.

2. Compute cost $J^{\{t\}} = \frac{1}{1000} \sum_{i=1}^{1000}L(\hat y^{(i)} - y^{(i)}) + \frac{\lambda}{2 * 1000} \sum_l||w^{[l]}||_2^2$

3. Backprop to comput gradient cost $dJ^{\{t\}},dw^{[l]},db^{[l]}$, update parameters.

<img src="images/mini_batch.png" style="width:800px"/>

### RMSprop
**Note: each layer got its own $S_{db},S_{dw}$ **
<img src="images/rmsprop.png" style="width:800px"/>

### Adam
**Note: each layer got its own $S_{db},S_{dw}$ **
<img src="images/adam.png" style="width:800px"/>

### Learning Rate Decay

1. $\alpha = \frac{\alpha}{1+ (decay\_rate) * (epoch\_num)}$

2. Exponentially decay: $\alpha=\alpha*{0.95}^{epoch\_num} $

3. $\alpha = \frac{\alpha k}{\sqrt{epoch\_number}}, \frac{\alpha k}{\sqrt{batch\_number}}$

4. Manually tunning

### Hyperparameter tunning

<img src="images/paras_tunning.png" style="width:600px"/>

## Transfer Learning
[《Deep Learning With Just a Little Data - YouTube》by Mike Bernico](http://t.cn/RuAW7AX)

When transfer learning make sense:

1. Task A and B have the same input x.

2. You have a lot more data for Task A than Task B.

3. Low level features from A could be helpful for learning B.

## End-to-End Learning

Pro:
- Let the data speak
    - By having a pure machine learning approach, the neural network will learn from x to y. It will be able to find which statistics are in the data, rather than being forced to reflect human preconceptions.
- Less hand-designing of components needed 
    - It simplifies the design work flow.
    
Cons:
- Large amount of labeled data
    - It cannot be used for every problem as it needs a lot of labeled data.
- Excludes potentially useful hand-designed component
    - Data and any hand-design’s components or features are the 2 main sources of knowledge for a learning algorithm. If the data set is small than a hand-design system is a way to give manual knowledge into the algorithm.
    
**Key question:Do you have sufficient data to learn a function of the complexity needed to map x to y? **

## CNN
## RNN
## RL