# Neural Networks

## Activation Functions

[Activation Functions: Neural Networks](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6)

### Softmax

$$softmax(z) = softmax(z +c) \qquad  \text{(c is a constant)}$$

$$softmax(z_i) = \frac{e^{z_i}}{\sum_j{e^{z_j}}}$$

Assume $\hat y = softmax(z), y$ is one-hot vector only labels a correct output 1, $L(z)$ is cross entropy function:

$$L(z) = - \sum_i y_i log(\hat{y}_i) = - log(\hat y_i) \tag{1} $$

$$\frac{\partial{L(z)}}{\partial{z}}= \hat y - y=\hat{y}_i-1 \tag{2} $$

> e.g. $\hat y = [0.015,0.866,0.117 ], \ y =[0,1,0].$ <br>
if $\hat{y}_2 = 0.866$ is true ouput, $ \partial{L_{z_2}}=0.866-1=-0.134, \ \frac{\partial{L(z)}}{\partial{z}} = [0.015,-0.134,0.117]$


### Sigmoid

<img src="images/sigmoid.png" style="width: 400px;"/>

$$sigmoid(z) = \frac{1}{1 + e^{-z}}$$

The main reason why we use sigmoid function is because it exists between **(0 to 1)**. Therefore, it is especially used for models where we have to predict the probability **as an output**.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

The function is **differentiable**.That means, we can find the slope of the sigmoid curve at any two points.

The function is **monotonic** but function’s derivative is not.

The logistic sigmoid function can cause a neural network to **get stuck** at the training time.

The **softmax** function is a more generalized logistic activation function which is used for **multiclass classification**.

### Tanh

<img src="images/tanh.jpeg" style="width: 400px;"/>

$$tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$

The range of the tanh function is from **(-1 to 1)**. tanh is also sigmoidal (s - shaped). The advantage is that the negative inputs will be mapped strongly negative **(zero-mean, derivative slope steepest around 0).**

The function is **monotonic & differentiable** while its derivative is not monotonic.

The tanh function is normally superior to sigmoid at hidden layer.

### ReLU & Leaky ReLU

<img src="images/sigmoid_vs_relu.png" style="width: 600px;"/>

$$ReLU(z) = max(0, z)$$


\begin{equation}
    \frac{\partial ReLU(z)}{\partial(z)}=\begin{cases}
        0, & \text{if $x<0$}.\\
        1, & \text{otherwise}.
    \end{cases}
\end{equation}



**Non-differentiable on zero** is not big deal for computer (Offen zero in very small demical like $1e^{-10}$).

But the issue is that all the negative values become zero immediately which decreases the ability of the model to fit or train from the data properly. That means any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turns affects the resulting graph by not mapping the negative values appropriately. So here is **leaky ReLU** came out.

<img src="images/relu_vs_leakyrelu.jpeg" style="width: 600px;"/>

<center>Fig : ReLU v/s Leaky ReLU</center>

The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so.

When a is not 0.01 then it is called Randomized ReLU.

Therefore the range of the Leaky ReLU is (-infinity to infinity).

Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their derivatives also monotonic in nature.

### Cheetsheet & Derivative

<img src="images/act_fun_cheetsheet.png" style="width: 800px;"/>

<center>Fig: Activation Function Cheetsheet</center><br>

<img src="images/af_derivative.png" style="width: 600px;"/>

## Fine-tune

### Bias vs Variance

<img src="images/bias_variance.png" style="width: 800px;"/>

### Norm Penalties

$$ J(\theta) = L(\hat{y}^{(i)}, y^{(i)})$$

$$ \tilde J(\theta) = {J}(\theta)+ \Omega(\theta) $$

$$ argmin_{\theta} = - \frac{1}{m} (\sum_{i=1}^m J(\theta) + \sum_{j=1}^l \Omega(\theta)), \quad \text{for $m$ is example size, $l$ is layer number} $$



#### L2 Norm (Hinge)

$$ \Omega(\theta) = \frac{\lambda}{2} {||\omega||}_2^2 $$

$$ {||\omega||}_2^2 = {\omega}^T \omega $$

$$ \nabla_{\omega}\tilde {J}(\omega) = \lambda \omega  + \nabla_{\omega}J(\omega) $$

For update of $\omega$ gradient decent with learning rate $\alpha$:

$$ \omega \leftarrow \omega - \alpha(\lambda \omega + \nabla_{\omega}J(\omega)) $$

$$ \omega \leftarrow (1 - \alpha\lambda)\omega - \alpha \nabla_{\omega}J(\omega) $$

>"weight decay": $(1 - \alpha\lambda)$, greater effect on direction of bigger eigen vector value of Hessian Matrix

#### L1 Norm (LASSO)

$$ \Omega(\theta) = \lambda {||\omega||}_1 $$

$$ {||\omega||}_1 = \sum_i |{\omega}_i| $$

$$ \nabla_{\omega}\tilde {J}(\omega) = \lambda sign(\omega) + \nabla_{\omega}J(\omega) $$

For update of $\omega$ gradient decent with learning rate $\alpha$:

$$ \omega \leftarrow \omega - \alpha(\lambda sign(\omega) + \nabla_{\omega}J(\omega)) $$

>Compared with L2 regulation, L1 regulation tend to **ouput sparser solution** and is used for  **feture selection**. L1 regulation make part of weight parameters zero, that means the corresponding feture can be safely ignored.

### Dropout

- 用于解决过拟合问题

- Dropout存在两个版本：直接（不常用）和反转。(这里只对Inverted Dropout进行说明)

- dropout是指在深度学习网络的训练过程中，对于神经网络单元，按照一定的概率将其暂时从网络中丢弃。（注意是暂时）


<img src="images/dropout.png" style="width: 400px;"/>


#### dropout 如何工作

[dropout 正则化](https://www.jianshu.com/p/257d3da535ab)

我们知道，典型的神经网络其训练流程是将输入通过网络进行正向传导，然后将误差进行反向传播。Dropout就是针对这一过程之中，随机地删除隐藏层的部分单元，进行上述过程。

综合而言，上述过程可以分步骤为：

- 随机删除网络中的一些隐藏神经元，保持输入输出神经元不变

- 将输入通过修改后的网络进行前向传播，然后将误差通过修改后的网络进行反向传播

- 对于另外一批的训练样本，重复上述操作

在训练阶段期间对激活值进行缩放，而测试阶段保持不变

<img src="images/inverted_dropout.png" style="width: 800px;"/>

#### dropout为何有效

- 由于每次用输入网络的样本进行权值更新时，隐含节点都是以一定概率随机出现，因此不能保证每2个隐含节点每次都同时出现，这样权值的更新不再依赖于有固定关系隐含节点的共同作用，阻止了某些特征仅仅在其它特定特征下才有效果的情况，减少神经元之间复杂的共适应性。

- 由于每一次都会随机地删除节点，下一个节点的输出不再那么依靠上一个节点，也就是说它在分配权重时，不会给上一层的某一结点非配过多的权重，起到了和L2正则化压缩权重差不多的作用。

- 可以将dropout看作是模型平均的一种，平均一个大量不同的网络。不同的网络在不同的情况下过拟合，虽然不同的网络可能会产生不同程度的过拟合，但是将其公用一个损失函数，相当于对其同时进行了优化，取了平均，因此可以较为有效地防止过拟合的发生。对于每次输入到网络中的样本（可能是一个样本，也可能是一个batch的样本），其对应的网络结构都是不同的，但所有的这些不同的网络结构又同时共享隐含节点的权值，这种平均的架构被发现通常是十分有用的来减少过拟合方法。

#### dropout使用技巧

- 在可能出现过拟合的网络层使用dropout

- dropout也可以被用作一种添加噪声的方法，直接对input进行操作。输入层设为更接近1的数，使得输入变化不会太大

- 使用 Inverted dropout 进行验证时不用使用dropout

#### dropout缺点

- 明确定义的损失函数每一次迭代都会下降，而dropout每一次都会随机删除节点，也就是说每一次训练的网络都是不同的，损失函数不再被明确地定义，在某种程度上很难计算，我们失去了调试工具。

#### 当前Dropout的使用情况

当前Dropout被大量利用于全连接网络，而且一般人为设置为0.5或者0.3，而在卷积隐藏层由于卷积自身的稀疏化以及稀疏化的ReLu函数的大量使用等原因，Dropout策略在卷积隐藏层中使用较少。
总体而言，Dropout是一个超参，需要根据具体的网路，具体的应用领域进行尝试。

### Early Stopping

Test set log loss errors is growing up as training continues.

<img src="images/early_stop.png" style="width: 600px;"/>

One way to think of early stopping is as a very efficient hyperparameter selection
algorithm. In this view, the number of training steps is just another hyperparameter.

<img src="images/para_earlystop.png" style="width: 600px;" />

<img src="images/meta_earlystop.png" style="sidth:600px;" />

## CNN
## RNN
## RL