# Challenge - 01: (Document 2)

# Activation Function 

### What is Activation Function ? 

1. Activation functions helps to determine the output of a neural network. These type of functions are attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction. 
2. In a neural network, inputs are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.
3. In Simple term , Activation functio is a function which helps in adjusting the weights in the neurons. 
4. Neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.

### Commonly used activation functions are 

### 1. **Sigmod function**

1. Sigmoid Function is a smoothing function that is easy to derive.
2. In the sigmoid function, we can see that its output is in the open interval (0,1). 
3. We can think of probability, but in the strict sense, don't treat it as probability. 
4. It can be thought of as the firing rate of a neuron. In the middle where the slope is relatively large, it is the sensitive area of the neuron.
5. On the sides where the slope is very gentle, it is the neuron's inhibitory area.
$$\sigma (x) = \frac{1}{1+e^{-x}}$$

$$where\ \sigma(x) \in (0, 1),\\
and\ x \in [-\infty, +\infty]$$

![image.png](attachment:image.png)

![image.png](attachment:image.png)


####  Observation  : 

1) When the input is slightly away from the coordinate origin, the gradient of the function becomes very small, almost zero. In the process of neural network backpropagation, we all use the chain rule of differential to calculate the differential of each weight w. When the backpropagation passes through the sigmod function, the differential on this chain is very small. Moreover, it may pass through many sigmod functions, which will eventually cause the weight w to have little effect on the loss function, which is not conducive to the optimization of the weight. This The problem is called gradient saturation or gradient dispersion.

2) The function output is not centered on 0, which will reduce the efficiency of weight update.

3) The sigmod function performs exponential operations, which is slower for computers.


#### Advantages of Sigmoid Function : -

1. Smooth gradient, preventing “jumps” in output values.
2. Output values bound between 0 and 1, normalizing the output of each neuron.
3. Clear predictions, i.e very close to 1 or 0.


#### Sigmoid has three major disadvantages:
* Prone to gradient vanishing
* Function output is not zero-centered
* Power operations are relatively time consuming


###  2. **Hyperbolic tangent activation function**

1. Tanh is a hyperbolic tangent function. 
2. The curves of tanh function and sigmod function are relatively similar.
3. when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. 
4. The difference is the output interval.
5. The output interval of tanh is (-1, 1), and the whole function is 0-centric, which is better than sigmod.

$$tanh(x) = \frac{(e^{x} - e^{-x})}{(e^{x} + e^{-x})}$$

$$where\ \tanh(x) \in (-1, 1),\\
and\ x \in [-\infty, +\infty]$$


![image.png](attachment:image.png)

### 3. **Rectified linear unit activation function(ReLU)**

1. The ReLU function is actually a function that takes the maximum value. Note that this is not fully interval-derivable, but we can take sub-gradient, as shown in the figure above.
2. Although ReLU is simple, it is an important achievement in recent years.
3. The ReLU (Rectified Linear Unit) function is an activation function that is currently more popular
4. ReLU function formula and curve are as follows


$$ReLU(x)= max(x,0)$$

$$where\ ReLU(x) \in (0, x),\\
and\ x \in [-\infty, +\infty]$$

![image.png](attachment:image.png)

#### Advantages:

1) When the input is positive, there is no gradient saturation problem.

2) The calculation speed is much faster. The ReLU function has only a linear relationship (before and after 0 i.e. conditionally linear). Whether it is forward or backward, it is much faster than sigmod and tanh. (Sigmod and tanh need to calculate the exponent, which will be slower.)

#### Disadvantages:

1) When the input is negative, ReLU is completely inactive, which means that once a negative number is entered, ReLU will die. In this way, in the forward propagation process, it is not a problem. Some areas are sensitive and some are insensitive. But in the backpropagation process, if you enter a negative number, the gradient will be completely zero, which has the same problem as the sigmod function and tanh function.

2) We find that the output of the ReLU function is either 0 or a positive number, which means that the ReLU function is not a 0-centric function.

###  4.**Leaky ReLU function**

$$ 
leaky\_relu(x, \alpha) = \left\{\begin{matrix} 
x & x\geq 0 \\ 
\alpha x & x \lt 0 
\end{matrix}\right.
$$

$$where\ x \in [-\infty, +\infty]$$


![image.png](attachment:image.png)

1. In order to solve the Dead ReLU Problem, people proposed to set the first half of ReLU 0.01x instead of 0.
2. Another intuitive idea is a parameter-based method, **$Parametric ReLU : f(x)= max(\alpha x,x)$**, WHERE $\alpha$ can be learned from back propagation.
3. In theory, Leaky ReLU has all the advantages of ReLU, plus there will be no problems with Dead ReLU, but in actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.

### 5.**ELU (Exponential Linear Units) function**

$$elu(x, \alpha) = \left\{\begin{matrix} x & x\geq0\\ \alpha \cdot (e^{x} - 1) & x \lt 0 \end{matrix}\right.$$

$$\alpha\ = scaler\ slope\ of\ negative\ section$$



![image.png](attachment:image.png)


ELU is also proposed to solve the problems of ReLU. Obviously, ELU has all the advantages of ReLU, and:

* No Dead ReLU issues
* The mean of the output is close to 0, zero-centered

One small problem is that it is slightly more computationally intensive. Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU.

###  6.**PRelu (Parametric ReLU)**

1. PReLU is also an improved version of ReLU. In the negative region, PReLU has a small slope, which can also avoid the problem of ReLU death.
2. Compared to ELU, PReLU is a linear operation in the negative region.
3. Although the slope is small, it does not tend to 0, which is a certain advantage.

$$f(y_i) = \left\{\begin{matrix} y_i & y_i>0\\ \alpha_i \cdot y_i & y_i \leq 0 \end{matrix}\right.$$

We look at the formula of PReLU. The parameter α is generally a number between 0 and 1, and it is generally relatively small, such as a few zeros. When α = 0.01, we call PReLU as Leaky Relu , it is regarded as a special case PReLU it.

Above, yᵢ is any input on the ith channel and aᵢ is the negative slope which is a learnable parameter.
* if $\alpha_i=0$, f becomes ReLU
* if $\alpha_i>0$, f becomes leaky ReLU
* if $\alpha_i$ is a learnable parameter, f becomes PReLU

### 7. **Softmax activation function**

$$S(x_j)=\frac{e^{x_j}}{\sum_{k=1}^{K} e^{x_k}}, where\ j = 1,2, \cdots, K $$


for an arbitrary real vector of length K, Softmax can compress it into a real vector of length K with a value in the range (0, 1), and the sum of the elements in the vector is 1. 

It also has many applications in Multiclass Classification and neural networks. Softmax is different from the normal max function: the max function only outputs the largest value, and Softmax ensures that smaller values have a smaller probability and will not be discarded directly. It is a "max" that is "soft".

The denominator of the Softmax function combines all factors of the original output value, which means that the different probabilities obtained by the Softmax function are related to each other.
In the case of binary classification, for Sigmoid, there are:

$$p(y= 1|x) = \frac{1}{1+e^{-\theta^Tx}}$$

$$p(y= 0|x) = 1- p(y= 1|x)  =\frac{e^{-\theta^Tx}}{1+e^{-\theta^Tx}}$$

For Softmax with K = 2, there are:

$$p(y= 1|x) = \frac{e^{\theta_1^Tx}}{e^{\theta_0^Tx} + e^{\theta_1^Tx}} = \frac{1}{1+e^{(\theta_0^T-\theta_1^T)x}} = \frac{1}{1+e^{-\beta x}}$$


$$p(y= 0|x) = \frac{e^{\theta_0^Tx}}{e^{\theta_0^Tx} + e^{\theta_1^Tx}} = \frac{e^{(\theta_0^T-\theta_1^T)x}}{1+e^{(\theta_0^T-\theta_1^T)x}} = \frac{e^{-\beta x}}{1+e^{-\beta x}}$$


Among them:

$$\beta = - (\theta_0^T-\theta_1^T)$$

can be seen that in the case of binary classification, Softmax is degraded to Sigmoid.
    


### 8. Swish

Swish is an activation function, 

$f(x)=x.\sigma(\beta.x)$ , where $\beta$ a learnable parameter. 

Nearly all implementations do not use the learnable parameter , in which case the activation function is $f(x)=x.\sigma(x)$ ("Swish-1").

![](https://paperswithcode.com/media/methods/Screen_Shot_2020-05-27_at_2.02.25_PM.png)

## Softplus

Softplus is an activation function 

$$f(x) = log(1+e^x)$$

It can be viewed as a smooth version of ReLU.

![](https://paperswithcode.com/media/methods/Screen_Shot_2020-05-27_at_2.07.07_PM.png)