# Introduction

In my opinion, ReLU activation is the best activation function for the hidden layers of a neural network. It is computationally less expensive than other activation functions like sigmoid and tanh. It allows the model to converge faster and perform better. It is also less likely to cause the vanishing gradient problem.

Fukushima first used ReLU in a paper published in 1969, 6 years before the Cognitron paper, in a so-called analog threshold element (see Equation 2 and Figure 3):

![image.png](attachment:image.png)

K. Fukushima, "Visual Feature Extraction by a Multilayered Network of Analog Threshold Elements," in IEEE Transactions on Systems Science and Cybernetics, vol. 5, no. 4, pp. 322-333, Oct. 1969, doi: 10.1109/TSSC.1969.300225.

ReLU stands for Rectified Linear Unit. 

Although it gives an impression of a linear function, ReLU has a derivative function and allows for backpropagation while simultaneously making it computationally efficient. 

The main catch here is that the ReLU function does not activate all the neurons at the same time. 

The neurons will only be deactivated if the output of the linear transformation is less than 0.

![image.png](attachment:image.png)

Matheatically, ReLU is defined as:

$$ f(x) = max(0, x) $$

- Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh functions.

- ReLU accelerates the convergence of gradient descent towards the global minimum of the loss function due to its linear, non-saturating property.

The negative side of the graph makes the gradient value zero. Due to this reason, during the backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons which never get activated.

***The Dying ReLU*** problem, which I explained below.

![image.png](attachment:image.png)

The negative side of the graph makes the gradient value zero. Due to this reason, during the backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons which never get activated.

To solve this problem, we can use Leaky ReLU, Parametric ReLU, and Exponential Linear Unit (ELU) activation functions.

# Leaky ReLU

***Leaky ReLU*** is an improved version of ReLU function to solve *the Dying ReLU* problem as it has a small positive slope in the negative area.

![image.png](attachment:image.png)

Mathematically it can be represented as:

$$ f(x) = max(0.01x, x) $$

The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it **does enable backpropagation**, even for **negative input** values.

By making this minor modification for negative input values, the gradient of the left side of the graph comes out to be a non-zero value. Therefore, we would no longer encounter dead neurons in that region. 

Here is the derivative of the Leaky ReLU function. 

![image.png](attachment:image.png)

The limitations that this function faces include:
- The predictions may not be consistent for negative input values. 
- The gradient for negative values is a small value that makes the learning of model parameters time-consuming.

# Parametric ReLU

***Parametric ReLU*** is another variant of ReLU that aims to solve the problem of gradient’s becoming zero for the left half of the axis. 

This function provides the slope of the negative part of the function as an argument `a`. By performing backpropagation, the most appropriate value of `a` is learnt.

![image.png](attachment:image.png)

Mathematically, it can be represented as:

$$ f(x) = max(ax, x) $$

Where $a$ is a slope for negative input values.

The parameterized ReLU function is used when the **leaky ReLU** function ***still fails at solving the problem of dead neurons***, and the relevant information is not successfully passed to the next layer. 

This function’s limitation is that it may perform differently for different problems depending upon the value of slope parameter $a$. (Like the learning rate in the gradient descent algorithm)

# Exponential Linear Unit (ELU)

***Exponential Linear Unit***, or ELU for short, is also a variant of ReLU that modifies the slope of the negative part of the function. 

ELU uses a *log curve* to define the negative values unlike the leaky ReLU and Parametric ReLU functions with a straight line.

![image.png](attachment:image.png)

Mathematically, it can be represented as:

$$ \begin{equation} f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases} \end{equation} $$

Where $a$ is a hyperparameter that defines the value of the function when $x$ is less than or equal to 0.

ELU is a strong alternative for ReLU because of the following advantages:
- ELU becomes smooth slowly until its output equal to $-\alpha$ whereas RELU sharply smoothes.
- Avoids dead ReLU problem by introducing log curve for negative values of input. It helps the network nudge weights and biases in the right direction.

The limitations of the ELU function are as follow:
- It increases the computational time because of the exponential operation included
- No learning of $\alpha$ parameter like Parametric ReLU
- Exploding gradient problem

![image.png](attachment:image.png)

Mathematically it can be represented as:

$$ \begin{equation} f(x) = \begin{cases} 1 & \text{for } x \geq 0 \\ f(x) + \alpha & \text{for } x <> 0 \end{cases} \end{equation} $$

Where $a$ is a hyperparameter that defines the value of the function when $x$ is less than or equal to 0.