Activation functions helps to determine the output of a neural network. These type of functions are attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction.

Activation function also helps to normalize the output of each neuron to a range between 1 and 0 or between -1 and 1.

In a neural network, inputs are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.

The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold.

In [5]:
import math

The values from -infinity to infinity makes hard to make decision incase of classification for that reason **activation function is used**.

> **1 . Sigmoid Function**

sigmoid function = 1/(1+(e)^-y)<br>
sigmoid function also known as activation function

In [3]:
def sigmoid(x):
    return 1 / (1+math.exp(-x))

In [4]:
sigmoid(100)

1.0

In [5]:
sigmoid(1)

0.7310585786300049

In [6]:
sigmoid(-21)

7.582560422162385e-10

In [7]:
sigmoid(.8)

0.6899744811276125

In [8]:
sigmoid(0)

0.5

> Sigmoid values ranges from 0 to 1<br> Derivative of sigmoid values ranges from 0 to 0.25

* Advantages of Sigmoid Function : -

1. Smooth gradient, preventing “jumps” in output values.
2. Output values bound between 0 and 1, normalizing the output of each neuron.
3. Clear predictions, i.e very close to 1 or 0.

* Sigmoid has three major disadvantages:

1. Prone to **gradient vanishing**
2. Function output is not zero-centered
3. Power operations are relatively time consuming

> **2. tanh function**

tanh = ((e^z)-(e^-z))/((e)^z+(e)^-z) <br>
Use sigmoid in output layer. All other places try to use tanh

In [10]:
def tanh(x):
    return (math.exp(x)-math.exp(-x))/(math.exp(x)+math.exp(-x))

In [11]:
tanh(-56)

-1.0

In [12]:
tanh(50)

1.0

In [15]:
tanh(0)

0.0

In [13]:
tanh(1)

0.7615941559557649

> Values ranges from -1 to 1<br> Derivative of tanh ranges from 0 to 1<br>
**zero centered**


> **3.ReLU(Rectified Linear Unit) Function**  - Most Popular


ReLU = max(0,x)<br>
derivative = 0 or 1<br><br>
For hidden layers, if you are not sure which activation function to use , just use ReLU as your default choice.

In [16]:
def relu(x):
    return max(0,x)

In [17]:
relu(-7)

0

In [18]:
relu(20)

20

* Advantages:

1. When the input is positive, there is no gradient saturation problem.

2. The calculation speed is much faster. The ReLU function has only a linear relationship. Whether it is forward or backward, it is much faster than sigmod and tanh. (Sigmod and tanh need to calculate the exponent, which will be slower.)

* Disadvantages:

1. When the input is negative, ReLU is completely inactive, which means that once a negative number is entered, ReLU will die. In this way, in the forward propagation process, it is not a problem. Some areas are sensitive and some are insensitive. But in the backpropagation process, if you enter a negative number, the gradient will be completely zero, which has the same problem as the sigmod function and tanh function. (**Dead Neuron**)

2. We find that the output of the ReLU function is either 0 or a positive number, which means that the ReLU function is not a 0-centric function.

> **4. Leaky ReLU** - Most commonly used


Leaky ReLU = max(0.1x,x)    ; so that value never be zero(dead neuron fixed here)<br>
derivative = 0 or 1

In [19]:
def leaky_relu(x):
    return max(0.1*x,x)

In [22]:
leaky_relu(-100)

-10.0

In [23]:
leaky_relu(8)

8

* Advantages:
1. No dead neuron
2. Computation is very fast


> **5. ELU(Expontential Linear Unit)**

combination of ReLU and exponential.<br>
**zero centered**



<pre>f(x) = x ; if x >0 <br>
    = alpha(e^x-1) ; otherwise</pre>

Derivative:<br>
negative if 0 else 1 greater than 0

In [6]:
def Elu(alpha,x):
    if x>0:
        return x
    else:
        return alpha*(math.exp(x)-1)

In [7]:
alpha = 0.01
print(Elu(alpha,5))
print(Elu(alpha,-2))

5
-0.008646647167633872


> **6. PReLU (Parameter ReLU)**

<pre>
f(y) = y ; if y>0
     = ay; y <=0 ; a is learnable parameter(like alpha)<pre>

Above, yᵢ is any input on the ith channel and aᵢ is the negative slope which is a learnable parameter.
* if aᵢ=0, f becomes ReLU
* if aᵢ>0, f becomes leaky ReLU
* if aᵢ is a learnable parameter, f becomes PReLU

**Note** : Generally speaking, these activation functions have their own advantages and disadvantages. There is no statement that indicates which ones are not working, and which activation functions are good. All the good and bad must be obtained by experiments.