# Activation Function

Activation functions helps to determine the output of a neural network. These type of functions are attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction. 
It’s just a thing function that you use to get the output of node. It is also known as Transfer Function

> Activation function also helps to normalize the output of each neuron to a range between 1 and 0 or between -1 and 1.


![Screenshot%202021-06-03%20141833.png](attachment:Screenshot%202021-06-03%20141833.png)

In a neural network, inputs are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.

The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold.


> Neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.

### Why we use Activation functions with Neural Networks?

It is used to determine the output of neural network like yes or no. It maps the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function)
activation functions (AFs) **performs complex computations in the hidden layers and then transfer the result to the output layer**. The primary purpose of AFs is to **introduce non-linear properties in the neural network.**  

They **convert the linear input signals of a node into non-linear output signals to facilitate the learning of high order polynomials that go beyond one degree for deep networks**. A unique aspect of AFs is that they are differentiable – this helps them function during the backpropagation of the neural networks.

### What is the need for non-linearity?

If activation functions are not applied, the output signal would be a linear function, which is a polynomial of one degree. While it is easy to solve linear equations, they have a limited complexity quotient and hence, have less power to learn complex functional mappings from data. Thus, without AFs, a neural network would be a linear regression model with limited abilities.  

This is certainly not what we want from a neural network. The task of neural networks is to compute highly complicated calculations. Furthermore, without AFs, neural networks cannot learn and model other complicated data, including images, speech, videos, audio, etc.  

 AFs help neural networks to make sense of complicated, high dimensional, and non-linear Big Data sets that have an intricate architecture – they contain multiple hidden layers in between the input and output layer.  

![Screenshot%202021-06-03%20143022.png](attachment:Screenshot%202021-06-03%20143022.png)

Imagine a neural network without the activation functions. In that case, every neuron will only be performing a linear transformation on the inputs using the weights and biases. Although linear transformations make the neural network simpler, but this network would be less powerful and will not be able to learn the complex patterns from the data.

A neural network without an activation function is essentially just a linear regression model

Thus we use a non linear transformation to the inputs of the neuron and this non-linearity in the network is introduced by an activation function.

## Commonly used activation functions

###  Binary Step Function

The first thing that comes to our mind when we have an activation function would be a threshold based classifier i.e. whether or not the neuron should be activated based on the value from the linear transformation.  

In other words, if the input to the activation function is greater than a threshold, then the neuron is activated, else it is deactivated, i.e. its output is not considered for the next hidden layer. Let us look at it mathematically-
![Screenshot%202021-06-03%20215801.png](attachment:Screenshot%202021-06-03%20215801.png)


![Screenshot%202021-06-03%20215850.png](attachment:Screenshot%202021-06-03%20215850.png)

![Screenshot%202021-06-03%20220004.png](attachment:Screenshot%202021-06-03%20220004.png)
Gradients are calculated to update the weights and biases during the backprop process. Since the gradient of the function is zero, the weights and biases don’t update.

### Linear or Identity Activation Function

We saw the problem with the step function, the gradient of the function became zero. This is because there is no component of x in the binary step function. Instead of a binary function, we can use a linear function. We can define the function as-  

**f(x)=ax**

![Screenshot%202021-06-03%20220200.png](attachment:Screenshot%202021-06-03%20220200.png)
**Output:  (16, -8)**
What do you think will be the derivative is this case? When we differentiate the function with respect to x, the result is the coefficient of x, which is a constant.  

**f'(x) = a**

![Screenshot%202021-06-03%20220308.png](attachment:Screenshot%202021-06-03%20220308.png)
Although the gradient here does not become zero, but it is a constant which does not depend upon the input value x at all. This implies that the weights and biases will be updated during the backpropagation process but the updating factor would be the same.  

In this scenario, the neural network will not really improve the error since the gradient is the same for every iteration. The network will not be able to train well and capture the complex patterns from the data. Hence, linear function might be ideal for simple tasks where interpretability is highly desired.  

## Non-linear Activation Function

The Nonlinear Activation Functions are the most used activation functions. Nonlinearity helps to makes the graph look something like this
![Screenshot%202021-06-03%20220954.png](attachment:Screenshot%202021-06-03%20220954.png)
It makes it easy for the model to generalize or adapt with variety of data and to differentiate between the output.  
The main terminologies needed to understand for nonlinear functions are:  
Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also known as slope.  
Monotonic function: A function which is either entirely non-increasing or non-decreasing.  
The Nonlinear Activation Functions are mainly divided on the basis of their range or curves-  

### **.   Sigmod function or Logistic Activation Function**



The function formula and chart are as follows

![Screenshot%202021-06-03%20143633.png](attachment:Screenshot%202021-06-03%20143633.png)


The Sigmoid function is the most frequently used activation function in the beginning of deep learning. It is a smoothing function that is easy to derive.

 the sigmoid function is a non-linear AF used primarily in feedforward neural networks. It is a differentiable real function, defined for real input values, and containing positive derivatives everywhere with a specific degree of smoothness. The sigmoid function appears in the output layer of the deep learning models and is used for predicting probability-based outputs. It is one of the most widely used non-linear activation function. Sigmoid transforms the values between the range 0 and 1. Generally, the derivatives of the sigmoid function are applied to learning algorithms. The graph of the sigmoid function is ‘S’ shaped.
 This essentially means -when I have multiple neurons having sigmoid function as their activation function,the output is non linear as well

The function itself has certain defects.

1) When the input is slightly away from the coordinate origin, the gradient of the function becomes very small, almost zero. In the process of neural network backpropagation, we all use the chain rule of differential to calculate the differential of each weight w. When the backpropagation passes through the sigmod function, the differential on this chain is very small. Moreover, it may pass through many sigmod functions, which will eventually cause the weight w to have little effect on the loss function, which is not conducive to the optimization of the weight. This The problem is called gradient saturation or gradient dispersion.

2) <span class="mark">The function output is not Zero(0) centered , which will reduce the efficiency of weight update.</span>

3) The sigmod function performs exponential operations, which is slower for computers.


**Advantages of Sigmoid Function : -**

1. Smooth gradient, preventing “jumps” in output values.
2. Output values bound between 0 and 1, normalizing the output of each neuron.
3. Clear predictions, i.e very close to 1 or 0.


**Sigmoid has three major disadvantages:**
* Prone to gradient vanishing
* <span class="mark">Function output is not zero-centered. if the data is not zero-centered then model will take more time to converge and to reach global minima.</span>
* Power operations are relatively time consuming
* Slow Convergence


![Screenshot%202021-06-03%20144546.png](attachment:Screenshot%202021-06-03%20144546.png)

![Screenshot%202021-06-03%20144746.png](attachment:Screenshot%202021-06-03%20144746.png)
The gradient values are significant for range -3 and 3 but the graph gets much flatter in other regions. This implies that for values greater than 3 or less than -3, will have very small gradients. As the gradient value approaches zero, the network is not really learning.  

Additionally, the sigmoid function is not symmetric around zero. So output of all the neurons will be of the same sign. This can be addressed by scaling the sigmoid function which is exactly what happens in the tanh function

### **Tanh or hyperbolic tangent Activation Function**

tanh function, is another type of AF. It is a smoother, zero-centered function having a range between -1 to 1.The tanh function formula and curve are as follows

!![Screenshot%202021-06-03%20145713.png](attachment:Screenshot%202021-06-03%20145713.png)

The tanh function is much more extensively used than the sigmoid function since it delivers better training performance for multilayer neural networks. The <span class="mark">biggest advantage of the tanh function is that it produces a zero-centered output</span>, thereby supporting the backpropagation process. The tanh function has been mostly used in recurrent neural networks for natural language processing and speech recognition tasks.  

However, <span class="mark">the tanh function, too, has a limitation – just like the sigmoid function, it cannot solve the vanishing gradient problem</span>. Also, the tanh function can only attain a gradient of 1 when the input value is 0 (x is zero). As a result, the function can produce some dead neurons during the computation process   


Tanh is a hyperbolic tangent function. The curves of tanh function and sigmod function are relatively similar. Let ’s compare them. First of all, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. The difference is the output interval. 

The output interval of tanh is 1), and the whole function is 0-centric, which is better than sigmod.

In general binary classification problems, the tanh function is used for the hidden layer and the sigmod function is used for the output layer. However, these are not static, and the specific activation function to be used must be analyzed according to the specific problem, or it depends on debugging.



![Screenshot%202021-06-03%20150055.png](attachment:Screenshot%202021-06-03%20150055.png)

### **.  ReLU function**

The ReLU function is another non-linear activation function that has gained popularity in the deep learning domain. ReLU stands for Rectified Linear Unit. <span class="mark">The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time.</span>  
This means that the neurons will only be deactivated if the output of the linear transformation is less than 0. ReLU function formula and curve are as follows

![Screenshot%202021-06-03%20150713.png](attachment:Screenshot%202021-06-03%20150713.png)
Range: [ 0 to infinity)

<span class="mark">For the negative input values, the result is zero, that means the neuron does not get activated</span>. Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh function
The ReLU function is actually a function that takes the maximum value. Note that this is not fully interval-derivable, but we can take sub-gradient, as shown in the figure above. Although ReLU is simple, it is an important achievement in recent years. 

The ReLU (Rectified Linear Unit) function is an activation function that is currently more popular. Compared with the sigmod function and the tanh function, it has the following **advantages and why it is widely used than other activation funtion:**

1) When the input is positive, there is no gradient saturation problem. The biggest advantage of ReLu is indeed non-saturation of its gradient, which greatly accelerates the convergence of stochastic gradient descent compared to the sigmoid / tanh functions  
2) ReLU function over other activation functions is that it does not activate all the neurons at the same time. ... Due to this reason, during the backpropogation process, the weights and biases for some neurons are not updated

3) The calculation speed is much faster. The ReLU function has only a linear relationship. Whether it is forward or backward, it is much faster than sigmod and tanh. (Sigmod and tanh need to calculate the exponent, which will be slower.)

**disadvantages**:

1) When the input is negative, ReLU is completely inactive, which means that once a negative number is entered, ReLU will die. In this way, in the forward propagation process, it is not a problem. Some areas are sensitive and some are insensitive. But in the backpropagation process, if you enter a negative number, the gradient will be completely zero, which has the same problem as the sigmod function and tanh function.

2) We find that the output of the ReLU function is either 0 or a positive number, which means that the ReLU function is not a 0-centric function.

![Screenshot%202021-06-03%20151622.png](attachment:Screenshot%202021-06-03%20151622.png)


![Screenshot%202021-06-03%20151719.png](attachment:Screenshot%202021-06-03%20151719.png)

#### What Is Saturating Gradient Problem?
According to the Cambridge Dictionary saturation means the act or result of filling a thing or place completely so that no more can be added  

In this context, it refers to a function for which a bigger input will not lead to a relevant increase in output. So if the **gradient is saturated (meaning it is extremely close to zero)**, a bigger upstream gradient doesn't lead to a bigger current gradient when applying the chain rule.  

**when the gradient saturations occurs then the model won't learn Poperly and hence model fails to predict** 

In neural networks, activation functions such as the logistic (sigmoid) and the hyperbolic tangent functions map any real values to a compact range of values. For example, the sigmoid function, S(x)= 1/(1+ e^(-x) ) maps a set of real values x to between 0 and 1. To attain these boundaries of either 0 or 1, large magnitude negative or positive values of x are required. Therefore, a neuron is said to be saturated when extremely large weights cause the neuron to produce values (gradients) that are very close to the range boundary. If the gradient is constantly 0, no learning will take place in the neural network. Likewise, if the gradient is constantly 1, it most likely means that the neuron is over-fitting on training data and will likely perform poorly on test data   

If you use sigmoid-like activation functions, like sigmoid and tanh, after some epochs of training, the linear part of each neuron will have values that are very big or very small. This means that the linear part will have a big output value regardless of its sign. Consequently, the input of sigmoid-like functions in each neuron which adds non-linearity will be far from the center of these functions.
In those locations, the gradient/derivative value is very small. Consequently, after numerous iterations, the weights get updated so slowly because the value of the gradient is very small. This is why we use the ReLU activation function for which its gradient doesn't have this problem. Saturating means that after some epochs that learning happens relatively fast, the value of the linear part will be far from the center of the sigmoid and it somehow saturates, and it takes too much time to update the weights because the value of gradient is small

https://datascience.stackexchange.com/questions/27665/what-is-saturating-gradient-problem

#### Why is ReLU not differentiable?
ReLU is differentiable at all the point except 0. the left derivative at z = 0 is 0 and the right derivative is 1

#### Why does ReLU die?
If our learning rate (α) is set too high, there is a significant chance that our new weights will end up in the highly negative value range, since our old weights will be subtracted by a large number. These negative weights result in negative inputs for ReLU, thereby causing the dying ReLU problem to happen.

### . Leaky ReLU function
Leaky ReLU function is nothing but an improved version of the ReLU function. As we saw that for the ReLU function, the gradient is 0 for x<0, which would deactivate the neurons in that region.  
It is an attempt to solve the dying ReLU problem  
The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so.  
When a is not 0.01 then it is called Randomized ReLU.  
Therefore the range of the Leaky ReLU is (-infinity to infinity).  
![Screenshot%202021-06-03%20152647.png](attachment:Screenshot%202021-06-03%20152647.png)
By making this small modification, the gradient of the left side of the graph comes out to be a non zero value. Hence we would no longer encounter dead neurons in that region. Here is the derivative of the Leaky ReLU function  



![Screenshot%202021-06-03%20152803.png](attachment:Screenshot%202021-06-03%20152803.png)

![Screenshot%202021-06-03%20152958.png](attachment:Screenshot%202021-06-03%20152958.png)

In order to solve the Dead ReLU Problem, people proposed to set the first half of ReLU 0.01x instead of 0. Another intuitive idea is a parameter-based method, Parametric ReLU : f(x)= max(alpha x,x), which alpha can be learned from back propagation. In theory, Leaky ReLU has all the advantages of ReLU, plus there will be no problems with Dead ReLU, but in actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.

###  ELU (Exponential Linear Units) function
Exponential Linear Unit or ELU for short is also a variant of Rectiufied Linear Unit (ReLU) that modifies the slope of the negative part of the function. Unlike the leaky relu and parametric ReLU functions, instead of a straight line, ELU uses a log curve for defning the negatice values. It is defined as 

![Screenshot%202021-06-03%20154137.png](attachment:Screenshot%202021-06-03%20154137.png)

![Screenshot%202021-06-03%20154216.png](attachment:Screenshot%202021-06-03%20154216.png)
ELU is also proposed to solve the problems of ReLU. Obviously, ELU has all the advantages of ReLU, and:

* No Dead ReLU issues
* The mean of the output is close to 0, zero-centered

One small problem is that it is slightly more computationally intensive. Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU.

###  PRelu (Parametric ReLU)
This is another variant of ReLU that aims to solve the problem of gradient’s becoming zero for the left half of the axis. The parameterised ReLU, as the name suggests, introduces a new parameter as a slope of the negative part of the function. Here’s how the ReLU function is modified to incorporate the slope parameter-

![Screenshot%202021-06-03%20154902.png](attachment:Screenshot%202021-06-03%20154902.png)
When the value of a is fixed to 0.01, the function acts as a Leaky ReLU function. However, in case of a parameterised ReLU function, ‘a‘ is also a trainable parameter. The network also learns the value of ‘a‘ for faster and more optimum convergence.

The derivative of the function would be same as the Leaky ReLu function, except the value 0.01 will be replcaed with the value of a.  
**f'(x) = 1, x>=0   
            = a, x<0**  
The parameterized ReLU function is used when the leaky ReLU function still fails to solve the problem of dead neurons and the relevant information is not successfully passed to the next layer.

PReLU is also an improved version of ReLU. In the negative region, PReLU has a small slope, which can also avoid the problem of ReLU death. Compared to ELU, PReLU is a linear operation in the negative region. Although the slope is small, it does not tend to 0, which is a certain advantage.


We look at the formula of PReLU. The parameter α is generally a number between 0 and 1, and it is generally relatively small, such as a few zeros. When α = 0.01, we call PReLU as Leaky Relu , it is regarded as a special case PReLU it.

Above, yᵢ is any input on the ith channel and aᵢ is the negative slope which is a learnable parameter.
* if aᵢ=0, f becomes ReLU
* if aᵢ>0, f becomes leaky ReLU
* if aᵢ is a learnable parameter, f becomes PReLU

###  Swish (A Self-Gated) Function
Swish is a lesser known activation function which was discovered by researchers at Google. Swish is as computationally efficient as ReLU and shows better performance than ReLU on deeper models.  The values for swish ranges from negative infinity to infinity. The function is defined as  

![Screenshot%202021-06-03%20210233.png](attachment:Screenshot%202021-06-03%20210233.png)

The formula is: **y = x * sigmoid (x)**


**f(x) = x*sigmoid(x)**
**(x) = x/(1-e^-x)**
![Screenshot%202021-06-03%20211321.png](attachment:Screenshot%202021-06-03%20211321.png)

As you can see, the curve of the function is smooth and the function is differentiable at all points. This is helpful during the model optimization process and is considered to be one of the reasons that swish outoerforms ReLU.   
 
A unique fact about this function is that swich function is not monotonic. This means that the value of the function may decrease even when the input values are increasing. Let’s look at the python code for the swish function   

**def swish_function(x):    
**    return x/(1-np.exp(-x))**   
**swish_function(-67), swish_function(4)**  
Output:  

(5.349885844610276e-28, 4.074629441455096)

Swish's design was inspired by the use of sigmoid functions for gating in LSTMs and highway networks. We use the same value for gating to simplify the gating mechanism, which is called **self-gating**. swish only works on the model which has more 40 than 40 layers of network.

The advantage of self-gating is that it only requires a simple scalar input, while normal gating requires multiple scalar inputs. This feature enables self-gated activation functions such as Swish to easily replace activation functions that take a single scalar as input (such as ReLU) without changing the hidden capacity or number of parameters.

1) Unboundedness (unboundedness) is helpful to prevent gradient from gradually approaching 0 during slow training, causing saturation. At the same time, being bounded has advantages, because bounded active functions can have strong reguairzation, and larger negative inputs will be resolved.

2) At the same time, smoothness also plays an important role in optimization and generalization.

###  Softmax

Softmax function is often described as a combination of multiple sigmoids. We know that sigmoid returns values between 0 and 1, which can be treated as probabilities of a data point belonging to a particular class. Thus sigmoid is widely used for binary classification problems.  

The softmax function can be used for multiclass classification problems. This function returns the probability for a datapoint belonging to each individual class. Here is the mathematical expression of the same-

![Screenshot%202021-06-03%20212846.png](attachment:Screenshot%202021-06-03%20212846.png)

While building a network for a multiclass problem, the output layer would have as many neurons as the number of classes in the target. For instance if you have three classes, there would be three neurons in the output layer. Suppose you got the output from the neurons as [1.2 , 0.9 , 0.75]. 

Applying the softmax function over these values, you will get the following result – [0.42 ,  0.31, 0.27]. These represent the probability for the data point belonging to each class. Note that the sum of all the values is 1. Let us code this in python


###   Maxout

One relatively popular choice is the Maxout neuron (introduced recently by Goodfellow et al.) that generalizes the ReLU and its leaky version. Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have w1,b1 =0).The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks

The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is a learnable activation function.

Maxout can be seen as adding a layer of activation function to the deep learning network, which contains a parameter k. Compared with ReLU, sigmoid, etc., this layer is special in that it adds k neurons and then outputs the largest activation value. value.


![Screenshot%202021-06-03%20214725.png](attachment:Screenshot%202021-06-03%20214725.png)# 

Maxout activation functions takes the maximum of inputs of the function that's why it is called as maximum of input function.

A maxout layer is simply a layer where the activation function is the max of the inputs. As stated in the paper, even an MLP with 2 maxout units can approximate any function. They give a couple of reasons as to why maxout may be performing well, but the main reason they give is the following --  

Dropout can be thought of as a form of model averaging in which a random subnetwork is trained at every iteration and in the end the weights of the different random networks are averaged. Since one cannot average the weights explicitly, an approximation is used. This approximation is exact for a linear network   
In maxout, they do not drop the inputs to the maxout layer. Thus the identity of the input outputting the max value for a data point remains unchanged. Thus the dropout only happens in the linear part of the MLP but one can still approximate any function because of the maxout layer.  
As the dropout happens in the linear part only, they conjecture that this leads to more efficient model averaging as the averaging approximation is exact for linear networks 
![Screenshot%202021-06-03%20215332.png](attachment:Screenshot%202021-06-03%20215332.png)

###  Softplus

![Screenshot%202021-06-03%20212228.png](attachment:Screenshot%202021-06-03%20212228.png)

The softplus function is similar to the ReLU function, but it is relatively smooth.It is unilateral suppression like ReLU.It has a wide acceptance range (0, + inf).

Softplus function: **f(x) = ln(1+exp x)**
The derivative of softplus is **f ′(x)=exp(x) / ( 1+exp⁡ x ) = 1/ (1 +exp(−x ))** which is also called the logistic function



## **------------------------------------------------------NOTE--------------------------------------------------------------**

## **Generally speaking, these activation functions have their own advantages and disadvantages. There is no statement that indicates which ones are not working, and which activation functions are good. All the good and bad must be obtained by experiments.**

->Sigmoid functions and their combinations generally work better in the case of classifiers.   
->Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient problem.  
->ReLU function is a general activation function and is used in most cases these days.    
->If we encounter a case of dead neurons in our networks the leaky ReLU function is the best choice.    
->Always keep in mind that ReLU function should only be used in the hidden layers.    
->As a rule of thumb, you can begin with using ReLU function and then move over to other activation functions in case ReLU  doesn’t provide with optimum results.  

## Why derivative/differentiation is used ?
When updating the curve, to know in which direction and how much to change or update the curve depending, upon the slope.That is why we use differentiation in almost every part of Machine Learning and Deep Learning.

![Screenshot%202021-06-03%20224059.png](attachment:Screenshot%202021-06-03%20224059.png)