# Activation Functions

<img src="Desktop/Activation_function_logo/blank.jpg" style="width:50px;height:50px"/>

# 1. Sigmoid Function

Sigmoid function gives an "S" shaped curve. The function maps any real value into another value between 0 and 1.

Equation: f(x) = 1/(1+$e^{-x}$)

Range: (0,1)

Equation Graph 

<img src="Desktop/Activation_function_logo/sigmoid.jpg" style="width:200px;height:200px"/>

Derivative: f'(x) = (1 / 1+$e^{-x}$) * (1 - (1 / (1+$e^{-x}$))

Derivative graph

<img src="Desktop/Activation_function_logo/sigmoid_der.jpg" style="width:200px;height:200px"/>

Advantage :-
1. The function is differentiable, means we can find slope of the sigmoid curve at any two points.
2. Output values is between 0 and 1, normalizing the output of each neuron.



Disadvantage:-
1. It has vanishing gradient, for very high or very low value of X, there is almost no change in prediction.
2. Output is not zero centered
3. Computationally expensive.

<img src="Desktop/Activation_function_logo/blank.jpg" style="width:50px;height:50px"/>

# Tanh / Hyperbolic tangent

Tanh is very similar to sigmoid function, only difference is it is symmetric around the origin. The range of values in this case is in from -1 to 1. Hence input to the next layers will not always be of the same sign.

Equation: f(x) = n = tanh(x) = ($e^{x}$ - $e^{-x}$) / ($e^{x}$ + $e^{-x}$)

Range: (-1,1)

Equation Graph

<img src="Desktop/Activation_function_logo/tanh.jpg" style="width:200px;height:200px"/>


Derivative: f'(x) = (1 - $n^{2}$)

Derivative graph:

<img src="Desktop/Activation_function_logo/tanh_der.jpg" style="width:200px;height:200px"/>

Advantage:
1. It is zero centered means it make easier for model that have strongly negative, neutral and strongly positive values.
2. It works better than sigmoid function

Disadvantage:
1. It has also vanishing gradient issue.
2. It has also slow convergence.

<img src="Desktop/Activation_function_logo/blank.jpg" style="width:50px;height:50px"/>

# ReLU(Rectified Linear Unit)

It is one of the most common activation function, it's main advantage is that it doesn't activate all the neurons at the same time, means neurons will be deactivated if the output o linear transformation is less than 0.

Equation: f(x) = max(0,x)

Range: (0,+${\infty}$)

Equation Graph:

<img src="Desktop/Activation_function_logo/relu.jpg" style="width:200px;height:200px"/>

Derivative: 
<br>
f'(x) = 1,if x>=0
<br>
        0, if x<0
                    
Derivative Graph:

<img src="Desktop/Activation_function_logo/relu_der.jpg" style="width:200px;height:200px"/>

Advantage:
1. Computationally efficient, allows the network to converge very quickly.
2. It has a derivative function and allows for back-propagation.

Disadvantage:
1. When inputs approach zero or negative, the gradient of the function becomes zero.
2. For negative value, network cannot perform back-propagation.

<img src="Desktop/Activation_function_logo/blank.jpg" style="width:50px;height:50px"/>

# Leaky ReLU

It is one of the attempt to fix the dying ReLU problem. Leaky units are the ones that have a very small gradient instead of a zero gradient when the input is negative, giving the chance for the network to continue its learning.

Equation: f(x) = max(0.01x, x)

Range: (0.01, +${\infty}$)

Eqution graph:

<img src="Desktop/Activation_function_logo/lrelu.jpg" style="width:300px;height:200px"/>

Derivative: 
<br>
f'(x) = 0.01  if x<0
<br>
        1    otherwise
                     
Derivative Graph:

<img src="Desktop/Activation_function_logo/lrelu_der.jpg" style="width:200px;height:200px"/>

Advantage:
1. This variation of ReLU has a small positive slope in negative area, so it does enable back-propagation, even for negative input values.

Disadvantage
1. It's doen't provide consistent prediction for negative input values.
2. In forward propagation if the learning rate is very high it will overshoot killing the neuron.

<img src="Desktop/Activation_function_logo/blank.jpg" style="width:50px;height:50px"/>

# Exponential Linear Unit (ELU)

It is very similar to ReLU except negative inputs. ELU becomes smooth slowly until its output equal to -a, whereas ReLU sharply smoothes. It tend to converge cost to zero faster and produce more accurate results.

Equation :   
<br>
f(x) = x            if x > 0
<br>
a.($e^{x}$  - 1)  if x < 0

Equation Graph:

<img src="Desktop/Activation_function_logo/ELU.jpg" style="width:200px;height:200px"/>

Derivative:
<br>
f'(x) = 1  if x > 0
<br>
(a.($e^{x}$  - 1) + a) if x <= 0

Derivative Equation:
<img src="Desktop/Activation_function_logo/elu_der.jpg" style="width:200px;height:200px"/>

Advantage
1. Avoids the dead relu problem and produces negtive output. which helps the network nudge the weight and biases in the right direction.
2. Produce activation instead of letting them be zero, when calculating the gradient.

Disadvantage 
1. Introduces longer computation time, because of exponential operation included.
2. It does not avoid the exploding gradient problem.

<img src="Desktop/Activation_function_logo/blank.jpg" style="width:50px;height:50px"/>

# Linear

It is a straight line function, where activation is proportinal to input(which is weighted sum from neuron)

Equation: f(z,m) = {z * m}

Equation Graph:
<img src="Desktop/Activation_function_logo/linear.jpg" style="width:200px;height:200px"/>

Derivative: f'(x) = {m}

Derivative Graph:

<img src="Desktop/Activation_function_logo/linear_der.jpg" style="width:200px;height:200px"/>

Advantage:
1. It gives a range of activation, so it is not binary activation.
2. We can connect few neurons together and if more than 1 fires, we could take the max and decide based on that.

Disadvantage:
1. Derivative is constant, so gradient has no relationship with x.
2. If there is an error in prediction, the changes made by back prapogation is constant and not depending on the chnage in input delta(x).

<img src="Desktop/Activation_function_logo/blank.jpg" style="width:50px;height:50px"/>

# Swish

The function is formulated as x times sigmoid x. Since ReLU produces 0 output for negative inputs and it can not be back-propagated. Herein, swish can partially handle this problem.

Equation: f(x) = x / (1+$e^{-x}$) 
<br>
y = x * delta(x)

Equation Graph:

<img src="Desktop/Activation_function_logo/swish.jpg" style="width:300px;height:200px"/>

Derivative: y' = x' * delta(x) + x * delta(x)'

Derivative Graph:

<img src="Desktop/Activation_function_logo/swish_der.jpg" style="width:300px;height:200px"/>

Advantage:
1. It can handle vanishing gradient problem.
2. It works better than ReLU till some extent.

Disadvantage
1. Compuatation is too much high for both feed forwarding and back propogation.

<img src="Desktop/Activation_function_logo/blank.jpg" style="width:50px;height:50px"/>

# Softplus

It is a newer function than tanh and sigmoid. It is an alternative of these traditional functions because it is differentiable
and it's derivative is easy to demonstrate. 
Output produced by sigmoid and tanh has upper and lower limit, where as softplus function produces output in scale of (0, +${\infty}$).

Equation: f(x) = ln(1+${e^x}$)

Equation Graph:
<img src="Desktop/Activation_function_logo/softplus.jpg" style="width:300px;height:200px"/>

Derivative: f'(x) = 1/(1+$e^{-x}$)

Derivative Graph:
<img src="Desktop/Activation_function_logo/softplus_der.jpg" style="width:300px;height:200px"/>

Advantage:
1. This function is much more smoother.
2. It is unilaterally supressed, and has a wide acceptance domain.

Disadvantage:
1. Due to logrithm operation, it is compuattionaly intensivea and is not used. 

<img src="Desktop/Activation_function_logo/blank.jpg" style="width:50px;height:50px"/>

# Maxout

It selects the maximum of the inputs. It enjoys all benefit of ReLU and it doesn't have it's drawback(dying ReLU).

Equation: max(w1x1 + b1, w2x2 + b2)

Equation Graph:
<img src="Desktop/Activation_function_logo/max.jpg" style="width:300px;height:200px"/>

Advantage:
1. It's ability to fit is very strong and can fit any convex function
2. It has advantage of linearity and unsaturation.

Disadvantage:
1. There are two set of parameter in each neuron, then the parameter quantity is doubled , which surges the number of overall
    parameter.
 

<img src="Desktop/Activation_function_logo/blank.jpg" style="width:50px;height:50px"/>

# Softmax

It calculates the probabilities distribution of event over "n" different event. 

Equation : f(x) = g(x)/h(x) = eˣᵢ / (Σⱼ₌₀ eˣᵢ)
    
Range: (0,1)

Equation Graph:
<img src="Desktop/Activation_function_logo/softmax.jpg" style="width:300px;height:200px"/>

Derivative: f'(x) = g'(x)h(x) - h'(x)g(x) / ${[h(x)]^2}$


Advantage:
1. It has ability to handle multiple class.
2. It is useful for output neurons, typically it is used only at output layer, for neural network that need to classify input
   into multiple categories.
   
Disadvantage:
1. It doesn't support null rejection.
2. It will not work if your data is not linearly separable.