### Activation Functions


- We can have **Activation Function** in normal neuron network like the following

<img src="images/activation1.png" width=800>


- We can also have **Activation Function** in neuron network with **hidden layer** like the following

<img src="images/activation2.png" width=800>


- Now if we remove the **Activation Function** from both the **hidden layer** and **output layer**, then we will get a **Linear Equation** where the output is just a **Weighted Sum** of the input features.

<img src="images/activation3.png" width=800>


- Now in real life we know that complex problems cannot always be solved using **Linear Equations**.
- So to solve this problem we need **Non Linear Equations**.
- And **Activation Functions** help in building these **Non Linear Equations**.

- In **sigmoid** function we get a smooth curve, it gives a range between 0 to 1.

<img src="images/activation4.png" width=800>

- There is another function named **tanh**, but instead of giving a range between 0 to 1 it gives a range between -1 to 1.


<img src="images/activation5.png" width=800>


- **Best practice is use *sigmoid* in output layer and in all other places use *tanh*.** In all other places **tanh** is better than **sigmoid** as it ***calculates the mean of zero*** and ***it centers the data***.

- Now the problem with **sigmoid** and **tanh** is when we have **derivative** which represents the slope, and **Non Linear** function at each step where the slope gets changed. The **derivative** tells how much the output changes for a given input.

<img src="images/activation6.png" width=800>

- So here for higher values like `4` or `-4` the changes in slope for **y** is negligible. This creates a problem in learning process. As if the **derivatives** become close to zero the learning becomes very slow during **Back Propagation**, this is called **Vanishing Gradients** problem.
- So remember **sigmoid** and **tanh** has this **Vanishing Gradients** problem as a result the learning process becomes very slow.
- To overcome this problem a new **Activation Function** gets created named **ReLU**. Here is the value is less than zero then output is zero and if it is greater than zero then the maximum value (same value) will be taken as output.

<img src="images/activation7.png" width=800>

- **Remember for hidden layers when not sure about which *Activation Function* to be used set *ReLU* as the default choice**.
- It also has the **Vanishing Gradients** problem where the value is less than zero. To overcome this there is another flavor of ***ReLU*** named ***Leaky ReLU***. Here instead of `0` it multiples the close to zero number with `.1`.


<img src="images/activation8.png" width=800>



**Summary of all the *Activation Functions*:**

<img src="images/activation9.png" width=1000>


- So it based on the situation which ***Activation Function*** to be used as if which gives the best output.

### *`sigmoid function`*

- It transforms any number in the range between `0` and `1`.

In [1]:
import math

def sigmoid(x):
  return 1 / (1 + math.exp(-x))

In [2]:
# let's check with value 100

sigmoid(100)

1.0

In [3]:
# let's check with value 1

sigmoid(1)

0.7310585786300049

In [4]:
# let's check with value -56

sigmoid(-56)

4.780892883885469e-25

### *`tanh`*

- Here it transforms the value into a range between `-1` and `1`.

In [5]:
def tanh(x):
  return (math.exp(x) - math.exp(-x)) / (math.exp(x) + math.exp(-x))

In [6]:
# let's check with value 100

tanh(100)

1.0

In [7]:
# let's check with value 1

tanh(1)

0.7615941559557649

In [8]:
# let's check with value -56

tanh(-56)

-1.0

### *`ReLU`*

In [9]:
def relu(x):
    return max(0,x)

In [10]:
# let's check with value 100

relu(100)

100

In [11]:
# let's check with value -56

relu(-56)

0

### *`Leaky ReLU`*

In [12]:
def leaky_relu(x):
    return max(0.1*x,x)

In [13]:
# let's check with value -100

leaky_relu(-100)

-10.0

In [14]:
# let's check with value 8

leaky_relu(8)

8