## Activation Function
An activation function, in the context of artificial neural networks and deep learning, is a mathematical function applied to the output of a neuron or a set of neurons to  normalize the input and produced output which is then passed forward into the subsequent layer. It introduces non-linearity into the network, allowing it to model complex relationships between inputs and outputs. In other words, a neural network without an activation function is essentially just a linear regression model.

Activation functions are typically applied element-wise to the output of each neuron in a neural network layer. They transform the weighted sum of inputs plus a bias term into an output signal that is passed to the next layer.

Two Types of activation function:
1. Linear Activation Function.
2. Non Linear Activation function.

***1.1.1 Linear activation function:***
In the linear activation function, the output of functions is not restricted in between any range. Its range is specified from -infinity to infinity.

$$f(x)= x+5$$

***1.1.2 Non Linear activation function:***
Since the non-linear function comes up with derivative functions, the problems related to backpropagation have been successfully solved.
$$f(x)= x^2*w + b$$

![Activation-Function:](../images/activation-function.png)

In [4]:
import numpy as np
import math

#### 1. Sigmoid activation function [0,1]:
The sigmoid non-linearity takes a real-valued number and “squashes” it into range between 0 and 1.
$$\sigma(x) = 1 / (1 + e^{-x})$$
Advantage:
1. Squashes numbers to range [0,1]
2. Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron

Disadvantage:
1. Saturated neurons “kill” the gradients
    - Gradients are in most cases near 0 (Big values/small values), that kills the updates if the graph/network are large.
2. Sigmoid outputs are not zero-centered
    - if the data coming into a neuron is always positive (e.g. $x>0$ elementwise in $f=w^Tx+b$), then the gradient on the weights w will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression f)
3. $exp()$ is a bit compute expensive

In [51]:
def sigmooid_function(x):
    sig=1/(1+math.exp(-x)) #1/(1+np.exp(-x))
    # print("Sigmoid Function:", sig)
    return sig

In [52]:
sigmooid_function(10)

0.9999546021312976

### 2. Tanh activation function [-1,1]
$tanh(x)=2σ(2x)−1$ or  $tanh(x)={2/(1+e^{-2x})}−1$

Advantage:
1. Squashes numbers to range [-1,1]
2. zero centered (nice)

Disadvantage:
1. still kills gradients when saturated

In [53]:
def tanh_function(x):
    thf= math.tanh(x)
    # print(thf)
    return thf

In [54]:
tanh_function(-1000)

-1.0

### 3. Relu activation function [0,x]
$max(0.0, x)$

Advantage:
- Does not saturate (in +region)
- Does not kill the gradient.
    - Only small values that are killed. Killed the gradient in the half.
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice (e.g. 6x)
- Actually more biologically plausible than sigmoid.

Disadvantage:
- Not zero-centered output
- If weights aren't initialized good, maybe 75% of the neurons will be dead and thats a waste computation. But its still works. This is an active area of research to optimize this.
    - To solve the issue mentioned above, people might initialize all the biases by 0.01


In [55]:
def relu_function(x):
	relu=max(0.0, x)
	# print("Relu Function:", relu)
	return relu

In [56]:
relu_function(-10)

0.0

### 4. Leaky Relu function:
$f(x) = max(αx, x)$

    - Does not saturate
    - Computationally efficient
    - Converges much faster than sigmoid/tanh in practice! (e.g. 6x)
    - will not “die”

In [2]:
def leaky_relu_function(x):
    if x>0:
        return x
    else:
        return .01*x

In [3]:
leaky_relu_function(-1000)

-10.0

### 5. Expotential Relu Function:
    - All benefits of ReLU
    - Closer to zero mean outputs
    - Negative saturation regime compared with Leaky ReLU adds some robustness to noise
    - Computation requires exp()

In [5]:
def exp_relu_function(x):
    '''
    '''
    if x>0:
        return x
    else:
        return .01*(np.exp(x)-1)

In [6]:
exp_relu_function(-10)

-0.009999546000702375

### 6. Maxout activation:
- $maxout(x) = max(w_1.T*x + b_1, w_2.T*x + b_2)$
- Generalizes RELU and Leaky RELU
- Doesn't die!
- Problems:
    - doubles the number of parameters per neuron.

In [2]:
def maxout_function(x, weights, biases):
    linear_outputs=np.dot(x, weights)+biases
    max_output= np.max(linear_outputs, axis=1)
    return max_output

In [7]:
x = np.array([[1, 2, 3], [4, 5, 6]])  # Input
weights = np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]])  # Weights
biases = np.array([0.1, 0.2])  # Biases

output = maxout_function(x, weights, biases)
print("Output:", output)

Output: [3.  6.6]


### Key Point:
    - Use ReLU. Be careful with your learning rates
    - Try out Leaky ReLU / Maxout / ELU
    - Try out tanh but don’t expect much
    - Don’t use sigmoid

### 7. Softmax Function:
The softmax function is a commonly used activation function in machine learning, particularly in multi-class classification problems. It takes a vector of real numbers as input and transforms them into a probability distribution over multiple classes. The softmax function ensures that the output probabilities sum up to 1.

$f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}}$

Here,
- $z$ is the input vector
- $k$ The number of classes in the multi-class classifier.

Example:
$z = \left[ \begin{array}{rr} 8  \\ 5 \\ 0 \end{array}\right] \hspace{1cm} $<br>

Calculation:

$e^{z_1}=e^8= 2981$<br>
$e^{z_2}=e^5= 148.4$<br>
$e^{z_3}=e^0= 1.0$<br>
$\sum_k e^{z_k}= e^8+e^5+e^0=3130.4$ <br>
$f_1(z_1) = \frac{e^{z_1}}{\sum_k e^{z_k}}$ $=2981/3130.4=0.953$<br>
$f_2(z_2) = \frac{e^{z_2}}{\sum_k e^{z_k}}$ $=148.4/3130.4=0.0474$<br>
$f_3(z_3) = \frac{e^{z_3}}{\sum_k e^{z_k}}$ $=1/3130.4=0.0003$<br>

It is informative to check that we have three output values which are all valid probabilities, that is they lie between 0 and 1, and they sum to 1.


In [1]:
# Softmax function
import numpy as np
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=0)

x = np.array([2.0, 1.0, 0.1])
outputs = softmax(x)
print('softmax numpy:', outputs)

softmax numpy: [0.65900114 0.24243297 0.09856589]


### 8. Argmax Function:
The argmax function is a mathematical function that returns the argument (input) that maximizes a given function or expression. In the context of machine learning and classification tasks, the argmax function is commonly used to determine the class with the highest predicted probability or score.

$$argmax(f) = arg \ max f(z)$$

$z = \left[ \begin{array}{rr} 8  \\ 5 \\ 0 \end{array}\right] \hspace{1cm}$<br>
$argmax(f) = arg \ max f(z)= \left[ \begin{array}{rr} 1  \\ 0 \\ 0 \end{array}\right] \hspace{1cm}$

In [1]:
import numpy as np

scores = np.array([3, 4, 2, 5])
argmax_index = np.argmax(scores)

print(argmax_index) 

3
