# Activation Functions

Activation functions are of two types based on how it is used in an ML model.

1. Activation functions that are used in **output layers** of ML models. The primary purpose of these activation functions is to **squash the value between a bounded range like 0 to 1**.
   1. Sigmoid: $f(x) = \frac{1}{1+e^{-x}}$
   2. Softmax: $f(x) = \frac{e^x}{\sum_{i=1}^{n}e^x}$
   3. Tanh: $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
2. Activation functions that are used in **hidden layers** of neural networks. The primary purpose of these activation functions is to **provide non-linearity without which neural networks cannot model non-linear relationships**. This type of activation function should ideally satisfy the following conditions: **Non-linear** to let neural network learn non-linear relationships, **Unbounded** to enable faster learning and avoid saturating early, **Continuously differentiable**.
   1. ReLU: $f(x) = max(0, x)$

## 1. RELU

What are the advantages of ReLU over sigmoid function in deep neural networks?

1. **Reduced likelihood of vanishing gradient**: $h$ arises when $a$>0. In this regime the gradient has a constant value. In contrast, the gradient of sigmoids becomes increasingly small as the absolute value of x increases. The constant gradient of ReLUs results in faster learning.
2. **Sparsity**: Sparsity arises when $a$≤0. The more such units that exist in a layer the more sparse the resulting representation.
3. **Better convergence performance**
4. **More computationally efficient**

In [1]:
import torch

relu = torch.nn.ReLU()
A = torch.randn(5)
print(A)
ans = relu(A)
print(ans)

tensor([0.1498, 0.4028, 2.1596, 1.0192, 0.9494])
tensor([0.1498, 0.4028, 2.1596, 1.0192, 0.9494])


## 2. Sigmoid and Softmax

Sigmoid formula: $f(x) = \frac{1}{1+e^{-x}}$, and its derivative: $f'(x) = f(x)(1-f(x))$

Softmax formula: $f(x) = \frac{e^x}{\sum_{i=1}^{n}e^x}$

What is the difference between sigmoid and softmax functions?

1. Sigmoid function is used in the output layer of a binary classification model. It squashes the output between 0 and 1. The output of sigmoid function is interpreted as the probability of the input belonging to class 1.
2. Softmax function is used in the output layer of a multi-class classification model. It squashes the output between 0 and 1. The output of softmax function is interpreted as the probability of the input belonging to each class.

## 3. Tanh
Tanh formula: $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$

## References

- [https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks](https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks)
- [https://towardsdatascience.com/fantastic-activation-functions-and-when-to-use-them-481fe2bb2bde](https://towardsdatascience.com/fantastic-activation-functions-and-when-to-use-them-481fe2bb2bde)