# Activation Functions

Activation functions are mathematical operations applied to the output of a neural network layer. They introduce non-linearity into the network, enabling it to learn complex patterns and relationships in the data.

### Sigmoid Function
- Formula: $f(x) = \frac{1}{1 + e^{-x}}$, the output is in the range (0, 1).
- It squashes the input values between 0 and 1, which can be interpreted as probabilities. However, it suffers from the vanishing gradient problem (explained below) and is rarely used in hidden layers nowadays due to its limitations.
- Derivative: $f'(x) = f(x) \cdot (1 - f(x))$

### Tanh Function
- Formula: $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$, the output is in the range (-1, 1).
- It squashes the input values between -1 and 1, which helps in centering the data around 0. It also suffers from the vanishing gradient problem.
- Derivative: $f'(x) = 1 - f(x)^2$

### ReLU Function
- Formula: $f(x) = max(0, x)$, the output is in the range (0, $\infty$).
- It is the most widely used activation function in hidden layers. It is computationally efficient and helps in mitigating the vanishing gradient problem. However, it suffers from the dying ReLU problem.
- Derivative: $f'(x) = 1$ if $x > 0$, else $0$

### Leaky ReLU Function
- Formula: $f(x) = max(\alpha x, x)$, where $\alpha$ is a small positive constant (e.g. 0.01). The output is in the range ($-\infty$, $\infty$).
- It is similar to ReLU but allows a small, non-zero gradient when the input is negative. This helps alleviate the dying ReLU problem.

### Softmax Function
- Formula: $f(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}$, the output is in the range (0, 1) and the sum of all outputs is 1.
- It is used in the output layer of a neural network for multi-class classification problems. It converts the raw scores into probabilities, making it easier to interpret the output.

-----

### What is vanishing gradient?

As we add more and more hidden layers, back propagation becomes less and less useful in passing information to the lower layers. In effect, as information is passed back, the gradients begin to vanish and become small relative to the weights of the networks.

The vanishing gradient problem is particularly pronounced when using activation functions like sigmoid and tanh, which have derivatives that become very small for large or small inputs.

## 1. RELU

What are the advantages of ReLU over sigmoid function in deep neural networks?

1. **Reduced likelihood of vanishing gradient**: $h$ arises when $a$>0. In this regime the gradient has a constant value. In contrast, the gradient of sigmoids becomes increasingly small as the absolute value of x increases. The constant gradient of ReLUs results in faster learning.
2. **Sparsity**: Sparsity arises when $a$≤0. The more such units that exist in a layer the more sparse the resulting representation.
3. **Better convergence performance**
4. **More computationally efficient**

In [None]:
import torch

relu = torch.nn.ReLU()
A = torch.randn(5)
print(A)
ans = relu(A)
print(ans)

: 

## 2. Sigmoid and Softmax

Sigmoid formula: $f(x) = \frac{1}{1+e^{-x}}$, and its derivative: $f'(x) = f(x)(1-f(x))$

Softmax formula: $f(x) = \frac{e^x}{\sum_{i=1}^{n}e^x}$

What is the difference between sigmoid and softmax functions?

1. Sigmoid function is used in the output layer of a binary classification model. It squashes the output between 0 and 1. The output of sigmoid function is interpreted as the probability of the input belonging to class 1.
2. Softmax function is used in the output layer of a multi-class classification model. It squashes the output between 0 and 1. The output of softmax function is interpreted as the probability of the input belonging to each class.

## References

- [https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks](https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks)
- [https://towardsdatascience.com/fantastic-activation-functions-and-when-to-use-them-481fe2bb2bde](https://towardsdatascience.com/fantastic-activation-functions-and-when-to-use-them-481fe2bb2bde)