***
**Disclaimer**: This notebook is lenghty, <u>navigate the links given below, if need be</u>.

## [Sigmoid](#sigmoid)
## [ReLU](#relu)
## [Leaky ReLU](#leaky_relu)
## [History](#history)
## [Landmarks of CNN in CV](#landmarks)

**In the second cell of this notebook, we will create a custom function to compute a 3D convolution using Numpy library and FOR LOOPS to iterate over the input.**

It's worth noting that **there are more efficient and optimized ways to implement convolutions** using libraries like <span style="font-size: 11pt; color: orange; font-weight: normal">**NumPy**</span> or frameworks like <span style="font-size: 11pt; color: orange; font-weight: normal">**TensorFlow**</span> or <span style="font-size: 11pt; color: orange; font-weight: normal">**PyTorch**</span>. These libraries provide highly optimized functions and operations for convolutions, taking advantage of vectorized computations and parallel processing.
***


## <a id="sigmoid"></a>Sigmoid:

The sigmoid function maps the input to a range between 0 and 1, making it suitable for binary classification problems or as the output activation in multi-class classification.

Sigmoid "squashes" the input to the range (0, 1) using the logistic function.

Sigmoid suffers from the vanishing gradient problem, where gradients become close to zero, and saturation for large inputs, making it challenging to train deep neural networks.

**Application**: Sigmoid is commonly used in <u>binary classification tasks</u>.

$$\text{sigmoid}(x) = \frac{1}{1 + \exp^{(-x)}}$$

```python
# TensorFlow
import tensorflow as tf
tf.sigmoid(x)

# PyTorch
import torch
torch.sigmoid(x)

# NumPy implementation:
import numpy as np
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
```

## <a id="relu"></a>ReLU (Rectified Linear Unit):
    
ReLU returns the input as is if it is positive, and 0 otherwise. ReLU is widely used as the activation function in hidden layers of deep neural networks, especially in convolutional neural networks (CNNs)
    
ReLU is computationally efficient, as it only involves simple thresholding operations. It overcomes the vanishing gradient problem and accelerates convergence in deep neural networks. It has been successful in training deep architectures and achieves good generalization.
    
**NOTE**: ReLU can suffer from the "dying ReLU" problem, where neurons become permanently inactive if their output falls below zero. It can cause dead neurons and a significant number of zero-valued gradients during backpropagation.
    
$$\text{ReLU}(x) = \max(0, x)$$

```python
# TensorFlow
import tensorflow as tf
tf.nn.relu(x)

# PyTorch code:
import torch
torch.relu(x)

# NumPy implementation:
import numpy as np
def relu(x):
    return np.maximum(0, x)
```

## <a id="leaky_relu"></a>Leaky ReLU:
    
Leaky ReLU is a variation of ReLU that introduces a small slope for negative values, addressing the "dying ReLU" problem where neurons become inactive during training, thus allowing gradients to flow. It helps prevent dead neurons and can improve model performance compared to standard ReLU
    
Leaky ReLU is used as an alternative to ReLU, especially in models where the "dying ReLU" problem is prevalent. It can be beneficial in scenarios where standard ReLU fails to provide satisfactory results.

**NOTE**: Leaky ReLU introduces additional hyperparameters, making it more complex to tune. The choice of the slope parameter can affect the model's performance.
    
Leaky ReLU sets positive values unchanged and negative values with a small slope, typically a small fraction like 0.01
    
$$\text{LeakyReLU}(x) = \begin{cases} x, & \text{if } x \geq 0 \\ \alpha \cdot x, & \text{if } x < 0 \end{cases}$$
    
```python
# TensorFlow
import tensorflow as tf
tf.nn.leaky_relu(x, alpha)

#PyTorch:
import torch
torch.nn.functional.leaky_relu(x, negative_slope=alpha)

#NumPy implementation:
import numpy as np
def leaky_relu(x, alpha=0.01):
    return np.where(x >= 0, x, alpha * x)
```

## Softmax:
    
Softmax is commonly used in the final (output) layer for multi-class classification. It normalizes the outputs into a probability distribution that sums up to 1, enabling class predictions. 
    
Softmax computes the exponential of each input element and normalizes them to obtain a probability distribution. It provides a clear interpretation of the model's confidence in each class prediction
    
*NOTE**: Softmax is sensitive to outliers and can amplify the differences between input values, potentially leading to numerical instability. It does not handle class imbalance well.
    
$$\text{softmax}(x_i) = \frac{\exp(x_i)}{\sum_{j=1}^{K}\exp(x_j)}$$ for $i = 1, 2, \ldots, K$, where $K$ is the number of classes.
    

```python
# TensorFlow
import tensorflow as tf
tf.nn.softmax(x)

# PyTorch code:
import torch
torch.nn.functional.softmax(x, dim=1)

# NumPy implementation:
import numpy as np
def softmax(x):
    exps = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exps / np.sum(exps, axis=1, keepdims=True)
```


## Tanh:
The hyperbolic tangent function, $tanh$, maps the input to a range between -1 and 1. It is commonly used in the hidden layers of neural networks.
    
Tanh is an S-shaped function that squashes the input to the range (-1, 1).

Tanh produces values between -1 and 1, which helps in capturing negative correlations in the data. It is zero-centered, making it useful in the hidden layers of neural networks
    
Tanh is often used as an activation function in recurrent neural networks (RNNs), autoencoders, and models where negative correlations in the data are important.
    
**NOTE**: Tanh shares some drawbacks with sigmoid, such as the vanishing gradient problem and saturation for large inputs. It is prone to gradients close to zero in the saturated regions, slowing down training
    
$$\text{tanh}(x) = \frac{{\exp(x) - \exp(-x)}}{{\exp(x) + \exp(-x)}}$$

```python
# TensorFlow
import tensorflow as tf

tf.nn.tanh(x)

# PyTorch
import torch

torch.tanh(x)

# NumPy
import numpy as np

np.tanh(x)
```

## Linear
    
The linear activation function, also known as the identity function, returns the input as is without any non-linear transformation. It is typically used in regression models or when a neural network needs to output a continuous value.

Linear activation function (identity) preserves the input as is, making it suitable for regression tasks or when a neural network needs to output a continuous value without non-linear transformation.
    
**NOTE**: Linear activation lacks non-linearity, limiting the representational power of the neural network. It cannot model complex relationships between inputs and outputs.
    
$$\text{linear}(x) = x$$

```python
def linear(x):
    return x
```

## ELU (Exponential Linear Units):
    
ELU is an activation function that addresses the vanishing gradient problem while allowing negative values. It provides smooth outputs for negative values and encourages the network to have mean activations close to zero.   
    
**NOTE**: ELU introduces additional computation due to the exponential function, which can slightly slow down the training processand an additional hyperparameter which has to be tuned.   
    
ELU is similar to ReLU for positive values but uses an exponential function for negative values, controlled by the parameter $\alpha$.
    
$$\text{ELU}(x) = \begin{cases} x, & \text{if } x \geq 0 \\
\alpha \cdot (\exp(x) - 1), & \text{if } x < 0 \end{cases}$$
    
```python
# TensorFlow
import tensorflow as tf
tf.nn.elu(x)

# PyTorch
import torch
torch.nn.functional.elu(x, alpha=alpha)

# NumPy implementation:
import numpy as np

def elu(x, alpha=1.0):
    return np.where(x >= 0, x, alpha * (np.exp(x) - 1))
```

## PReLU (Parametric Rectified Linear Units):
    
PReLU is an extension of ReLU that introduces a learnable parameter $\alpha$ for the negative slope. It provides flexibility and adaptability to the data during training, potentially improving model performance.

PReLU is useful in scenarios where ReLU fails to capture the non-linear relationships in the data. It has been applied to various deep learning models, including CNNs and neural machine translation.    
    
**NOTE**: PReLU increases the model's complexity due to the additional learnable parameter. It requires careful initialization and can be prone to overfitting.

$$\text{PReLU}(x) = \begin{cases} x, & \text{if } x \geq 0 \\ \alpha \cdot x, & \text{if } x < 0 \end{cases}$$

```python
import numpy as np

# While main body of function remains the same ...
def prelu(x, alpha=0.01):
    return np.maximum(0.0, x) + alpha * np.minimum(0.0, x)

# Implementation of learnable parameter << ALPHA >> for PReLU,
# Unfortunately, is out of the scope of this article
```

## GELU (Gaussian Error Linear Units):
    
GELU is an activation function that approximates the cumulative distribution function (CDF) of a Gaussian distribution.

GELU has been particularly successful in transformer-based architectures, such as the Transformer models for machine translation, language understanding, and other natural language processing tasks. It has shown improved performance compared to traditional activation functions like ReLU in such scenarios.
    
GELU smoothly saturates positive and negative values using a Gaussian cumulative distribution approximation.
    
$$\text{GELU}(x) = 0.5 \cdot \left(1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$$



The erf (error function) is a special function that cannot be expressed in terms of elementary functions like polynomials, exponentials, or trigonometric functions. It is commonly denoted as $\text{erf}(x)$ and is defined as follows:

$$ \text{erf}(x) = \frac{2}{\sqrt{\pi}} \int_{0}^{x} e^{-t^2} dt $$

The error function is an odd function, symmetric around the origin, with a range from -1 to 1. It is primarily used in statistics and probability calculations involving normal distributions, such as calculating cumulative probabilities, quantiles, or the complementary error function (erfc).

In the context of the GELU activation function, the erf function is used to approximate the cumulative distribution function (CDF) of a Gaussian distribution, allowing GELU to provide a smooth non-linear behavior.   
    
```python
# Unfortunately, due to complexity,
# implementation of GELU is out of the scope of this article
```