# Activation functions

## Team members:

- Technical writer - Arman
- Designer of interactive plots - Sabina
- Designer of quizzes - Aidos

## Introduction

As we've journeyed through the world of MLPs and their layers, you've seen how these networks work their magic. But there's one crucial element that ties it all together: activation functions.

```{figure} images/activation_f.gif
:align: center
```

Activation functions decide whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. They are differentiable operators for transforming input signals to outputs, while most of them add nonlinearity. Because activation functions are fundamental to deep learning, let’s briefly survey some common ones.

## The importance of Activation Functions in Neural Networks

Activation functions, essential in neural networks, allow for more than mere linear transformations of inputs. Without them, a neural network, regardless of its hidden layers, would be reduced to a linear model, incapable of handling complex tasks. These functions enable the network to focus on relevant information and ignore the irrelevant, significantly enhancing its learning capability.

````{admonition} The historical beginning of the activation functions
:class: dropdown

### Binary step activation function

Binary step function depends on a threshold value that decides whether a neuron should be activated or not. The input fed to the activation function is compared to a certain threshold; if the input is greater than it, then the neuron is activated, else it is deactivated, meaning that its output is not passed on to the next hidden layer.

```{math}
:label: binary_step
    f(x) =
    \begin{cases}
    0, & \text{if } x < 0,\\
    1, & \text{if } x \geq 0.
    \end{cases}
```

### Non-Linear activation function

The linear activation function is equivalent to a linear regression model. This is because the linear activation function simply outputs the input that it receives, without applying any transformation.

In a neural network, the output of a neuron is computed using the following equation:

```{math}
:label: non_linear_act
    output = activation(inputs \cdot weights + bias)
```

Non-linear activation functions overcome linear ones by enabling backpropagation through input-dependent derivatives and supporting deep, complex architectures with non-linear input combinations.

````

<span style="display:none" id="q_binary">W3sicXVlc3Rpb24iOiAiV2hhdCBhcmUgc29tZSBsaW1pdGF0aW9ucyBvZiB1c2luZyBhIGJpbmFyeSBzdGVwIGZ1bmN0aW9uPyIsICJ0eXBlIjogIm1hbnlfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJhbnN3ZXIiOiAiVGhleSBhbGxvdyBiYWNrcHJvcGFnYXRpb24iLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiTm8sIHRoYXQgaXMgaW5jb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJJdCBjYW5ub3QgcHJvdmlkZSBtdWx0aS12YWx1ZSBvdXRwdXRzIiwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdC4gSXQgY2FuJ3QgYmUgdXNlZCBmb3IgbXVsdGktY2xhc3MgY2xhc3NpZmljYXRpb24gcHJvYmxlbXMifSwgeyJhbnN3ZXIiOiAiVGhlIGdyYWRpZW50IG9mIHRoZSBzdGVwIGZ1bmN0aW9uIGlzIHplcm8iLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJOb25lIG9mIHRoZSBhYm92ZSIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJJbmNvcnJlY3QuIn1dfV0=</span>

In [3]:
from jupyterquiz import display_quiz
display_quiz("#q_binary")

<IPython.core.display.Javascript object>

## Sigmoid function

This function takes any real value as input and outputs values in the range of 0 to 1. 

The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to 0.0.


```{math}
:label: sigmoid
    \sigma(x) = \frac{1}{1 + e^{-x}}
```

Derivative of sigmoid function:

```{math}
:label: sigmoid_derivative
    \sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))
```

<span style="display:none" id="q_derivative">W3sicXVlc3Rpb24iOiAiV2hhdCBpcyB0aGUgcmVzdWx0IG9mIHRoZSBkZXJpdmF0aXZlIG9mIHRoZSBzaWdtb2lkIGZ1bmN0aW9uIHdoZW4geCA9IDA/IiwgInR5cGUiOiAibnVtZXJpYyIsICJhbnN3ZXJzIjogW3sidHlwZSI6ICJ2YWx1ZSIsICJ2YWx1ZSI6IDAuMjUsICJjb3JyZWN0IjogdHJ1ZSwgImZlZWRiYWNrIjogIkNvcnJlY3QhIFRoZSBkZXJpdmF0aXZlIG9mIHRoZSBzaWdtb2lkIGZ1bmN0aW9uIGF0ICR4ID0gMCQgaXMgaW5kZWVkICRmJygwKSA9IDAuMjUkLiJ9LCB7InR5cGUiOiAiZGVmYXVsdCIsICJmZWVkYmFjayI6ICJUaGF0J3Mgbm90IHRoZSBjb3JyZWN0LiBUcnkgYWdhaW4uICJ9XX1d</span>

In [None]:
from jupyterquiz import display_quiz
display_quiz("#q_derivative")


`````{admonition} Pros
:class: tip
- **Smooth gradient**: The sigmoid function is differentiable at every point. This smooth gradient prevents sudden changes in the output values, which is helpful for gradient-based optimization methods.
- **Output range**: It maps inputs to a range between 0 and 1, making it useful for models where the output needs to be interpreted as a probability.
`````

```{admonition} Cons
:class: warning
- **Vanishing gradient problem**: For very high or very low input values, the gradient of the sigmoid function becomes very small, almost zero. This greatly slows down the learning process or even stops it completely, as the weights update very slowly.
- **Not zero-centered**: The output of the sigmoid function is not centered around zero, which can lead to the gradients being all positive or all negative, potentially leading to issues during optimization.
- **Computationally expensive**: The exponential function used in sigmoid is more computationally intensive compared to other alternatives.
```

<span style="display:none" id="q_sigmoid">W3sicXVlc3Rpb24iOiAiV2hhdCBpcyByZWxhdGVkIHRvIHNpZ21vaWQgZnVuY3Rpb24/IiwgInR5cGUiOiAibWFueV9jaG9pY2UiLCAiYW5zd2VycyI6IFt7ImFuc3dlciI6ICJHaXZlcyBhIGNsZWFyIHByZWRpY3Rpb24oY2xhc3NpZmljYXRpb24pIHdpdGggMSAmIDAiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJOb3QgYSB6ZXJvLWNlbnRyaWMgZnVuY3Rpb24iLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJOb3JtYWxseSB1c2VkIGFzIHRoZSBvdXRwdXQgb2YgYSBiaW5hcnkgcHJvYmFiaWxpc3RpYyBmdW5jdGlvbi4iLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJOb24tbm9ybWFsaXplZCBmdW5jdGlvbiIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJJbmNvcnJlY3QsIHRoaXMgaXMgb25lIG9mIHRoZSBiZXN0IG5vcm1hbGl6ZWQgZnVuY3Rpb24uIn1dfV0=</span>

In [7]:
from jupyterquiz import display_quiz
display_quiz("#q_sigmoid")

<IPython.core.display.Javascript object>

## Tanh function (Hyperbolic Tangent)

The tanh activation function is similar to the sigmoid in that it maps input values to an s-shaped curve. But, in the tanh function, the range is (-1, 1) and is centered at 0. This addresses one of the issues with the sigmoid function.

```{math}
:label: tanh
    \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
```

The output of the tanh activation function is [Zero centered](https://stackoverflow.com/questions/59540276/why-in-preprocessing-image-data-we-need-to-do-zero-centered-data); hence we can easily map the output values as strongly negative, neutral, or strongly positive. 

Derivative of tanh function:

```{math}
:label: tanh_derivative
\tanh'(x) = 1 - \tanh^2(x)
```

As you can see - it also faces the problem of vanishing gradients similar to the sigmoid activation function. Plus the gradient of the tanh function is much steeper as compared to the sigmoid function.

`````{admonition} Pros
:class: tip
- **Zero-centered**: Unlike the sigmoid function, tanh outputs range from -1 to 1, making it zero-centered. This can help with the convergence of the gradient descent during training, as it avoids bias in the gradients.
- **Stronger gradients**: Compared to the sigmoid function, tanh often yields stronger gradients, as the slope can be steeper. This can lead to faster learning in some cases.
`````

```{admonition} Cons
:class: warning
- **Vanishing gradient problem**: Tanh suffers from the vanishing gradient problem, similar to sigmoid. For inputs with large magnitudes, the function saturates, making the gradient near zero.
- **Not suitable for all layers**: Due to its nature, tanh might not be suitable for use in the output layer for certain types of problems, like binary classification, where outputs like probabilities are required.
```

<span style="display:none" id="q_tanh">W3sicXVlc3Rpb24iOiAiV2hhdCBpcyBhZHZhbnRhZ2VzIG9mIHRhbmggZnVuY3Rpb24/IiwgInR5cGUiOiAibWFueV9jaG9pY2UiLCAiYW5zd2VycyI6IFt7ImFuc3dlciI6ICJMaWtlIGEgc2lnbW9pZCwgaXQgaGFzIGEgbGltaXRlZCByYW5nZSBvZiB2YWx1ZXMiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJDb21wdXRhdGlvbmFsbHkgaW5leHBlbnNpdmUgZnVuY3Rpb24iLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiTm8sIGJlY2F1c2UgaXQgaXMgZXhwb25lbnRpYWwgaW4gbmF0dXJlLiJ9LCB7ImFuc3dlciI6ICJVbmxpa2UgdGhlIHNpZ21vaWQsIHRoZSByYW5nZSBvZiB2YWx1ZXMgaXMgc3ltbWV0cmljYWwiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJaZXJvLWNlbnRyaWMgZnVuY3Rpb24iLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9XX1d</span>

In [9]:
from jupyterquiz import display_quiz
display_quiz("#q_tanh")

<IPython.core.display.Javascript object>

## ReLU function

ReLU (Rectified Linear Unit) is a more modern and widely used activation function. ReLU is a simple activation function that replaces negative values with 0 and leaves positive values unchanged, which helps avoid issues with gradients during backpropagation and is faster computationally.

```{math}
:label: relu
f(x) = \max(0, x)
```

<span style="display:none" id="q_relu">W3sicXVlc3Rpb24iOiAiSW4gdGhlIFJlTFUgYWN0aXZhdGlvbiBmdW5jdGlvbiwgd2hhdCBpcyB0aGUgb3V0cHV0IHdoZW4gdGhlIGlucHV0IGlzIG5lZ2F0aXZlPyIsICJ0eXBlIjogIm51bWVyaWMiLCAicHJlY2lzaW9uIjogMiwgImFuc3dlcnMiOiBbeyJ0eXBlIjogInZhbHVlIiwgInZhbHVlIjogMCwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdC4ifSwgeyJ0eXBlIjogInJhbmdlIiwgInZhbHVlIjogMC4wLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7InR5cGUiOiAiZGVmYXVsdCIsICJmZWVkYmFjayI6ICJUaGF0J3MgaW5jb3JyZWN0LiBUcnkgYWdhaW4uIn1dfV0=</span>

`````{admonition} Pros
:class: tip
- **Computational efficiency**: ReLU is computationally simple as it only requires a max function, which makes it faster to compute than sigmoid or tanh, especially in deep networks.
- **Solves vanishing gradient problem**: Unlike sigmoid and tanh, ReLU partially addresses the vanishing gradient problem. Gradients do not vanish as quickly because the slope is constant for positive values.
- **Sparsity**: ReLU leads to sparse activations; in other words, for any given input, only a few neurons are activated (those with positive input). This sparsity makes the network more efficient and less prone to overfitting.
- **Improved convergence**: ReLU can help neural networks converge faster compared to sigmoid and tanh, due to its linear, non-saturating form.
`````

```{admonition} Cons
:class: warning
- **Dying ReLU problem**: For inputs less than zero, the gradient is zero, which can lead to dead neurons that stop learning entirely during training. This is known as the "dying ReLU" problem.
- **Fragile to outliers and noise**: Since ReLU is unbounded for positive values, it can be sensitive to outliers and noise in the data, leading to unstable training, especially with high learning rates.
- **Inappropriate for negative values**: Since ReLU completely blocks negative inputs, it may not be suitable for tasks where negative input values carry important information.
- **Not zero-centered**: Like sigmoid, the output of ReLU is not zero-centered. This can potentially lead to optimization issues during training.
```

In [11]:
from jupyterquiz import display_quiz
display_quiz("#q_relu")

<IPython.core.display.Javascript object>

### The Dying ReLU problem


Derivative of ReLU function:

```{math}
:label: relu_derivative
f'(x) = 
\begin{cases} 
0 & \text{if } x < 0 \\
1 & \text{if } x > 0 \\
\text{undefined} & \text{if } x = 0 
\end{cases}
```

The negative side of the ReLU it makes the gradient value zero. Due to this reason, during the backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons which never get activated. 

### Leaky ReLU

Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as it has a small positive slope in the negative area.

```{math}
:label: leaky_relu
f(x) = \max(0.01x, x)
```

The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does enable backpropagation, even for negative input values. 

`````{admonition} Pros
:class: tip
- **Solves dying ReLU problem**: Unlike standard ReLU, Leaky ReLU allows a small, non-zero gradient when the unit is not active (for negative input values), which helps to keep the neurons alive and mitigate the dying ReLU problem.
- **Handles negative inputs**: By allowing a small gradient for negative values, it utilizes information from negative inputs which might be useful in some cases.

`````

```{admonition} Cons
:class: warning
- **Parameter tuning**: The slope for negative values is fixed (a small value like 0.01), which may not be optimal for all problems and might require fine-tuning.
- **Not zero-centered**: Like ReLU, the output is not zero-centered, which can potentially lead to optimization issues during training.
- **Inconsistent predictions**: The predictions may not be consistent for negative input values.
```

<span style="display:none" id="q_leaky_relu">W3sicXVlc3Rpb24iOiAiSW4gYSBMZWFreSBSZUxVIGFjdGl2YXRpb24gZnVuY3Rpb24gd2l0aCBhIGxlYWt5IGZhY3RvciBvZiAwLjEsIGlmIHRoZSBpbnB1dCBpcyAtMC41LCB3aGF0IGlzIHRoZSBvdXRwdXQ/IiwgInR5cGUiOiAibnVtZXJpYyIsICJwcmVjaXNpb24iOiAyLCAiYW5zd2VycyI6IFt7InR5cGUiOiAidmFsdWUiLCAidmFsdWUiOiAtMC4wNSwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdC4ifSwgeyJ0eXBlIjogImRlZmF1bHQiLCAiZmVlZGJhY2siOiAiVGhhdCdzIGluY29ycmVjdC4gVHJ5IGFnYWluLiJ9XX1d</span>

In [13]:
from jupyterquiz import display_quiz
display_quiz("#q_leaky_relu")

<IPython.core.display.Javascript object>

### Parametric ReLU

Parametric ReLU provides the slope of the negative part of the function as an argument $\alpha$. By performing backpropagation, the most appropriate value of $\alpha$
is learnt.

```{math}
:label: parametric_relu
f(x) = \max(\alpha x, x)
```

The parameterized ReLU function is used when the Leaky ReLU function still fails at solving the problem of dead neurons, and the relevant information is not successfully passed to the next layer. 

`````{admonition} Pros
:class: tip
- **Adaptive Learning**: In PReLU, the slope of the negative part is learned during training. This adaptability can lead to better performance as the function can adapt to the specificities of the data.
`````

```{admonition} Cons
:class: warning
- **Risk of overfitting**: The additional learnable parameters can lead to overfitting, especially in small datasets.
- **Increased complexity**: The added complexity of learning the parameters for the negative slope can increase the training time and computational cost.
- **Dependency on slope parameter**:  Function may perform differently for different problems depending upon the value of slope parameter.
```

## SoftMax

Before exploring the ins and outs of the Softmax activation function, we should focus on its building block: the sigmoid/logistic activation function that works on calculating probability values. 

The output of the sigmoid function was in the range of 0 to 1, which can be thought of as probability. 

````{admonition} Question
:class: important
Let’s suppose we have five output values of 0.8, 0.9, 0.7, 0.8, and 0.6, respectively. How can we move forward with it?

```{admonition} Answer
:class: hint, dropdown
We can’t.

The above values don’t make sense as the sum of all the classes/output probabilities should be equal to 1
```
````

The Softmax function is described as a combination of multiple sigmoids. It calculates the relative probabilities. Similar to the sigmoid/logistic activation function, the SoftMax function returns the probability of each class. 

```{math}
\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
```

```{note}
SoftMax is most commonly used as an activation function for the last layer of the neural network in the case of multi-class classification. 
```

````{admonition} Simple example how SoftMax works
:class: dropdown
Assume that you have three classes, meaning that there would be three neurons in the output layer. Now, suppose that your output from the neurons is [1.8, 0.9, 0.68].

Applying the softmax function over these values to give a probabilistic view will result in the following outcome: [0.58, 0.23, 0.19]. 

The function returns 1 for the largest probability index while it returns 0 for the other two array indexes. So the output would be the class corresponding to the 1st neuron(index 0) out of three.
````

```{figure} images/softmax.jpg
:align: center
```

## Activation functions plot

In [2]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return np.where(x > 0, 1, 0)

def parametric_relu(x, alpha=0.01):
    return np.maximum(alpha * x, x)

def parametric_relu_derivative(x, alpha=0.01):
    return np.where(x > 0, 1, alpha)

x_values = np.linspace(-5, 5, 100)


sigmoid_trace = go.Scatter(x=x_values, y=sigmoid(x_values), name='Sigmoid', mode='lines', line=dict(color='brown') )
sigmoid_derivative_trace = go.Scatter(x=x_values, y=sigmoid_derivative(x_values), name='Sigmoid Derivative', mode='lines')

tanh_trace = go.Scatter(x=x_values, y=tanh(x_values), name='Tanh', mode='lines')
tanh_derivative_trace = go.Scatter(x=x_values, y=tanh_derivative(x_values), name='Tanh Derivative', mode='lines')

relu_trace = go.Scatter(x=x_values, y=relu(x_values), name='ReLU', mode='lines')
relu_derivative_trace = go.Scatter(x=x_values, y=relu_derivative(x_values), name='ReLU Derivative', mode='lines')

initial_alpha = 0.05
parametric_relu_trace = go.Scatter(x=x_values, y=parametric_relu(x_values, alpha=initial_alpha), name=f'Parametric ReLU (alpha = {initial_alpha})', mode='lines')
parametric_relu_derivative_trace = go.Scatter(x=x_values, y=parametric_relu_derivative(x_values, alpha=initial_alpha), name='Parametric ReLU Derivative', mode='lines')

fig = make_subplots(rows=1, cols=1)

fig.add_trace(sigmoid_trace)
fig.add_trace(sigmoid_derivative_trace)
fig.add_trace(tanh_trace)
fig.add_trace(tanh_derivative_trace)
fig.add_trace(relu_trace)
fig.add_trace(relu_derivative_trace)
fig.add_trace(parametric_relu_trace)
fig.add_trace(parametric_relu_derivative_trace)


fig.update_xaxes(title_text="x")
fig.update_yaxes(title_text="f(x)")
fig.update_layout(title_text="Activation functions and their derivatives", title_x=0.4)

buttons = [
    dict(label="All", method="update", args=[{"visible": [True, False, True, False, True, False, True, False]}]),
    dict(label="Sigmoid", method="update", args=[{"visible": [True, True, False, False, False, False, False, False]}]),
    dict(label="Tanh", method="update", args=[{"visible": [False, False, True, True, False, False, False, False]}]),
    dict(label="ReLU", method="update", args=[{"visible": [False, False, False, False, True, True, False, False]}]),
    dict(label="Parametric ReLU", method="update", args=[{"visible": [False, False, False, False, False, False, True, True]}]),
]

fig.update_layout(
    updatemenus=[
        {
            "buttons": buttons,
            "direction": "down",
            "pad": {"r": 10, "t": 10},  
            "showactive": True,
            "x": 0.33,
            "xanchor": "left",
            "y": 1.163,
            "yanchor": "top"
        },
    ]
)

fig.show()