# Activation functions

## Team members:

- Technical writer - Arman
- Designer of interactive plots - Sabina
- Designer of quizzes - Aidos

## What is activation functions?

Activation functions decide whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. They are differentiable operators for transforming input signals to outputs, while most of them add nonlinearity. Because activation functions are fundamental to deep learning, let’s briefly survey some common ones.

```{figure} images/activation_f.gif
:align: center
```

## Why do Neural Networks Need an Activation Function?

Activation functions, essential in neural networks, allow for more than mere linear transformations of inputs. Without them, a neural network, regardless of its hidden layers, would be reduced to a linear model, incapable of handling complex tasks. These functions enable the network to focus on relevant information and ignore the irrelevant, significantly enhancing its learning capability.

## 3 Types of Neural Networks Activation Functions

Let's examine the most popular types of neural network activation functions to solidify our knowledge of activation functions in practice. The three most popular functions are:

1. Binary step
2. Linear activation
3. Non-linear activation

### Binary step activation function

Binary step function depends on a threshold value that decides whether a neuron should be activated or not. The input fed to the activation function is compared to a certain threshold; if the input is greater than it, then the neuron is activated, else it is deactivated, meaning that its output is not passed on to the next hidden layer.

```{math}
:label: binary_step
    f(x) =
    \begin{cases}
    0, & \text{if } x < 0,\\
    1, & \text{if } x \geq 0.
    \end{cases}
```

<span style="display:none" id="q_binary">W3sicXVlc3Rpb24iOiAiV2hhdCBhcmUgc29tZSBsaW1pdGF0aW9ucyBvZiB1c2luZyBhIGJpbmFyeSBzdGVwIGZ1bmN0aW9uPyIsICJ0eXBlIjogIm1hbnlfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJhbnN3ZXIiOiAiVGhleSBhbGxvdyBiYWNrcHJvcGFnYXRpb24iLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiTm8sIHRoYXQgaXMgaW5jb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJJdCBjYW5ub3QgcHJvdmlkZSBtdWx0aS12YWx1ZSBvdXRwdXRzIiwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdC4gSXQgY2FuJ3QgYmUgdXNlZCBmb3IgbXVsdGktY2xhc3MgY2xhc3NpZmljYXRpb24gcHJvYmxlbXMifSwgeyJhbnN3ZXIiOiAiVGhlIGdyYWRpZW50IG9mIHRoZSBzdGVwIGZ1bmN0aW9uIGlzIHplcm8iLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJOb25lIG9mIHRoZSBhYm92ZSIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJJbmNvcnJlY3QuIn1dfV0=</span>

In [None]:
from jupyterquiz import display_quiz
display_quiz("#q_binary")

In [7]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def binary_step_function(x, threshold):
    return np.where(x > threshold, 1, 0)

def plot_binary_step(threshold):
    x = np.linspace(-10, 10, 1000)
    y = binary_step_function(x, threshold)

    fig = make_subplots(rows=1, cols=1, subplot_titles=['Binary step activation function'])
    trace = go.Scatter(x=x, y=y, mode='lines', name=f'Threshold = {threshold}', line=dict(color='#FDAB9F'))

    fig.add_trace(trace)

    fig.update_layout(
        xaxis_title='x',
        yaxis_title='f(x)',
        showlegend=True,
        legend=dict(x=0, y=1, traceorder='normal'),
        height=500,
        width=700,
    )

    return fig

initial_threshold = 0.0

fig = go.FigureWidget(plot_binary_step(initial_threshold))

def update_threshold(threshold):
    with fig.batch_update():
        fig.data[0].y = binary_step_function(fig.data[0].x, threshold)
        fig.layout.title.text = f'Binary step activation function (threshold = {threshold})'

fig


FigureWidget({
    'data': [{'line': {'color': '#FDAB9F'},
              'mode': 'lines',
              'name'…

### Linear activation function

The linear activation function, also referred to as "no activation" or "identity function," is a function where the activation is directly proportional to the input. This function does not modify the weighted sum of the input and simply returns the value it was given. 

```{math}
:label: linear_activation
    f(x) = x
```

<span style="display:none" id="q_linear">W3sicXVlc3Rpb24iOiAiV2hhdCBhcmUgc29tZSBsaW1pdGF0aW9ucyBvZiB1c2luZyBhIGxpbmVhciBhY3RpdmF0aW9uIGZ1bmN0aW9uPyIsICJ0eXBlIjogIm1hbnlfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJhbnN3ZXIiOiAiSXRcdTIwMTlzIG5vdCBwb3NzaWJsZSB0byB1c2UgYmFja3Byb3BhZ2F0aW9uIiwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdC4gVGhlIGRlcml2YXRpdmUgb2YgdGhlIGZ1bmN0aW9uIGlzIGEgY29uc3RhbnQgYW5kIGhhcyBubyByZWxhdGlvbiB0byB0aGUgaW5wdXQgeCJ9LCB7ImFuc3dlciI6ICJUaGUgbGFzdCBsYXllciByZW1haW5zIGxpbmVhciByZWdhcmRsZXNzIG9mIG5ldHdvcmsgZGVwdGguIiwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdC4gSXQgZWZmZWN0aXZlbHkgcmVkdWNlcyB0aGUgbmV0d29yayB0byBvbmUgbGF5ZXIuIn0sIHsiYW5zd2VyIjogIkl0IG91dHB1dHMgdGhlIGlucHV0IHZhbHVlIHdpdGhvdXQgYW55IHRyYW5zZm9ybWF0aW9uLiIsICJjb3JyZWN0IjogdHJ1ZSwgImZlZWRiYWNrIjogIkNvcnJlY3QuIn0sIHsiYW5zd2VyIjogIkl0IGlzIGhpZ2hseSBlZmZlY3RpdmUgZm9yIGFsbCB0eXBlcyBvZiBuZXVyYWwgbmV0d29yayBhcmNoaXRlY3R1cmVzIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkluY29ycmVjdCJ9XX1d</span>

In [None]:
from jupyterquiz import display_quiz
display_quiz("#q_linear")

In [8]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from ipywidgets import interact, widgets

def linear_activation_function(x):
    return x

def plot_linear_activation():
    x = np.linspace(-10, 10, 1000)
    y = linear_activation_function(x)

    fig = make_subplots(rows=1, cols=1, subplot_titles=['Linear Activation Function'])
    trace = go.Scatter(x=x, y=y, mode='lines', name='Linear Activation', line=dict(color='#FDAB9F'))

    fig.add_trace(trace)

    fig.update_layout(
        xaxis_title='x',
        yaxis_title='f(x)',
        showlegend=True,
        legend=dict(x=0, y=1, traceorder='normal'),
        height=500,
        width=700,
    )

    return fig


fig_linear = go.FigureWidget(plot_linear_activation())
widgets.VBox([fig_linear])


VBox(children=(FigureWidget({
    'data': [{'line': {'color': '#FDAB9F'},
              'mode': 'lines',
     …

### Non-Linear activation function

The linear activation function is equivalent to a linear regression model. This is because the linear activation function simply outputs the input that it receives, without applying any transformation.

In a neural network, the output of a neuron is computed using the following equation:

```{math}
:label: non_linear_act
    output = activation(inputs * weights + bias)
```

Non-linear activation functions overcome linear ones by enabling backpropagation through input-dependent derivatives and supporting deep, complex architectures with non-linear input combinations.

## Sigmoid function

This function takes any real value as input and outputs values in the range of 0 to 1. 

The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to 0.0.


```{math}
:label: sigmoid
    \sigma(x) = \frac{1}{1 + e^{-x}}
```

Derivative of sigmoid function:

```{math}
:label: sigmoid_derivative
    \sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))
```

As we can see from the above Figure, the gradient values are only significant for range -3 to 3, and the graph gets much flatter in other regions. 

It implies that for values greater than 3 or less than -3, the function will have very small gradients. As the gradient value approaches zero, the network ceases to learn and suffers from the [Vanishing gradient](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) problem.


<span style="display:none" id="q_sigmoid">W3sicXVlc3Rpb24iOiAiV2hhdCBpcyByZWxhdGVkIHRvIHNpZ21vaWQgZnVuY3Rpb24/IiwgInR5cGUiOiAibWFueV9jaG9pY2UiLCAiYW5zd2VycyI6IFt7ImFuc3dlciI6ICJHaXZlcyBhIGNsZWFyIHByZWRpY3Rpb24oY2xhc3NpZmljYXRpb24pIHdpdGggMSAmIDAiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJOb3QgYSB6ZXJvLWNlbnRyaWMgZnVuY3Rpb24iLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJOb3JtYWxseSB1c2VkIGFzIHRoZSBvdXRwdXQgb2YgYSBiaW5hcnkgcHJvYmFiaWxpc3RpYyBmdW5jdGlvbi4iLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJOb24tbm9ybWFsaXplZCBmdW5jdGlvbiIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJJbmNvcnJlY3QsIHRoaXMgaXMgb25lIG9mIHRoZSBiZXN0IG5vcm1hbGl6ZWQgZnVuY3Rpb24uIn1dfV0=</span>

In [None]:
from jupyterquiz import display_quiz
display_quiz("#q_sigmoid")

In [25]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Sigmoid and sigmoid derivative functions
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

x = np.linspace(-7, 7, 100)
sigmoid_values = sigmoid(x)
sigmoid_derivative_values = sigmoid_derivative(x)

# Create initial plots with titles
fig = make_subplots(rows=1, cols=1)

sigmoid_trace = go.Scatter(x=x, y=sigmoid_values, mode='lines', name='Sigmoid function')
sigmoid_derivative_trace = go.Scatter(x=x, y=sigmoid_derivative_values, mode='lines', name='Sigmoid derivative', line=dict(color='#FDAB9F'))

# Set plot titles
sigmoid_title = "Sigmoid Function"
sigmoid_derivative_title = "Sigmoid Derivative"

# Create buttons for interaction
buttons = [
    {
        "label": "Sigmoid",
        "method": "update",
        "args": [{"visible": [True, False]}, {"title": sigmoid_title}],
    },
    {
        "label": "Sigmoid Derivative",
        "method": "update",
        "args": [{"visible": [False, True]}, {"title": sigmoid_derivative_title}],
    },
    {
        "label": "Both",
        "method": "update",
        "args": [{"visible": [True, True]}, {"title": "Both Plots"}],
    },
]

# Update layout with buttons and increased pad
fig.update_layout(
    updatemenus=[
        {
            "buttons": buttons,
            "direction": "down",
            "pad": {"r": 10, "t": -20},  # Increased top padding
            "showactive": True,
            "x": 0.33,
            "xanchor": "left",
            "y": 1.163,
            "yanchor": "top"
        },
    ],
    title=sigmoid_title,  # Set the initial title
    xaxis_title="X-axis",
    yaxis_title="f(x)",
    showlegend=True,
    legend=dict(x=0, y=1, traceorder='normal'),
    height=500,
    width=700,
)

# Add initial traces to the figure
fig.add_trace(sigmoid_trace)
fig.add_trace(sigmoid_derivative_trace)

# Show the initial figure
fig.show()


## Tanh function (Hyperbolic Tangent)

The tanh activation function is similar to the sigmoid in that it maps input values to an s-shaped curve. But, in the tanh function, the range is (-1, 1) and is centered at 0. This addresses one of the issues with the sigmoid function.

```{math}
:label: tanh
    \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
```

The output of the tanh activation function is [Zero centered](https://stackoverflow.com/questions/59540276/why-in-preprocessing-image-data-we-need-to-do-zero-centered-data); hence we can easily map the output values as strongly negative, neutral, or strongly positive. 

Derivative of tanh function:

```{math}
:label: tanh_derivative
\tanh'(x) = 1 - \tanh^2(x)
```

As you can see - it also faces the problem of vanishing gradients similar to the sigmoid activation function. Plus the gradient of the tanh function is much steeper as compared to the sigmoid function.

<span style="display:none" id="q_tanh">W3sicXVlc3Rpb24iOiAiV2hhdCBpcyBhZHZhbnRhZ2VzIG9mIHRhbmggZnVuY3Rpb24/IiwgInR5cGUiOiAibWFueV9jaG9pY2UiLCAiYW5zd2VycyI6IFt7ImFuc3dlciI6ICJMaWtlIGEgc2lnbW9pZCwgaXQgaGFzIGEgbGltaXRlZCByYW5nZSBvZiB2YWx1ZXMiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJDb21wdXRhdGlvbmFsbHkgaW5leHBlbnNpdmUgZnVuY3Rpb24iLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiTm8sIGJlY2F1c2UgaXQgaXMgZXhwb25lbnRpYWwgaW4gbmF0dXJlLiJ9LCB7ImFuc3dlciI6ICJVbmxpa2UgdGhlIHNpZ21vaWQsIHRoZSByYW5nZSBvZiB2YWx1ZXMgaXMgc3ltbWV0cmljYWwiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7ImFuc3dlciI6ICJaZXJvLWNlbnRyaWMgZnVuY3Rpb24iLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9XX1d</span>

In [None]:
from jupyterquiz import display_quiz
display_quiz("#q_tanh")

```{note}
Although both sigmoid and tanh face vanishing gradient issue, tanh is zero centered, and the gradients are not restricted to move in a certain direction. Therefore, in practice, tanh nonlinearity is always preferred to sigmoid nonlinearity.
```

In [35]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1.0 - tanh(x)**2

x = np.linspace(-7, 7, 100)
tanh_values = tanh(x)
tanh_derivative_values = tanh_derivative(x)

fig = make_subplots(rows=1, cols=1)

tanh_trace = go.Scatter(x=x, y=tanh_values, mode='lines', name='tanh')
tanh_derivative_trace = go.Scatter(x=x, y=tanh_derivative_values, mode='lines', name='tanh derivative', line=dict(color='#FDAB9F'))

tanh_title = "Hyperbolic Tangent Function"
tanh_derivative_title = "Hyperbolic Tangent Derivative"

buttons = [
    {
        "label": "tanh",
        "method": "update",
        "args": [{"visible": [True, False]}, {"title": tanh_title}],
    },
    {
        "label": "tanh Derivative",
        "method": "update",
        "args": [{"visible": [False, True]}, {"title": tanh_derivative_title}],
    },
    {
        "label": "Both",
        "method": "update",
        "args": [{"visible": [True, True]}, {"title": "Both Plots"}],
    },
]

fig.update_layout(
    updatemenus=[
        {
            "buttons": buttons,
            "direction": "down",
            "pad": {"r": 10, "t": -20},  
            "showactive": True,
            "x": 0.4,
            "xanchor": "left",
            "y": 1.163,
            "yanchor": "top"
        },
    ],
    title=tanh_title, 
    xaxis_title="X-axis",
    yaxis_title="tanh(x)",
    showlegend=True,
    legend=dict(x=0, y=1, traceorder='normal'),
    height=500,
    width=700,
)

fig.add_trace(tanh_trace)
fig.add_trace(tanh_derivative_trace)

fig.show()

## ReLU function

ReLU (Rectified Linear Unit) is a more modern and widely used activation function. ReLU is a simple activation function that replaces negative values with 0 and leaves positive values unchanged, which helps avoid issues with gradients during backpropagation and is faster computationally.

```{math}
:label: relu
f(x) = \max(0, x)
```

<span style="display:none" id="q_relu">W3sicXVlc3Rpb24iOiAiSW4gdGhlIFJlTFUgYWN0aXZhdGlvbiBmdW5jdGlvbiwgd2hhdCBpcyB0aGUgb3V0cHV0IHdoZW4gdGhlIGlucHV0IGlzIG5lZ2F0aXZlPyIsICJ0eXBlIjogIm51bWVyaWMiLCAicHJlY2lzaW9uIjogMiwgImFuc3dlcnMiOiBbeyJ0eXBlIjogInZhbHVlIiwgInZhbHVlIjogMCwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdC4ifSwgeyJ0eXBlIjogInJhbmdlIiwgInZhbHVlIjogMC4wLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiJ9LCB7InR5cGUiOiAiZGVmYXVsdCIsICJmZWVkYmFjayI6ICJUaGF0J3MgaW5jb3JyZWN0LiBUcnkgYWdhaW4uIn1dfV0=</span>

In [1]:
from jupyterquiz import display_quiz
display_quiz("#q_relu")

<IPython.core.display.Javascript object>

In [41]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return np.where(x > 0, 1, 0)

x = np.linspace(-3, 3, 100)
relu_values = relu(x)
relu_derivative_values = relu_derivative(x)

fig = make_subplots(rows=1, cols=1)

relu_trace = go.Scatter(x=x, y=relu_values, mode='lines', name='ReLU function')
relu_derivative_trace = go.Scatter(x=x, y=relu_derivative_values, mode='lines', name='ReLU derivative', line=dict(color='#FDAB9F'))

relu_title = "Rectified Linear Unit (ReLU) Function"
relu_derivative_title = "ReLU Derivative"

buttons = [
    {
        "label": "ReLU",
        "method": "update",
        "args": [{"visible": [True, False]}, {"title": relu_title}],
    },
    {
        "label": "ReLU Derivative",
        "method": "update",
        "args": [{"visible": [False, True]}, {"title": relu_derivative_title}],
    },
    {
        "label": "Both",
        "method": "update",
        "args": [{"visible": [True, True]}, {"title": "Both Plots"}],
    },
]

fig.update_layout(
    updatemenus=[
        {
            "buttons": buttons,
            "direction": "down",
            "pad": {"r": 10, "t": 10},  
            "showactive": True,
            "x": 0.33,
            "xanchor": "left",
            "y": 1.163,
            "yanchor": "top"
        },
    ],
    title=relu_title, 
    xaxis_title="X-axis",
    yaxis_title="f(x)",
    showlegend=True,
    legend=dict(x=0, y=1, traceorder='normal'),
    height=500,
    width=700,
)

fig.add_trace(relu_trace)
fig.add_trace(relu_derivative_trace)

fig.show()


### The Dying ReLU problem


Derivative of ReLU function:

```{math}
:label: relu_derivative
f'(x) = 
\begin{cases} 
0 & \text{if } x < 0 \\
1 & \text{if } x > 0 \\
\text{undefined} & \text{if } x = 0 
\end{cases}
```

The negative side of the ReLU it makes the gradient value zero. Due to this reason, during the backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons which never get activated. 

### Leaky ReLU

Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as it has a small positive slope in the negative area.

```{math}
:label: leaky_relu
f(x) = \max(0.01x, x)
```

The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does enable backpropagation, even for negative input values. 

```{admonition} Limitations
:class: warning
The predictions may not be consistent for negative input values.
```

<span style="display:none" id="q_leaky_relu">W3sicXVlc3Rpb24iOiAiSW4gYSBMZWFreSBSZUxVIGFjdGl2YXRpb24gZnVuY3Rpb24gd2l0aCBhIGxlYWt5IGZhY3RvciBvZiAwLjEsIGlmIHRoZSBpbnB1dCBpcyAtMC41LCB3aGF0IGlzIHRoZSBvdXRwdXQ/IiwgInR5cGUiOiAibnVtZXJpYyIsICJwcmVjaXNpb24iOiAyLCAiYW5zd2VycyI6IFt7InR5cGUiOiAidmFsdWUiLCAidmFsdWUiOiAtMC4wNSwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdC4ifSwgeyJ0eXBlIjogImRlZmF1bHQiLCAiZmVlZGJhY2siOiAiVGhhdCdzIGluY29ycmVjdC4gVHJ5IGFnYWluLiJ9XX1d</span>

In [2]:
from jupyterquiz import display_quiz
display_quiz("#q_leaky_relu")

<IPython.core.display.Javascript object>

In [48]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def leaky_relu_derivative(x, alpha=0.01):
    return np.where(x > 0, 1, alpha)

x = np.linspace(-3, 3, 100)
alpha = 0.01
leaky_relu_values = leaky_relu(x, alpha=alpha)
leaky_relu_derivative_values = leaky_relu_derivative(x, alpha=alpha)

fig = make_subplots(rows=1, cols=1)

leaky_relu_trace = go.Scatter(x=x, y=leaky_relu_values, mode='lines', name='Leaky ReLU function')
leaky_relu_derivative_trace = go.Scatter(x=x, y=leaky_relu_derivative_values, mode='lines', name='Leaky ReLU derivative', line=dict(color='#FDAB9F'))

leaky_relu_title = f"Leaky Rectified Linear Unit (Leaky ReLU) Function, alpha = {alpha}"
leaky_relu_derivative_title = "Leaky ReLU Derivative"

buttons = [
    {
        "label": "Leaky ReLU",
        "method": "update",
        "args": [{"visible": [True, False]}, {"title": leaky_relu_title}],
    },
    {
        "label": "Leaky ReLU Derivative",
        "method": "update",
        "args": [{"visible": [False, True]}, {"title": leaky_relu_derivative_title}],
    },
    {
        "label": "Both",
        "method": "update",
        "args": [{"visible": [True, True]}, {"title": "Both Plots"}],
    },
]

fig.update_layout(
    updatemenus=[
        {
            "buttons": buttons,
            "direction": "down",
            "pad": {"r": 10, "t": 10},
            "showactive": True,
            "x": 0.33,
            "xanchor": "left",
            "y": 1.163,
            "yanchor": "top"
        },
    ],
    title=leaky_relu_title, 
    xaxis_title="X-axis",
    yaxis_title="Value",
    showlegend=True,
    legend=dict(x=0, y=1, traceorder='normal'),
    height=500,
    width=700,
)

fig.add_trace(leaky_relu_trace)
fig.add_trace(leaky_relu_derivative_trace)

fig.show()


### Parametric ReLU

Parametric ReLU provides the slope of the negative part of the function as an argument $\alpha$. By performing backpropagation, the most appropriate value of $\alpha$
is learnt.

```{math}
:label: parametric_relu
f(x) = \max(\alpha x, x)
```

The parameterized ReLU function is used when the Leaky ReLU function still fails at solving the problem of dead neurons, and the relevant information is not successfully passed to the next layer. 

```{admonition} Limitations
:class: warning
Limitations: Function may perform differently for different problems depending upon the value of slope parameter a.
```

In [51]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def leaky_relu_derivative(x, alpha=0.01):
    return np.where(x > 0, 1, alpha)

x = np.linspace(-3, 3, 100)


alpha_values = [0.01, 0.05, 0.1, 0.2, 0.5]

fig = make_subplots(rows=1, cols=1)


for alpha in alpha_values:
    leaky_relu_values = leaky_relu(x, alpha=alpha)
    leaky_relu_derivative_values = leaky_relu_derivative(x, alpha=alpha)

    fig.add_trace(
        go.Scatter(x=x, y=leaky_relu_values, mode='lines', name=f'Parametric ReLU alpha={alpha}'),
        row=1, col=1
    )
    fig.add_trace(
        go.Scatter(x=x, y=leaky_relu_derivative_values, mode='lines', name=f'Derivative alpha={alpha}', line=dict(color='#FDAB9F')),
        row=1, col=1
    )

# Update layout
fig.update_layout(
    title="Parametric Rectified Linear Unit (Leaky ReLU)",
    xaxis_title="X-axis",
    yaxis_title="Value",
    showlegend=True,
    legend=dict(x=0, y=1, traceorder='normal'),
    height=500,
    width=700,
    updatemenus=[
        {
            "buttons": [
                {"label": f"Alpha {alpha}", "method": "update", "args": [{"visible": [alpha == a for a in alpha_values for _ in (0, 1)]}] }
                for alpha in alpha_values
            ],
            "direction": "down",
            "pad": {"r": 10, "t": 10},
            "showactive": True,
            "x": 0.33,
            "xanchor": "left",
            "y": 1.15,
            "yanchor": "top"
        },
    ]
)

fig.update_traces(visible=False)

fig.show()


## SoftMax

Before exploring the ins and outs of the Softmax activation function, we should focus on its building block: the sigmoid/logistic activation function that works on calculating probability values. 

The output of the sigmoid function was in the range of 0 to 1, which can be thought of as probability. 

````{admonition} Question
:class: important
Let’s suppose we have five output values of 0.8, 0.9, 0.7, 0.8, and 0.6, respectively. How can we move forward with it?

```{admonition} Answer
:class: hint, dropdown
We can’t.

The above values don’t make sense as the sum of all the classes/output probabilities should be equal to 1
```
````

The Softmax function is described as a combination of multiple sigmoids. It calculates the relative probabilities. Similar to the sigmoid/logistic activation function, the SoftMax function returns the probability of each class. 

```{math}
\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
```

```{note}
SoftMax is most commonly used as an activation function for the last layer of the neural network in the case of multi-class classification. 
```

````{admonition} Simple example how SoftMax works
:class: dropdown
Assume that you have three classes, meaning that there would be three neurons in the output layer. Now, suppose that your output from the neurons is [1.8, 0.9, 0.68].

Applying the softmax function over these values to give a probabilistic view will result in the following outcome: [0.58, 0.23, 0.19]. 

The function returns 1 for the largest probability index while it returns 0 for the other two array indexes. So the output would be the class corresponding to the 1st neuron(index 0) out of three.
````

In [52]:
import numpy as np
import plotly.graph_objects as go

def softmax(x):
    exp_values = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_values / np.sum(exp_values, axis=-1, keepdims=True)

x = np.linspace(-5, 5, 100)
softmax_values = softmax(x)
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=softmax_values, mode='lines', name='Softmax', line=dict(color='#FDAB9F')))

fig.update_layout(
    title='Softmax function',
    xaxis=dict(title='x'),
    yaxis=dict(title='Probability'),
    showlegend=True
)

fig.show()
