<center>
    <tr>
    <td><img src="images/Quansight_Logo_Lockup_1.png" width="25%"></img></td>
    </tr>
</center>

---
# Common Activation Functions
---

In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

## Perceptron Model

A perceptron model takes an input, computes its weighted sum, and either fires or doesn't.  A single perceptron can implement *linearly separable* functions.  The decision when to fire (or not) is an important one.  We discuss activation functions as mechanism that determine whether or not a neuron fires.

## Step function

The simplest model for an activation function.  Neuron fires (or is equal to 1) if the weighted sum is greater than some threshold.  

$$
y = 
\begin{cases}
1 & x \ge \mathrm{threshold} \\
0 & \mathrm{otherwise}
\end{cases}
$$

Here $y$ represents neuron activation, $x$ is the weighted sum input to the neuron, and $\mathrm{threshold}$ is the neuron-specific threshold value.

In [None]:
x = torch.FloatTensor(np.linspace(-10, 10, 100))
threshold = 2.5

y = np.where(x >= threshold, 1, 0)

plt.figure(figsize=(5, 5))
plt.plot(x, y, 'r');
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.title('Step function');

This activation function is used in the *Perceptron Learning Algorithm*, originally proposed by Frank Rosenblatt in 1943 and later refined by Minsky and Papert in 1969.

Step function can work well for binary classification problems; however, how would we use for a multi-class classifier problem?  What if more than one neuron is activated for a given input example.  It would be difficult for us to determine which neuron is "more" activated then other neurons that are also activated?  It would be good to have a mechanism where neuron activation is not binary.  Rather a neuron can be fired anywhere between 0 percent and 100 percent.

## Linear function

One choice for non-binary activations is to use linear function.  Neuron activation is proportional to its input (i.e., weighted sum).  

$$
y = c x
$$

Here $y$ represents neuron activation, $x$ is the weighted sum input to the neuron, and $c$ is a constant.

In [None]:
x = torch.FloatTensor(np.linspace(-10, 10, 100))
c = 2

plt.figure(figsize=(5, 5))
plt.plot(c*x,'r-',label='output');
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.title('Linear function');

While this function solves the problem binary activation problem, it suffers from the following issues:

- The gradient is constant, and doesn't depend upon the input.
- The function is linear, and this means that even if we have multiple layers, the network can only represent linear functions.

## Sigmoid

$$
y = \frac{1}{1 + \exp(-x)}
$$

In [None]:
x = torch.FloatTensor(np.linspace(-10, 10, 100))
m = nn.Sigmoid()

plt.figure(figsize=(5, 5))
plt.plot(m(x),'r-',label='output');
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.title('Sigmoid');

### Pros

- Step-like smooth function
- Non-binary activation
- Non-linear, so we can stack layers together and can approximate non-linear functions
- Values lie between $0$ and $1$
- Most widely used activation functions in neural networks

### Cons

- Suffer from *vanishing gradients* problems; if the input $x$ is too far away from $0$, the gradients become too small.  Training slows down or stops completely.
- One way around this problem is to scale the inputs $x$ to avoid ranges where the output becomes flat.

## Tanh

$$
\begin{align}
h &= \tanh(x) \\
&= \frac{2}{1 + \exp(-2x)} - 1
\end{align}
$$

This function has similar properties to Sigmoid.  The gradients are steeper than Sigmoid around $0$.

In [None]:
x = torch.FloatTensor(np.linspace(-10, 10, 100))
m = nn.Tanh()

plt.figure(figsize=(5, 5))
plt.plot(m(x),'r-',label='output');
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.title('Tanh');

## ReLU

In [None]:
x = torch.FloatTensor(np.linspace(-10, 10, 100))
m = nn.ReLU()

plt.figure(figsize=(5, 5))
plt.plot(x, m(x), 'r');
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.title('ReLU');

$$
y = \max(0, x)
$$

- Most widely used activation function in deep neural networks.
- ReLU is *non-linear*, so we can stack layers to approximate non-linear functions.
- It has "sparsity activation" property, which is very useful.  Sparsity activation refers to the fact then a number of neurons will be turned off (note that for inputs less than $0$, ReLU is off).  Sparse activations make networks more efficient. 
- Computationally less expensive than Sigmoid or Tanh.

### Cons

- ReLU do suffer from dying ReLU problems.  This happens when an input to ReLU is less than $0$.  The gradients for values less than $x$ is $0$.  This can cause a neuron to stop learning.

## Leaky ReLU

In [None]:
x = torch.FloatTensor(np.linspace(-10, 10, 100))
m = nn.LeakyReLU(0.1)

plt.figure(figsize=(5, 5))
plt.plot(x, m(x), 'r');
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.title('Leaky ReLU');

- Attempts to address the dying ReLU problem
- Doesn't offer "sparsity activation" property

## Softmax

In [None]:
x = torch.randn(5)
m = nn.Softmax(dim=0)

plt.figure(figsize=(5, 5))
plt.plot(x,'b-.',label='input vector')
plt.plot(m(x),'r-',label='output vector');
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.title('Softmax');
plt.legend();

- Used in multi-class classification problems
- Turns input vector in to a probability distribution 

## Softshrink

In [None]:
x = torch.FloatTensor(np.linspace(-10, 10, 100))
m = nn.Softshrink(lambd=2.5)

plt.figure(figsize=(5, 5))
#plt.plot(x,'b-.',label='input vector')
plt.plot(x, m(x),'r-',label='output vector');
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.title('Softshrink');
plt.legend();

## Which activation function to use?

- Sigmoid works well for classifier
- ReLU works well for internal layers of deep layers
- Rule of thumb: pick an activation function that leads to faster training ...

<center>
    <tr>
    <td><img src="images/Quansight_Logo_Lockup_1.png" width="25%"></img></td>
    </tr>
</center>