# Chapter 4: Activation Functions

The activation function mimic a neuron “firing” or “not firing” based on input information.
There are two types of activation functions:

- The activation functions used in hidden layers (usually will be the same for all neurons, but doesn’t have to).
- The activation functions used in the output layer.

## 4.1. The Step Activation Function

It is rarely a choice nowadays as it is not very informative. The return value does not contain all information about the input (an input of 3 will output the same value as an input of 3000).

<center><img src='./image/4-1.png' style='width: 60%'/></center>

## 4.2. The Linear Activation Function

This activation function is usually applied to the last layer’s output in the case of a regression model.

<center><img src='./image/4-2.png' style='width: 60%'/></center>

## 4.3. The Sigmoid Activation Function

This activation function is more granular than the step activation function.

The return value contains all information about the input. This function works better with NNs.

The Sigmoid function will be used as the output layer’s activation function.

<center><img src='./image/4-3.png' style='width: 60%'/></center>

The sigmoid function, historically used in hidden layers, was eventually repplaced by the Rectified Linear Units activation function (ReLU).

## 4.4. The Rectified Linear Activation Function

The most widely used activation function – mainly because of speed and efficiency.

The Sigmoid activation function is not the most complicated, but it’s still more challenging to compute than the ReLU activation function.

ReLU activation function is extremely close to being a linear activation function while remaining nonlinear, due to that bend after 0. This simple property is very effective.

<center><img src='./image/4-4.png' style='width: 60%'/></center>

## 4.5. Why Use Activation Function?

NNs need to contain two or more hidden layers to fit a nonlinear function, we need those hidden layers to use a nonlinear activation function.

A nonlinear function cannot be represented well by a straight line.

An example of linear problem in life:

- The cost of some number of shirts, know the cost of an individual shirt.

An example of nonlinear problem in life:

- The price of a home depends on size, location, time of year attempting to sell, number of rooms, yard, neighborhood, and so on.

A neural network with 2 hidden layers of 8 neurons each, the result of training this model will look like:

<center><img src='./image/4-5.png' style='width: 60%'/></center>

A neural network with 2 hidden layers of 8 neurons each, using the linear activation function:

<center><img src='./image/4-6.png' style='width: 60%'/></center>

The same NN architecture, using the rectified linear activation function:

<center><img src='./image/4-7.png' style='width: 60%'/></center>

## 4.6. Linear Activation in the Hidden Layers

Consider a NN with all linear activation functions of $y = x$.

<center><img src='./image/4-8.png' style='width: 60%'/></center>

No matter what we do, how many layers we have, this NN can only depict linear relationships if we use linear activation functions.

The entire network is a linear function as well.

## 4.7. ReLU Activation in a Pair of Neurons

How the rectified linear activation function can suddenly map nonlinear relationships and functions?

<center><img src='./image/4-9.png' style='width: 60%'/><font color='gray'><i>Single neuron with a weight of 0 and a bias of 0.</i></font></center>

<center><img src='./image/4-10.png' style='width: 60%'/><font color='gray'><i>Single neuron with a weight of 1 and a bias of 0.</i></font></center>

<center><img src='./image/4-11.png' style='width: 60%'/><font color='gray'><i>Single neuron with a weight of 1 and a bias of 0.5.</i></font></center>

<center><img src='./image/4-12.png' style='width: 60%'/><font color='gray'><i>Single neuron with a weight of -1 and a bias of 0.5. (See when this neuron deactivates)</i></font></center>

<center><img src='./image/4-13.png' style='width: 60%'/><font color='gray'><i>Pair neurons, the 2nd neuron’s bias does no offsetting, output is linear.</i></font></center>

<center><img src='./image/4-14.png' style='width: 60%'/><font color='gray'><i>Pair neurons, the 2nd neuron bias of 1 (shifts the overall function vertically).</i></font></center>

<center><img src='./image/4-15.png' style='width: 60%'/><font color='gray'><i>Pair neurons, the 2nd neuron weight of -2, have both an activation and a deactivation point. When both neurons are activated, “area of effect” comes into play, produce values in the range of the granular variable output. If any neuron in the pair is inactive, the pair will produce non-variable output.
</i></font></center>

<center><img src='./image/4-16.png' style='width: 60%'/><font color='gray'><i>Pair neurons, the 2nd neuron weight of -2, “area of effect”.</i></font></center>

## 4.8. ReLU Activation in the Hidden Layers

We will fit the sine wave function using 2 hidden layers of 8 neurons each, and <font color='green'>hand-tune the values to fit the curve</font>.

We do this by working with 1 pair of neurons at a time, which means 1 neuron from each layer individually.

Assume that the layers are not densely connected, each neuron from the first hidden layer connects to only one neuron from the second hidden layer for simplification.

The model takes 1 input ($x$) and return 1 output $y = sin(x)$.

The output layer uses the Linear activation function.

The hidden layers use the ReLU activation function.

<center><img src='./image/4-17.png' style='width: 60%'/><font color='gray'><i>Start with the first pair of neurons, set all weights to 0.</i></font></center>

<center><img src='./image/4-18.png' style='width: 60%'/><font color='gray'><i>Set weights of the hidden layer neurons and the output neuron to 1.</i></font></center>

<center><img src='./image/4-19.png' style='width: 60%'/><font color='gray'><i>Adjust the weight for the first neuron of the first layer to 6.</i></font><font color='green'><i> The initial slope is correct.</i></font><font color='red'><i> A problem is this function never ends</i></font><font color='gray'><i> as this neuron pair never deactivates. We want the deactivation to occur where the red fitment line diverges initially from the green sine wave (~ 0.7).</i></font></center>

<center><img src='./image/4-20.png' style='width: 60%'/><font color='gray'><i>Increase the bias for the 2nd neuron to 0.7. This offsets the overall function vertically. </i></font></center>

<center><img src='./image/4-21.png' style='width: 60%'/><font color='gray'><i>Set the weight for the 2nd neuron to -1, causing the deactivation point to occur where we want.</i></font></center>

<center><img src='./image/4-22.png' style='width: 60%'/><font color='gray'><i>Flip this slope back by seting the weight of the connection to the output neuron to -1. Now we need to offset this up a bit.</i></font></center>

<center><img src='./image/4-23.png' style='width: 60%'/><font color='gray'><i>We will use first 7-pairs of neurons in the hidden layers to create the sine wave’s shape, </i></font><font color='green'><i>then the bottom pair to offset everything vertically.</i></font><font color='gray'><i> Now we completed the first section.</i></font></center>

<center><img src='./image/4-24.png' style='width: 60%'/><font color='gray'><i>Set all weights for this 2nd pair of neurons to 1, including the output neuron.</i></font><font color='red'><i> The 2nd pair begins activation too soon</i></font><font color='gray'><i>, which impacts the “area of effect” of the top pair that we already aligned.</i></font></center>

<center><img src='./image/4-25.png' style='width: 60%'/><font color='gray'><i>We want the 2nd pair to start influcing the output where the first pair deactivates, so we want to adjust the function horizontally.</i></font></center>

<center><img src='./image/4-26.png' style='width: 60%'/><font color='gray'><i>Flip the 2nd pair’s function segment by flipping the weight to the output neuron.</i></font></center>

<center><img src='./image/4-27.png' style='width: 60%'/><font color='gray'><i>Use the bottom pair to fix the vertical offset.</i></font></center>

<center><img src='./image/4-28.png' style='width: 60%'/><font color='gray'><i>Continue the method, begin the activation for the 3rd pair of hidden layer neurons when we wish the slope to start going down.</i></font></center>

<center><img src='./image/4-29.png' style='width: 60%'/><font color='gray'><i>Repeat the process for each section, giving us a final result.</i></font></center>

<center><img src='./image/4-30.png' style='width: 60%'/><font color='gray'><i>Pass data through, the neuron’s areas of effect come into play – only when both neurons are activated based on input. With input 0.08, the only pairs activated are the top ones, as this is their area of effect.</i></font></center>

<center><img src='./image/4-31.png' style='width: 60%'/><font color='gray'><i>With input 0.51, the 4th pair of neurons is activated. Even without any of the other weights, with ReLU we can fit the sine wave pretty well.</i></font></center>

<center><img src='./image/4-32.png' style='width: 60%'/><font color='gray'><i>If we enable all of the weights now and allow a mathematical optimizer to train, we can see even better fitment.</i></font></center>

<center><img src='./image/4-33.png' style='width: 60%'/><font color='gray'><i>More neurons can enable more unique areas of effect, we need two or more hidden layers, we need nonlinear activation functions to map nonlinear problems.</i></font></center>






## 4.9. ReLU Activation Function Code

A simple version:

In [1]:
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]
output = []
for i in inputs:
    if i > 0 :
        output.append(i)
    else :
        output.append( 0 )
print (output)

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]


A more simple version:

In [2]:
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]
output = []
for i in inputs:
    output.append(max(0, i))
print (output)

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]


The new rectified linear activation class:

In [3]:
# ReLU activation
class Activation_ReLU :
    
    # Forward pass
    def forward ( self , inputs ):
        
        # Calculate output values from input
        self.output = np.maximum( 0 , inputs)

Apply this activation function to the dense layer’s outputs:

In [4]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data
nnfs.init()

# Dense layer
class Layer_Dense:
    
    # Layer initialization
    def __init__(self, n_inputs, n_neurons):
        
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    
    # Forward pass
    def forward(self, inputs):
        
        # Calculate output values from inputs, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases

# Create dataset
X, y = spiral_data( samples = 100 , classes = 3 )

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense( 2 , 3 )

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Make a forward pass of our training data through this layer
dense1.forward(X)

# Forward pass through activation func, takes in output from previous layer
activation1.forward(dense1.output)

# Let's see output of the first few samples:
print (activation1.output[: 5 ])

[[0.         0.         0.        ]
 [0.         0.00011395 0.        ]
 [0.         0.00031729 0.        ]
 [0.         0.00052666 0.        ]
 [0.         0.00071401 0.        ]]


## 4.10. The Softmax Activation Function

ReLU is

- Unbounded
- Not normalize with other units: values can be anything, an output of `[12, 99, 318]` is without context.
- Exclusive: each output is independent of the others.

We want this model to be a classifier, so we want in activation function meant for classification. One of these is the Softmax activation function.

- Can take in non-normalized or uncalibrated inputs and produce a normalized distribution of probabilities for our classes (add up to 1).
- The predicted class is associated with the output neuron that return the largest <b>confidence score</b>.

The Softmax function is

$$
S_{i,j}\ =  \frac{e^{z_{i,j}}}{\sum_{l=1}^{L} e^{z_{i,l}}}
$$

$e = 2.71828182846$ is the exponential growth number.

$z_{i,j}$ means a singular output value, the index $i$ means the current sample and the index $j$ means the current output in this sample.

<center><img src='./image/4-34.png' style='width: 60%'/><font color='gray'><i>Graph of an exponential function.</i></font></center>


In [5]:
# Values from the previous output in Chapter 2, section 2.6
layer_outputs = [ 4.8 , 1.21 , 2.385 ]

# e - we use E here to match a common coding style where constants are uppercased
E = 2.71828182846     # you can also use math.e

# For each value in a vector, calculate the exponential value
exp_values = []
for output in layer_outputs:
    exp_values.append(E ** output)     # ** - power operator in Python

print('Exponentiated values:')
print(exp_values)

Exponentiated values:
[121.51041751893969, 3.3534846525504487, 10.85906266492961]


In [6]:
# Now normalize values
norm_base = sum(exp_values)
norm_values = []

for value in exp_values:
    norm_values.append(value / norm_base)

print('Normalized exponentiated values:')
print(norm_values)
print('Sum of normalized values:', sum(norm_values))


Normalized exponentiated values:
[0.8952826639573506, 0.024708306782070668, 0.08000902926057876]
Sum of normalized values: 1.0


We can perform the same set of operations with the use of NumPy:

In [7]:
import numpy as np

# Values from the earlier previous when we described what a neural network is
layer_outputs = [ 4.8 , 1.21 , 2.385 ]

# For each value in a vector, calculate the exponential value
exp_values = np.exp(layer_outputs)
print('Exponentiated values:' )
print(exp_values)

# Now normalize values
norm_values = exp_values / np.sum(exp_values)
print('Normalized exponentiated values:' )
print(norm_values)
print('Sum of normalized values:' , np.sum(norm_values))


Exponentiated values:
[121.51041752   3.35348465  10.85906266]
Normalized exponentiated values:
[0.89528266 0.02470831 0.08000903]
Sum of normalized values: 0.9999999999999999


The results are similar, but faster to calculate and the code is easier to read with NumPy.

To train in batches, we need to convert this functionality to accept layer outputs in batches.

In [8]:
# Softmax activation
class Activation_Softmax:
    
    # Forward pass
    def forward(self, inputs):
        
        # Get unnormalized probabilities (max subtraction to avoid “dead neurons” and “exploding”)
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        
        # Normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities


Exploding phenomena: it doesn’t take a very large number (a mere 1000) to cause an overflow error.

In [9]:
import numpy as np

print(np.exp(1))
print(np.exp(10))
print(np.exp(100))
print(np.exp(1000))

2.718281828459045
22026.465794806718
2.6881171418161356e+43
inf


  print(np.exp(1000))


In [10]:
import numpy as np

print(np.exp(-np.inf), np.exp(0))

0.0 1.0


We can subtract any value from all of the inputs, and it will not change the output.

In [11]:
softmax = Activation_Softmax()
softmax.forward([[1, 2, 3]])
print (softmax.output)

[[0.09003057 0.24472847 0.66524096]]


In [12]:
softmax.forward([[ - 2 , - 1 , 0 ]]) # subtracted 3 - max from the list
print (softmax.output)

[[0.09003057 0.24472847 0.66524096]]


If we divide the layer’s output data `[1, 2, 3]` by 2?

In [13]:
softmax.forward([[0.5, 1, 1.5]])
print(softmax.output)

[[0.18632372 0.30719589 0.50648039]]


The output confidences changed due to the nonlinearity nature of the exponentiation. That’s why we need to scale all of the input data to a NN in the same way.

In [14]:
# Create dataset
X, y = spiral_data(samples = 100, classes = 3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 3 input features (as we take output of previous layer here) and 3 output values
dense2 = Layer_Dense(3, 3)

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

# Make a forward pass of our training data through this layer
dense1.forward(X)

# Make a forward pass through activation function it takes the output of first dense layer here
activation1.forward(dense1.output)

# Make a forward pass through second Dense layer it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Make a forward pass through activation function it takes the output of second dense layer here
activation2.forward(dense2.output)

# Let's see output of the first few samples:
print(activation2.output[: 5 ])

[[0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]]


The distribution of predictions is almost equal, as each of the samples has 0.33 predictions for each class. This results from the random initialization of weights (a draw from the normal distribution) and zeroed biases.

`argmax` returns the indices of the maximum values along an axis. The confidence score can be as important as the class prediction itself. `argmax` of `[0.22, 0.6, 0.18]` is the same as the argmax for `[0.32, 0.36, 0.32]`, but 60% confidence is much better than a 36% confidence.

The full code:

In [15]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data
nnfs.init()

# Dense layer
class Layer_Dense:
    # Layer initialization
    def __init__(self, n_inputs, n_neurons):
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    # Forward pass
    def forward(self, inputs):
        # Calculate output values from inputs, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases

# ReLU activation
class Activation_ReLU:
    # Forward pass
    def forward(self, inputs):
        # Calculate output values from inputs
        self.output = np.maximum(0, inputs)

# Softmax activation
class Activation_Softmax:
    # Forward pass
    def forward(self, inputs):
        # Get unnormalized probabilities, axis 0 means row wise and 1 means columns wise
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # Normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities

# Create dataset
X, y = spiral_data(samples=100, classes=3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 3 input features (as we take output of previous layer here) and 3 output values
dense2 = Layer_Dense(3, 3)

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

# Make a forward pass of our training data through this layer
dense1.forward(X)

# Make a forward pass through activation function it takes the output of first dense layer here
activation1.forward(dense1.output)

# Make a forward pass through second Dense layer it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Make a forward pass through activation function it takes the output of second dense layer here
activation2.forward(dense2.output)

# Let's see output of the first few samples:
print(activation2.output[:5])

[[0.33333334 0.33333334 0.33333334]
 [0.3333332  0.3333332  0.33333364]
 [0.3333329  0.33333293 0.3333342 ]
 [0.3333326  0.33333263 0.33333477]
 [0.33333233 0.3333324  0.33333528]]


Although neurons are interconnected, they each have their respective weights and biases and are not “normalized” with each other.

Our example model is currently random. We need a way to calculate how wrong the NN is at current predictions and begin adjusting weights and biases to decrease error over time.

Our next step is to quantify how wrong the model is through what’s defined as a loss function.