# Chapter 4: Activation Functions

- in general, your neural network will have two types of activation functions:
- the first will be the activation function used in your hidden layers and the second will be the activation function in the output layer
- the purpose of this activation function is to mimic a neuron "firing" or not based on input information
- as discussed, the simplest version of this is the step function
- if the `weights * inputs + biases` result in a value greater than 0, the neuron will "fire" an output of 1, otherwise it will output a 0

### Rectified Linear Activation Function (hidden layers)
- the step function is not ideal because we want something a bit more granular, with finer control over the outputs than simply "1" or "0"
- the original activation function for neural networks was the Sigmoid activation function, but this was eventually replaced by the *Rectified Linear Activation Function*
---
- the ReLU is very simple: it is quite literally y=x, clipped at 0
- if x is less than 0, y=0, otherwise y=x
- the ReLU outperforms the Sigmoid activation function and is the default/most used activation function in neural networks 
- the ReLU has very straightforward code: 

In [33]:
# ReLU activation function
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]

output = []
for i in inputs:
    if i > 0:
        output.append(i)
    else:
        output.append(0)

output

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]

- there's actually a NumPy equivalent we can use called `np.maximum`
- we'll incorporate this in our rectified linear activation class:

In [0]:
class Activation_ReLU:

    def forward(self, inputs):
        self.output = np.maximum(0, inputs)

- let's apply this activation function to the code we have so far:

In [0]:
import numpy as np

np.random.seed(0)

def create_data(points, classes):
    X = np.zeros((points*classes, 2))
    y = np.zeros(points*classes, dtype='uint8')
    for class_number in range(classes):
        ix = range(points*class_number, points*(class_number+1))
        r = np.linspace(0.0, 1, points)  # radius
        t = np.linspace(class_number*4, (class_number+1)*4, points) + np.random.randn(points)*0.05
        X[ix] = np.c_[r*np.sin(t*2.5), r*np.cos(t*2.5)]
        y[ix] = class_number
    return X, y

class Layer_Dense:

    def __init__(self, inputs, neurons):
        self.weights = 0.01 * np.random.randn(inputs, neurons)
        self.biases = np.zeros((1, neurons))

    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases

class Activation_ReLU:

    def forward(self, inputs):
        self.output = np.maximum(0, inputs)

---

In [36]:
# Create dataset
X, y = create_data(100, 3)

# Create Dense layer with 2 input features and 3 output values (3 neurons)
dense1 = Layer_Dense(2, 3)  # first dense layer, 2 inputs (each sample has 2 features), 3 outputs

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Make a forward pass of our training data through this layer
dense1.forward(X)

# Fwd pass thru activation func. Takes in output from prev layer
activation1.forward(dense1.output)

# Let's see output of few first samples:
print(activation1.output[:5])

[[0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 9.17448014e-05 0.00000000e+00]
 [0.00000000e+00 2.34361157e-04 0.00000000e+00]
 [0.00000000e+00 4.45243326e-04 0.00000000e+00]
 [0.00000000e+00 6.15104009e-04 0.00000000e+00]]


- as you can see, negative values have been **clipped** (modified to be 0)
- that's all there is to the ReLU
- the second type of activation function is for your output layer

### Softmax Activation Function (output layer)
- in our case, we want our model to be a classifier, so we want an activation function designed for classification: let's use the *Softmax activation function*
- the softmax activation function on the output layer can take-in non-normalized inputs and compute a normalized distribution of probabilities for our classes
- in the case of classification, we want to see a prediction of which the class the network thinks the input represents
- this distribution returned by the softmax activation function represents **confidence** scores for each class, which will add up to 1
- the predicted class is associated with the output neuron with the largest confidence score
---
- let's break down the softmax activation function into simple pieces with Python code
- to start, here are some example outputs from a neural network layer: 

In [0]:
layer_outputs = [4.8, 1.21, 2.385]

- the first step is to "exponentiate" the outputs using Euler's number, *e*, which is roughly *2.71828182846* and referred to as the “exponential growth” number

In [0]:
# e - mathematical constant, we use E here to match a common coding  style where constants are uppercased
E = 2.71828182846  # you can also use math.e

In [39]:
# For each value in a vector, calculate the exponential value
exp_values = []
for output in layer_outputs:
    exp_values.append(E ** output)  # ** - power operator in Python
print('exponentiated values:')
print(exp_values)

exponentiated values:
[121.51041751893969, 3.3534846525504487, 10.85906266492961]


- exponentiation serves multiple purposes, but is mainly used for calculating a more meaningful loss
- once we've exponentiated, we want to convert these values to a probability distribution

In [40]:
# Now normalize values
norm_base = sum(exp_values)  # We sum all values
norm_values = []
for value in exp_values:
    norm_values.append(value / norm_base) 
print('normalized exponentiated values:')
print(norm_values)

print('sum of normalized values:', sum(norm_values))

normalized exponentiated values:
[0.8952826639573506, 0.024708306782070668, 0.08000902926057876]
sum of normalized values: 1.0


- we can do everything we just did above with NumPy

In [41]:
# softmax activation function
import numpy as np

layer_outputs = [4.8, 1.21, 2.385]  # values we got as an output earlier when we described what neural network is

# For each value in a vector, calculate the exponential value
exp_values = np.exp(layer_outputs)
print('exponentiated values:')
print(exp_values)

# Now normalize values
norm_values = exp_values / np.sum(exp_values)
print('normalized exponentiated values:')
print(norm_values)
print('sum of normalized values:', np.sum(norm_values))

exponentiated values:
[121.51041752   3.35348465  10.85906266]
normalized exponentiated values:
[0.89528266 0.02470831 0.08000903]
sum of normalized values: 0.9999999999999999


- along with dead neurons (whose values become 0), another common issue is values becoming astronomically large (there's a lot of multiplication and exponentiation being applied to values) 

In [0]:
# get unnormalized probabilities
exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
# normalize them for each sample
probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)

- here, we have some new functions and a newly-added process for subtracting the max value from the inputs
- `np.exp()` does the exponentiation part
- in the case of a 2D array/matrix, `axis=0` refers to columns, and `axis=1` refers to rows
- first, let's see the *default*, which is `None`

In [45]:
layer_outputs = np.array([[4.8, 1.21, 2.385],
                          [8.9, -1.81, 0.2],
                          [1.41, 1.051, 0.026]])

print("sum without axis")
print(np.sum(layer_outputs))

sum without axis
18.172


- with no axis specified, we are just summing *all* the values
- next, we will sum the columns:

In [46]:
print("another way to think of it w/ a matrix == axis 0: columns:")
print(np.sum(layer_outputs, axis=0))

another way to think of it w/ a matrix == axis 0: columns:
[15.11   0.451  2.611]


- next, we will sum the rows: 

In [49]:
print("so we can sum axis 1, but note the current shape:")
print(np.sum(layer_outputs, axis=1))

so we can sum axis 1, but note the current shape:
[8.395 7.29  2.487]


- we got the outputs we wanted and expected, but we need to simplify the outputs to a single value per sample
- we're trying to sum all the outputs from a layer for each sample in a batch
- we can accomplish this by using `keepdims=True`

In [50]:
print("sum axis 1, but keep the same dimensions as input:")
print(np.sum(layer_outputs, axis=1, keepdims=True))

sum axis 1, but keep the same dimensions as input:
[[8.395]
 [7.29 ]
 [2.487]]


- now let's combine all of this into a softmax class: 

In [0]:
# Softmax activation
class Activation_Softmax:

    # Forward pass
    def forward(self, inputs):

        # get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)

        self.output = probabilities

- finally, in `exp_values` we included the subtraction of the largest of the inputs before we did the exponentiation to combat the "exploding" values problem
- here is an example of how easily values can become large:

In [55]:
np.exp(1)

2.718281828459045

In [56]:
np.exp(10)

22026.465794806718

In [59]:
np.exp(100)

2.6881171418161356e+43

- what we know about this exponential function is that its output value tends to 0 as the input value decreases to negative infinity, and the output value is 1 when the input is 0

In [60]:
np.exp(-np.inf), np.exp(0)

(0.0, 1.0)

- now let's use our softmax activation function and see our neural network thus far: 

In [61]:
# Create dataset
X, y = create_data(100, 3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)  # first dense layer, 2 inputs (each sample has 2 features), 3 outputs

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 3 input features (as we take output of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(3, 3)  # second dense layer, 3 inputs, 3 outputs

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

#===========================================================================#

# Make a forward pass of our training data through this layer
dense1.forward(X)

# Make a forward pass through activation function - we take output of previous layer here
activation1.forward(dense1.output)

# Make a forward pass through second Dense layer - it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Make a forward pass through activation function - we take output of previous layer here
activation2.forward(dense2.output)

# Let's see output of few first samples:
print(activation2.output[:5])

[[0.33333333 0.33333333 0.33333333]
 [0.33333333 0.33333333 0.33333333]
 [0.33333333 0.33333333 0.33333333]
 [0.33333333 0.33333333 0.33333333]
 [0.33333333 0.33333333 0.33333333]]


- in order to determine which classification the model has chosen to be the prediction, we usually perform an `argmax`, which simply checks which of the classes in the output distribution have the highest confidence

### Full Code

In [62]:
import numpy as np


np.random.seed(0)


# Our sample dataset
def create_data(n, k):
    X = np.zeros((n*k, 2))  # data matrix (each row = single example)
    y = np.zeros(n*k, dtype='uint8')  # class labels
    for j in range(k):
        ix = range(n*j, n*(j+1))
        r = np.linspace(0.0, 1, n)  # radius
        t = np.linspace(j*4, (j+1)*4, n) + np.random.randn(n)*0.2  # theta
        X[ix] = np.c_[r*np.sin(t*2.5), r*np.cos(t*2.5)]
        y[ix] = j
    return X, y


# Dense layer
class Layer_Dense:

    # Layer initialization
    def __init__(self, inputs, neurons):
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(inputs, neurons)
        self.biases = np.zeros((1, neurons))

    # Forward pass
    def forward(self, inputs):
        # Calculate output values from input ones, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases


# ReLU activation
class Activation_ReLU:

    # Forward pass
    def forward(self, inputs):
        self.output = np.maximum(0, inputs)


# Softmax activation
class Activation_Softmax:

    # Forward pass
    def forward(self, inputs):

        # get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)

        self.output = probabilities


# Create dataset
X, y = create_data(100, 3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)  # first dense layer, 2 inputs (each sample has 2 features), 3 outputs

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 3 input features (as we take output of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(3, 3)  # second dense layer, 3 inputs, 3 outputs

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

# Make a forward pass of our training data thru this layer
dense1.forward(X)

# Make a forward pass thru activation function - we take output of previous layer here
activation1.forward(dense1.output)

# Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Make a forward pass thru activation function - we take output of previous layer here
activation2.forward(dense2.output)

# Let's see output of few first samples:
print(activation2.output[:5])

[[0.33333333 0.33333333 0.33333333]
 [0.33333317 0.33333318 0.33333364]
 [0.33333289 0.33333292 0.3333342 ]
 [0.33333259 0.33333264 0.33333477]
 [0.33333233 0.33333239 0.33333528]]


- at this point, we've completed what we need for forward-passing data through our model
- we used the rectified linear activation function (ReLU) for the hidden layer, which works on a pre-neuron basis
- additionally, we used the softmax activation function for the output layer because it accepts non-normalized values as inputs and outputs a probability distribution, which we'll use as confidence scores for each class
---
- next, we need to calculate how wrong the neural network is and begin adjusting weights and biases to decrease error over time
- thus, our next step is to quantify how wrong the model is through a *loss function*