In [8]:
import numpy as np
import matplotlib.pyplot as plt
import nnfs
from nnfs.datasets import spiral_data

nnfs.init()

class dense_layer:
    def __init__ (self, n_inputs, n_neurons):
        #weights and biases are initalized as transposed already
        self.weights = 0.01*np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases

**Activation functions** are applied to outputs, so that it is possible for a 2+ layers model to map nonlinear functions.  
An NN generally has two types of activation functions: the one used in hidden layers, and the second the one used in the output layer.  
### Why?
Activation function, as we said, are used so that a NN can map nonlinearity. The main idea is that a nonlinear function cannot be represented by a straight line (one such function is the sine function). Problems in real life usually are represented by non linearity, as they are pretty complex.  
If we do not use activation functions, the final results would only represent y=x, as neurons by themselves represent just the sum of products + bias.


We have several choices:

### The step activation function

The purpose of this activation function is to mimic a neuron "firing" or "not firing" based on some input information. The simplest version is a step function, where if the output is greater than 0, the neuron will output 1; otherwise it will output 0. This is rarely used nowadays.

### The linear activation function

A linear activation function is the equation of a line, for instance x=y. This is usually applied to the last layer's output for regression models

### The sigmoid activation function

The problem with the step function is that it is not informative, i.e. we lose information about the output, as it always gives either 0 or 1 as a result. It is better to have a function with more granular and informative results. The original solution for this problem is the **sigmoid** activation function. By using this, we can "trace back" to the original output value. 

### Rectified linear activation function
the **rectified linear unit activation function** (ReLU) is a simpler implementation, where we have y=x for values > 0, and 0 otherwise

---
## Coding ReLU

In [4]:
inputs = [0,2,-1, 3.3, -2.7, 1.1, 2.2, -100]
output = []

for i in inputs:
    output.append(max(0,i))
print(output)

output = np.maximum(0, inputs)
print(output)

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]
[0.  2.  0.  3.3 0.  1.1 2.2 0. ]


In [9]:
class activation_relu:
    def forward(self, inputs):
        self.output = np.maximum(0, inputs)
        
X, y = spiral_data(samples=100, classes=3)
dense1 = dense_layer(2,3)

activation1 = activation_relu()

dense1.forward(X)
activation1.forward(dense1.output)

print(activation1.output[:5])

[[0.         0.         0.        ]
 [0.         0.00011395 0.        ]
 [0.         0.00031729 0.        ]
 [0.         0.00052666 0.        ]
 [0.         0.00071401 0.        ]]


### The softmax activation function

Since we want our model to be a classifier, we want our activation function to work for classification. One of such function is the **softmax** function. In this case, the issue with ReLU is that it is:
1. not normalized (i.e., the values can be anything)
2. exclusive (i.e., each output is indeèendent of the others)
In case of classification, we want the ouput to reflect a prediction of which class the network thinks the input represents. A softmax activation function represents **confidence scores** for each class, which will add up to 1. For instance, we could have an output of \[0.45, 0.55\], which would predict the 2nd class. We can also calculate the confidence distribution, which in this case would not be very high.  
Let's code this:

In [13]:
layer_outputs = [4.8, 1.21, 2.385]
# first we exponentiate the outputs
exp_values = np.exp(layer_outputs)
    
print(exp_values)
#we then convert these values to a probability distribution
#we do so by taking a given exp value and dividing it by the sum of all the other exp values
norm_values = exp_values / np.sum(exp_values)

print(norm_values)

[121.51041752   3.35348465  10.85906266]
[0.89528266 0.02470831 0.08000903]


In [17]:
#let's code a similar model for batches

class activation_softmax:
    def forward(self, inputs):
        #we subtract the maximum value to deal with exploding values
        exp_values = np.exp(inputs -np.max(inputs, axis=1, keepdims=True))
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        
        self.output = probabilities


layer_outputs = np.array([[4.8, 1.21, 2.385],
                         [8.9, -1.81, 0.2],
                         [1.41, 1.051, 0.026]])

softmax = activation_softmax()
softmax.forward(layer_outputs)

np.argmax(softmax.output)


3

In [18]:
X,y = spiral_data(samples=100, classes=3)
dense1=dense_layer(2,3)
activation1=activation_relu()
dense2 = dense_layer(3,3)
activation2 = activation_softmax()

dense1.forward(X)
activation1.forward(dense1.output)
dense2.forward(activation1.output)
activation2.forward(dense2.output)

print(activation2.output[:5])

[[0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]]
