### Single Neuron
Every neuron has a unique connection to every neuron in the previous layer, as well as a unique bias
The following are examples of two neurons being calculated

In [30]:
inputs = [1.2, 5.1, 2.1]
weights = [3.1, 2.1, 8.7]
bias = 3

output = inputs[0] * weights[0] + inputs[1] * weights[1] + inputs[2] * weights[2] + bias
output

35.7

In [31]:
inputs = [1, 2, 3]
weights = [0.2, 0.8, -0.5]
bias = 2

output = inputs[0] * weights[0] + inputs[1] * weights[1] + inputs[2] * weights[2] + bias
output

2.3

### A Single Full Layer
To model 3 output neurons with 4 output neurons, we need 3 unique weight sets, each with 4 values, and 3 unique biases

In [29]:
inputs = [1, 2, 3, 2.5]
weights1 = [0.2, 0.8, -0.5, 1.0]
weights2 = [0.5, -0.91, 0.26, -0.5]
weights3 = [-0.26, -0.27, 0.17, 0.87]

bias1 = 2
bias2 = 3
bias3 = 0.5

outputs = [
    inputs[0]*weights1[0] + inputs[1]*weights1[1] + inputs[2]*weights1[2] + inputs[3]*weights1[3] + bias1,
    inputs[0]*weights2[0] + inputs[1]*weights2[1] + inputs[2]*weights2[2] + inputs[3]*weights2[3] + bias2,
    inputs[0]*weights3[0] + inputs[1]*weights3[1] + inputs[2]*weights3[2] + inputs[3]*weights3[3] + bias3,
]

outputs

[4.8, 1.21, 2.385]

The whole crux of deep learning is to figure out these weights and biases to get the output values we want
The following is a simplification of the code for scalability

In [33]:
inputs = [1, 2, 3, 2.5]
weights = [[0.2, 0.8, -0.5, 1.0],
           [0.5, -0.91, 0.26, -0.5],
           [-0.26, -0.27, 0.17, 0.87]
          ]

biases = [2,
          3,
          0.5,
         ]

layer_outputs = []

for neuron_weights, neuron_bias in zip(weights, biases):
    neuron_output = 0
    for n_input, weight in zip(inputs, neuron_weights):
        neuron_output += n_input * weight
    neuron_output += neuron_bias
    
    layer_outputs.append(neuron_output)

layer_outputs

[4.8, 1.21, 2.385]

### Shape
Is what is the size of the dimension at each one

In [35]:
array = [1,5,6,2]
shape = (4, )

In [37]:
array = [[1,5,6,2],
         [3,2,1,3]]
shape = (2, 4, )

In [38]:
array = [[[1,5,6,2],
         [3,2,1,3]],
         [[1,5,6,2],
         [3,2,1,3]],
         [[1,5,6,2],
         [3,2,1,3]]
        ]
shape = (3, 2, 4, )

A 2D array, that is a matrix
These arrays need to be homologous, with each level being of the same shape as every other

## Dot Product
How do we multiply the weights with the inputs efficienty?
A dot product is the sum of the products of all corresponding elements of 2 matrices of the same shape
Gives us a single scalar

In [39]:
import numpy as np

In [44]:
inputs = [1, 2, 3, 2.5]
weights = [0.2, 0.8, -0.5, 1.0]
bias = 2

output = np.dot(inputs, weights) + bias #Here the order doesn't matter, as both arrays are of single dimension
output

4.8

### Dot Product of a Layer of Neurons

In [45]:
inputs = [1, 2, 3, 2.5]
weights = [[0.2, 0.8, -0.5, 1.0],
           [0.5, -0.91, 0.26, -0.5],
           [-0.26, -0.27, 0.17, 0.87]
          ]

biases = [2, 3, 0.5]

outputs = np.dot(weights, inputs) + biases
outputs

array([4.8  , 1.21 , 2.385])

Now that the weights is a matrix of vectors, the order matters for the dot product
Because we have 3 sets of weights, we want things indexed by the weight sets

### Batches
Because most of the calculations taking place are independent of each other, it makes sense to calculate multiple sets of inputs, as a single batch together
Hence it makes sense to do Neural Networks on GPU with their hundreds to thousands of cores, as opposed to CPUs with their dozen cores at most

Another reason we want to batch is because higher batch size results in a greater generalization to the data
If it tries to adjust the weights and biases for every input set, it may jump around very chaotically

Too large of batch sizes may result in overfitting
32 is the usual go to amount

In [51]:
inputs = [[1, 2, 3, 2.5],
          [2.0, 5.0, -1.0, 2.0],
          [-1.5, 2.7, 3.3, -0.8],
         ]
# size = (3, 4)
weights = [[0.2, 0.8, -0.5, 1.0],
           [0.5, -0.91, 0.26, -0.5],
           [-0.26, -0.27, 0.17, 0.87]
          ]
#size = (3, 4)

biases = [2, 3, 0.5]

outputs = np.dot(inputs, weights) + biases
outputs

ValueError: shapes (3,4) and (3,4) not aligned: 4 (dim 1) != 3 (dim 0)

The shapes not being aligned here is because we can't dot multiply a (3,4) matrix with (3,4) matrix
We need the weights matrix to be (4,3)
This operation is known as **Transpose**

#### Transpose

In [50]:
inputs = [[1, 2, 3, 2.5],
          [2.0, 5.0, -1.0, 2.0],
          [-1.5, 2.7, 3.3, -0.8],
         ]
# size = (3, 4)

weights = [[0.2, 0.8, -0.5, 1.0],
           [0.5, -0.91, 0.26, -0.5],
           [-0.26, -0.27, 0.17, 0.87]
          ]
#size = (3, 4)
weights = np.array(weights).T
#size = (4, 3)

biases = [2, 3, 0.5]

outputs = np.dot(inputs, weights) + biases
outputs

array([[ 4.8  ,  1.21 ,  2.385],
       [ 8.9  , -1.81 ,  0.2  ],
       [ 1.41 ,  1.051,  0.026]])

### Second Layer
We're going to need another set of weights and biases

In [56]:
inputs = [[1, 2, 3, 2.5],
          [2.0, 5.0, -1.0, 2.0],
          [-1.5, 2.7, 3.3, -0.8],
         ]

weights1 = [[0.2, 0.8, -0.5, 1.0],
           [0.5, -0.91, 0.26, -0.5],
           [-0.26, -0.27, 0.17, 0.87]
          ]
weights1 = np.array(weights1).T

biases1 = [2, 3, 0.5]


weights2 = [[0.1, -0.14, 0.5],
            [-0.5, 0.12, -0.33],
            [-0.44, 0.73, -0.13]
          ]
weights2 = np.array(weights2).T

biases2 = [-1, 2, -0.5]

layer1_outputs = np.dot(inputs, weights1) + biases1
layer2_outputs = np.dot(layer1_outputs, weights2) + biases2

layer2_outputs

array([[ 0.5031 , -1.04185, -2.03875],
       [ 0.2434 , -2.7332 , -5.7633 ],
       [-0.99314,  1.41254, -0.35655]])

### Scalable Number of Layers
When you save a model, you're saving the weights and biases. Loading model just means setting the weights and biases as the same as the saved model

Initializing a neural network is setting the weights and biases as random numbers. Generally between -1 and 1

In [64]:
np.random.seed(0)

#Standart name of the input set in machine learning is X
X = [[1, 2, 3, 2.5],
     [2.0, 5.0, -1.0, 2.0],
     [-1.5, 2.7, 3.3, -0.8],
    ]

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.1 * np.random.randn(n_inputs, n_neurons)
        # the np.random.randn provides a gaussian distribution around 0
            # accepts the shape as the parameters
        # we need to know 2 things regarding the shape from the programmer when this layer is created
            # n_inputs: the number of features coming in per input set
            # n_neurons: how many neurons we have in this layer
        # 0.1 is to bound the weights even tighter around 0
        # the weights are now shaped so the transpose is no longer necessary
        self.biases = np.zeros((1, n_neurons))
        # np.zeros accepts the shape as a tuple for the first parameter
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases

layer1 = Layer_Dense(4, 5)
# since the input is X, we need to set the n_inputs accordingly
layer2 = Layer_Dense(5,2)
# since the input is the output from the previous layer, the input is the size of the previous layer
# the output can of any size we want


layer1.forward(X)
layer2.forward(layer1.output)
layer2.output

array([[ 0.148296  , -0.08397602],
       [ 0.14100315, -0.01340469],
       [ 0.20124979, -0.07290616]])

The size of layer2 output is 3, because that's the size of the batch of the input data

# Activation Function
Comes after the weights and functions, and determines the output
Ensures the output is within a reasonable range

## Step Function
A function that outputs 1 if x is more than 0, and 0 if x is less than 0

## Sigmoid Function
* Slightly more reliable of a function, with a more granular output
* Fits whatever the output was between 0 and 1
* Very useful when calculating the error
* Vanishing gradianet problem

## Rectified Linear Unit Activation Function
* If x is more than 0, return x, else return 0
* Much faster than Sigmoid
* Doesn't have the vanishing gradient problem
* The most popular activation function for hidden layers

# Why Activation Function?
* Just using weights and biases means we're using a linear activation function (return x)
* A linear activation function can't fit a non-linear function
    * Such as a linear function trying to fit into a sine function
* The Rectified Linear is ALMOST linear, the behaviour at 0 makes it non-linear enough to be able to fit non-linear functions
* Honestly just rewatch [this](https://www.youtube.com/watch?v=gmjzbpSVY1A&list=PLQVvvaa0QuDcjD5BAw2DxE6OF2tius3V3&index=5) video if you need a refresher

In [77]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data

nnfs.init()

X, y = spiral_data(100, 3) #100 feature sets with 3 classes

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.1 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases
        
class Activation_ReLU:
    def forward(self, inputs):
        self.output = np.maximum(0, inputs)

layer1 = Layer_Dense(2, 5) # because the spiral data gives a pair of x and y observations, the n_input is 2
activation1 = Activation_ReLU()

layer1.forward(X)
activation1.forward(layer1.output)
activation1.output

array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.65504505e-04,
        4.56846210e-05],
       [0.00000000e+00, 5.93469958e-05, 0.00000000e+00, 2.03573116e-04,
        6.10024377e-04],
       ...,
       [1.13291524e-01, 0.00000000e+00, 0.00000000e+00, 8.11079666e-02,
        0.00000000e+00],
       [1.34588361e-01, 0.00000000e+00, 3.09493970e-02, 5.66337556e-02,
        0.00000000e+00],
       [1.07817926e-01, 0.00000000e+00, 0.00000000e+00, 8.72561932e-02,
        0.00000000e+00]], dtype=float32)

We see a lot of 0 values in the batch, but not really a problem because the weights and biases will be altered later on as the model is optimized

If it's not enough, we can change the initialization of the biases as a non-zero value

### Softmax Activation Function
Let's say our network gave us the output

In [78]:
layer_outputs = [4.8, 1.21, 2.385]

* If the model is meant to classify images of dogs or cats or neither, we want the output layers to be a probability distribution, which cannot be the case if all the neurons are not boundend and independent of one another 
* We COULD just sum it all up, and divide each value by the total, but if ReLU is used, it's going to clip out any negatives, regardless of if it is -1 or -20 or -9000, and it would be impossible to learn how wrong it is
    * You've lost all meaning once you've clipped
* Just using a linear activation function still does not get rid of the negative problem
* We can make all numbers positive, without losing the meaning through **exponentiation functions**

In [83]:
from math import e

exp_values = [e**x for x in layer_outputs]

exp_values

[121.51041751873483, 3.353484652549023, 10.859062664920513]

Next step would be to normalize the outputs, i.e. turn it into a probability distribution by dividing all the values by the total

In [86]:
norm_base = sum(exp_values)
norm_values = [x/norm_base for x in exp_values]

print(norm_values)
sum(norm_values)

[0.8952826639572619, 0.024708306782099374, 0.0800090292606387]


0.9999999999999999

###### The same, but done in numpy

In [89]:
from math import e
import numpy as np

exp_values = np.exp(layer_outputs)
norm_values = exp_values / np.sum(exp_values)

print(norm_values)
np.sum(norm_values)

[0.89528266 0.02470831 0.08000903]


0.9999999999999999

##### In Summary
1. Took the set of the values of the output layers, e.g \[dog, cat, human\]
2. Exponentiated every value with base e
3. Normalized it all, by dividing each value by the sum
4. Output

The combination of exponentiation and the normalization is the **Softmax Function**

Next up, we convert this code to support batches

In [101]:
layer_outputs = np.array([[4.8, 1.21, 2.385],
                          [8.9, -1.81, 0.2],
                          [1.41, 1.051, 0.026]])

exp_values = np.exp(layer_outputs) # nothing changes with this step

#for the normalization, we want to iterate over the 3 items in the batch and divide them by the sum individually
norm_bases = np.sum(exp_values, axis = 1, keepdims = True) 
# axis = 0 means the sum of the columns, axis = 1 means the sum of the rows
# notice the keepdims

norm_bases

array([[ 135.72296484],
       [7333.35859605],
       [   7.98280655]])

In [103]:
norm_values = exp_values / norm_bases

print(np.sum(norm_values, axis = 1))
norm_values

[1. 1. 1.]


array([[8.95282664e-01, 2.47083068e-02, 8.00090293e-02],
       [9.99811129e-01, 2.23163963e-05, 1.66554348e-04],
       [5.13097164e-01, 3.58333899e-01, 1.28568936e-01]])

* One problem with this approach is because we're exponentiating, it does not take a too high output, for e^output to cause an overflow
* We can solve this problem by subtracting all the values by the largest value in the batch first
    * This would cut the range of the exp to between 0 and 1
    * This still returns identical normalized values, while protecting us from an overflow

### All so far put together

In [108]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data

nnfs.init()

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.1 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases
        
class Activation_ReLU:
    def forward(self, inputs):
        self.output = np.maximum(0, inputs)
        
class Activation_Softmax:
    def forward(self, inputs):
        exp_values = np.exp(inputs - np.max(inputs, axis = 1, keepdims = True))
        probabilities = exp_values / np.sum(exp_values, axis = 1, keepdims = True)
        self.output = probabilities
        
X, y = spiral_data(100, 3) #100 feature sets with 3 classes

dense1 = Layer_Dense(2, 3)
activation1 = Activation_ReLU()

dense2 = Layer_Dense(3, 3)
activation2 = Activation_Softmax()

dense1.forward(X)
activation1.forward(dense1.output)

dense2.forward(activation1.output)
activation2.forward(dense2.output)

activation2.output[:5]


array([[0.33333334, 0.33333334, 0.33333334],
       [0.33331734, 0.3333183 , 0.33336434],
       [0.3332888 , 0.33329153, 0.33341965],
       [0.33325943, 0.33326396, 0.33347666],
       [0.33323312, 0.33323926, 0.33352762]], dtype=float32)

# Training
* We now want to know not just what's right and what's wrong, but also how right and how wrong
    * We do so through a **Loss Function**

### Loss Function
* Since the neural network does not output a discrete classification, but a probability distribution, it does not make sense to throw away the probabilities and only try to optimize for the accuracy of the classes
* One simple way to implement a loss function is the **Mean Absolute Error**
    * The sum of the absolute errors of the model at regular intervals
* In general, the Loss Function choice for classification is the **Categorical Cross-Entropy**
    * Take the negative sum of the target value, multiplied by the log of the predicted value for each of the values in the distribution
    * Very convenient for back propagation and optimization
    * The formula can be simplified by **One-Hot Encoding**

#### One-Hot Encoding
* Have a vector that's as long as how many classes you have, filled with 0s, except for at the target index, at which will be a 1
* eg:
    * Classes = 3
    * Label = 0
    * One-Hot = \[1, 0, 0\]
    * Prediction = \[0.7, 0.1, 0.2\]
    * L = - Sigma_j y_j * log(y^hat_j) = -(1 * log(0.7) + 0 * log(0.1) + 0 * log(0.2)) = -(-0.3566 + 0 + 0) = 0.3566
    

In [111]:
softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])

class_targets = [0, 1, 1]
# because item in the batch predicts highest likely for index 0, second item for index 1, etc etc

logs = -np.log(softmax_outputs[[0,1,2], [class_targets]])
# numpy arrays can receive a list of indices you're interested in
# for the first dimension, we're picking all 0 to 2 of them
# for the second dimension, we're only picking the predicted class targets
# because we're multiplying the distribution of wrong choices by 0, we can just ignore them altogether

print(logs)
np.mean(logs)

[[0.35667494 0.69314718 0.10536052]]


0.38506088005216804

* The problem with this approach is if our model gives the prediction 0 for the right class
    * This infinitely wrong output will mess up the mean of the loss for the entire batch
    * One solution is the clip the probabilities by a small insignificant amount, such as between 1e-7 and 1 - 1e-7
        * The upper end is also clipped so as to be not biased

### Put Together

In [118]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data

nnfs.init()

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.1 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases
        
class Activation_ReLU:
    def forward(self, inputs):
        self.output = np.maximum(0, inputs)
        
class Activation_Softmax:
    def forward(self, inputs):
        exp_values = np.exp(inputs - np.max(inputs, axis = 1, keepdims = True))
        probabilities = exp_values / np.sum(exp_values, axis = 1, keepdims = True)
        self.output = probabilities

class Loss:
    def calculate(self, outputs, y):
        # y is the intended target values
        sample_losses = self.forward(outputs, y)
        data_loss = np.mean(sample_losses)
        return data_loss
    
class Loss_CategoricalCrossEntropy(Loss):
    #inheriting from the base Loss class
    def forward(self, y_pred, y_true):
        # y_pred will come from the neural network
        # y_true will come from the training set
        samples = len(y_pred)
        y_pred_clipped = np.clip(y_pred, 1e-7, 1-1e-7)
        
        if len(y_true.shape) == 1:
            # means scalar class values have been passed
            correct_confidences = y_pred_clipped[range(samples), y_true]
        elif len(y_true.shape) == 2:
            # one hot encoded values have been passed
            correct_confidences = np.sum(y_pred_clipped * y_true, axis = 1)
            
        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods
        
        
X, y = spiral_data(100, 3) #100 feature sets with 3 classes

dense1 = Layer_Dense(2, 3)
activation1 = Activation_ReLU()

dense2 = Layer_Dense(3, 3)
activation2 = Activation_Softmax()

dense1.forward(X)
activation1.forward(dense1.output)

dense2.forward(activation1.output)
activation2.forward(dense2.output)

print(activation2.output[:5])

loss_function = Loss_CategoricalCrossEntropy()
loss = loss_function.calculate(activation2.output, y)

loss

[[0.33333334 0.33333334 0.33333334]
 [0.33331734 0.3333183  0.33336434]
 [0.3332888  0.33329153 0.33341965]
 [0.33325943 0.33326396 0.33347666]
 [0.33323312 0.33323926 0.33352762]]


1.098445

Our goal is now to decrease this loss

| ||

|| |_