Calculating Network Error with Loss
- loss function = cost function
- loss calculates how wrong a network is from the correct answer and is the model's error. Thus, ideally loss should be zero.
- Classification network outputs are akin to the confidence of the network's classification, and thus want to increase confidence (i.e., move correct neuron closer to 1) and decrease misplaced confidence
- For the current task at hand, we will use categorical cross entropy loss, however for differeny types of network outputs, there are obviously different functions => Mean squared error (regression), Binary Cross entropy loss (sigmoid activation function w/ two mutually exclusive classes and single output neuron; aka log loss?)

Categorical Cross Entropy loss (Note: did some extra reading outside the book)
- used for multiple mutually exclusive classes in classification task, thus commonly used with a softmax activation layer
- cross entropy means the differnce between two distributions, in our case, the output distribution of the network and the actual ground truth distribution.
- categorical comes from the fact that the ground truth distribution is category based (i.e., there is only one correct category and not varying degrees of correctness/probability e.g., one hot encoded or sparse)
- categorical cross entropy loss = -sum((ground_truth_value(i) * -log(predicted_value(i))); where i is the ith value in the softmax output matrix and the ground truth matrix is one hot encoded
- one hot encoded is an array/matrix where the correct value or desired values are 1 and the rest are 0
- so the above equation in categorical cross entropy loss results in all the wrong classes being multiplied by 0 and the correct class being multiplied by 1. This results in simplification in code to just -log(predicted_value_of_correct_class)
- going forward references to log in the book mean natural log (ln = log with base e)
- Ultimately => goal is to calculate average categorical cross entropy loss for each training batch

Further math intuition => 
- -log(x) is downward sloping and where x = 1, -log(x) = 0; which works because if your network is predicting 1 for the correct class, then the loss is 0. => 1 * -log(0) = 0;
- As confidence decreases (lower output value), the loss approaches infinity, there is an asymptote at x = 0; this will present a problem later in the book (need to add very small value to predicted probability so not passing 0) bc log(0) = undefined)

A simple example:

In [3]:
import math
# An example output from the output layer of the neural network
softmax_output = [0.7, 0.1, 0.2]

# Ground truth

target_output = [1, 0, 0]
loss = -(math.log(softmax_output[0])*target_output[0] +
math.log(softmax_output[1])*target_output[1] +
math.log(softmax_output[2])*target_output[2])
print(loss)

#simplification - see notes above for explanation => don't need to include other terms besides one in desired 
# ground truth because they go to 0

loss = -math.log(softmax_output[0])
print(loss)

0.35667494393873245
0.35667494393873245


Dynamically Taking the Log of Desired Index Point 
- have layer output and the correct answer for the layer in an array and list
- this list can be one-hot-encoded (explained and exemplified above), or sparse (below)
- sparse means that ground-truth array contains numbers representing the correct classes, such as 0 = dog, 1 = cat, 2 = human. So [1, 0 , 2], would correspond with 3 feature samples, whose ground trought outputs are cat, dog, human. As opposed to one-hot-encoded where cat would be [0, 1, 0]. So a sparse array will be single dimension, whereas one hot encoded will be multi dimensonal
- in the below examples, the loss is averaged, this also applies to one-hot, just did not show 

In [7]:
import numpy as np

softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])

#where the value represents a class; e.g., 0 = dog, 1 = cat, 2 = human; so dog, cat, cat here
class_targets = [0, 1, 1] #sparse encoding

#one way to get this

#for each row in the outputs, get the value in that row corresponding with the correct class, aka for each row, get column
for targ_idx, distribution in zip(class_targets, softmax_outputs):
    print(distribution[targ_idx])

#even faster using numpy => get the [[row_numbers, col_number]], getting each row here because we want each output
print(softmax_outputs[[0, 1, 2], class_targets])

#so since we want to get the target value at each row, we always want to get each row, so can make further dynamic
print(softmax_outputs[range(len(softmax_outputs)), class_targets])
#range len counts off each row for the length of the softmax outputs array

##full sparse simplification and log and average of loss => averaging applies to one-hot too
neg_log = -np.log(softmax_outputs[range(len(softmax_outputs)), class_targets])
average_loss = np.mean(neg_log)
print(average_loss)

0.7
0.5
0.9
[0.7 0.5 0.9]
[0.7 0.5 0.9]
0.38506088005216804


Handling One hot and sparse ground truth encodings
- to make network as flexible as possible and be able to handle multiple ground truth formats (one-hot and sparse), are implementing the code below
- can test whether the ground truth array is one hot or sparse by looking at dimensions; 2D array is one hot because each output row is a list of 1s and 0s for the hot and cold classes of the respective feature sample. Sparse is 1D because each value in array communicates the ground truth class for its respective feature sample, which is also designated to same the column (neuron) location index in the output array (see examples above), so if class is 0 then output position is the first spot in the row, so on an so forth 
- implementation: np arrays have property variable shape, which describes their dimensionality. If shape is tuple length of 1, then shape is 1D, if tuple length of 2 then shape is 2D, etc.

In [12]:
import numpy as np
softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])

class_targets = np.array([[1, 0, 0],
                          [0, 1, 0],
                          [0, 1, 0]])


#implementation see notes above

if len(class_targets.shape) == 1: #if 1D
    correct_confidences = softmax_outputs[range(len(softmax_outputs)), class_targets]
elif len(class_targets.shape) == 2: #if 2D
    correct_confidences = np.sum(softmax_outputs*class_targets, axis = 1) #axis = 1 to sum the values of each row
neg_log = -np.log(correct_confidences)

average_loss = np.mean(neg_log)

print(average_loss)


0.38506088005216804


Log of 0:
- natural log function approaches 0, but never touches => intuitively, negative exponents are fractions, so the more negative an exponent gets, the closer the log gets to 0, but never touches
- sometimes the network will return 0, if it has no confidence in the correct class. In this case, np.log returns -np.inf, which when put into averages (e.g, our average of lossses), will cause the average to also be infinity
- just simply adding very small value to the condifence (e.g., ln(x + small_val)) to prevent 0 does not work because it shifts the function to the left, so when confidence is 100, the loss will be negative, which indicates negative error, and does not make sense. This overall biases confidence towards 1 as well, which could result in lower accuracy?
- solution is to contrain the possible range of confidences to values between a very_small_value, and (1 - very_small_value), using np.clip

In [16]:
#clipping example
y_pred = [0,1]
y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
print(y_pred_clipped)

[1.000000e-07 9.999999e-01]


Classes for Loss
- creating a common loss class and a child loss class for categorical cross entropy, presumably because we will have different loss functions later in the book

In [17]:
class Loss:

    def calculate(self, output, y): #output of softmax layer, y is y_true

        sample_losses = self.forward(output, y) #forward function of child loss class
        data_loss = np.mean(sample_losses) #average loss over batch
        return data_loss
    

class Loss_CategoricalCrossEntropy(Loss): #inherits loss class
    
    def forward(self, y_pred, y_true):
        
        samples = len(y_pred) #doing len in advance for 1D, not sure why its here and not in if statement but following book on this
        
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)

        if len(y_true.shape) == 1: #if 1D aka sparse
            correct_confidences = y_pred_clipped[range(samples), y_true]
        
        elif len(y_true.shape) == 2: #if 2D aka 1 hot encoded
            correct_confidences = np.sum(y_pred_clipped*y_true, axis = 1)

        negative_log_likelihoods = -np.log(correct_confidences)
        
        return negative_log_likelihoods


Full Code up to this point

In [20]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data
nnfs.init()

# Dense layer
class Layer_Dense:
    
    # Layer initialization
    def __init__(self, n_inputs, n_neurons):
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    
    # Forward pass
    def forward(self, inputs):
        # Calculate output values from inputs, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases

# ReLU activation
class Activation_ReLU:
    # Forward pass
    def forward(self, inputs):
        # Calculate output values from inputs
        self.output = np.maximum(0, inputs)

# Softmax activation
class Activation_Softmax:
    # Forward pass
    def forward(self, inputs):
        # Get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # Normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities

# Common loss class
class Loss:
    # Calculates the data and regularization losses
    # given model output and ground truth values
    def calculate(self, output, y):
        
        # Calculate sample losses
        sample_losses = self.forward(output, y)
        
        # Calculate mean loss
        data_loss = np.mean(sample_losses)
        
        # Return loss
        return data_loss
    
class Loss_CategoricalCrossentropy(Loss):
    # Forward pass
    def forward(self, y_pred, y_true):
        # Number of samples in a batch
        samples = len(y_pred)
        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)

    # Probabilities for target values -
    # only if categorical labels
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[range(samples),y_true]
    
    # Mask values - only for one-hot encoded labels
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(y_pred_clipped*y_true, axis=1)
    # Losses
        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods
    

# Create dataset
X, y = spiral_data(samples=100, classes=3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)
# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 3 input features (as we take output
# of previous layer here) and 3 output values
dense2 = Layer_Dense(3, 3)

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

# Create loss function
loss_function = Loss_CategoricalCrossentropy()

# Perform a forward pass of our training data through this layer
dense1.forward(X)
# Perform a forward pass through activation function
# it takes the output of first dense layer here
activation1.forward(dense1.output)

# Perform a forward pass through second Dense layer
# it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Perform a forward pass through activation function (softmax)
# it takes the output of second dense layer here
activation2.forward(dense2.output)

# Let's see output of the first few samples:
print(activation2.output[:5])

# it takes the output of second dense layer activation and returns loss
loss = loss_function.calculate(activation2.output, y)

print("loss", loss)

###See below cells for further discussion on accuracy calculation
# Calculate accuracy from output of activation2 and targets
# calculate values along first axis
predictions = np.argmax(activation2.output, axis=1)

if len(y.shape) == 2:
    y = np.argmax(y, axis=1)

accuracy = np.mean(predictions == y)
# Print accuracy
print('acc:', accuracy)


[[0.33333334 0.33333334 0.33333334]
 [0.33333316 0.3333332  0.33333364]
 [0.33333287 0.3333329  0.33333418]
 [0.3333326  0.33333263 0.33333477]
 [0.33333233 0.3333324  0.33333528]]
loss 1.0986104
acc: 0.34


Accuracy Calculation:
- accuracy measures the percentage of correctly identified classes; accuracy = number_correct/number_of_classes
- to get network prediction for a sample, get the location of max of the softmax output array => so need to take the argmax value of each row in the batch. Can be done with np.argmax, which lets you specify a max on a specific axis, and returns the location of that max
- softmax output is slightly different than above for the purpose of example

In [19]:
import numpy as np
# Probabilities of 3 samples
softmax_outputs = np.array([[0.7, 0.2, 0.1],
                            [0.5, 0.1, 0.4],
                            [0.02, 0.9, 0.08]])

# Target (ground-truth) labels for 3 samples
class_targets = np.array([0, 1, 1])

# Getting the location in the output row of the predicted class
predictions = np.argmax(softmax_outputs, axis=1)

# If targets (ground truths) are one-hot encoded - convert them to 1D (also done with argmax for each row)
if len(class_targets.shape) == 2:
    class_targets = np.argmax(class_targets, axis=1) #returns location in row of 1 hot encoded value

#Returns True (1) where the locations are the same for predicted and class targets, meaning network was correct, False (0) otherwise
#averaging this is accuracy bc count of correct = sum of true (1)
accuracy = np.mean(predictions == class_targets) 
print('acc:', accuracy)

acc: 0.6666666666666666
