Calculating Network Error with Loss
- loss function = cost function
- loss calculates how wrong a network is from the correct answer and is the model's error. Thus, ideally loss should be zero.
- Classification network outputs are akin to the confidence of the network's classification, and thus want to increase confidence (i.e., move correct neuron closer to 1) and decrease misplaced confidence
- For the current task at hand, we will use categorical cross entropy loss, however for differeny types of network outputs, there are obviously different functions => Mean squared error (regression), Binary Cross entropy loss (sigmoid activation function w/ two mutually exclusive classes and single output neuron; aka log loss?)

Categorical Cross Entropy loss (Note: did some extra reading outside the book)
- used for multiple mutually exclusive classes in classification task, thus commonly used with a softmax activation layer
- cross entropy means the differnce between two distributions, in our case, the output distribution of the network and the actual ground truth distribution.
- categorical comes from the fact that the ground truth distribution is category based (i.e., there is only one correct category and not varying degrees of correctness/probability e.g., one hot encoded or sparse)
- categorical cross entropy loss = -sum((ground_truth_value(i) * -log(predicted_value(i))); where i is the ith value in the softmax output matrix and the ground truth matrix is one hot encoded
- one hot encoded is an array/matrix where the correct value or desired values are 1 and the rest are 0
- so the above equation in categorical cross entropy loss results in all the wrong classes being multiplied by 0 and the correct class being multiplied by 1. This results in simplification in code to just -log(predicted_value_of_correct_class)
- going forward references to log in the book mean natural log (ln = log with base e)
- Ultimately => goal is to calculate average categorical cross entropy loss for each training batch

Further math intuition => 
- -log(x) is downward sloping and where x = 1, -log(x) = 0; which works because if your network is predicting 1 for the correct class, then the loss is 0. => 1 * -log(0) = 0;
- As confidence decreases (lower output value), the loss approaches infinity, there is an asymptote at x = 0; this will present a problem later in the book (need to add very small value to predicted probability so not passing 0) bc log(0) = undefined)

A simple example:

In [3]:
import math
# An example output from the output layer of the neural network
softmax_output = [0.7, 0.1, 0.2]

# Ground truth

target_output = [1, 0, 0]
loss = -(math.log(softmax_output[0])*target_output[0] +
math.log(softmax_output[1])*target_output[1] +
math.log(softmax_output[2])*target_output[2])
print(loss)

#simplification - see notes above for explanation => don't need to include other terms besides one in desired 
# ground truth because they go to 0

loss = -math.log(softmax_output[0])
print(loss)

0.35667494393873245
0.35667494393873245


Dynamically Taking the Log of Desired Index Point 
- have layer output and the correct answer for the layer in an array and list
- this list can be one-hot-encoded (explained and exemplified above), or sparse (below)
- sparse means that ground-truth array contains numbers representing the correct classes, such as 0 = dog, 1 = cat, 2 = human. So [1, 0 , 2], would correspond with 3 feature samples, whose ground trought outputs are cat, dog, human. As opposed to one-hot-encoded where cat would be [0, 1, 0]. So a sparse array will be single dimension, whereas one hot encoded will be multi dimensonal
- in the below examples, the loss is averaged, this also applies to one-hot, just did not show 

In [7]:
import numpy as np

softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])

#where the value represents a class; e.g., 0 = dog, 1 = cat, 2 = human; so dog, cat, cat here
class_targets = [0, 1, 1] #sparse encoding

#one way to get this

#for each row in the outputs, get the value in that row corresponding with the correct class, aka for each row, get column
for targ_idx, distribution in zip(class_targets, softmax_outputs):
    print(distribution[targ_idx])

#even faster using numpy => get the [[row_numbers, col_number]], getting each row here because we want each output
print(softmax_outputs[[0, 1, 2], class_targets])

#so since we want to get the target value at each row, we always want to get each row, so can make further dynamic
print(softmax_outputs[range(len(softmax_outputs)), class_targets])
#range len counts off each row for the length of the softmax outputs array

##full sparse simplification and log and average of loss => averaging applies to one-hot too
neg_log = -np.log(softmax_outputs[range(len(softmax_outputs)), class_targets])
average_loss = np.mean(neg_log)
print(average_loss)

0.7
0.5
0.9
[0.7 0.5 0.9]
[0.7 0.5 0.9]
0.38506088005216804


Handling One hot and sparse ground truth encodings
- to make network as flexible as possible and be able to handle multiple ground truth formats (one-hot and sparse), are implementing the code below
- can test whether the ground truth array is one hot or sparse by looking at dimensions; 2D array is one hot because each output row is a list of 1s and 0s for the hot and cold classes of the respective feature sample. Sparse is 1D because each value in array communicates the ground truth class for its respective feature sample, which is also designated to same the column (neuron) location index in the output array (see examples above), so if class is 0 then output position is the first spot in the row, so on an so forth 
- implementation: np arrays have property variable shape, which describes their dimensionality. If shape is tuple length of 1, then shape is 1D, if tuple length of 2 then shape is 2D, etc.

In [12]:
import numpy as np
softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])

class_targets = np.array([[1, 0, 0],
                          [0, 1, 0],
                          [0, 1, 0]])


#implementation see notes above

if len(class_targets.shape) == 1: #if 1D
    correct_confidences = softmax_outputs[range(len(softmax_outputs)), class_targets]
elif len(class_targets.shape) == 2: #if 2D
    correct_confidences = np.sum(softmax_outputs*class_targets, axis = 1) #axis = 1 to sum the values of each row
neg_log = -np.log(correct_confidences)

average_loss = np.mean(neg_log)

print(average_loss)


0.38506088005216804
