Back Propagation
- beginning with an simple example where we just seek to minimize the output of a single neuron
- goal is to figure out how much each input, weight, and bias impacts the neuron function (and eventually then network)
- to do this need to use the chain rule and take the derivative with respect to each input, weight, bias (only 1 bias here)
- only use derivative with respect to weights and biases to minimize loss, but need to know derivatives with respect to inputs as well because it is used to chain to another layer (more understanding in next bullet point)
- we are chaining the layers together via the input derivative so like derivative of each layer with respect to its input, but then on the last layer or the layer we are interested in calculating the derivative for, take the derivative of that layer with respect to weight or bias. Cause the change in the weight or bias ultimately is the ultimate input to that next layer, so it will transform that layer's input. dfunction/dweight = dlayer2(layer1output)/dlayer1 * dlayer1/dx; so its really the same thing as the chain rule, and as part of the chain rule, need to know the derivative of the outer function wih respect to its input, which is the inner function, then can take the derivative of inner function with respect to whatever parameter you want, in this case weight or bias. So the input is acting as the chain (the chain rule!)

Back Prop on 1 Neuron
- Example Neuron function where x0, w0 are inputs and respective weight, b is bias: y = relu(w0 * x0 + w1 * x1 + w2 *x2 + b) = max(w0 * x0 + w1 * x1 + w2 *x2 + b, 0)
- can be broken down even further into considering each weight* input is own function; the book does this. See function, were sum() is the sum of the weights* inputs and bias, and mul() is weights* inputs. So derivative of full neuron with respect to x0 would be deriv from next layer wrt input * dReLU()/dsum() * dsum()/dmul() * dmul()/dx0

In [1]:
# Forward pass
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias
# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]
# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b
# ReLU activation function
y = max(z, 0)
# Backward pass
# The derivative from the next layer
dvalue = 1.0

''' Example of how this comes together
dtwoneurons/dx0 = dnext_layer/dReLU * dReLu/dsum() * dsum()/dmul() * dmul/dx0
'''

# Derivative of ReLU and the chain rule
drelu_dz = dvalue * (1. if z > 0 else 0.)
print("Next Layer times Relu:", drelu_dz)

# Partial derivatives of the multiplication, the chain rule, deriv of plain sum is just 1 (think about it)
dsum_dxw0 = 1
dsum_dxw1 = 1
dsum_dxw2 = 1
dsum_db = 1
drelu_dxw0 = drelu_dz * dsum_dxw0
drelu_dxw1 = drelu_dz * dsum_dxw1
drelu_dxw2 = drelu_dz * dsum_dxw2
drelu_db = drelu_dz * dsum_db
print("Next Layer, RelU, and sum for each sum and bias:", drelu_dxw0, drelu_dxw1, drelu_dxw2, drelu_db)
# Partial derivatives of the multiplication, the chain rule
#short cut to derivative here is that the deriv wrt to weight is just input value and wrt to input is just the weight (flip-flop)
dmul_dx0 = w[0] 
dmul_dx1 = w[1]
dmul_dx2 = w[2]
dmul_dw0 = x[0]
dmul_dw1 = x[1]
dmul_dw2 = x[2]
drelu_dx0 = drelu_dxw0 * dmul_dx0
drelu_dw0 = drelu_dxw0 * dmul_dw0
drelu_dx1 = drelu_dxw1 * dmul_dx1
drelu_dw1 = drelu_dxw1 * dmul_dw1
drelu_dx2 = drelu_dxw2 * dmul_dx2
drelu_dw2 = drelu_dxw2 * dmul_dw2

#note that the full deriv wrt to bias is calculated above it is 1, since it is just a sum onto the sum function
print("Full wrt inputs, weights:",drelu_dx0, drelu_dw0, drelu_dx1, drelu_dw1, drelu_dx2, drelu_dw2)

Next Layer times Relu: 1.0
Next Layer, RelU, and sum: 1.0 1.0 1.0 1.0
Full wrt inputs, weights: -3.0 1.0 -1.0 -2.0 2.0 3.0


Ultimately Simplifying the Code Above
- taking out all multplying by 1 etc. just leaves you with derivative of next_layer * ReLu * w0 (or x0, bias, etc)

In [7]:
# Forward pass
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias
# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]
# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b
# ReLU activation function
y = max(z, 0)
# Backward pass
# The derivative from the next layer
dvalue = 1.0

''' Example of how this comes together
dtwoneurons/dx0 = dnext_layer/dReLU * dReLu/dsum() * dsum()/dmul() * dmul/dx0

z = sum() + b

now simplified to dnext_layer/dReLU * dReLu/dz * dz/dx[0]
'''

# Derivative of ReLU and the chain rule
drelu_dz = dvalue * (1. if z > 0 else 0.)
print("Next Layer times Relu:", drelu_dz)

# Partial derivatives of the multiplication, the chain rule
#short cut to derivative here is that the deriv wrt to weight is just input value and wrt to input is just the weight (flip-flop)

drelu_dx0 = dvalue * (1. if z > 0 else 0.) * w[0]
drelu_dw0 = dvalue * (1. if z > 0 else 0.) * x[0]
drelu_dx1 = dvalue * (1. if z > 0 else 0.) * w[1]
drelu_dw1 = dvalue * (1. if z > 0 else 0.) * x[1]
drelu_dx2 = dvalue * (1. if z > 0 else 0.) * w[2]
drelu_dw2 = dvalue * (1. if z > 0 else 0.) * x[2]
drelu_db = dvalue * (1. if z > 0 else 0.) * 1
print("Full wrt inputs, weights:",drelu_dx0, drelu_dw0, drelu_dx1, drelu_dw1, drelu_dx2, drelu_dw2)
print("wrt bias", drelu_db)

Next Layer times Relu: 1.0
Full wrt inputs, weights: -3.0 1.0 -1.0 -2.0 2.0 3.0
wrt bias 1.0


Decreasing the output of neuron (manually); run cell above for gradient values

In [8]:
dx = [drelu_dx0, drelu_dx1, drelu_dx2] # gradients on inputs
dw = [drelu_dw0, drelu_dw1, drelu_dw2] # gradients on weights
db = drelu_db # gradient on bias...just 1 bias here

#current weights and bias
print("Current Weights", w, b)

#applying a small negative value to our gradient to respect to weight, ie. how much the final output changes wrt to change in weights
#negative because we want to decrease the output of neuron
w[0] += -0.001 * dw[0]
w[1] += -0.001 * dw[1]
w[2] += -0.001 * dw[2]
b += -0.001 * db

#new weight and bias
print("New weights", w, b)

# Multiplying inputs by new weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding
z = xw0 + xw1 + xw2 + b
# ReLU activation function
y = max(z, 0)
print("New Output", y, "Old Output", 6)


Current Weights [-3.0, -1.0, 2.0] 1.0
New weights [-3.001, -0.998, 1.997] 0.999
New Output 5.985 Old Output 6


So basically gradient descent is this, which much of this insight came from reading outside the book:
- find how much each weight impacts the loss/target function (derivative wrt weight aka gradient)
- multiply that impact by a small constant amount across impacts (gradients) => so weights that have larger gradients will result in larger values, smaller gradients will result in smaller values (e.g., .001 * 1 < .001 * 2)
- subtract that amount from the current weight value. Subtract because we are trying to minimize. Weights with larger impacts on function output (e.g., larger gradient) will decrease by more whereas ones with smaller gradients will decrease less
- In other words: you are changing the weights that have the most impact by more than less impactful ones, thus "moving in the steepest direction" (as all the blogs say) because are changing the most impactful parameters by greater values. Adding/Subtracting a slope times some constant back to the slope so the steeper the slope (bigger gradient), the greater value you are adding/subtracting back to the slope(gradient), changing the slope by more than if it was less steep.
- as the gradients decrease, then you make overall smaller changes to weights, so effectively taking smaller steps, the closer you get to optimizing the function at the minimum (gradients approaching zero)
- changing the weights while holding input constant => by doing the above to change the most impactful weights by more to more quickly reduce the loss function

This was the basic process of minimizing the output of a neuron, in reality want to minimize the loss of a network. The next layer aspect of this is kind of ingnored (it is 1), although if you include the next layer wouldn't we really be minimizing the value of the next layer?

Backpropagation Between Layers
- so in a layer, each nueron outputs a gradient with respect to each input
- the sum of the gradients for each respective input is the full derivative for that input. This leverages the principle that derivatives sum linearly. So partial derivative for input 1 across 3 neurons can sum togther for full derivative of input 1 for the whole layer => gained this insight thru additional reading outside the book
- So here can just sum weights of each neuron that correspond to each respective input (excluding any other chain rule stuff from previous layer, otherwise need to multiply these first per chain rule) => this is the the shortcut approach.

Backpropagation Between Layers
- Note: did not find the book explanation particularly clear, so the following is my interpretation of it
- just like on the forward pass, each nueron is outputting a gradient with respect to inputs, on the backward pass, each following layer has a gradient with respect to the layer before it (i.e., previous layer occurs earlier in forward pass). To continue backpropagation/chain rule, need to calculate the full derivative of each of the previous layer's neurons to backpropagate through them. This is done by summing the following layer's partial derivatives with respect to that inputting neuron. Can do this because you can add derivatives (i.e. derivative of x + x + x is the same as 3x), in other words, partial derivatives wrt to same variable sum linearly/are additive.
- Example: for neuron A in the current layer, all neurons in the following layer have a partial derivative associated with neuron A in the current layer. To do the chain rule and backpropagate through neuron A in the current layer, need to know the full derivative (i.e., not just partial) with respect to neuron A. A derivative is additive and the sum of the partial derivative (as seen in earlier chapters). Thus to get the full derivative wrt to neuron A's output, need to sum partial derivatives wrt to A' output of each of the following layer neurons.
- a good visual metaphor is like bundling a bunch of frayed wires (partial derivatives) together into 1 big cable (full derivative)
- so, given a following, current, and input layer: to get the partial derivatives wrt to an input layer neuron, need to do the chain rule etc. So multiply the following layer's derivative by the derivative wrt weight of the input neuron's output (aka just the input). This would be the weight that a current layer's neuron has on an input layer's neuron. This multiplication gives you the partial derivative wrt to that input neuron for a given current neuron. Repeat this for each neuron in the current layer and sum. This then gives you the full derivative wrt to that input neuron, which you can use to continue backpropagating thru the network. Repeat for all neurons in the layers.
- (ignore if you are confused) basically you are calculating gradient wrt to the inputs of the layer, so each full derivative wrt to 1 input is part of a larger gradient of each of the input neurons output values 

![backprop_diagram.jpg](attachment:backprop_diagram.jpg)

Overview of Example from Book
- imagine a layer of 3 neurons that have 4 inputs each (so 4 weights each). There are layers that follow the 3 layers, but these are not shown
- we want to get the gradient of the function with respect to each of the 4 inputs. So we need the partial derivative of the function wrt to each of the inputs, aka need to take the derivative of the function wrt to each input
- the layer of the 3 recives a gradient from the rest of the greater function , where each partial derivative is the derivative wrt each neuron's in the layer of 3's output. 
- so each of these partial derivatives is passed to its respective neuron in the layer of 3 to continue the chain rule.
- the partial derivative is multiplied by the respective neuron weights (if taking deriv wrt to input), which results in each neuron have a gradient for each input
- sum each neuron's partial derivative wrt to the same input to get the full derivative wrt to that input (explained above). That full derivative is a partial derivative to the larger gradient funtion of that layer
- presumably this is also how we got the input gradient to the layer of 3 as well


- nuance of the example is that the weight matrix is transposed because we have it naturally transposed (ie. it is defined transposed rather than being .T) in the actual layer objects.
- using shortcut of derivative wrt input is weight


In [1]:
import numpy as np

# Passed-in gradient from the next layer
# for the purpose of this example we're going to use
# a vector of 1s
dvalues = np.array([[1., 1., 1.]])
# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T

# Sum weights related to the given input multiplied by
# the gradient related to the given neuron
dx0 = sum([weights[0][0]*dvalues[0][0], weights[0][1]*dvalues[0][1],weights[0][2]*dvalues[0][2]])
dx1 = sum([weights[1][0]*dvalues[0][0], weights[1][1]*dvalues[0][1],weights[1][2]*dvalues[0][2]])
dx2 = sum([weights[2][0]*dvalues[0][0], weights[2][1]*dvalues[0][1],weights[2][2]*dvalues[0][2]])
dx3 = sum([weights[3][0]*dvalues[0][0], weights[3][1]*dvalues[0][1],weights[3][2]*dvalues[0][2]])
#note that because the weights are transposed, the dvalues are only multiplied by their respective neuron
#this is as should be, so dvalues[0][2], the third item in dvalues is always multiplied by the third item of the row in the transposed matrix
# aka the third dvalue is only multiplied by third neuron weights (the 3rd row in untransposed)

#gradient of total function with respect to the inputs
dinputs = np.array([dx0, dx1, dx2, dx3])
print(dinputs)

[ 0.44 -0.38 -0.07  1.37]


Simplification:
- multiplying and summing this stuff can just be boiled down to the do product
- PLEASE NOTE: because the dvalues is row oriented, the weights have to be transposed again - they are transposed upon their variable definition to simulate how they will be in the network object, BUT THEN must be transposed again to make the dot product work
- this is because dvalues is a row vector (3,) aka (1,3), so having a weights vector (as of transpose) of (4,3) does not align dimensions properly for dot product. So have to transpose again, this way the multiplication and addition works out too to get the correct partial gradients/chain rule calculated (i.e. neuron's weights are only multiplied by dvalue corresponding to that neuron, then summing the partial derivatives across neurons wrt to a particular input)

In [17]:
import numpy as np
# Passed-in gradient from the next layer
# for the purpose of this example we're going to use
# a vector of 1s
dvalues = np.array([[1., 1., 1.]])
# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T

# sum weights of given input
# and multiply by the passed-in gradient for this neuron
#same results as above

#NOTE transpose again
dinputs = np.dot(dvalues[0], weights.T)
print(dinputs)

[ 0.44 -0.38 -0.07  1.37]


Batch of Samples:
- Just follow methodology above, except now with batch, can be seen in dvalues
- note once again, the double transpose as described above
- results in a batch of gradients that are each wrt their particular samples

In [18]:
import numpy as np
# Passed-in gradient from the next layer
# for the purpose of this example we're going to use
# an array of an incremental gradient values
dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])
# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T

# sum weights of given input
# and multiply by the passed-in gradient for this neuron
#results in batch of gradients wrt to inputs for each sample

#NOTE: as discussed above, double transpose
dinputs = np.dot(dvalues, weights.T)
print(dinputs)

[[ 0.44 -0.38 -0.07  1.37]
 [ 0.88 -0.76 -0.14  2.74]
 [ 1.32 -1.14 -0.21  4.11]]


Gradients with respect to weights and biases are different!
- Book does not cover this topic or at least anywhere that I saw, but I have figured it out hahahaha!
- see below for explanation

Gradients With Respect to Weights
- This is NOT like the gradient with respect to inputs and using the intuition from gradients wrt to inputs (i.e. summing the partial derivatives with respect to the same input for each neuron) is not correct and will explain below.
- Used chatgpt socratically to help me walk thru this, so credit to it.

What is actually happening:
- firstly for inputs in not transposed form, each row represents an input sample. For dvalues, each row the same layer's dvalues for each batch, and each column represents a neuron. Also remember that derivative wrt to weights is inputs
- we are transposing the inputs and making it the first term of the dot product (see code). What this does: multiplies the each row of the inputs.T by each column of dvalues. As a result, each dvalue is only multiplied by its respective input sample. For example, the first dvalue row, which corrsponds to the first sample's dvalues, is only multiplied by values in the first inputs.T column, which correspnds to the first input sample.
- this makes intuitive sense, you should only be multiplying derivatives for each sample with their own sample
- but after multiplying, we are summing. Why? What does the sum mean?
- The sum means that we are adding up the each neruon's gradients wrt to the weights across batches. For example, for the first input feature of the first neuron, we are multiplying the first row of inputs.T by the first column of dvalues. This gives us a gradient of the partial derivative with respect to the first neurons weight of that feature for all of the samples. in other words, for each sample, this is the first neuron's gradient with respect to the first feature. Then, by summing them all together, we are assessing the net cumulative impact of that specific weight of the neuron has on the loss function. 
- the output matrix rows represent the features of each sample and the columns represent the neurons. The values are the aggregated gradient wrt the weight for each neuron's weight of a particular feature
- intuitive example behind the sum: lets say we have two features per example. Across the batch, the first feature is -1 half the time and 1 the other half. The second feature is always 1. By summing these across the batch, we find that the sum of the first feature is 0, where as the second feature is a large positive number. Thus, the second feature is much more informative towards reducing loss. An update to the weights on the second feature via the learning rate will result in much greater impact to loss function


Other considerations
- Now I better understand why feature scaling might be important or the balance of features in a dataset. Very large feature values or unbalanced data will result in too much/too little attention paid to that feature relative to others.
- instead of summing, you can also average, for example if you have uneven batch sizes or data with lots of outliers/noise, so that the aggregate gradients are a bit smoother across samples, as big values would cause lumpy sums for the features (i.e. sums for the same feature would be very different across batches). Intuitively though, assuming equal batch sizes, averaging is the same as dividing the learning rate by the batch size (just moving the 1/N from the sum to the learning rate). 
- Summing reduces computational complexity and simpler (as you can see we can use dot product)
- summing is standard convention and expected for gradient descent algos. I think this makes sense (at least for batches) as you can really only optimize 1 set of summed gradients, without it just becoming stochastic gradient descent, like it would seeminly defeat the purpose of batches to not sum
- increases stability (ie. less lumpiness as described above), which may help make smoother convergence, less jumping around in loss
- Not summing a batch would result in 3d matrix, where for each sample would have a matrix of neuron gradients wrt to each input
- self evident as to why it is different than methodology for gradient wrt to inputs (Chain rule); but, summing that way would combine the impacts that each input has on each neuron in the layer, which is the opposite of the goal. We want to keep that separate for this to isolate the impact that each neuron's weight on the feature has across the batch.

Book code:
- see above for detailed explanation
- remember the fundamentals: each dvalue for each sample is only ever multiplied by derivs wrt to weights for the same corresponding sample
- output is the sum of the gradient wrt to weight for each feature of each individual neuron. So columns neurons and rows are features. Values are the sums of their gradients
- we do this because we want to aggregate the impact across the batch that a weight update would have

In [22]:
import numpy as np

# Passed-in gradient from the next layer
# for the purpose of this example we're going to use
# an array of an incremental gradient values
dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])

# We have 3 sets of inputs - samples
inputs = np.array([[1, 2, 3, 2.5],
                   [2., 5., -1., 2],
                   [-1.5, 2.7, 3.3, -0.8]])

# sum inputs of given input
# and multiply by the passed-in gradient for this neuron
#transpose gets inputs in correct orientation for desired result of dot product
dweights = np.dot(inputs.T, dvalues)
print(dweights)

#output for example: the values in the first row are the sums of the gradients across 
#batches for each neuron wrt to the first feature. They are the same because each neuron has same dvalue across batches

[[ 0.5  0.5  0.5]
 [20.1 20.1 20.1]
 [10.9 10.9 10.9]
 [ 4.1  4.1  4.1]]


Gradient with respect to biases
- derivative wrt to biases is 1
- technically could be doing a np.dot([[1,1,1]], dvalues) to follow the logic from above; that is want to sum the aggregate impact of the biases. But since multiplying by 1 does nothing, can just skip to sum.
- want to sum column the column ,because it represents a neuron, so axis=0
- keep dims to keep it as a (1,3) row array

In [29]:
import numpy as np
# Passed-in gradient from the next layer
# for the purpose of this example we're going to use
# an array of an incremental gradient values
dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])
# One bias for each neuron
# biases are the row vector with a shape (1, neurons)
biases = np.array([[2, 3, 0.5]]) #the derivatives of these is [[1, 1, 1]]


# dbiases - sum values, do this over samples (first axis), keepdims
# since this by default will produce a plain list -
# we explained this in the chapter 4
dbiases = np.sum(dvalues, axis=0, keepdims=True)

print(dbiases)

[[6. 6. 6.]]


Gradient of ReLU Function
- remember that derivative of max(x,0) function wrt to input is 1 when x > 0 and o when x <= 0
- in this basic implementation below, we are defining an array of zeros that has the same shape as layer output, then setting it to 1 where the output is greater than 0. This represents the drelu gradient
- multiply by the passed in gradient (ie. dvalues) to get the full drelu wrt to inputs. This is part of chain rule
- a simplification can be done (not shown here but in next example), where 

In [31]:
import numpy as np
# Example layer output
z = np.array([[1, 2, -3, -4],
              [2, -7, -1, 3],
              [-1, 2, 5, -1]])

dvalues = np.array([[1, 2, 3, 4],
                    [5, 6, 7, 8],
                    [9, 10, 11, 12]])

# ReLU activation's derivative
drelu = np.zeros_like(z)
drelu[z > 0] = 1

print(drelu)
# The chain rule
drelu *= dvalues
print(drelu)

[[1 1 0 0]
 [1 0 0 1]
 [0 1 1 0]]
[[ 1  2  0  0]
 [ 5  0  0  8]
 [ 0 10 11  0]]


Simplification of Relu
- since the derivative is 1 when x > 0, we can skip the multiplication step, and just set the dvalues array to 0 where the layer output is less than 0. This effectively gives us the drelu gradient
- copy dvalues because we are making an update to the original variable

In [32]:
import numpy as np
# Example layer output
z = np.array([[1, 2, -3, -4],
              [2, -7, -1, 3],
              [-1, 2, 5, -1]])

dvalues = np.array([[1, 2, 3, 4],
                    [5, 6, 7, 8],
                    [9, 10, 11, 12]])
# ReLU activation's derivative
# with the chain rule applied
drelu = dvalues.copy()
drelu[z <= 0] = 0
print(drelu)

[[ 1  2  0  0]
 [ 5  0  0  8]
 [ 0 10 11  0]]


Layer Forward and Backward Example
- Please note that for the passed in gradient to the layer this uses the output of the layer, which in practice is not done, which I confirmed with chatGPT, it agreed. Notice that dvalues = relu_outputs
- Also for avoidance of confusion, note that drelu has been replaced by dvalues in all the gradient calcs for inputs, weights, and biases. Which again is intuitive dvalues have just been adjusted by drelu, and everything still applies in the same fashion as explained above.
- only actually end up using dweights and dbiases. dinputs not used because we are not passing anything to next layer, dinputs would be the dvalues passed to the layer before it (I think)

In [34]:
import numpy as np


# We have 3 sets of inputs - samples
inputs = np.array([[1, 2, 3, 2.5],
                   [2., 5., -1., 2],
                   [-1.5, 2.7, 3.3, -0.8]])

# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T

# One bias for each neuron
# biases are the row vector with a shape (1, neurons)
biases = np.array([[2, 3, 0.5]])

# Forward pass
layer_outputs = np.dot(inputs, weights) + biases  # Dense layer
relu_outputs = np.maximum(0, layer_outputs)  # ReLU activation

# Backward pass - for this example we're using ReLU's output
# as passed-in gradients (we're minimizing this output)
dvalues = relu_outputs #JUST FOR THIS EXAMPLE, WOULD HAVE AN ACTUAL GRADIENT PASSED IN

# Backpropagation and optimization

# ReLU activation's derivative with the chain rule applied
drelu = dvalues.copy()
drelu[layer_outputs <= 0] = 0


# Dense layer
# dinputs - multiply by weights
dinputs = np.dot(drelu, weights.T) #note drelu now rather than dvalues as seen in examples
# dweights - multiply by inputs
dweights = np.dot(inputs.T, drelu) #note drelu now rather than dvalues as seen in examples
# dbiases - sum values, do this over samples (first axis), keepdims
# since this by default will produce a plain list -
# we explained this in the chapter 4
dbiases = np.sum(drelu, axis=0, keepdims=True) #note drelu now rather than dvalues as seen in examples

# Update parameters
weights += -0.001 * dweights
biases += -0.001 * dbiases

print(weights) #results in transposed weights
print(biases)
#dinputs are not used to update the current layer and presumably passed to next earlier layer as the passed in gradient


[[ 0.179515   0.5003665 -0.262746 ]
 [ 0.742093  -0.9152577 -0.2758402]
 [-0.510153   0.2529017  0.1629592]
 [ 0.971328  -0.5021842  0.8636583]]
[[1.98489  2.997739 0.497389]]


Updates to Layer Objects
- need to store inputs in a variable because when doing the backwards pass, need them to calculate gradient wrt to weights
- need to add backward functions to the layer object and relu object, using principles discussed above

In [None]:
class Layer_Dense:
    # Layer initialization
    def __init__(self, inputs, neurons):
        self.weights = 0.01 * np.random.randn(inputs, neurons)
        self.biases = np.zeros((1, neurons))
    # Forward pass
    def forward(self, inputs):
        self.inputs = inputs #need to store these to compute gradients wrt to wegihts
        self.output = np.dot(inputs, self.weights) + self.biases
    
    def backward(self, dvalues):
        # Gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        # Gradient on values
        self.dinputs = np.dot(dvalues, self.weights.T)

class Activation_ReLU:
    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        self.output = np.maximum(0, inputs)
    
    # Backward pass
    def backward(self, dvalues):
        # Since we need to modify the original variable,
        # let's make a copy of the values first
        self.dinputs = dvalues.copy()
        # Zero gradient where input values were negative
        self.dinputs[self.inputs <= 0] = 0

Derivative with Respect to Loss function:
- firstly, the derivative of log function: f(x) = ln(h(x)); f'(x) = 1/h(x) * h'(x)
- please note that the book calls ln log, and if you use some online resources it may interpret log vs ln differently
- so to be clear, we are using natural log in the function (ln), but the book calls it log
- Full derivative: dL/dy_hat = -y/y_hat; where y_hat is predicted value of the target category and y is y true. See book for full math; please note that this drops the sum, because remember, the 1 hot encoded values for the non-target category are 0 and just cancel out
- see example code, skips over forward pass for the example
- note that this example code includes preparation for gradient normalization by dividng the gradient by the number. Benefits of averaging loss: (1) Normalization, as previously discussed, summing would result in the loss being dependent on the sample size, (2) Normalization with average allows comparability between batches and different models. PLEASE NOTE THAT we did not sum yet. This is only the division part, an optimizer, which will be discussed performs the sum.
- SEE HERE. THE WRITING ABOVE was my intital perception of the books explanation. But, thinking further, it we are performing gradient normalization at the loss stage, then we are performing it at all stages of the backpropagation function. Because chain rule involves multiplying all the derivatives. So the division of the loss function will be passed through to all steps of the backprop, when we are taking derivatives with respect to weights and biases. Not conceptually as important for deriv wrt to inputs, because while it is also averaged, it is ultimately just part of the chain to get us to a weight/bias derivative. And the reference to the summing is, I think at this point, misleading because we have been summing the gradients all along via the dot products.

In [1]:
class Loss:
    # Calculates the data and regularization losses
    # given model output and ground truth values
    def calculate(self, output, y):
        
        # Calculate sample losses
        sample_losses = self.forward(output, y)
        
        # Calculate mean loss
        data_loss = np.mean(sample_losses)
        
        # Return loss
        return data_loss

In [20]:
class Loss_CategoricalCrossentropy(Loss):
    #NOTE: skipping over the forward code 
    # Backward pass
    def backward(self, dvalues, y_true):
        
        # Number of samples (i.e. how many rows are there)
        samples = len(dvalues)
        
        # Number of labels in every sample
        # We'll use the first sample to count them
        labels = len(dvalues[0]) #how many columns are there?
        
        # If labels are sparse, turn them into one-hot vector
        if len(y_true.shape) == 1:#if the array is 1D i.e. (3,) vs (1,3), so list shape rather than array
            #remember - sparse arrays are just list of index positions of category/target variable
            #so this creates a nxn array of 0s with 1 down the diagonal, then when you pass the true indexes, the array is transformed
            #into only the rows with index so for 3 targets [3] => [0,0,1]
            y_true = np.eye(labels)[y_true]

        # Calculate gradient - guessing that dvalue are just the network output here
        self.dinputs = -y_true / dvalues
        # Normalize gradient
        self.dinputs = self.dinputs / samples #note that we are just dividing here, no summing yet
        #sum to be performed by optimizer (can sum because denominators are all the same)

Softmax Activation Derivative:
- firstly, derivative of division: f(x) = g(x)/h(x); f'(x) = (g'(x) * h(x) - g(x) * h'(x))/(h'(x))^2
- second, derivative of e^x wrt x is e^x
- see attached image for full handwritten derivation
- this derivation introduces kronecker delta function, which in the general case: 1 when i=j and 0 when i != j. In our case, when j=k or j!=k
- full derivation: Sij*kronecker_deltajk - Sij * Sik

![softmax_derivative.jpg](attachment:softmax_derivative.jpg)

Further discussion & code implemenatation of softmax derivative:
- Each input influences each output because softmax involves dividing by the sum of all the softmax values in the softmax layer. So for each output node in the softmax, there is a vector of partial derivatives wrt to the input. Think about it, if we change the variable we are deriving wrt to, each other output node does not stay consant because that value we are changing is part of the division for the other nodes. By changing one input value, we inherently change all the values, thus there is a partial derivative for each node even though we may be changing another node's value.
- So, this is why we workout the equation so we can use kronecker delta. It basically signifies when the derivative wrt to the numerator will be 0, indicating that we are not deriving with respect to the input of current softmax output node. The equations workout nicely into the form we have, quite cool. Kind of makes you wonder if they knew it would work out or kind of were like - oh hey, we can use the kronecker delta here to make stuff easier. 
- Now, as discussed, since the derivative wrt to each node of softmax will have its own gradient, when we take the derivative wrt to each input, we receive a 2D array, where each row represents a gradient wrt to an input value. So for example, the first row will be the partial derivative of each softmax node wrt to the first input, and the first value in that gradient will be the one where j=k, or kronecker delta = 1, or the numerator is the e^1stinput, or the derivative of the numerator != 0, any way you want to think about it
- See below for example code and more explanation

In [28]:
#output of single sample
# full formula Sij*kronecker_deltajk - Sij * Sik

import numpy as np
softmax_output = [0.7, 0.1, 0.2] #sample i

#### First term of derivative Sij*kronecker_deltajk ###

#reshaping the array into a column so that we can do the np.eye multiplication
softmax_output = np.array(softmax_output).reshape(-1, 1)
print(softmax_output) 

#One way to get the first term - see below
print(np.eye(softmax_output.shape[0]))
print(softmax_output * np.eye(softmax_output.shape[0]))

#the above can also be accomplished with:
print("First Term:")
print(np.diagflat(softmax_output))

#this diagflat output represents the first term in the equation of the softmax derivative
# this array it is equal to Sij when j=k, and 0 j != k; it represents the kronecker delta
# np.eye can be thought of as the kronecker delta array it is 1, when row num = col num aka j=k

##### Second Term of Derivative Sij * Sik ####
print("Second Term:")
print(np.dot(softmax_output, softmax_output.T))

#np.dot(column vector*row_vector), so there is no summing really. Just multiplying first by first, then first by second, etc.

### Equation Altogether ####
print("Whole equation in 1 line:")
print(np.diagflat(softmax_output) - np.dot(softmax_output, softmax_output.T))


[[0.7]
 [0.1]
 [0.2]]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
[[0.7 0.  0. ]
 [0.  0.1 0. ]
 [0.  0.  0.2]]
First Term:
[[0.7 0.  0. ]
 [0.  0.1 0. ]
 [0.  0.  0.2]]
Second Term:
[[0.49 0.07 0.14]
 [0.07 0.01 0.02]
 [0.14 0.02 0.04]]
Whole equation in 1 line:
[[ 0.21 -0.07 -0.14]
 [-0.07  0.09 -0.02]
 [-0.14 -0.02  0.16]]


Softmax Output and Jacobian Matrix - How does this get passed back?
- the softmax output results in a 2D array for each sample. So across a batch, we have a 3D array, where each row is a 2D array of the softmax jacobian array/matrix for a sample.
- Each row represents the derivative wrt to its respective the presynaptic input, so first row is partial derviatives of each softmax node with respect to the first input. 
- But we cannot backpropagate this jacobian array. It wouldn't work functionally, as we can only backpropagate a 1D gradient, so we can multiply the presynatic neurons by their respective value.
- So we just apply the chain rule, similar to how derivatives wrt to other inputs are passed back. 
- first step - pass back the loss function dvalues (i.e. loss func derivative with respect to each softmax node). Since each softmax node only impacts its respective loss function node, this is a simple element wise multiplication
- then for each softmax layer input, we sum the softmax layer's partial derivatives wrt to that input, to get the full derivative with respect to that input. So we are summing the row of the jacobian matrix
- For example, the first row represents derivative of each softmax node with respect to the first input. So if we want to get the full softmax derivative wrt the first input, we must sum that row. (because sum of partial derivatives = full derivative) 
- repeat the multiplication and summing process for each row in the jacobian matrix
- this is just like applying chain rule for the other layers that we previously read about, except that the softmax inputs are not impacting the post synaptic layer via weights, but via the softmax summation denominator
- so we can simplfy this, for one sample, into the dot product of jacobian matrix and the loss dvalues 
- see below for diagram

![softmax_backprop.jpg](attachment:softmax_backprop.jpg)

Code Notes:
- self.dinputs is just creating a matrix to store the gradients of each sample
- enumerate allows you to iterate over the values in a list, and know their index as well, so for index, value in enumerate: would be 0, first value; 1, second value; etc.
- this example enumerates the zipped values, so we have an index number and then we iterate over two lists. 
- so the for loop says for each sample output of the softmax, calculate the gradients, aka the jacobian matrix
- then, calculate the full derivative wrt to each input of the softmax (as previously discussed), storing the outputted gradient as a row in dinputs
- dinputs represents the gradients outputted from the softmax function. Where each row is the gradient of a sample. 
- in other words, it is what we described above, just looping over each sample's loss, calculating th jacobian wrt to a sample, and then outputting the gradients

In [None]:
class Activation_Softmax:

    # Backward pass
    def backward(self, dvalues):
        # Create uninitialized array
        self.dinputs = np.empty_like(dvalues)
        # Enumerate outputs and gradients
        for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)):
            # same softmax deriv calcuations from previous shown code
            single_output = single_output.reshape(-1, 1)
            jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T)
            
            # Calculate sample-wise gradient
            # and add it to the array of sample gradients
            self.dinputs[index] = np.dot(jacobian_matrix, single_dvalues)

Simplification - Combining ln loss and softmax into 1 derivative
- see the picture below for full explanation
- summary: can substitute the softmax function output variables for loss inputs, as they are same value. This enables you to rewrite the gradient of the loss wrt to softmax inputs in terms of loss variables. Big moves in the equation are: breaking out the loss summation into a term where j=k and then the remaning summation is left in tact where j != k. Then substituting loss variable into the softmax derivative function (the one with kronecker delta). Then substituting these into your split summation and simplifying from there. Note that during simplificaiton we drop the summation because the loss one hot enoding just results in multiplying by zero. See loss derivative notes for further detail (same concept).
- ultimate output is y_hat - y_truth
- the reason for doing this is that it is faster

![softmax_and_loss_derivative.jpg](attachment:softmax_and_loss_derivative.jpg)

Combined code for derivatives:
- in the backward section we check if the ground truth array is 1 hot encoded, and then change it to discrete. This is the opposite of the loss function implementation, where we check for discrete and change to 1 hot encoded
- this is because with discrete, the category index in the ground truth discrete array will correspond to the column index of the output array. For example: discrete_array = [0,1,2] => one_hot = [[1,0,0],[0,1,0],[0,0,1]]. Use argmax(axis=1) to achive this, which gets the max location of each row
- so we can use the discrete array to index the output array to get what the network is outputting for the ground truth class. 
- then we can subtract 1 from the network's output for the ground truth class because the ground truth value will always be 1
- this also has gradient normalization as we previously discussed

In [None]:
###example class that handles loss for both categorical cross entropy and softmax

class Activation_Softmax_Loss_CategoricalCrossentropy():
# Creates activation and loss function objects
    def __init__(self):
        self.activation = Activation_Softmax()
        self.loss = Loss_CategoricalCrossentropy()
        # Forward pass
    def forward(self, inputs, y_true):
        # Output layer's activation function
        self.activation.forward(inputs)
        # Set the output
        self.output = self.activation.output
        # Calculate and return loss value
        return self.loss.calculate(self.output, y_true)
    # Backward pass
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)

        # If labels are one-hot encoded,
        # turn them into discrete values
        if len(y_true.shape) == 2:
            y_true = np.argmax(y_true, axis=1)
        
        # Copy so we can safely modify
        self.dinputs = dvalues.copy()
    
        # For each row in dinputs, get what the network has for the correct class and subtract 1
        self.dinputs[range(samples), y_true] -= 1
        
        # Normalize gradient
        self.dinputs = self.dinputs / samples

Code of Softmax, Categorial Cross Entropy, and Combination
- be sure to run loss function cell a few cells up

In [3]:
# Softmax activation
class Activation_Softmax:
# Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        
        # Get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        
        # Normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        
        self.output = probabilities
    # Backward pass
    def backward(self, dvalues):
        # Create uninitialized array
        self.dinputs = np.empty_like(dvalues)
        # Enumerate outputs and gradients
        for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)):
            # Flatten output array
            single_output = single_output.reshape(-1, 1)
            # Calculate Jacobian matrix of the output
            jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T)
            # Calculate sample-wise gradient
            # and add it to the array of sample gradients
            self.dinputs[index] = np.dot(jacobian_matrix, single_dvalues)


class Loss_CategoricalCrossentropy(Loss):
# Forward pass
    def forward(self, y_pred, y_true):
    # Number of samples in a batch
        samples = len(y_pred)
        
        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
        # Probabilities for target values -
        # only if categorical labels
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[range(samples),y_true]
        # Mask values - only for one-hot encoded labels
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(y_pred_clipped * y_true, axis=1)
        # Losses
        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods
    
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)
        # Number of labels in every sample
        # We'll use the first sample to count them
        labels = len(dvalues[0])
        # If labels are sparse, turn them into one-hot vector
        if len(y_true.shape) == 1:
            y_true = np.eye(labels)[y_true]
        
        # Calculate gradient
        self.dinputs = -y_true / dvalues
        # Normalize gradient
        self.dinputs = self.dinputs / samples


class Activation_Softmax_Loss_CategoricalCrossentropy():
# Creates activation and loss function objects
    def __init__(self):
        self.activation = Activation_Softmax()
        self.loss = Loss_CategoricalCrossentropy()
        # Forward pass
    def forward(self, inputs, y_true):
        # Output layer's activation function
        self.activation.forward(inputs)
        # Set the output
        self.output = self.activation.output
        # Calculate and return loss value
        return self.loss.calculate(self.output, y_true)
    # Backward pass
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)

        # If labels are one-hot encoded,
        # turn them into discrete values
        if len(y_true.shape) == 2:
            y_true = np.argmax(y_true, axis=1)
        
        # Copy so we can safely modify
        self.dinputs = dvalues.copy()
    
        # For each row in dinputs, get what the network has for the correct class and subtract 1
        self.dinputs[range(samples), y_true] -= 1
        
        # Normalize gradient
        self.dinputs = self.dinputs / samples



A check that both forms return the same values for gradients
- be sure to run cell with loss object creation a few cells up and other functions above

Calculating Time Difference
- be sure to run cell with loss object creation a few cells up and other functions above

In [5]:
#be sure to run cell with loss object creation a few cells up
import numpy as np
softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])

class_targets = np.array([0, 1, 1])

softmax_loss = Activation_Softmax_Loss_CategoricalCrossentropy()
softmax_loss.backward(softmax_outputs, class_targets)
dvalues1 = softmax_loss.dinputs

activation = Activation_Softmax()
activation.output = softmax_outputs

#be sure to run cell with loss object creation a few cells up
loss = Loss_CategoricalCrossentropy()

loss.backward(softmax_outputs, class_targets)
activation.backward(loss.dinputs)
dvalues2 = activation.dinputs

print('Gradients: combined loss and activation:')
print(dvalues1)
print('Gradients: separate loss and activation:')
print(dvalues2)

Gradients: combined loss and activation:
[[-0.1         0.03333333  0.06666667]
 [ 0.03333333 -0.16666667  0.13333333]
 [ 0.00666667 -0.03333333  0.02666667]]
Gradients: separate loss and activation:
[[-0.1         0.03333333  0.06666667]
 [ 0.03333333 -0.16666667  0.13333333]
 [ 0.00666667 -0.03333333  0.02666667]]


Time difference between the two methods
- be sure to run cell with loss object creation a few cells up and other functions above
- it is about 2.3x faster to combine the derivatives

In [8]:
import numpy as np
from timeit import timeit

#be sure to run cell with loss object creation a few cells up

softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])

class_targets = np.array([0, 1, 1])
def f1():
    softmax_loss = Activation_Softmax_Loss_CategoricalCrossentropy()
    softmax_loss.backward(softmax_outputs, class_targets)
    dvalues1 = softmax_loss.dinputs
def f2():
    activation = Activation_Softmax()
    activation.output = softmax_outputs
    #be sure to run cell with loss object creation a few cells up

    loss = Loss_CategoricalCrossentropy()
    loss.backward(softmax_outputs, class_targets)
    activation.backward(loss.dinputs)
    dvalues2 = activation.dinputs

t1 = timeit(lambda: f1(), number=10000)
t2 = timeit(lambda: f2(), number=10000)
print(t2/t1)

2.3499585963398837


Full model code up to this point
- Note that during running the code, we do not use Categorical cross entropy loss and softmax layers separately. Now we make use of the combined object to leverage the optimized derivative

In [15]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data
nnfs.init()
import matplotlib.pyplot as plt

# Dense layer
class Layer_Dense:
    # Layer initialization
    def __init__(self, n_inputs, n_neurons):
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Calculate output values from input ones, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases
        # Backward pass
    def backward(self, dvalues):
        # Gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        # Gradient on values
        self.dinputs = np.dot(dvalues, self.weights.T)

# ReLU activation
class Activation_ReLU:
    # Forward pass
    def forward(self, inputs):
        self.inputs = inputs
        # Calculate output values from inputs
        self.output = np.maximum(0, inputs)

    def backward(self, dvalues):
        # Since we need to modify original variable,
        # let’s make a copy of values first
        self.dinputs = dvalues.copy()
        # Zero gradient where input values were negative
        self.dinputs[self.inputs <= 0] = 0

# Softmax activation
class Activation_Softmax:
# Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        
        # Get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        
        # Normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        
        self.output = probabilities
    # Backward pass
    def backward(self, dvalues):
        # Create uninitialized array
        self.dinputs = np.empty_like(dvalues)
        # Enumerate outputs and gradients
        for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)):
            # Flatten output array
            single_output = single_output.reshape(-1, 1)
            # Calculate Jacobian matrix of the output
            jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T)
            # Calculate sample-wise gradient
            # and add it to the array of sample gradients
            self.dinputs[index] = np.dot(jacobian_matrix, single_dvalues)

# Common loss class
class Loss:
    # Calculates the data and regularization losses
    # given model output and ground truth values
    def calculate(self, output, y):
        
        # Calculate sample losses
        sample_losses = self.forward(output, y)
        
        # Calculate mean loss
        data_loss = np.mean(sample_losses)
        
        # Return loss
        return data_loss
    
class Loss_CategoricalCrossentropy(Loss):
# Forward pass
    def forward(self, y_pred, y_true):
    # Number of samples in a batch
        samples = len(y_pred)
        
        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
        # Probabilities for target values -
        # only if categorical labels
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[range(samples),y_true]
        # Mask values - only for one-hot encoded labels
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(y_pred_clipped * y_true, axis=1)
        # Losses
        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods
    
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)
        # Number of labels in every sample
        # We'll use the first sample to count them
        labels = len(dvalues[0])
        # If labels are sparse, turn them into one-hot vector
        if len(y_true.shape) == 1:
            y_true = np.eye(labels)[y_true]
        
        # Calculate gradient
        self.dinputs = -y_true / dvalues
        # Normalize gradient
        self.dinputs = self.dinputs / samples


class Activation_Softmax_Loss_CategoricalCrossentropy():
# Creates activation and loss function objects
    def __init__(self):
        self.activation = Activation_Softmax()
        self.loss = Loss_CategoricalCrossentropy()
        # Forward pass
    def forward(self, inputs, y_true):
        # Output layer's activation function
        self.activation.forward(inputs)
        # Set the output
        self.output = self.activation.output
        # Calculate and return loss value
        return self.loss.calculate(self.output, y_true)
    # Backward pass
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)

        # If labels are one-hot encoded,
        # turn them into discrete values
        if len(y_true.shape) == 2:
            y_true = np.argmax(y_true, axis=1)
        
        # Copy so we can safely modify
        self.dinputs = dvalues.copy()
    
        # For each row in dinputs, get what the network has for the correct class and subtract 1
        self.dinputs[range(samples), y_true] -= 1
        
        # Normalize gradient
        self.dinputs = self.dinputs / samples

Run cell above before running the below

In [16]:
# Create dataset
X, y = spiral_data(samples=100, classes=3)
# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)
# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()
# Create second Dense layer with 3 input features (as we take output
# of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(3, 3)
# Create Softmax classifier’s combined loss and activation
loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()
# Perform a forward pass of our training data through this layer
dense1.forward(X)
# Perform a forward pass through activation function
# takes the output of first dense layer here
activation1.forward(dense1.output)

# Perform a forward pass through second Dense layer
# takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Perform a forward pass through the activation/loss function
# takes the output of second dense layer here and returns loss
loss = loss_activation.forward(dense2.output, y)


# Let’s see output of the first few samples:
print(loss_activation.output[:5])
# Print loss value
print('loss:', loss)
# Calculate accuracy from output of activation2 and targets
# calculate values along first axis
predictions = np.argmax(loss_activation.output, axis=1)

if len(y.shape) == 2:
    y = np.argmax(y, axis=1)
accuracy = np.mean(predictions == y)

# Print accuracy
print('acc:', accuracy)

# Backward pass
loss_activation.backward(loss_activation.output, y)
dense2.backward(loss_activation.dinputs)
activation1.backward(dense2.dinputs)
dense1.backward(activation1.dinputs)
# Print gradients
print(dense1.dweights)
print(dense1.dbiases)
print(dense2.dweights)
print(dense2.dbiases)

[[0.33333334 0.33333334 0.33333334]
 [0.33333316 0.3333332  0.33333364]
 [0.33333287 0.3333329  0.33333418]
 [0.3333326  0.33333263 0.33333477]
 [0.33333233 0.3333324  0.33333528]]
loss: 1.0986104
acc: 0.34
[[ 1.5766358e-04  7.8368575e-05  4.7324404e-05]
 [ 1.8161036e-04  1.1045571e-05 -3.3096316e-05]]
[[-3.6055347e-04  9.6611722e-05 -1.0367142e-04]]
[[ 5.4410957e-05  1.0741142e-04 -1.6182236e-04]
 [-4.0791339e-05 -7.1678100e-05  1.1246944e-04]
 [-5.3011299e-05  8.5817286e-05 -3.2805994e-05]]
[[-1.0732794e-05 -9.4590941e-06  2.0027626e-05]]
