Back Propagation
- beginning with an simple example where we just seek to minimize the output of a single neuron
- goal is to figure out how much each input, weight, and bias impacts the neuron function (and eventually then network)
- to do this need to use the chain rule and take the derivative with respect to each input, weight, bias (only 1 bias here)
- only use derivative with respect to weights and biases to minimize loss, but need to know derivatives with respect to inputs as well because it is used to chain to another layer (more understanding in next bullet point)
- we are chaining the layers together via the input derivative so like derivative of each layer with respect to its input, but then on the last layer or the layer we are interested in calculating the derivative for, take the derivative of that layer with respect to weight or bias. Cause the change in the weight or bias ultimately is the ultimate input to that next layer, so it will transform that layer's input. dfunction/dweight = dlayer2(layer1output)/dlayer1 * dlayer1/dx; so its really the same thing as the chain rule, and as part of the chain rule, need to know the derivative of the outer function wih respect to its input, which is the inner function, then can take the derivative of inner function with respect to whatever parameter you want, in this case weight or bias. So the input is acting as the chain (the chain rule!)

Back Prop on 1 Neuron
- Example Neuron function where x0, w0 are inputs and respective weight, b is bias: y = relu(w0 * x0 + w1 * x1 + w2 *x2 + b) = max(w0 * x0 + w1 * x1 + w2 *x2 + b, 0)
- can be broken down even further into considering each weight* input is own function; the book does this. See function, were sum() is the sum of the weights* inputs and bias, and mul() is weights* inputs. So derivative of full neuron with respect to x0 would be deriv from next layer wrt input * dReLU()/dsum() * dsum()/dmul() * dmul()/dx0

In [1]:
# Forward pass
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias
# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]
# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b
# ReLU activation function
y = max(z, 0)
# Backward pass
# The derivative from the next layer
dvalue = 1.0

''' Example of how this comes together
dtwoneurons/dx0 = dnext_layer/dReLU * dReLu/dsum() * dsum()/dmul() * dmul/dx0
'''

# Derivative of ReLU and the chain rule
drelu_dz = dvalue * (1. if z > 0 else 0.)
print("Next Layer times Relu:", drelu_dz)

# Partial derivatives of the multiplication, the chain rule, deriv of plain sum is just 1 (think about it)
dsum_dxw0 = 1
dsum_dxw1 = 1
dsum_dxw2 = 1
dsum_db = 1
drelu_dxw0 = drelu_dz * dsum_dxw0
drelu_dxw1 = drelu_dz * dsum_dxw1
drelu_dxw2 = drelu_dz * dsum_dxw2
drelu_db = drelu_dz * dsum_db
print("Next Layer, RelU, and sum for each sum and bias:", drelu_dxw0, drelu_dxw1, drelu_dxw2, drelu_db)
# Partial derivatives of the multiplication, the chain rule
#short cut to derivative here is that the deriv wrt to weight is just input value and wrt to input is just the weight (flip-flop)
dmul_dx0 = w[0] 
dmul_dx1 = w[1]
dmul_dx2 = w[2]
dmul_dw0 = x[0]
dmul_dw1 = x[1]
dmul_dw2 = x[2]
drelu_dx0 = drelu_dxw0 * dmul_dx0
drelu_dw0 = drelu_dxw0 * dmul_dw0
drelu_dx1 = drelu_dxw1 * dmul_dx1
drelu_dw1 = drelu_dxw1 * dmul_dw1
drelu_dx2 = drelu_dxw2 * dmul_dx2
drelu_dw2 = drelu_dxw2 * dmul_dw2

#note that the full deriv wrt to bias is calculated above it is 1, since it is just a sum onto the sum function
print("Full wrt inputs, weights:",drelu_dx0, drelu_dw0, drelu_dx1, drelu_dw1, drelu_dx2, drelu_dw2)

Next Layer times Relu: 1.0
Next Layer, RelU, and sum: 1.0 1.0 1.0 1.0
Full wrt inputs, weights: -3.0 1.0 -1.0 -2.0 2.0 3.0


Ultimately Simplifying the Code Above
- taking out all multplying by 1 etc. just leaves you with derivative of next_layer * ReLu * w0 (or x0, bias, etc)

In [7]:
# Forward pass
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias
# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]
# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b
# ReLU activation function
y = max(z, 0)
# Backward pass
# The derivative from the next layer
dvalue = 1.0

''' Example of how this comes together
dtwoneurons/dx0 = dnext_layer/dReLU * dReLu/dsum() * dsum()/dmul() * dmul/dx0

z = sum() + b

now simplified to dnext_layer/dReLU * dReLu/dz * dz/dx[0]
'''

# Derivative of ReLU and the chain rule
drelu_dz = dvalue * (1. if z > 0 else 0.)
print("Next Layer times Relu:", drelu_dz)

# Partial derivatives of the multiplication, the chain rule
#short cut to derivative here is that the deriv wrt to weight is just input value and wrt to input is just the weight (flip-flop)

drelu_dx0 = dvalue * (1. if z > 0 else 0.) * w[0]
drelu_dw0 = dvalue * (1. if z > 0 else 0.) * x[0]
drelu_dx1 = dvalue * (1. if z > 0 else 0.) * w[1]
drelu_dw1 = dvalue * (1. if z > 0 else 0.) * x[1]
drelu_dx2 = dvalue * (1. if z > 0 else 0.) * w[2]
drelu_dw2 = dvalue * (1. if z > 0 else 0.) * x[2]
drelu_db = dvalue * (1. if z > 0 else 0.) * 1
print("Full wrt inputs, weights:",drelu_dx0, drelu_dw0, drelu_dx1, drelu_dw1, drelu_dx2, drelu_dw2)
print("wrt bias", drelu_db)

Next Layer times Relu: 1.0
Full wrt inputs, weights: -3.0 1.0 -1.0 -2.0 2.0 3.0
wrt bias 1.0


Decreasing the output of neuron (manually); run cell above for gradient values

In [8]:
dx = [drelu_dx0, drelu_dx1, drelu_dx2] # gradients on inputs
dw = [drelu_dw0, drelu_dw1, drelu_dw2] # gradients on weights
db = drelu_db # gradient on bias...just 1 bias here

#current weights and bias
print("Current Weights", w, b)

#applying a small negative value to our gradient to respect to weight, ie. how much the final output changes wrt to change in weights
#negative because we want to decrease the output of neuron
w[0] += -0.001 * dw[0]
w[1] += -0.001 * dw[1]
w[2] += -0.001 * dw[2]
b += -0.001 * db

#new weight and bias
print("New weights", w, b)

# Multiplying inputs by new weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding
z = xw0 + xw1 + xw2 + b
# ReLU activation function
y = max(z, 0)
print("New Output", y, "Old Output", 6)


Current Weights [-3.0, -1.0, 2.0] 1.0
New weights [-3.001, -0.998, 1.997] 0.999
New Output 5.985 Old Output 6


This was the basic process of minimizing the output of a neuron, in reality want to minimize the loss of a network. The next layer aspect of this is kind of ingnored (it is 1), although if you include the next layer wouldn't we really be minimizing the value of the next layer?