# Dev Notebook - Backpropagation 

In this notebook I examine how backpropagation works referencing the examples provided in [NNFS](https://nnfs.io/) 

In [2]:
import nnfs 
import numpy as np

import matplotlib.pyplot as plt

nnfs.init()

#### Backpropagation - simplified 

Backpropagation through ReLU, based on the example in NNFS to ensure a solid understanding of the underlying math (partial diff and chain rule) and mechanisms 

In [4]:
# simulating a forward pass 

x = [1.0, -2.0, 3.0]
w = [-3.0, -1.0, 2.0]
b = 1.0

# Multiplying inputs by weights
xw0 = x[0]* w[0]
xw1 = x[1]* w[1]
xw2 = x[2]* w[2]

# Summing weights and bias
z = xw0 + xw1 + xw2 +b

# applying relu
y = max(z, 0)


If we represent the forward pass as a function we can say:

$$\text{ReLU}\left(\sum[\text{inputs}\cdotp\text{weights}]+\text{bias}\right)$$

We now need to find the partial derivatives of all the function for all the parameters. For example if we wanted to know the effect that w0 had on the outcome we woul need to know:

$$ \frac{\partial}{\partial x_0}\left[\text{ReLU}\left(\sum[\text{inputs}\cdotp\text{weights}]+\text{bias}\right)\right] = \frac{d \text{ReLU()}}{d \text{sum()}}\cdot\frac{\partial\text{sum()}}{\partial mul(x_0,w_0)}\cdot\frac{\partial mul(x_0,w_0)}{\partial x_0} $$


In [5]:
# The backward pass

# derivative from previous layers
d_val = 1.0

# the derivative of relu wrt z 
d_relu_dz = d_val * (0,1)[z>0] # == i if z> 0, else 0 

# Recall the derivative of a sum opperator os always 1 
# derivative of the sum wrt x_n*w_n 
d_sum_dxwn = 1
d_relu_dxw0 = d_relu_dz * d_sum_dxwn
d_relu_dxw1 = d_relu_dz * d_sum_dxwn
d_relu_dxw2 = d_relu_dz * d_sum_dxwn

# derivative of the sum wrt b (bias) 
d_sum_db = 1
d_relu_db = d_relu_dz * d_sum_db

# Recall the derivative of a product is whateve input is being multiplied 
d_mul_dx0 = w[0]
d_mul_dx1 = w[1]
d_mul_dx2 = w[2]
d_relu_dx0 = d_mul_dx0 * d_relu_dxw0
d_relu_dx1 = d_mul_dx1 * d_relu_dxw2
d_relu_dx2 = d_mul_dx2 * d_relu_dxw2

d_mul_dw0 = x[0]
d_mul_dw1 = x[1]
d_mul_dw2 = x[2]
d_relu_dw0 = d_mul_dw0 * d_relu_dxw0
d_relu_dw1 = d_mul_dw1 * d_relu_dxw1
d_relu_dw2 = d_mul_dw2 * d_relu_dxw2

# Simplifying the above we can rewrite as:
d_relu_dx0 = d_val * (0,1)[z>0] * w[0]


In [6]:
# Optimized code for the backward pass 
# (yes, variables are being shadowed but it is okay this section is just for learning an not final code)
d_val = 1.0

d_x = [d_val*(0,1)[z>0]*_w for _w in w] # the derivative of the previous layer * d of relu * the corresponding weight for the input
d_w = [d_val*(0,1)[z>0]*_x for _x in x] # the derivative of the previous layer * d of relu * the corresponding input for the weight
d_b = d_val * (0,1)[z>0] # the derivative of the previous layer * d of relu (the derivative of the sum will always be 1)

#### Backpropagation - A layer of neurons 

Considering multiple neurons in a layer rather than just one 

In [7]:
# dummy passed in grads from previous layer 
d_val = np.array([[1.,1.,1.]])

weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T

# gradient for first input  
d_x0 = sum([weights[0][i] * d_val[0][i] for i in range(weights.shape[1])])
d_x1 = sum([weights[1][i] * d_val[0][i] for i in range(weights.shape[1])])
d_x2 = sum([weights[2][i] * d_val[0][i] for i in range(weights.shape[1])])
d_x3 = sum([weights[3][i] * d_val[0][i] for i in range(weights.shape[1])])
d_x = np.array([d_x0, d_x1, d_x2, d_x3])
d_x


array([ 0.44, -0.38, -0.07,  1.37])

In [8]:
# optimizing the above code and accounting for batches of samples we get:

d_val = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])

d_x = np.dot(d_val, weights.T)
d_x

array([[ 0.44, -0.38, -0.07,  1.37],
       [ 0.88, -0.76, -0.14,  2.74],
       [ 1.32, -1.14, -0.21,  4.11]], dtype=float32)

In [9]:
# To calculate the gradients wrt the weights we consider the input values 
inputs = np.array([[1, 2, 3, 2.5],
                    [2., 5., -1., 2],
                    [-1.5, 2.7, 3.3, -0.8]])

d_w = np.dot(inputs.T, d_val)
d_w


array([[ 0.5,  0.5,  0.5],
       [20.1, 20.1, 20.1],
       [10.9, 10.9, 10.9],
       [ 4.1,  4.1,  4.1]], dtype=float32)

In [10]:
# Calculating the derivative of the bias 
d_b = np.sum(d_val, axis=0, keepdims=True)
d_b

array([[6., 6., 6.]])

In [11]:
# output for the linear component 
z = np.array([[1,2,-3,-4], [2,-7,-1,3], [-1, 2,5,-1]])
d_val = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
# Calcuting the derivative of Relu 
d_relu = np.zeros(z.shape)
d_relu[z>0] = 1
d_relu *= d_val
d_relu 

array([[ 1.,  2.,  0.,  0.],
       [ 5.,  0.,  0.,  8.],
       [ 0., 10., 11.,  0.]], dtype=float32)

At this point I will go back and update the Linear Layer and the Relu with backward code.

#### Backpropagation of CCE Loss

We find that the derivative of CCE loss to be:

$$ \frac{\delta L_i}{\hat{y_{i,j}}} = -\frac{y_{i,j}}{\hat{y_{i,j}}} $$

I will now add this directly to the function

#### Backpropagation of Softmax activation

$$ \frac{\partial S_{i,j}}{\partial Z_{i,k}} = S_{i,j} \cdot (\delta_{j,k} - S_{i,k})$$

In [12]:
# Test implementation 

# Softmax output 
so = [0.7, 0.1, 0.2] 
so = np.array(so).reshape(-1, 1)
np.diagflat(so)

array([[0.7, 0. , 0. ],
       [0. , 0.1, 0. ],
       [0. , 0. , 0.2]])

In [13]:
np.dot(so, so.T)

array([[0.49, 0.07, 0.14],
       [0.07, 0.01, 0.02],
       [0.14, 0.02, 0.04]], dtype=float32)

In [14]:
np.diagflat(so) - np.dot(so, so.T)

array([[ 0.20999999, -0.07      , -0.14      ],
       [-0.07      ,  0.09      , -0.02      ],
       [-0.14      , -0.02      ,  0.16      ]])

### Backward pass for L1 and L2 

$$ L_1'=\frac{\partial}{\partial w_n} \lambda\sum_m|w_m| =  \lambda\frac{\partial}{\partial w_n} |w_m| =
    \begin{cases}
        1 & w_m > 0 \\
        -1 & w_m < 0
    \end{cases}

$$ 

$$ L_2' = \frac{\partial}{\partial w_n} \lambda\sum_m w_m^2 = \lambda\frac{\partial}{\partial w_n}w_m^2 = 2\lambda w_m$$

In [4]:
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]])

dl1 = np.ones_like(weights)
dl1[weights < 0 ] = -1

dl1

array([[ 1.,  1., -1.,  1.],
       [ 1., -1.,  1., -1.],
       [-1., -1.,  1.,  1.]])

## Backpropagation of Dropout

When the output of the binomial function is 1, the function is the previous layers output $$z$$:

$$ f'(z, p)  = \frac{\partial}{\partial z}\left[ \frac{z}{1-p} \right] = \frac{1}{1-p}\cdot\frac{\partial}{\partial z} z = \frac{1}{1-p} $$

where p is the rate of neurons we intend to zero

and when the output is 0 the functions output is 0 and so is the partial derivative. 

$$ f'(z, p) = 0 $$

so we can denote the derivative of dropout as: 

$$ \frac{\partial}{\partial Z_i}\text{Dropout} = \begin{cases} 
                                                    \frac{1}{1-p} & r_i = 1 \\
                                                    0 & r_i =0 
                                                \end{cases}
                                                = \frac{r_i}{1-p} $$