# CSS

In [1]:
from IPython.display import HTML
style = """
<style>
.expo {
  line-height: 150%;
}

.visual {
  width: 400px;
}

</style>
"""
HTML(style)

## Now what?

We have made our prediction and computed our loss, $L$. Now what?

Recall: each "step" is just a function applied to some input that results in some output.

### Now what?

If we write out what we just did in terms of mathematical functions, we could write it as:

\begin{align}
A &= a(x, V) \\
B &= b(A) \\
C &= c(B, W) \\
P &= p(C) \\
L &= l(P)
\end{align}

So, say we have a neural net with just one hidden layer. We could write the loss of a neural net on a given observation $ x $ as:

$$ L = l(p(c(b(a(x, V)), W))) $$

### Now what?

Mathematically, we _want_ to change the weights in such a way that the loss will be reduced during the next iteration. The equations:

$$ W = W - \frac{\partial l}{\partial W}$$

$$ V = V - \frac{\partial l}{\partial V}$$

do this.

### Now what?

Notice that this "makes sense":

* If $\frac{\partial l}{\partial W}$ is a positive number, then we want to _decrease_ the weight, since increasing the weight would _increase_ our loss. That is exactly what the equation $ W = W - \frac{\partial l}{\partial W}$ does.
* Similarly, if $\frac{\partial l}{\partial W}$ is a negative number, then we want to _increase_ the weight, since increasing the weight would _decrease_ our loss. In both cases, the equation $ W = W - \frac{\partial l}{\partial W}$ works.

## Backpropogation - setup

Now we want to make our neural net smarter by updating its weights. We've see that to do that, we need to compute $\frac{\partial L}{\partial W}$ and $\frac{\partial L}{\partial V}$. How do we do this?

Well, we know that 

$$ L = l(p(c(b(a(x, V)), W))) $$

### Backpropogation - setup

Our good friend the chain rule tells us that: 

$$ \frac{\partial L}{\partial W} = \frac{\partial l}{\partial P} * \frac{\partial p}{\partial C} * \frac{\partial c}{\partial W}  $$

and 

$$ \frac{\partial L}{\partial V} = \frac{\partial l}{\partial P} * \frac{\partial p}{\partial C} * \frac{\partial c}{\partial B} * \frac{\partial b}{\partial A} * \frac{\partial a}{\partial V}  $$

Each one of these partial derivatives turns out to be simple!

## Backpropogation - step 1:

First, let's compute:

$$ \frac{\partial l}{\partial P} $$

### Backpropogation - step 1:

Since 

$$ L = l(P) = \frac{1}{2}(y - P)^2 $$

Then:

$$ \frac{\partial l}{\partial P} = -(y - P)$$

### Backpropogation - step 1:

And coding this up is simply:

In [15]:
dLdP = -(y - P)
array_print(dLdP)

The array:
 [[-0.61]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
    <img src='img/neural_net_4_loss_grad.png'>
</div>

## Backpropogation - step 2:

Next, let's compute:

$$ \frac{\partial p}{\partial C} $$

Recall that:

$$ P = \begin{bmatrix} p_1 \end{bmatrix} = p(c) = \sigma(c) $$

### A digression on the sigmoid

If

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

Then 

$$\sigma'(x) = \sigma(x) * (1 - \sigma(x))$$

### Backpropogation - step 2:

So if

$$ p(c) = \sigma(c) $$

then

$$ p'(c) = \sigma(c) * (1 - \sigma(c)) $$

### Backpropogation - step 2:

So, coding this up is simply:

In [16]:
dPdC = sigmoid(C) * (1-sigmoid(C))
array_print(dPdC)

The array:
 [[ 0.24]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
    <img src='img/neural_net_4_prediction_grad.png'>
</div>

### Backpropogation - step 3:

Next we want to compute:

$$ \frac{\partial c}{\partial W} $$

### Backpropogation - step 3:

Recall that:


$$
\begin{align}
C &= \begin{bmatrix} c_1 \end{bmatrix} \\ 
&= c(W) \\
&= w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4
\end{align}
$$

### Backpropogation - step 3:

Now recall that by 

$$ \frac{\partial c}{\partial W} $$

we mean:

$$ \begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} $$

### Backpropogation - step 3:

But since 

$$ c(W) = w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4 $$

$ \frac{\partial c}{\partial w_{11}} $, for example, is just $b_1$, $ \frac{\partial c}{\partial w_{21}} $ is just $b_2$, etc.

### Backpropogation - step 3:

Thus,

$$ \frac{\partial c}{\partial W} =
\begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} = \begin{bmatrix}b_1 \\
                  b_2 \\
                  b_3 \\
                  b_4
                  \end{bmatrix} $$

Which is just $ B^T$.

### Backpropogation - step 3:

So, coding this up is simply:

In [17]:
dCdW = B.T
array_print(dCdW)

The array:
 [[ 0.33]
 [ 0.75]
 [ 0.27]
 [ 0.63]]
The dimensions are 4 rows and 1 column


Note that this has the same dimensions as `W`, which is what we want.

## Computing $\frac{\partial L}{\partial W}$

Now computing $\frac{\partial L}{\partial W}$ is simply a matter of doing the matrix multiplications, which again, by the chain rule, will actually cause the weights to be updated in the right direction.

In [18]:
dLdW = np.dot(dCdW, dLdP * dPdC)
array_print(dLdW, 3)

The array:
 [[-0.048]
 [-0.109]
 [-0.039]
 [-0.091]]
The dimensions are 4 rows and 1 column


## Backpropogation - step 4:

By the same logic that we applied in Step 3, since:

$$ c(W) = w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4 $$

we have:

$$ \frac{\partial c}{\partial W} =
\begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} = \begin{bmatrix}b_1 \\
                  b_2 \\
                  b_3 \\
                  b_4
                  \end{bmatrix} = B^T $$

### Backpropogation - step 4:

And again, since:

$$ c(W) = w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4 $$

We have:
    
$$ \frac{\partial c}{\partial B} =
\begin{bmatrix}\frac{\partial c}{\partial b_1} \\
                  \frac{\partial c}{\partial b_2} \\
                  \frac{\partial c}{\partial b_3} \\
                  \frac{\partial c}{\partial b_4}
                  \end{bmatrix} = \begin{bmatrix}w_{11} \\
                  w_{21} \\
                  w_{31} \\
                  w_{41}
                  \end{bmatrix} = W^T $$

### Backpropogation - step 4:

So coding this up simply gives:

In [19]:
dCdB = W.T
array_print(dCdB)

The array:
 [[-0.5   0.65 -0.41 -1.03]]
The dimensions are 1 row and 4 columns


## Where are we

<div class="visual">
    <img src='img/neural_net_4_c_grad.png'>
</div>

### Backpropogation - step 5:

Next, we want to compute:

$$ \frac{\partial b}{\partial A} $$

### Backpropogation - step 5:

But since:

$$ B = b(A) = \sigma(A) $$

Which is really just shorthand for:

$$ B = \begin{bmatrix} b_1 \\ b_2 \\ b_3 \\ b_4 \end{bmatrix} = \begin{bmatrix} \sigma(a_1) \\ \sigma(a_2) \\ \sigma(a_3) \\ \sigma(a_4) \end{bmatrix} $$

### Backpropogation - step 5:

Since we know that:
    
$$ \sigma'(A) = \sigma(A) * (1 - \sigma(A)) $$ 

Then:
    
$$ \frac{\partial b}{\partial A} = \begin{bmatrix} \sigma(a_1) * (1 - \sigma(a_1) \\ 
\sigma(a_2) * (1 - \sigma(a_2) \\ 
\sigma(a_3) * (1 - \sigma(a_3) \\
\sigma(a_4) * (1 - \sigma(a_4) \end{bmatrix} = \sigma(A) * (1 - \sigma(A))$$

### Backpropogation - step 5:

In [20]:
dBdA = sigmoid(A) * (1-sigmoid(A))
array_print(dBdA)

The array:
 [[ 0.22  0.19  0.2   0.23]]
The dimensions are 1 row and 4 columns


## Where are we

<div class="visual">
    <img src='img/neural_net_4_b_grad.png'>
</div>

### Backpropogation - step 6:

Finally, we want to compute the most involved of our partial derivatives:

$$ \frac{\partial a}{\partial V} $$

### Backpropogation - step 6:

Recalling that:

$$ a(X, V) = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \\ a_4 \end{bmatrix}$$

### Backpropogation - step 6:

But $ a(X, V) $ is itself shorthand for the equations:

$$ x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} = a_1 $$
$$ x_1 * v_{12} + x_2 * v_{22} + x_3 * v_{32} = a_2 $$
$$ x_1 * v_{13} + x_2 * v_{23} + x_3 * v_{33} = a_3 $$
$$ x_1 * v_{14} + x_2 * v_{24} + x_3 * v_{34} = a_4 $$

### Backpropogation - step 6:

So $ \frac{\partial a}{\partial V} $ is really shorthand for:

$$ \begin{bmatrix}\frac{\partial a}{\partial v_{11}} & \frac{\partial a}{\partial v_{12}} & \frac{\partial a}{\partial v_{13}} & \frac{\partial a}{\partial v_{14}} \\
\frac{\partial a}{\partial v_{21}} & \frac{\partial a}{\partial v_{22}} & \frac{\partial a}{\partial v_{23}} & \frac{\partial a}{\partial v_{24}} \\
\frac{\partial a}{\partial v_{31}} & \frac{\partial a}{\partial v_{32}} & \frac{\partial a}{\partial v_{33}} & \frac{\partial a}{\partial v_{34}} \\
\end{bmatrix} $$

### Backpropogation - step 6:

But, note that focusing on just $a_1$ for example:

$$ \frac{\partial a_1}{\partial v_{11}} = x_1 $$
$$ \frac{\partial a_1}{\partial v_{21}} = x_2 $$
$$ \frac{\partial a_1}{\partial v_{31}} = x_3 $$

Since again,

$$ x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} = a_1 $$

### Backpropogation - step 6:

whereas for $a_2$ and $a_3$

$$ \frac{\partial a_2}{\partial v_{11}} = 0 $$
$$ \frac{\partial a_3}{\partial v_{11}} = 0 $$

Since, for example:

$$ x_1 * v_{12} + x_2 * v_{22} + x_3 * v_{32} = a_2 $$

### Backpropogation - step 6:

So if we write: 
    
$$ A = \begin{bmatrix}a_1 \\ a_2 \\ a_3 \\ a_4 \end{bmatrix} $$

Then $\frac{\partial a}{\partial V}$ ends up being:

$$ \frac{\partial a}{\partial V} = \begin{bmatrix}
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} &
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} &
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} &
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}\end{bmatrix} $$

### Backpropogation - step 6:

Which in terms of the matrix multiplication that results is the same as writing just:

$$ \frac{\partial a}{\partial V} = x^T $$

### Backpropogation - step 6:

Which is of course easy to code as:

In [21]:
dAdV = x.T
array_print(dAdV)

The array:
 [[1]
 [0]
 [0]]
The dimensions are 3 rows and 1 column


## Where are we

<div class="visual">
    <img src='img/neural_net_4_a_grad.png'>
</div>

## Computing $\frac{\partial l}{\partial V}$

To compute $\frac{\partial l}{\partial V}$, we simply multiply all of these partial derivatives we've calculated together, being careful to use matrix multiplication where necessary and elementwise multiplication where necessary:

### Computing $\frac{\partial l}{\partial V}$

In [22]:
dLdV = np.dot(dAdV, np.dot(dLdP * dPdC, dCdB) * dBdA)
array_print(dLdV)

The array:
 [[ 0.02 -0.02  0.01  0.03]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]]
The dimensions are 3 rows and 4 columns


Note that this has the same shape as $V$, which is what we want!

## Updating the weights

Updating the weights can now be done simply:

In [23]:
W -= dLdW
V -= dLdV