# Introduction to Neural Networks

Neural networks are a very flexible model through which to learn complex, high dimensional functions. First, we will apply the model in Pytorch, and then implement it by hand, using multivariable calculus to derive the parameter updates.

In this example, we will apply a neural network to an image classification problem. We have a dataset containing images clothes, with each image having a label describing the item of clothing in the image. Each $28 \times 28$-pixel image is represented as a length $784$ vector $\mathbf{x}$. We use bold font to indicate a vector. There are 10 possible items of clothing (e.g. t-shirt, hat) in our dataset, so $y$ is an integer in the range $[0, 10]$. Together, our sample space is pairings of images and labels, $(\mathbf{x}_i, y_i) \sim (\mathcal{X}, \mathcal{Y})$ (i.e. a single image, label pair index by $i$ can be drawn from the sample space of all pairs of images and labels). s  


We want to learn the conditional distribution $p(y| \mathbf{x})$ i.e. what is the probability of the label $y$ given an image, $\mathbf{x}$. For example, what is the probability this is an image $\mathbf{x}$ is of a t-shirt? In this case, we want to learn a function $f(\mathbf{x}) : \mathbf{x} \rightarrow \mathbf{y}$. It will take a vector of length $784$ and output a vector of length $10$, with each element of the output vector assigning some weight related to the probability of image $\mathbf{x}$ being a particular label $y$. These unnormalised weights output by the model are called 'logits'.

<p align="center">
    <img src="./nn-introduction/nn-image-1.png" width="400">
</p>

In this example, we will first follow the [Pytorch](https://docs.pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) 'Getting Started' page to train a simple model for this classification task. Then, we will get a deeper grip on how these models work by implementing the model ourselves in Numpy, which requires calculating the derivatives required for model training ourselves.

## Coding up a Neural Network in Pytorch

This section exactly follows [the Pytorch introduction](https://docs.pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html).

### The data

We will use the FashionMINST dataset, containing $28 \times 28$ images of clothing, 10 possible labels. In total there are $60,000$ image sin the training set and $10,000$ images in the test set.

<p align="center">
    <img src="./nn-introduction/FashionMNIST.png" width="500">
</p>

In [7]:
%%capture
import torch
from torch import nn
# DataLoader is an iterable around DataSet, which stores samples and their corresponding labels.
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST(
    root="data",  # root directory
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)


In [8]:

batch_size = 64

# Create data loaders
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break


Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64


Our data is a 4D array with dimensions: 
$$
\text{(batch size, num channels (RGB), image height, image width)}
$$

Each image in the batch has a single label (e.g. $y = \text{t-shirt}$), so we have 
$64$ labels per-batch.

**Mini-batch** training is a way to reduce the noise associated with updating the parameters 
after running a single image through the network. During mini-batch training, the  mini-batch of images (here $64$ images) is passed through the network without updating any parameter. The average loss over the entire batch is calculated. 

Next, backpopagation proceeds by takes the derivative of this average loss with respect
to the parameters. This gives a lower-variance estimate of the 
gradients than using a single sample. 

An **Epoch** is one pass over the dataset.

### The Model


In [9]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device}")

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()

        self.flatten = nn.Flatten()

        # Linear layers are:  yᵢ = φ( ∑ⱼ xⱼ Wᵀⱼᵢ + bᵢ ) so together y = φ(xW^T + b)
        # where W is a (output_features, input_features) set of weights, b is vector of biases
        # so we can easily control output shape. Layers are fully connected.
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)




Using cuda
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)



#### Optimising the Model

We use the cross-entropy loss (going into this in more detail in the next section).

In [10]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

# We can see we have Weights, Biases, Weights, Biases, Weights, Biases (3 layers)
print(f"The model parameters")
for name, param in model.named_parameters():
      print(
          f"Name: {name}\n"
          f"Type: {type(param)}\n"
          f"Size: {param.size()}\n"
      )

The model parameters
Name: linear_relu_stack.0.weight
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([512, 784])

Name: linear_relu_stack.0.bias
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([512])

Name: linear_relu_stack.2.weight
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([512, 512])

Name: linear_relu_stack.2.bias
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([512])

Name: linear_relu_stack.4.weight
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([10, 512])

Name: linear_relu_stack.4.bias
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([10])



### Training the Model

We train the model over the (batched) training data, updating the parametrs based on the derivative of the loss with respect to the parameters.

In [11]:
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train(mode=True)  # put into 'training mode'

    for batch, (X, y) in enumerate(dataloader):

        X, y = X.to(device), y.to(device)

        # Pred is (64, 10) tuple of predictions for this batch
        # y is (64, 1) (classes)
        # Cross entropy loss https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss
        pred = model(X)
        loss = loss_fn(pred, y)

        loss.backward()
        optimizer.step()  # perform one step θt <- f(θ_{t-1})
        optimizer.zero_grad()  # zero the accumulated gradients, ready for the next step

train(train_dataloader, model, loss_fn, optimizer)

### Testing the Model

Now run the model over the test data and compute the accuracy:

In [12]:
def test(dataloader, model, loss_fn):
    """"""
    size = len(dataloader.dataset)
    num_batches = len(dataloader)

    model.eval()  # Go from train to eval mode

    test_loss, correct = 0, 0

    with torch.no_grad():  # this just turns of gradient computation for speed

        for batch, (X, y) in enumerate(dataloader):

            X, y = X.to(device), y.to(device)

            pred = model(X)
            # pred_i = torch.argmax(torch.exp(pred) / torch.sum(torch.exp(pred)), axis=1)
            pred_i = pred.argmax(1)  # of course, it doesn't matter if the logits are passed through softmax, which maintains transitivity
            correct += (pred_i == y).type(torch.float).sum().item()
            test_loss += loss_fn(pred, y)

        test_loss /= num_batches
        correct /= size
        print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

print("Epoch 1")
test(test_dataloader, model, loss_fn)

for i in range(5):
    print(f"Epoch {i + 2}")
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)


Epoch 1
Test Error: 
 Accuracy: 48.6%, Avg loss: 2.163036 

Epoch 2
Test Error: 
 Accuracy: 55.4%, Avg loss: 1.900973 

Epoch 3
Test Error: 
 Accuracy: 61.5%, Avg loss: 1.528527 

Epoch 4
Test Error: 
 Accuracy: 64.0%, Avg loss: 1.254724 

Epoch 5
Test Error: 
 Accuracy: 65.1%, Avg loss: 1.087651 

Epoch 6
Test Error: 
 Accuracy: 66.1%, Avg loss: 0.980660 



What is the Model Learning?

1) We are learning p(y|x). We are not learning p(y, x), p(x), p(y) etc.
2) One interpretation is learning dcision boundaries across a 784 dimension space
3) another inteprestation is our weight matrices are learning 10 templates, but not quite
4) link the CS module

TODO:
https://arxiv.org/abs/1311.2901
http://neuralnetworksanddeeplearning.com/
https://cs231n.github.io/linear-classify/
https://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
https://www.jasonosajima.com/backprop.html
+ the initialisation paper

## Training a Neural Network by Hand - A 'Simple' Example

Now, we will implement the same model that we created in Python, but this time 'by-hand' in Numpy. To do this, we will have  to calculate the derivatives of the parameters with respect to the loss function, in order to update them during training. 
First, we will review a simplified version of the model that contains only weights, but no bias or nonlinear functions.

### The Model

Let $l^1$, $l^2$ and $l^3$ be vector-valued functions representing the layers $1$, $2$, $3$ of the network. $l^1$ takes our length $784$ row vector $\mathbf{x}$ and returns a length $512$ vector. $l^2$ takes this length $512$ vector as input and returns another length $512$ vector. Our final layer $l^3$ takes this a length $512$ vector as input and outputs a length $10$ vector. 
\begin{aligned}
    &l^1(\mathbf{x}) = \mathbf{x}W^1 \ \ \ \ \text{(1, 784) x (784, 512) = (1, 512)} \\
    &l^2(\mathbf{l^1}) = \mathbf{l^1}W^2 \ \ \ \ \text{(1, 512) x (512, 512) = (1, 512)} \\
    &l^3(\mathbf{l^2}) = \mathbf{l^2}W^3 \ \ \ \ \text{(1, 512) x (512, 10) = (1, 10)}
\end{aligned}

All vectors are row vectors. The $W$ are our matrices of weights with shape (input, output): $W^1$ is $(784, 512)$,  $W^2$ is $(512, 512)$ and $W^3$ is $(512, 10)$. We can therefore index individual weights as $W^3_{i, o}$ where $i$ is the index of the input node (i.e. the node the weight connects from), and $o$ is the index of the output node (i.e. the node the weight connects to).

<p align="center">
    <img src="./nn-introduction/nn-image-labelled.png" width="400">
</p>

Note that in practice this 'network' reduces to a single linear classifier of shape $(784, 10)$, so of course in reality we would never structure it like this. However, we will do it in this way first as a nice example.

*(An aside)*: This notation is a little non-standard (often vectors are column vectors, and the weight matrix is shape (output, input) but this notation makes the derivations below much simpler. Also the layers are functions $l^3(\cdot)$ but can be treated as vectors $\mathbf{l^3}$ when evaluated. We will typically write them as vectors.

### The Loss

The $(1, 10)$ vector of logits output from the final layer (also called 'scores', one for each label) is used to compute the loss. We will use the cross-entropy loss:

$$
L(\mathbf{l^3}, y) = -\log \dfrac{ \exp{ l^3_y }}{ \sum_k \exp{ l^3_k }}
$$

Here $l^3_n$ is the $n$th element of our vector of logits, and $y$ is the index of the correct label for this image. The term after the $\log$ is the [softmax function](https://en.wikipedia.org/wiki/Softmax_function#:~:text=The%20softmax%20function%2C%20also%20known,used%20in%20multinomial%20logistic%20regression) which normalises the logits to probabilities. Therefore, given an input image $\mathbf{x}$ the model returns a probability distribution over the labels $y$. [This article](https://cs231n.github.io/linear-classify/) has a nice, deeper discussion of the Cross Entropy Loss.

Clearly, we want this probability to be high (ideally $1$ for the correct label, $0$ for every other label). So this loss makes intuitive sense, we are maximising the probability the network returns for the correct label. Here, we equivalently minimise the negative log probability.


### Predicting a label from an image - the forward pass

To predict a label for an image $\mathbf{x}$, we apply the defined model to compute the logits:

\begin{aligned}
\mathbf{l^3} &= l^3(l^2(l^1(\mathbf{x}))) \\
             &= ((\mathbf{x}W^1) W^2) W^3 \\
             &=  \mathbf{x}W^1 W^2 W^3 \\
\end{aligned}

The predicted label is the one that maximises the probability as computed by the softmax function:

$$
\hat{y} = \arg\max_{y} \, \mathrm{softmax}(\mathbf{l^3})_y
$$

In other words, the predicted label is the one the model assigns the highest probability to (recall $y$ is the index of the correct label).

### Backpropagation in our simple network

We want to find the set of weights that minimise the loss function. The loss function is a multivariate function over the space of parameters. Therefore, we can minimise the loss with respect to the parameters through [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent). For this, we need to compute the derivative of the loss with respect to the parameters.

For example, take $W^3_{1,2}$ that connects the first neuron from layer 2 to the second neuron in layer 3. How does a small change in this weight change the output of our loss function? A small change in $W^3_{1,2}$ will result in a small change in the second neuron in layer 3 ($l^3_2$)—which will directly affect the loss function $L(\mathbf{l^3})$. This 'chain' of dependencies is captured by the chain rule:

\begin{aligned}
&\dfrac{\partial L(\mathbf{l^3})}{\partial W^3_{1,2}} = \dfrac{\partial L(\mathbf{l^3})}{\partial l^3_2}  \dfrac{\partial l^3_2 }{\partial W^3_{1,2}}
\end{aligned}

Instead of focusing on a single node in layer 2, we can write the derivative with respect to the vector of layer 2 nodes. Because we have $512$ layer 2 nodes, this will be a  $(1, 512)$ vector of partial derivatives where each element indicates how the loss changes with a layer 2 node $l^2_i$. 

Also, from now on, we will write the loss as $L$ instead of $L(\mathbf{l^3})$ for brevity, but it is useful to remember it is a function of the output layer. Together:

\begin{aligned}
&\dfrac{\partial L}{\partial W^3} = \dfrac{\partial L}{\partial \mathbf{l^3}}   \dfrac{\partial \mathbf{l^3} }{\partial W^3} 
\end{aligned}

and for the other weights:

\begin{aligned}
&\dfrac{\partial L}{\partial W^2} = \dfrac{\partial L}{\partial \mathbf{l^3}}  \dfrac{\partial \mathbf{l^3}}{\partial \mathbf{l^2}}  \dfrac{\partial \mathbf{l^2} }{\partial W^2}  \\
&\dfrac{\partial L}{\partial W^1} = \dfrac{\partial L}{\partial \mathbf{l^3}}  \dfrac{\partial \mathbf{l^3}}{\partial \mathbf{l^2}}  \dfrac{\partial \mathbf{l^2}}{\partial \mathbf{l^1}}  \dfrac{\partial \mathbf{l^1} }{\partial W^1}
\end{aligned}

A visualisation of $\dfrac{\partial L}{\partial W^2_{1,1}}$:

<p align="center">
    <img src="./nn-introduction/nn-image-derivatives.png" width="500">
</p>

#### Understanding each term with [matrix calculus](https://en.wikipedia.org/wiki/Matrix_calculus)

To review what these terms are, because the notation is doing a lot of heavy lifting and hiding complexity:

$\dfrac{\partial L}{\partial W^3}$, $L$ is a scalar-valued function (it takes as an input a vector of length $10$, the output of layer 3, and returns a scalar). We take the derivative of this with respect to a matrix. This is a $(512, 10)$  matrix (the size of $W^3$) where each entry indicates how each individual weight $W^3_{i,o}$ affects the loss. 

 $\dfrac{\partial \mathbf{l^3}}{\partial \mathbf{l^2}}$  is the derivative of a vector-valued function (it outputs a vector, of length $10$) with respect to a vector (length $512$). This is the Jacobian, with each elements indicating how element in $l^2$ affects each element in  $l^3$. As layer 3 has $10$ neurons and layer 2 has $512$, we will have a $(10, 512)$ matrix of partial derivatives.

$\dfrac{\partial \mathbf{l^2} }{\partial W^2}$ captures how a small change in each weight in $W^2$ will affect each dimension in $\mathbf{l^2}$. We will have a $(512, 512, 512)$, rank-3 tensor!

#### Computing the derivatives of the loss with respect to $W^3$

Fortunately the structure of our network means we can simplify this a lot as many of these partial derivatives are zero. First we will look in detail at computing the derivative of the loss with respect to the weights connected to the final layer:

$$
\dfrac{\partial L}{\partial W^3} = \dfrac{\partial L}{\partial \mathbf{l^3}}   \dfrac{\partial \mathbf{l^3} }{\partial W^3} 
$$

##### $\dfrac{\partial L}{\partial W^3}$

This is the exact derivative we want to compute to update the weights for the third layer, $W^3$. It will be a $(512, 10)$ matrix, with each element being how a small change in that weight affects the loss e.g. (we drop the superscript for brevity):

$$
    \dfrac{\partial L }{\partial W} =
    \begin{bmatrix}
        \dfrac{ \partial L }{ \partial W_{1,1} } & \dfrac{ \partial L }{ \partial W_{1,2} } & ... & \dfrac{ \partial L }{ \partial W_{1,10} }\\
        \dfrac{ \partial L }{ \partial W_{2,1} } & \ddots & & \vdots \\
        \vdots \\
        \dfrac{ \partial L }{ \partial W_{512,1} } & \dots & &   \dfrac{ \partial L }{ \partial W_{512,10} }
    \end{bmatrix}
$$


##### $\dfrac{\partial L}{\partial \mathbf{l^3}}$

This is the derivative of a scalar-valued loss function with respect to a vector. It is a $10$-dimensional vector of partial derivatives, where each entry captures the loss changes with respect to a change in that node of layer 3. 

We can evaluate the derivative of the cross entropy loss directly:

$$
\dfrac{\partial L }{\partial l^3_n} = \text{softmax}(l^3)_n - \delta_{n, y}
$$

 
Where $\delta$ is the [Kronecker delta](https://en.wikipedia.org/wiki/Kronecker_delta) that equals $1$ when $n=y$ and $0$ otherwise. This says that our vector of derivatives is $\text{softmax}(l^3)_n$ for all output logits in layer 3 that are for the incorrect labels, and $\text{softmax}(l^3)_n - 1$ for the index of the true label $y$. 

See **Appendix 1** for the derivation. In full (show as column vectors here) this looks like:

$$
\dfrac{ \partial L}{ \partial \mathbf{l^3} } =
\begin{bmatrix}
    \dfrac{\partial L}{\partial l^3_1} \\
    \dfrac{\partial L}{\partial l^3_2} \\
    \vdots\\
    \dfrac{\partial L}{\partial l^3_{10}}
\end{bmatrix}
=
\begin{bmatrix}
    \text{softmax}(l^3_1) - \delta_{1, y} \\
    \text{softmax}(l^3_2) - \delta_{2, y} \\
    \vdots \\
    \text{softmax}(l^3_{10}) - \delta_{10, y}
\end{bmatrix}
$$


##### $\dfrac{\partial \mathbf{l^3} }{\partial W^3}$

As mentioned above, this is a 3-rank tensor with the shape $(512, 10, 10)$ (how each element of $l^3_n$ changes with each weight $W^3_{i,o}$). How can we even deal with this in our computation?

A strategy is to break the problem down to look at individual elements of $\dfrac{\partial L}{\partial W^3_{i,o}}$ and compute these using the total derivative rule (**Appendix 2**):

\begin{align}
\dfrac{\partial L}{\partial W^3_{i,o}}
&= \dfrac{\partial L}{\partial l^3} \dfrac{\partial l^3}{\partial W^3_{i,o}}  \\
&= \sum_n \dfrac{\partial L}{\partial l^3_n} \dfrac{\partial l^3_n}{\partial W^3_{i,o}} \\
&= \dfrac{\partial L}{\partial l^3_o} \dfrac{\partial l^3_o}{\partial W^3_{i,o}} \\
&= \dfrac{\partial L}{\partial l^3_o} l^2_i
\end{align}


The notation can get a little tricky to follow here. In words, to find out how the loss changes with respect to a single weight $W^3_{i,o}$ we will see how changing this weight changes layer 3, and then how changing layer 3 changes the loss. To compute this, we will take the sum over how each weight $W^3_{i,o}$ affects the node $n$ in layer 3, $l^3_n$, then how a change in this node affects the loss. This weight only connects to one node, $o$ and so the effect of changing that weight on all other nodes is zero. 

So we can take the usual scalar derivative here $\dfrac{\partial l^3_o}{\partial W^3_{i,o}} = \dfrac{\partial \ l^2_i W^3_{i,o}}{\partial W^3_{i,o}} = l^2_i$. This makes intuitive sense, the change in $l^3_o$ when we make a small change $\delta W^3_{i,o}$ is exactly the value of the input node, $l^2_i$.


##### Putting this all together: $\dfrac{\partial L}{\partial W^3} = \dfrac{\partial L}{\partial \mathbf{l^3}} \dfrac{\partial \mathbf{l^3} }{\partial W^3}$

We can collect the derivatives together in a matrix. All the weights are all $W^3$ matrix and we omit the superscript inside the matrix for brevity:

$$
\dfrac{ \partial L}{ \partial W^3} =
\begin{bmatrix}
\dfrac{\partial L}{ \partial l^3_1 } \dfrac{\partial l^3_1 }{ \partial W_{1,1} } 
& \dfrac{\partial L}{ \partial l^3_2 } \dfrac{\partial l^3_2 }{ \partial W_{1,2} } 
& \dots 
& \dfrac{\partial L}{ \partial l^3_{10} } \dfrac{\partial l^3_{10} }{ \partial W_{1, 10} }  \\
\dfrac{\partial L}{ \partial l^3_1 } \dfrac{\partial l^3_1 }{ \partial W_{2,1} } 
& \ddots \\
\vdots \\
\dfrac{\partial L}{ \partial l^3_1 } \dfrac{\partial l^3_1 }{ \partial W_{512,1} } 
& \dots & 
& \dfrac{\partial L}{ \partial l^3_{10} } \dfrac{\partial l^3_{10} }{ \partial W_{512,10} }
\end{bmatrix}
=
\begin{bmatrix}
[\text{sm}(l^3_1) - \delta_{1, y}]l^2_1 & 
[\text{sm}(l^3_2) - \delta_{2, y}] l^2_1
& \dots 
& [\text{sm}(l^3_{10}) - \delta_{10, y}] l^2_1 & \\
[\text{sm}(l^3_1) - \delta_{1, y}] l^2_2 & \ddots & & \vdots \\
\vdots \\
[\text{sm}(l^3_1) - \delta_{1, y}] l^2_{512} & \dots & 
& [\text{sm}(l^3_{10}) - \delta_{10, y}] l^2_{512} 
\end{bmatrix}
$$

i.e. for each weight, we see the effect of changing that weight on the layer 3 node it is connected to and multiply it with the effect of changing that layer 3 node on the loss.

We can write all of this as the [outer product](https://en.wikipedia.org/wiki/Outer_product) of two vectors (recall all vectors are row vectors here, so the transpose is to a column vector):


$$
\dfrac{\partial L}{\partial W^3} = (\mathbf{l^2})^T \dfrac{\partial L}{\partial \mathbf{l^3}}
$$


#### Computing the derivatives with respect to $W^2$

Next, we are interested in how changing a weight in layer 2 $W^2_{i,o}$ changes the output of layer 2, how this changes layer 3, and how this affects the loss.

$$
\dfrac{\partial L}{\partial W^2} = \dfrac{\partial L}{\partial \mathbf{l^3}}  \dfrac{\partial \mathbf{l^3}}{\partial \mathbf{l^2}}  \dfrac{\partial \mathbf{l^2} }{\partial W^2}
$$

We have seen similar terms to all of these above, except for the middle term on the right hand side, the Jacobian.


##### $\dfrac{\partial \mathbf{l^3}}{\partial \mathbf{l^2}}$

This is the Jacobian. It is $(10, 512)$ and contains the derivative of each element in $\mathbf{l^3}$ with respect to each element in $\mathbf{l^2}$:

\begin{aligned}
\dfrac{\partial \mathbf{l^3}}{\partial \mathbf{l^2}} = 
\begin{bmatrix}
    \dfrac{\partial l^3_1}{\partial l^2_1} & \dfrac{\partial l^3_1}{\partial l^2_2} & \dots & \dfrac{\partial l^3_1}{\partial l^2_{512}} \\
    \dfrac{\partial l^3_2}{\partial l^2_1} & \ddots & & \vdots \\
    \vdots & & \\
    \dfrac{\partial l^3_{10}}{\partial l^2_1} & \dots & & \dfrac{\partial l^3_{10}}{\partial l^2_{512}} 
\end{bmatrix}
\end{aligned}

We can compute $\dfrac{ \partial L }{ \partial \mathbf{l^2}} = \dfrac{ \partial L }{ \partial \mathbf{l^3}} \dfrac{ \partial  \mathbf{l^3} }{ \partial \mathbf{l^2}}$ as matrix multiplication with shapes $(1, 10) \times (10, 512)$, with the resulting $(1, 512)$ vector being:

\begin{aligned}
\begin{bmatrix}
    \sum_o \dfrac{ \partial L }{ \partial l^3_o } \dfrac{\partial l^3_o }{ \partial l^2_1 } &
    \sum_o \dfrac{ \partial L }{ \partial l^3_o } \dfrac{\partial l^3_o }{ \partial l^2_2 } &
    \cdots &
    \sum_o \dfrac{ \partial L }{ \partial l^3_o } \dfrac{\partial l^3_o }{ \partial l^2_{512} }
\end{bmatrix}
\end{aligned}

Because each node of layer 2 is connected to every node of layer 3, so to understand how a change in node in layer 2 affects the loss, we sum over its effect on every node in layer 3. While the derivatives don't go to zero like in the $W^3$ case above, each individual derivative is simple to compute:

$$
\dfrac{\partial l^3_o }{ \partial l^2_i } = \dfrac{\partial l^2_i W^3_{i, o} }{ \partial l^2_i } = W^3_{i, o}
$$

And so we can compute this with the matrix vector calculation:

$$
\dfrac{ \partial L }{ \partial \mathbf{l^2} } = \dfrac{ \partial L}{ \partial \mathbf{l^3}} (W^3)^T
$$ 


##### $\dfrac{\partial \mathbf{l^2} }{\partial W^2}$

Now, we have the multiplication of this $(1, 512)$ 2-rank tensor $\dfrac{ \partial L }{ \partial \mathbf{l^2} }$ with the $(512, 512, 512)$ 3-rank tensor $\dfrac{\partial \mathbf{l^2} }{\partial W^2}$. In tensor multiplication, we need to be explicit about what dimensions we sum over.

Because each $W^2_{i, o}$ is connected to a single layer 2 node $l^2_o$, things are the same as in the $W^3$ case and most of the derivatives go to zero. Therefore using the same logic, we have:


$$
\dfrac{ \partial L }{ \partial W^2 } = (\mathbf{l^1})^T \dfrac{ \partial L }{ \partial \mathbf{l^2} } 
$$


i.e. the outer product of a $(512, 1)$ vector by a $(1, 512)$ vector to give us a $(512, 512)$ matrix of derivatives. 

##### Computing the partial derivatives of $W^1$

Since we have done through all of the hard work of really inspecting what is going on under the hood, we can really simplify the notation going forward. This compact notation really highlights the 'chain' aspect of the chain rule:

$$
\dfrac{\partial L}{\partial W^1} = \dfrac{\partial L}{\partial \mathbf{l^3}}  \dfrac{\partial \mathbf{l^3}}{\partial \mathbf{l^2}}  \dfrac{\partial \mathbf{l^2}}{\partial \mathbf{l^1}}  \dfrac{\partial \mathbf{l^1} }{\partial W^1}
$$

We are using the total derivative rule to condense all intermediate calculations, just like above:

First:

$$
\mathbf{g_1} = \dfrac{\partial L}{\partial \mathbf{l^2}} = \dfrac{\partial L}{\partial \mathbf{l^3}}  \dfrac{\partial \mathbf{l^3}}{\partial \mathbf{l^2}} = \dfrac{\partial L}{\partial \mathbf{l^3}} (W^3)^T
$$

$(1, 512) = (1, 10) \times (10, 512)$

$$
\mathbf{g_2} = \dfrac{\partial L}{\partial \mathbf{l^1}} =  \dfrac{\partial L}{\partial \mathbf{l^2}} \dfrac{\partial \mathbf{l^2}}{\partial \mathbf{l^1}} = \mathbf{g_1}(W^2)^T
$$

$(1, 512) = (1, 512) \times (512, 512)$ 

Finally we have the outer product calculation:

$$
\dfrac{\partial L}{\partial W^1} = \dfrac{\partial L}{\partial \mathbf{l^1}} \dfrac{\partial \mathbf{l^1}}{\partial W^1} = (\mathbf{x})^T \mathbf{g_2}
$$

$(512, 512) = (512, 1) \times (1, 512)$

**It's worth reflecting on this. We have such complexity here. And it reduces really nicely.**
It's awesome to see how changing the derivatives, backwards from the final layer, greatly simplifies computing the derivatives for our complex, fully connected network.

It is also very easy to compute, because we already have our weight vectors and we compute the output of each layer as part of our forward pass!

## The Code

In [None]:
import numpy as np

run = False

class MyBasicNetwork:
    def __init__(self, learning_rate=0.02):

        self.a = learning_rate

        # Define weight matrix (output dim, input dim) by convention
        # Use zero-mean Xavier init (good for sigmoid, it has little
        # effect here as we don't use activation functions,
        # but useful for comparison.)
        self.W1 = np.random.randn(784, 512) * np.sqrt(1 / 784)
        self.W2 = np.random.randn(512, 512) * np.sqrt(1 / 512)
        self.W3 = np.random.randn(512, 10) * np.sqrt(1 / 512)

    def loss(self, l3, y):
        p = self.softmax(l3)[0][y]
        return -np.log( p + 1e-15)

    def softmax(self, vec):
        C = np.max(vec)
        return np.exp(vec - C) / np.sum(np.exp(vec - C))

    def predict(self, x):
        # forward pass through the network
        x = x.reshape(1, x.size)

        l1 = x @ self.W1
        l2 = l1 @ self.W2
        l3 = l2 @ self.W3

        pred = np.argmax(self.softmax(l3))

        return pred, l1, l2, l3

    def update_weights(self, x, y, verbose=False):

        _, l1, l2, l3 = self.predict(x)

        loss = self.loss(l3, y)

        if verbose:
            print(f"Loss: {loss}")

        # Compute the derivatives
        dloss_dl3 = self.softmax(l3) 
        dloss_dl3[0][y] -= 1

        dloss_dW3 = l2.T @ dloss_dl3       # (512, 10) = (512, 1) x (1, 10)

        dloss_dl2 = dloss_dl3 @ self.W3.T  # (1, 512) = (1, 10) x (10, 512)
        dloss_dW2 = l1.T @ dloss_dl2       # (512, 512) = (512, 1) x (1, 512)

        dloss_dl1 = dloss_dl2 @ self.W2.T  # (1, 512) = (1, 512) x (512, 512)
        dloss_dW1 = x.T @ dloss_dl1        # (784, 512) = (781, 1) x (1, 512)

        self.W3 -= self.a * dloss_dW3
        self.W2 -= self.a * dloss_dW2
        self.W1 -= self.a * dloss_dW1

# We won't run this here because it is very slow,
# but it gives an accuracy of ~73%
if run:
    # Initialise and train the model (no batching)
    model = MyBasicNetwork()


    for i, (X, y) in enumerate(training_data):

        x = np.asarray(X[0, :, :])
        y = int(y)

        model.update_weights(x, y)

        if i % 1000 == 0:
            print(f"Training iteration: sample: {i}")

    # Check the model accuracy
    results = np.empty(len(test_data))

    for i, (X, y) in enumerate(test_data):

        x = np.asarray(X[0, :, :])

        results[i] = model.predict(x)[0] == y

    print(f"Percent Correct: {np.mean(results) * 100}%")

## A Better Model

We will now extend the model to a standard neural network by including a bias term and activation function. add some simple changes to improve the performance of the model. 

In our simple version, we essentially had $\mathbf{l} = \mathbf{x}W$ where $W$ is a $(784, 10)$ matrix. One way to interpret this is a matrix encoding $10$ hyperplanes (their normal vectors) in a $784$-dimension space. The dot product between the image vector $\mathbf{x}$ and a column of this $W$ captures the alignment between that hyperplane's normal vector and the image vector.

A simple change we will make is to include a bias term $\mathbf{b}$. We can interpret this as adding an offset to each hyperplane, meaning our data does not need to be centered at zero.

We will make the network nonlinear by passing the output of each node through a nonlinear activation function. While each layer still applies a linear transformation, the composition of these transformations with nonlinear activations makes the overall mapping nonlinear. As a result, the network can represent complex, nonlinear decision boundaries in high-dimensional space.

Another way to think about the benefit of non-linear activation function is to think of each node as an individual function, and the composition across layers as the composition of these functions. In the linear case, they are compositions if linear functions, which are still linear. For example, (omitting biases here for convenience) the output of layer 2 node 1 is: $\sum_o \sum_i x_i W^1_{i,o} W^2_{o, 1}$ which is still linear. This node is a function mapping a vector to a scalar (its linear, so a hyperplane); it's domain is $\mathbf{x} \in \mathcal{X}$. If we add nonlinear activations, the function this node can represent are much more flexible as they are the composition of nonlinear functions. [This](https://www.youtube.com/watch?v=CqOfi41LfDw&t=925s) is a brilliant video on this idea.

Our network is now:

$$
\begin{aligned}
    l^1 &= \phi(\mathbf{x}W^1 + \mathbf{b^1}) \\
    l^2 &= \phi(\mathbf{l^1}W^2 + \mathbf{b^2}) \\
    l^3 &= \mathbf{l^2}W^3 + \mathbf{b^3} \\
\end{aligned}
$$

where $\phi(\cdot)$ is our nonlinear activation function. $\mathbf{b^i}$ is a $(1, n)$ vector of biases, one for each of $n$ nodes layer $i$. The output of $\mathbf{l^{n-1}}W^n+\mathbf{b^n}$ in each layer is still a vector (e.g. $(1, 10)$) and we apply the nonlinear activation function element-wise to this vector.

We use the sigmoid function because it has interesting derivatives, but in general ReLu is preferred in larger, modern networks due to the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem). We don't apply the sigmoid function to the last layer, as we feed this directly to the cross-entropy loss which is already doing a similar mapping with the softmax function.

Now, we will take the derivatives of the loss with respect to the weights *and* biases. 

### Computing the derivatives

#### Derivatives of the sigmoid activation function

The sigmoid function maps the interval $(-\infty, \infty)$ to $(0, 1)$.

$$
\phi(z(t)) = \dfrac{1}{1 + e^{-z(t)}}
$$

Let $z$ is some arbitrary function of the variable $t$. Then the derivative by the chain rule is:

$$
\dfrac{d}{dt} (1 + e^{-z(t)})^{-1} =  \dfrac{ e^{-z(t)} }{ (1 + e^{-z(t)})^2 } \dfrac{d}{dt} z(t)
$$

So, we will need account for include these additional terms in our derivative computations.

#### $W^3$ and $\mathbf{b^3}$

As we don't apply the sigmoid to the last layer, this is exactly the same as the simple example:

$$
\dfrac{\partial L}{\partial W^3} = (\mathbf{l^2})^T \dfrac{\partial L}{\partial \mathbf{l^3}}
$$


The derivative of layer 3 with respect to the **bias** is $1$,  $\dfrac{\partial l^3_o}{\partial b^3_0} = \dfrac{\partial } {\partial b^3_o} l^2_i W^3_{i, o} + b^3_o = 1$. This makes intuitive sense, when we make a small $\delta b^3_o$ change, it just changes the output of layer 3 exactly by this small change $\delta b^3_o$, because it is simply added to the output. Therefore:

$$
\dfrac{\partial L}{\partial \mathbf{b^3}} = \dfrac{\partial L}{\partial \mathbf{l^3}}
$$


#### $W^2$ and $\mathbf{b^2}$

As in the simple case, the chance in the loss with a change in the weights of layer 2 is:

$$
\dfrac{\partial L}{\partial W^2} = \dfrac{\partial L}{\partial \mathbf{l^3}}  \dfrac{\partial \mathbf{l^3}}{\partial \mathbf{l^2}}  \dfrac{\partial \mathbf{l^2} }{\partial W^2}
$$


##### $\dfrac{ \partial L }{ \partial \mathbf{l^2}}$

Because layer 3 does not have an activation function (and the derivative of the new bias term goes to zero) this is identical to the simple case:

$$
\dfrac{ \partial L }{ \partial \mathbf{l^2}} = \dfrac{ \partial L }{ \partial \mathbf{l^3}} (W^3)^T
$$


##### $\dfrac{ \partial L }{ \partial W^2 }$

Next, to compute the derivative of the loss with respect to the layer 2 weight matrix and the bias. First starting with the weights, again considering each weight in isolation:

$$
\dfrac{ \partial L }{\partial W^2_{i, o}} = \dfrac{ \partial L }{ \partial l^2_o } \dfrac{ \partial l^2_o }{ \partial W^2_{i, o}}
$$

We calculated the first term above, so looking at the second term:

$$
\dfrac{ \partial l^2_o }{ \partial W^2_{i, o}} = \dfrac{ \partial }{ \partial W^2_{i, o}} \phi( \sum_n l^1_n W^2_{n, o} + b^2_o)
$$

The term inside $\phi$ with respect to the weight $W^2_{i,o}$ is $l^1_i$ as it is $0$ when $n \neq i$ (because that weight does not connect to that layer 1 input node). Recalling the form of the derivative of the sigmoid function above, and letting $\hat{l^2_o} = \sum_i l^1_i W^2_{i, o} + b^2_o$:

$$
\dfrac{ \partial l^2_o }{ \partial W^2_{i,o}} = \dfrac{ e^{-\hat{l^2_o}} }{ (1 + e^{-\hat{l^2_o}})^2 } \ l^1_i
$$

We can implement this in matrix form. $\hat{l^2}$ is a $(1, 10)$ vector and so the element-wise multiplication is what we need here (officilally called the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)) with notation $\circ$). Bringing this all together:

$$
\begin{aligned}
    \dfrac{ \partial L }{\partial W^2} = (\mathbf{l^1})^T   \bigg(  \dfrac{ e^{-\mathbf{\hat{l^2}}} }{ (1 + e^{-\mathbf{\hat{l^2}}})^2 }   \circ \dfrac{ \partial L }{ \partial \mathbf{l^2} } \bigg)
\end{aligned}
$$

$$
(512, 512) = (512, 1) \times (1, 512) \circ (1, 512)
$$

So in fact, the derivative looks very similar to before we had the activation function, except now we have these extra terms from the chain rule.

The bias calculation is simple, if we take the above derivative with respect to $b_o$ then the inside derivative goes to $1$:

$$
    \dfrac{ \partial L }{\partial \mathbf{b^2}} =  \dfrac{ e^{-\mathbf{\hat{l^2}}} }{ (1 + e^{-\mathbf{\hat{l^2}}})^2 } \circ \dfrac{ \partial L }{ \partial \mathbf{l^2} }
$$


#### $W^1$ and $\mathbf{b^1}$

The derivation for layer 1 follows all the same ideas that we explored for layer 2.

$$
\dfrac{\partial L }{ \partial W^1} = \dfrac{\partial L }{ \partial \mathbf{l^3} } \dfrac{\partial \mathbf{l^3} }{ \partial \mathbf{l^2} } \dfrac{\partial \mathbf{l^2} }{ \partial \mathbf{l^1} } \dfrac{\partial \mathbf{l^1} }{ \partial \mathbf{W^1} }
$$

We just need to compute $\dfrac{\partial L }{ \partial \mathbf{l^1} }$ again, now we have the nonlinear activation function to deal with.

#### $\dfrac{\partial L}{ \partial \mathbf{l^1} }$ 

Again, this will be a $(1, 512)$ vector where each element is how the loss changes with a small change in the corresponding node of layer 1. Similar to the simple case, we can compute each element individually using the total derivative rule:

$$
\begin{aligned}
\dfrac{\partial L}{\partial l^1_i }  &= \sum_o \dfrac{\partial L}{\partial l^2_o } \dfrac{\partial l^2_o}{\partial l^1_i } 
\end{aligned}
$$

and note the output of a single layer 2 node is:

$$
l^2_o = \phi \left( \sum_n l^1_n W^2_{n, o} + b^2_o \right)
$$

For the derivative of the term inside the activation function, this is exactly the same as the simple case. We take the derivative of $\sum_n l^1_n W^2_{n, o} + b^2_o$ with respect to a specific input node $l^1_i$ then all terms are zero except for the case where $i = n$. Therefore, the derivative with respect to $l^1_i$ is $W^2_{i,o}$. 

Again, we will introduce the notation $\mathbf{\hat{l^2}} = \mathbf{l^1}W^2 + \mathbf{b^2}$, i.e. $\mathbf{\hat{l^2}}$ is the output of layer 2 before we put it through the activation function. 

Putting this together with the derivative of the activation function, as above:

$$
\begin{aligned}
\dfrac{\partial L}{\partial l^1_i } &= \sum_o \dfrac{\partial L}{\partial l^2_o } \dfrac{\partial l^2_o}{\partial l^1_i }  \\
 &= \sum_o \dfrac{\partial L}{\partial l^2_o }  \dfrac{\partial }{\partial l^1_i } \phi({\hat{l^2_o}} ) \\
&= \sum_o \dfrac{\partial L}{\partial l^2_o } \dfrac{ e^{-\hat{l^2_o}} }{ (1 + e^{-\hat{l^2_o}})^2 } W^2_{i, o}
\end{aligned}
$$


So we can represent this in matrix form as:

$$
\dfrac{ \partial L }{ \partial \mathbf{l^1} } = \bigg( \dfrac{ \partial L }{ \partial \mathbf{l^2} }  \circ  \dfrac{ e^{-\mathbf{\hat{l^2}}} }{ (1 + e^{-\mathbf{\hat{l^2}}})^2 } \bigg) (W^2)^T
$$


#### So putting it all together for $W^1$ and $\mathbf{b^1}$

Again, we use $\hat{\mathbf{l^1}} = \mathbf{x}W^1 + \mathbf{b^1}$ for the output of layer 1 before inputting to the activation function. 

So as in the simple case, we will go through these piece-by-piece:

$$
\begin{aligned}
    \mathbf{g_1} &= \dfrac{\partial L }{ \partial \mathbf{l^2} } = \dfrac{\partial L }{ \partial \mathbf{l^3} } \dfrac{\partial \mathbf{l^3} }{ \partial \mathbf{l^2} }  \\
    &=  \dfrac{ \partial L }{ \partial \mathbf{l^3} } (W^3)^T
\end{aligned}
$$

$(1, 512) = (1, 10) \circ (1, 10) \times (10, 512)$ 

$$
\begin{aligned}
\mathbf{g_2} &= \dfrac{ \partial L }{ \partial \mathbf{l^1 }} \\ 
&= \mathbf{g_1} \dfrac{\partial \mathbf{l^2} }{ \partial \mathbf{l^1} } \\
&= \bigg( \mathbf{g_1} \circ \dfrac{ e^{-\hat{\mathbf{l^2}} } }{ (1 + e^{-\hat{\mathbf{l^2}}})^2 } \bigg) (W^2)^T
\end{aligned} 
$$

$(1, 512) = (1, 512) \circ (1, 512) \times (512, 512)$


$$
\begin{aligned}
\dfrac{\partial L}{ W^1} &= \mathbf{g_2} \dfrac{\partial \mathbf{l^1} }{ \partial W^1 }\\
&= \mathbf{x}^T \bigg( \dfrac{ e^{-\hat{\mathbf{l^1}} } }{ (1 + e^{-\hat{\mathbf{l^1}} })^2 } \circ \mathbf{g_2} \bigg)
\end{aligned}
$$

$$
(784, 512) = (784, 1) \times (1, 512) * (1, 512)
$$

When taking the derivative with respect to the bias, all steps are the same except the last step $\mathbf{x}$ is $1$ (to see this, take the derivative of $l^1_o = x_i W^1_{i, o} + b^1_o$ with respect to $b_0$ or $W^1_{i,o}$):

$$
\dfrac{ \partial L}{ \partial \mathbf{b^1}} = \dfrac{ e^{-\hat{\mathbf{l^1}} } }{ (1 + e^{-\hat{\mathbf{l^1}} })^2 } \circ \mathbf{g_2}
$$


This network is implemented below, and we see the accuracy in this case has increased by 10%, from 73% to 83%.


In [None]:

run = False

class MyBetterNetwork:
    def __init__(self, learning_rate=0.02):

        self.a = learning_rate

        # Define weight matrix (output dim, input dim) by convention
        # Use zero-mean Xavier init (good for sigmoid)
        # This makes a huge differences vs uniform.
        self.W1 = np.random.randn(784, 512) * np.sqrt(1 / 784)
        self.W2 = np.random.randn(512, 512) * np.sqrt(1 / 512)
        self.W3 = np.random.randn(512, 10) * np.sqrt(1 / 512)

        self.b1 = np.zeros((1, 512))
        self.b2 = np.zeros((1, 512))
        self.b3 = np.zeros((1, 10))

    def loss(self, l3, y):
        p = self.softmax(l3)[0][y]
        return -np.log( p + 1e-15)

    def softmax(self, vec):
        C = np.max(vec)
        return np.exp(vec - C) / np.sum(np.exp(vec - C))

    def predict(self, x):
        # forward pass through the network
        x = x.reshape(1, x.size)

        l1_hat = x @ self.W1 + self.b1
        l1 = self.phi(l1_hat)

        l2_hat = l1 @ self.W2 + self.b2
        l2 = self.phi(l2_hat)

        l3 = l2 @ self.W3 + self.b3

        pred = np.argmax(self.softmax(l3))

        return pred, l1_hat, l1, l2_hat, l2, l3

    def phi(self, vec):
        return 1 / (1 + np.exp(-vec))

    def dphi_dvec(self, vec):
        return np.exp(-vec) / (1 + np.exp(-vec))**2

    def update_weights(self, x, y, verbose=False):

        x = x.reshape(1, x.size)

        _, l1_hat, l1, l2_hat, l2, l3 = self.predict(x)

        loss = self.loss(l3, y)

        if verbose:
            print(f"Loss: {loss}")

        # Compute the derivatives
        dloss_dl3 = self.softmax(l3)
        dloss_dl3[0][y] -= 1

        dloss_dW3 = l2.T @ dloss_dl3
        dloss_db3 = dloss_dl3

        dloss_dl2 = dloss_dl3 @ self.W3.T                               # (1, 512) = (1, 10) x (10, 512)
        dloss_dW2 = l1.T @ (self.dphi_dvec(l2_hat) * dloss_dl2)         # (512, 512) = (512, 1) x (1, 512) * (1, 512)
        dloss_db2 = self.dphi_dvec(l2_hat) * dloss_dl2                  # (1, 512) = (512, 1) x (1, 512)

        dloss_dl1 = (dloss_dl2 * self.dphi_dvec(l2_hat)) @ self.W2.T    # (1, 512) = (1, 512) * (1, 512) x (512, 512)
        dloss_dW1 = x.T @ (self.dphi_dvec(l1_hat) * dloss_dl1)          # (784, 512) = (784, 1) x (1, 512) * (1, 512)
        dloss_db1 = self.dphi_dvec(l1_hat) * dloss_dl1                  # (1, 512) = (1, 512) * (1, 512)

        self.W3 -= self.a * dloss_dW3
        self.W2 -= self.a * dloss_dW2
        self.W1 -= self.a * dloss_dW1

        self.b3 -= self.a * dloss_db3
        self.b2 -= self.a * dloss_db2
        self.b1 -= self.a * dloss_db1

# We won't run this here because it is very slow,
# but it gives an accuracy of ~83%
if run:
        
    # Initialise and train the model (no batching)
    model = MyBetterNetwork()

    for i, (X, y) in enumerate(training_data):

        x = np.asarray(X[0, :, :])
        y = int(y)

        model.update_weights(x, y, verbose=False)

        if i % 1000 == 0:
            print(f"Training iteration: sample: {i}")

    # Check the model accuracy
    results = np.empty(len(test_data))

    for i, (X, y) in enumerate(test_data):

        x = np.asarray(X[0, :, :])

        results[i] = model.predict(x)[0] == y

    print(f"Percent Correct: {np.mean(results) * 100}%")

**Appendix 1**

The cross entropy loss is:

\begin{aligned}
    L &= -\log \dfrac{ \exp{ l_{3, y} } }{ \sum_k \exp{ l_{3, k} } } \\
    &= -\bigg[ \log \exp{ l_{3, y} } - \log \sum_k \exp{ l_{3, k} } \bigg] \\
    &=  \log \sum_k \exp{ l_{3, k} } - l_{3, y}
\end{aligned}

(by the log laws). i.e. we take the logit  of layer 3 that matches the correct label $y$, normalise it to a probability
with the softmax function and take the negative log.

Let's start by taking the derivative with respect to $l_{3, y}$ where this is shorthand for $l_{3, i}$, $i=y$ i.e.
the layer 3 logit for the label that is correct for this image. We are asking: how does a small change in this logit effect the loss?

$$
\begin{aligned}
\dfrac{ \partial L }{ \partial l_{3, y} } &= \dfrac{ \partial }{ \partial l_{3, y} } \left( \log \sum_k \exp{ l_{3, k} } - l_{3, y} \right)
&=  \dfrac{ \partial }{ \partial l_{3, y} } \log \sum_k \exp{ l_{3, k} } - \dfrac{ \partial }{ \partial l_{3, y} }  l_{3, y}  \\
&= \dfrac{1}{ \sum_k \exp{ l_{3, k} } }  \dfrac{ \partial }{ \partial l_{3, y} } \sum_k \exp{ l_{3, k} } - 1
\end{aligned}
$$

(by the derivative of $\log x$ rule). Note that the last term will be $0$ when the the input dimension is not $y$ (because it is treated as a scalar).

We see that in the sum, the derivative of $\exp{ l_{3, i} } $ w.r.t $l_{3, k}$ is $\exp{ l_{3, i} }$ when $i = k$ and $0$ otherwise (as it is treated as a scalar).
So whatever dimension $i$ of $l_3$, we will input to the loss, we get $\text{softmax}(\mathbf{l_3})_i$ as the first term. But only when $i = y$ do we get $-1$ in the second term.

**Appendix 2**

In multivariate calculus, the total derivative rule captures how a function changes with respect to a variable that affects the function in multiple ways (through multiple intermediary functions).

For example, let's say we have the variable $t$ and let:

$$
w = f(x(t), y(t))
$$

The total derivative rule tells us how the output, $w$ changes with a small change in $t$. Intuitively, it is the sum of how $\delta t$ changes $w$ through $x(t)$ and how $\delta t$ changes $w$ through $y(t)$:

$$
\dfrac{dw}{dt} = \dfrac{\partial w}{ \partial x(t)} \dfrac{\partial x(t)}{ \partial t} + \dfrac{\partial w}{ \partial y(t)} \dfrac{\partial y(t)}{ \partial t} 
$$

Note this is exactly analogous to our set up with the layers and weights. A small change in a node in layer 2 will effect the loss through every node in layer 3. Let's look at node 1 from layer 2 ($l^2_1$) as an example. Here layer 3 nodes are a function taking $l^2_1$, which is like $t$ in the above example, as an input:

$$
\text{loss} = L( \ l^3_1(l^2_1), \ l^3_2(l^2_1), \ ..., \ l^3_{10}(l^2_1) \ )
$$ 

$$
\dfrac{\partial L}{ \partial l^2_1 } = \dfrac{\partial L}{ \partial l^3_1 } \dfrac{ \partial l^3_1}{ \partial l^2_1 } + \dfrac{\partial L}{ \partial l^3_2 } \dfrac{\partial l^3_2 }{ \partial l^2_1 } + ... + \dfrac{\partial L}{ \partial l^3_{10} } \dfrac{\partial l^3_{10} }{ \partial l^2_1 }
$$

Which is exactly what we do to deal with these derivatives.