# Introduction to Neural Networks

Neural networks are a very flexible model through which to learn complex, high dimensional functions. First, we will apply the model in Pytorch, and then implement it by hand, using multivariable calculus to derive the parameter updates.

In this example, we will apply a neural network to an image classification problem. We have a dataset containing images clothes, with each image having a label describing the item of clothing in the image. Each $28 \times 28$-pixel image is represented as a length $784$ vector $\mathbf{x}$. We use bold font to indicate a vector. There are 10 possible items of clothing (e.g. t-shirt, hat) in our dataset, so $y$ is an integer in the range $[0, 10]$. Together, our sample space is pairings of images and labels, $(\mathbf{x}_i, y_i) \sim (\mathcal{X}, \mathcal{Y})$ (i.e. a single image, label pair index by $i$ can be drawn from the sample space of all pairs of images and labels).

We want to learn the conditional distribution $p(y| \mathbf{x})$ i.e. what is the probability of the label $y$ given an image, $\mathbf{x}$. For example, what is the probability this is an image $\mathbf{x}$ is of a t-shirt? In this case, we want to learn a function $f(\mathbf{x}) : \mathbf{x} \rightarrow \mathbf{y}$. It will take a vector of length $784$ and output a vector of length $10$, with each element of the output vector assigning some weight related to the probability of image $\mathbf{x}$ being a particular label $y$. These unnormalised weights output by the model are called 'logits'.

<img src="nn-image-1.png" width="400">

In this example, we will first follow the [Pytorch](https://docs.pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) 'Getting Started' page to train a simple model for this classification task. Then, we will get a deeper grip on how these models work by implementing the model ourselves in Numpy, which requires calculating the derivatives required for model training ourselves.

## Coding up a Neural Network in Pytorch

This section exactly follows [the Pytorch introduction](https://docs.pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html).

### The data

We will use the FashionMINST dataset, containing $28 \times 28$ images of clothing, 10 possible labels. In total there are $60,000$ image sin the training set and $10,000$ images in the test set.

<img src="FashionMNIST.png" width="500">

In [50]:
import torch
from torch import nn
# DataLoader is an iterable around DataSet, which stores samples and their corresponding labels.
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST(
    root="data",  # root directory
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)


In [51]:
batch_size = 64

# Create data loaders
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break


Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64


This makes sense, our batch is (batch size, num channels (RGB), image height, image width) and for each image in the batch we have a single label (e.g. X = t-shirt).

When training, Batch Size...

Epoch...

### The Model

The very cool thing, we can just define the inputs, outputs, and define an architecture that we will be able to mould into the function of interest. Note the amazing beauty of how flexible this is, because of our set up. We simply define inputs, outputs and a black-box architecture (for now). Then we don't need to think about whats going on inside, we 'shape' this through iterative training of the weights on our known data. Once we have got a good mould we are done!


In [52]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device}")

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()

        self.flatten = nn.Flatten()

        # Linear layers are:  yᵢ = φ( ∑ⱼ xⱼ Wᵀⱼᵢ + bᵢ ) so together y = φ(xW^T + b)
        # where W is a (output_features, input_features) set of weights, b is vector of biases
        # so we can easily control output shape. Layers are fully connected.
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)




Using cpu
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)



#### What is the Model Learning?

Let's just look again what the model is learning. To fully characterise the dataset,  we would have the joint distribution
between the data and the labels p(x, y). We can draw this out in the simple case:


This also gives us information about the relative frequencies of the labels, y and images. Then we could convert to p(y|X) by dividing by p(x) i.e. Bayes


However, this is much more complicated, and the aim of generative models. This is what we need to learn for generative models! but, it is more complex.
Can maginalise out

In our case, we will take a shortcut and just learn p(y|x). Basically it learns the feature space and how that maps to 28*28 dimension space and partitions it
into 10 classes. The decision boundary is a hyper-surface in the space. Of course, this tells us nothing about p(x, y) because XXX. but it is a nice way of p(y|X).

A simple drawing!

but in reality, it is not a simple plane but a manifold in much hig

hmmm, still not really sure what exactly we are learning. DOes this literally funnel an unkonwn x into a already-known x and then map p(y|x)?

https://arxiv.org/abs/1311.2901
http://neuralnetworksanddeeplearning.com/
https://cs231n.github.io/linear-classify/
https://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf

In [53]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

# We can see we have Weights, Biases, Weights, Biases, Weights, Biases (3 layers)
print(f"The model parameters")
for name, param in model.named_parameters():
      print(
          f"Name: {name}\n"
          f"Type: {type(param)}\n"
          f"Size: {param.size()}\n"
          f"e.g. {param[:5]}\n\n")

The model parameters
Name: linear_relu_stack.0.weight
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([512, 784])
e.g. tensor([[ 0.0242, -0.0248, -0.0166,  ..., -0.0256,  0.0325, -0.0028],
        [ 0.0264,  0.0106, -0.0349,  ..., -0.0113, -0.0121,  0.0221],
        [ 0.0038, -0.0243,  0.0197,  ..., -0.0243, -0.0306,  0.0113],
        [-0.0337, -0.0340, -0.0342,  ...,  0.0348, -0.0187, -0.0099],
        [ 0.0065,  0.0300,  0.0117,  ...,  0.0249, -0.0121,  0.0218]],
       grad_fn=<SliceBackward0>)


Name: linear_relu_stack.0.bias
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([512])
e.g. tensor([-0.0340, -0.0322, -0.0116,  0.0172, -0.0129], grad_fn=<SliceBackward0>)


Name: linear_relu_stack.2.weight
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([512, 512])
e.g. tensor([[ 0.0256, -0.0107, -0.0299,  ..., -0.0403,  0.0318, -0.0249],
        [ 0.0342, -0.0205,  0.0428,  ...,  0.0127,  0.0360, -0.0173],
        [-0.0149,  0.0350, -0.0221,  ..., 

### Training and Evaluating the Model

In [54]:
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train(mode=True)  # put into 'training mode'

    for batch, (X, y) in enumerate(dataloader):

        X, y = X.to(device), y.to(device)

        # Pred is (64, 10) tuple of predictions for this batch
        # y is (64, 1) (classes)
        # Cross entropy loss https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss
        pred = model(X)
        loss = loss_fn(pred, y)

        loss.backward()
        optimizer.step()  # perform one step θt <- f(θ_{t-1})
        optimizer.zero_grad()  # zero the accumulated gradients, ready for the next step

        if batch % 100 == 0:
            loss = loss.item()  # Note `loss` is a object, we use `item()` to get the scalar loss
            current = (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

train(train_dataloader, model, loss_fn, optimizer)

loss: 2.311416  [   64/60000]
loss: 2.297227  [ 6464/60000]
loss: 2.284765  [12864/60000]
loss: 2.277797  [19264/60000]
loss: 2.245504  [25664/60000]
loss: 2.223144  [32064/60000]
loss: 2.229235  [38464/60000]
loss: 2.200769  [44864/60000]
loss: 2.204567  [51264/60000]
loss: 2.150434  [57664/60000]


In [55]:
def test(dataloader, model, loss_fn):
    """"""
    size = len(dataloader.dataset)
    num_batches = len(dataloader)

    model.eval()  # Go from train to eval mode

    test_loss, correct = 0, 0

    with torch.no_grad():  # this just turns of gradient computation for speed

        for batch, (X, y) in enumerate(dataloader):

            X, y = X.to(device), y.to(device)

            pred = model(X)
            # pred_i = torch.argmax(torch.exp(pred) / torch.sum(torch.exp(pred)), axis=1)
            pred_i = pred.argmax(1)  # of course, it doesn't matter if the logits are passed through softmax, which maintains transitivity
            correct += (pred_i == y).type(torch.float).sum().item()
            test_loss += loss_fn(pred, y)

        test_loss /= num_batches
        correct /= size
        print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

print("Epoch 1")
test(test_dataloader, model, loss_fn)

for i in range(5):
    print(f"Epoch {i + 2}")
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)


Epoch 1
Test Error: 
 Accuracy: 36.9%, Avg loss: 2.159129 

Epoch 2
loss: 2.175913  [   64/60000]
loss: 2.156889  [ 6464/60000]
loss: 2.109339  [12864/60000]
loss: 2.123337  [19264/60000]
loss: 2.051893  [25664/60000]
loss: 2.000705  [32064/60000]
loss: 2.027452  [38464/60000]
loss: 1.956121  [44864/60000]
loss: 1.970606  [51264/60000]
loss: 1.868530  [57664/60000]
Test Error: 
 Accuracy: 49.2%, Avg loss: 1.889209 

Epoch 3
loss: 1.925272  [   64/60000]
loss: 1.883625  [ 6464/60000]
loss: 1.786613  [12864/60000]
loss: 1.827395  [19264/60000]
loss: 1.697302  [25664/60000]
loss: 1.655534  [32064/60000]
loss: 1.676771  [38464/60000]
loss: 1.589110  [44864/60000]
loss: 1.623531  [51264/60000]
loss: 1.493784  [57664/60000]
Test Error: 
 Accuracy: 62.1%, Avg loss: 1.532667 

Epoch 4
loss: 1.596773  [   64/60000]
loss: 1.552969  [ 6464/60000]
loss: 1.422226  [12864/60000]
loss: 1.491803  [19264/60000]
loss: 1.362407  [25664/60000]
loss: 1.363337  [32064/60000]
loss: 1.370365  [38464/60000]
lo

Add a note here on generative vs discrimiantive. Here we only learn p(y|x) NOT p(x, y)! Here we simply do ML on p(y|x) !!!!



## Training a Neural Network by Hand - A 'Simple' Example

Pytorch...
This gets very fiddly

### The Model

Let $l_1$, $l_2$ and $l_3$ be vector-valued functions representing our layers. $l_1$ takes our length $784$ row vector $\mathbf{x}$ and returns a length $512$ vector. $l_2$ takes as input, and outputs, a length $512$ vector. Finally, $l_3$ takes a length $512$ vector as input and outputs a length $10$ vector. To make the mathematics clear, we will first use a very simple model, that has no bias or non-linear rectifying functions.

\begin{aligned}
    &l_1(\mathbf{x}) = \mathbf{x}W_1 \ \ \ \ \text{(1, 784) x (784, 512) = (1, 512)} \\
    &l_2(\mathbf{l_1}) = \mathbf{l_1}W_2 \ \ \ \ \text{(1, 512) x (512, 512) = (1, 512)} \\
    &l_3(\mathbf{l_2}) = \mathbf{l_2}W_3 \ \ \ \ \text{(1, 512) x (512, 10) = (1, 10)} \\
\end{aligned}

The $W$ are our matrices of weights with shape (outputs, inputs). $W_1$ is (10, 512),  $W_2$ is (512, 512) and $W_3$ is (512, 10). Here we take these matrices as shape (input, output) and index individual weights as $W_{3, ij}$ where $i$ is the $i$th neuron in the input layer, and $j$ is the $j$th neuron in the output layer. All vectors are row vectors. This is a little non-standard but this notation makes the derivations below much simpler. The shapes of these vector-matrix multiplications are shown on the right. We go from a length 784 vector $\mathbf{x}$ to a length 10 vector $\mathbf{l_3}$ as expected.

Here, we already run into some notational difficulty. Our layers are functions, which when we evaluate at some input we get a vector. We then feed this vector into the next function. So our layers are both functions, and when evaluated are vectors. We will use the bold font to indicate the evaluated function. We will drop the bracket notation as it is too verbose.
TODO: But in notation, we always use the evaluated vector, not the function symbol.

Finally, we take the (1, 10) shape vector of logits and use these to compute our loss function, the cross-entropy loss:

$$
L(\mathbf{l_3}, y) = -\log \dfrac{ \exp{ \mathbf{ l_{3, y} }}}{ \sum_k \exp{ \mathbf{l_{3, k} }}}
$$

Here $\mathbf{l_{3, k}}$ is the $k$-indexed element of our vector of logits from layer 3, and $y$ is the index of the correct label for this image. You may recognise the term after the $\log$ as the [softmax function](https://en.wikipedia.org/wiki/Softmax_function#:~:text=The%20softmax%20function%2C%20also%20known,used%20in%20multinomial%20logistic%20regression) which normalises the logits to probabilities. Therefore, we are computing the probability of the input image $\mathbf{x}$ being of label $y$ according to our model.

Therefore, we are aiming to maximise the log probability associated with the correct label, $y$. This makes intuitive sense, we have an image $\mathbf{x}$ and are computing a set of probabilities, one for each of the 10 labels. Of course, we want to maximise the probability we assign to the correct label $y$, we we know in this supervised context that it is the true, correct label for this image. Here we will equivalently minimise the negative log probability.



### Predicting a label from an image - the forward pass

For this simple network, it is easy to use our model to take an image $\mathbf{x}$ (input), map it to our output of 10 logits and use these to predict a label $\hat{y}$. First we apply the network to the input data:

\begin{aligned}
\mathbf{l_3} &= l_3(l_2(l_1(\mathbf{x}))) \\
             &= (((\mathbf{x}W_1) W_2) W_3 \\
             &=  \mathbf{x}W_1 W_2 W_3 \\
\end{aligned}

We then take this output, and the predicted label is the one that maximises the probability

$$
\hat{y} = \arg\max_{y} \, \mathrm{softmax}(\mathbf{l}_3)_y
$$

i.e. the predict label is the one that is assigned the highest probability, according to our output layer (we take advantage of that fact that the possible labels, $\hat{y}$, are defined as indices, so we can use them in index out the layer 3 logits).

### Backpropagation in our simple network

TODO: go through optimisation here and the whole point of calculating these derivatives is to update the weights

Next, we need to know how to update our weight matrices ($W_3$, $W_2$, $W_1$) during training. For example, take $W_{3, 12}$ that connects the first neuron from layer 2 to the second neuron in layer 3. How does a small change in this weight change the output of our loss function? In this case, a small change in $W_{3, 12}$ will result in a small change in the second neuron in layer 3 ($\mathbf{l_{3, 2}$) which will directly affect the loss function $L(\mathbf{l_3})$. This 'chain' of dependencies is capture by chain rule:

\begin{aligned}
&\dfrac{\partial L(\mathbf{l_3})}{\partial W_3} = \dfrac{\partial L(\mathbf{l_3})}{\partial \mathbf{l_3}}  \dfrac{\partial \mathbf{l_3} }{\partial W_3}
\end{aligned}

i.e. the change in the loss function due to a small change in a weight equals the effect of a small change in the weight on layer 3, and then effect of the corresponding change in layer 3 on the loss function. We can express the derivatives of other weight matrices similarly:

\begin{aligned}
&\dfrac{\partial L(\mathbf{l_3})}{\partial W_2} = \dfrac{\partial L(\mathbf{l_3})}{\partial \mathbf{l_3}}  \dfrac{\partial \mathbf{l_2}}{\partial \mathbf{l_3}}  \dfrac{\partial \mathbf{l_2} }{\partial W_3}  \\
&\dfrac{\partial L(\mathbf{l_3})}{\partial W_1} = \dfrac{\partial L(\mathbf{l_3})}{\partial \mathbf{l_3}}  \dfrac{\partial \mathbf{l_2}}{\partial \mathbf{l_3}}  \dfrac{\partial \mathbf{l_1}}{\partial \mathbf{l_2}}  \dfrac{\partial \mathbf{l_1} }{\partial W_1}
\end{aligned}

TODO: check notation
TODO: CHECK AGAINST https://www.jasonosajima.com/backprop.html

Now, this notation is doing some extreme heavy lifting, and hiding a lot of complexity. Take $\dfrac{\partial L(\mathbf{l_3})}{\partial W_2}$, $L$ is a scalar-valued function (it takes as input a vector of length 10 and returns a scalar). We want to take the deriative of this w.r.t. a matrix - we want to know, how each element of $W_{3, ij}$ affects the loss function $L$. Therefore this will be a matrix, of size (512, 10) (the size of $W_3$). How about $\dfrac{\partial \mathbf{l_2}}{\partial \mathbf{l_3}}$? This is the derivative of a vector-valued function (i.e. it outputs a vector, of length 512) wirth repsect to a vector! So we will ask, for each element of $\mathbf{l_{3, j}}$ how does element $\mathbf{l_{2, i}}$ affect it? As layer 3 has 10 neurons and layer 2 as 512, we will have a (10, 512) matrix of partial derivatives. And for $\dfrac{\partial \mathbf{l_2} }{\partial W_3}$ we will ask how a small change in each weight in $W_2$ will affect each dimension in $\mathbf{l_2}$ - we will have a (512, 512, 512) 3-rank tensor!

Fortunately the structure of our network means many of these partial derivatives are zero, and we can simply this a lot. Below, we go through each of these in turn, before implementing this network in Python. Note that for brevity we often write $\partial L(\mathbf{l_3})$ as $\partial L$. TODO: only for $W_3$ at first.


#### $\dfrac{\partial L(\mathbf{l_3})}{\partial W_3}$

This is the exact derivative we want to compute to update the weights for the third layer, $W_3$. It will be a (512, 10) matrix, with each element been how a small change in that weight affects the loss i.e.

$$
    \dfrac{\partial L }{\partial W} =
    \begin{bmatrix}
        \dfrac{ \partial L }{ \partial W_{1,1} } & \dfrac{ \partial L }{ \partial W_{1,2} } & ... & \dfrac{ \partial L }{ \partial W_{1,10} }\\
        \dfrac{ \partial L }{ \partial W_{2,1} } & \ddots & & \vdots \\
        \vdots \\
        \dfrac{ \partial L }{ \partial W_{512,1} } & \dots & &   \dfrac{ \partial L }{ \partial W_{512,10} }
    \end{bmatrix}
$$

Remember in our convention, the first index of the weight matrix is the input, the second is the output. So for example, $W_{512, 10}$ is the weight from layer 2 neuron 512 to layer 3 unit 10.

Add a picture of the neurons?

Now that we are clear the matrix of partial derivative takes this form, we will compute its contents using the chain rule $\dfrac{\partial L(\mathbf{l_3})}{\partial W_3} = \dfrac{\partial L(\mathbf{l_3})}{\partial \mathbf{l_3}}  \dfrac{\partial \mathbf{l_3} }{\partial W_3}$.


#### $\dfrac{\partial L(\mathbf{l_3})}{\partial \mathbf{l_3}}$

This is the derivative of the scalar-valued loss function with respect to a vector (the 10 units of layer 3). So, it will be a 10-dimension vector of
partial derivatives, as below. Recall that we index the $i$th unit of a layer (e.g. the third layer) with the notation $\mathbf{l_{3, i}}$.
We can evaluate the derivative of the cross entropy loss directly: it is $\text{softmax}(l_{3, i}) - 1$ when $i = y$
(i.e. this is the derivative for the logit that matches the index of the correct label, $y$). The derivative for all other units in layer 3,
when $i \neq y$, $\text{softmax}(l_{3, i})$ when $i \neq y$. We can represent this as  $\text{softmax}(l_{3, i}) - \delta_{i, y}$ where $\delta$ is the
[Kronecker delta](https://en.wikipedia.org/wiki/Kronecker_delta). This is quite curious, see **Appendix 1** for the derivation.

To summarise, the partial derivative of the loss w.r.t layer 3 is:

$$
\dfrac{ \partial L}{ \partial \mathbf{l_3} } =
\begin{bmatrix}
    \dfrac{\partial L}{\partial \mathbf{l_{3, 1}}} \\
    \dfrac{\partial L}{\partial \mathbf{l_{3, 2}}} \\
    \vdots\\
    \dfrac{\partial L}{\partial \mathbf{l_{3, 10}}}
\end{bmatrix}
=
\begin{bmatrix}
    \text{softmax}(l_{3, 1}) - \delta_{1, y} \\
    \text{softmax}(l_{3, 2}) - \delta_{2, y} \\
    \vdots \\
    \text{softmax}(l_{3, 10}) - \delta_{10, y}
\end{bmatrix}
$$

#### $\dfrac{\partial \mathbf{l_3} }{\partial W_3}$

Here we have the derivative of a vector with respect to a matrix. We are asking how a small change every weight in $W_3$ (a 512 by 10 matrix) effects each dimension of the vector $\mathbf{l_3}$. So we have a (512, 10, 10) set of partial derivatives, a rank-3 tensor! How can we even deal with this in our computation?

Luckily, the structure of our layered neural network means that this can be reduced. TODO: key here is we take the full oepration pL wirit W3. The strategy is to break the problem down to look at individual elements of $\dfrac{\partial L(l_3)}{\partial W_3}$, as in the image above and fill in each term-wise e.g. $\dfrac{\partial L(l_3)}{\partial W_{3, ij}}$ as well as using the total derivative rule (**Appendix B**):

\begin{align}
\dfrac{\partial L(l_3)}{\partial W_{3, ij}}
&= \dfrac{\partial L}{\partial l_{3}} \dfrac{\partial l_{3}}{\partial W_{3,ij}}  \\
&= \sum_k \dfrac{\partial L}{\partial l_{3,k}} \dfrac{\partial l_{3,k}}{\partial W_{3,ij}} \\
&= \dfrac{\partial L}{\partial l_{3,j}} \dfrac{\partial l_{3,j}}{\partial W_{3,ij}} \\
&= \dfrac{\partial L}{\partial l_{3,j}} l_{2, i}
\end{align}

TODO Explain this in great detail with words. It is a nightmare to follow the indicies but quite clear in terms of neurons, inputs and outputs (show image). The nice thing about this is that the derivative for $\dfrac{\partial l_{3,k}}{\partial W_{3,ij}}$ is $ 0 \ \forall \ j \neq k$  i.e. a weight that connects from unit $i$ has no effect on the $k$th unit in layer 3 unless it connects to unit $k$ i.e. $j = k$ in layer 3. Therefore, the term is simply by the scalar derivative rule:

\begin{aligned}
    \dfrac{\partial l_{3,j}}{\partial W_{3,ij}} = \dfrac{\partial \ l_{2, i} W_{3, ij} }{\partial W_{3,ij}} =  l_{2, i}
\end{aligned}

Also note that this will be the same for every column of the derivative L to W2 (see expansion above). i.e. all weight from l2_i will have the same derivative (it makes sense, as they compeltely depend on the change in l2_i as they go from W_ij to their target j
TODO: explain this more intutiively, have an image


#### Putting this all together: $\dfrac{\partial L(\mathbf{l_3})}{\partial W_3} = \dfrac{\partial L(\mathbf{l_3})}{\partial \mathbf{l_3}} \dfrac{\partial \mathbf{l_3} }{\partial W_3}$

Note here the weights are all W_3 matrix and we omit for brevity. And now we know we can simply the expression to reduce the derivatgive only to
respect that the neuron in layer three that our weight actually connects to (as above).

$$
\dfrac{ \partial L}{ \partial \mathbf{l_3} } =
\begin{bmatrix}
\dfrac{\partial L}{ \partial l_{3, 1} } \dfrac{\partial l_{3, 1} }{ \partial W_{1,1} } 
& \dfrac{\partial L}{ \partial l_{3, 2} } \dfrac{\partial l_{3, 2} }{ \partial W_{1,2} } 
& \dots 
& \dfrac{\partial L}{ \partial l_{3, 10} } \dfrac{\partial l_{3, 10} }{ \partial W_{1, 10} }  \\
\dfrac{\partial L}{ \partial l_{3, 1} } \dfrac{\partial l_{3, 1} }{ \partial W_{2,1} } 
& \ddots \\
\vdots \\
\dfrac{\partial L}{ \partial l_{3, 1} } \dfrac{\partial l_{3, 1} }{ \partial W_{512,1} } 
& \dots & 
& \dfrac{\partial L}{ \partial l_{3, 10} } \dfrac{\partial l_{3, 10} }{ \partial W_{512,10} }
\end{bmatrix}
=
\begin{bmatrix}
\dfrac{\partial L}{ \partial l_{3, 1} } l_{2, 1} & 
\dfrac{\partial L}{ \partial l_{3, 2} } l_{2, 1}
& \dots & & \\
\dfrac{\partial L}{ \partial l_{3, 1} } l_{2, 2} & \ddots \\
\vdots \\
\dfrac{\partial L}{ \partial l_{3, 1} } l_{2, 512}
\end{bmatrix}
$$

Which we can write (recalling we are using row veectors, so transposing to a to column vector):

$$
\mathbf{l_2}^T \dfrac{\partial L}{\partial \mathbf{l_3}}
$$

i.e. the outer product of our vector of partial derivatives of the loss with respect to the output layer 3 (as above)
and layer two outputs of the network. Amazing how all this complexity goes down to something so simple! return to this.






#### $\dfrac{\partial \mathbf{l_2}}{\partial \mathbf{l_3}}$

This is the jacobian. It is (512, 10) (and show it exactly). Again, we will collapse this when taking deriative wrt the loss
to make the calculations easier For each element, we use the total derivative rule. So we get... (matrix) and we can represent it as (...) We do this BELOW!


TODO: just drop the (l3) and make it clear somewhere to remember that it is a function of just l3.

#### Putting the Layer 2 weights together: $\dfrac{\partial L(\mathbf{l_3})}{\partial W_2} = \dfrac{\partial L(\mathbf{l_3})}{\partial \mathbf{l_3}}  \dfrac{\partial \mathbf{l_2}}{\partial \mathbf{l_3}}  \dfrac{\partial \mathbf{l_2} }{\partial W_3} $

Now this gets complicated again, more complicated above. While W2(1,1) say only affects neuron 1 in layer 2, neuron 1 in layer 2
is connected to all 10 neurons in layer 3. So l2_1 affects the loss through all of this. Therefore, we can proceed
as above but we need to compute this iteratively. First, we can compute $\dfrac{\partial L}{\partial \mathbf{l_2}$, noting 
this is a derivative of a scalar valued loss function with respect to a vector. So we will have a 512 length vector
of partial derivatives. We can compute each entry with the total derivative rule:

$$
\dfrac{ \partial L }{ \partial l_{2, i} } = \sum_k \dfrac{ \partial L }{ \partial l_{3, k} } \dfrac{\partial l_{3, k} }{ \partial l_{2, i} }
$$

say units!
This doesn't simplify so well. But look at the derivative (its the weight matrix) and look at the computation we want to do. We can
recover this vector with:

I think dl/dl3 W_1

$\dfrac{\partial L(\mathbf{l_3})}{\partial \mathbf{l_3}}  \dfrac{\partial \mathbf{l_2}}{\partial \mathbf{l_3}}$

then again

$\dfrac{\partial \mathbf{l_2} }{\partial W_3}$


express as outer product




And simialrly for the final layer (check computation by hand, just put here the final result

Note how we can reuse many of these computations

## Layer 1 weights, using nicer notation (use g to depresent the chain derivatives we move forward.
This should be significantly less confusing and more concise!


Same idea for L2, but add in the  extra term

Then list all derivatives

and implement.

IMAE OF UNITS
TODO: indicate W_ij is from unit i to j
i.e. IMAGE OF THE MATRIX

TODO: we should be writing L(l3, y)!!
TODO: make sure that when we index a vector, it is not bold!









See **Appendix 1** for the derivation of $\dfrac{\partial L(l_3)}{\partial l_3}$, the notation $\delta{iy}$ is the Kronecker delta. The only way to see this is the derivation.

Note that this notation is doing a lot of heavy lifting! This is really key and ofen skipped, things work out nicely for our simple function that that is not always the case. For example, XX is a vector of a scalar valued function to a vector valued functino, XX is a derivative of a vector valued functino to a vector valued fucntion, and XX is the derivative of a vector valued function to a matrix! See **Appendix 2** for a full exploration of this. Simiarlly these are not scalar multiplcations, it depends on the results!

TODO: A percenton lesson. Because each perceptron learns a hpyerplane, so indeed we are stackng lots of hyperplans together in a net and summing them! beautiful! and then nonlinear... and deepp...




In [56]:
import numpy as np

class MyNeuralNetwork:
    def __init__(self, learning_rate=0.05):

        self.a = learning_rate

        # Define weight matrix (output dim, input dim) by convention
        self.W1 = np.random.uniform(0, 0.05, (512, 28*28))
        self.W2 = np.random.uniform(0, 0.05, (512, 512))
        self.W3 = np.random.uniform(0, 0.05, (10, 512))

    def loss(self, x, y):
        return - np.log( np.exp(x[y]) / np.sum(np.exp(x)) )

    def update_weights(self, x, y):

        # Forward pass
        l1 = x @ self.W1.T
        l2 = l1 @ self.W2.T
        l3 = l2 @ self.W3.T

        loss = self.loss(l3, y)
        print(f"Loss: {loss}")

        # 1) check these shapes, see why they must be wrong
        # 2) use out products and matrix multiplications. You will never
        # be able to do this without paying very careful attention to the matrix shapes!!!
        dloss_dl3 = np.exp(l3) / np.sum(np.exp(l3))
        dloss_dl3[y] -= 1
        dl3_W3 = l2
        dl3_l2 = self.W3.T
        dl2_W2 = l1
        dl2_l1 = self.W2.T
        dl1_W1 = x

        dloss_dW3 = dloss_dl3 * dl3_W3
        dloss_dW2 = dloss_dl3 * dl3_l2 * dl2_W2
        dloss_dW1 = dloss_dl3 * dl3_l2 * dl2_l1 * dl1_W1

        self.W3 -= self.a * dloss_dW3
        self.W2 -= self.a * dloss_dW2
        self.W1 -= self.a * dloss_dW1

    def train_step(self, x, y):

        self.update_weights(x, y)







**Appendix 1**

The cross entropy loss is:

\begin{aligned}
    L &= -\log \dfrac{ \exp{ l_{3, y} } }{ \sum_k \exp{ l_{3, k} } } \\
    &= -\bigg[ \log \exp{ l_{3, y} } - \log \sum_k \exp{ l_{3, k} } \bigg] \\
    &=  \log \sum_k \exp{ l_{3, k} } - l_{3, y}
\end{aligned}

(by the log laws). i.e. we take the logit  of layer 3 that matches the correct label $y$, normalise it to a probability
with the softmax function and take the negative log.

Let's start by taking the derivative with respect to $l_{3, y}$ where this is shorthand for $l_{3, i}$, $i=y$ i.e.
the layer 3 logit for the label that is correct for this image. We are asking: how does a small change in this logit effect the loss?

\begin{aligned}
\dfrac{ \partial L }{ \partial l_{3, y} } &= \dfrac{ \partial }{ \partial l_{3, y} } \left( \log \sum_k \exp{ l_{3, k} } - l_{3, y} \right) \\
&=  \dfrac{ \partial }{ \partial l_{3, y} } \log \sum_k \exp{ l_{3, k} } - \dfrac{ \partial }{ \partial l_{3, y} }  l_{3, y}  \\
&= \dfrac{1}{ \sum_k \exp{ l_{3, k} } }  \dfrac{ \partial }{ \partial l_{3, y} } \sum_k \exp{ l_{3, k} } - 1
\end{aligned}

(by the derivative of $\log x$ rule). Note that the last term will be $0$ when the the input dimension is not $y$ (because it is treated as a scalar).

We see that in the sum, the derivative of $\exp{ l_{3, i} } $ w.r.t $l_{3, k}$ is $\exp{ l_{3, i} }$ when $i = k$ and $0$ otherwise (as it is treated as a scalar).
So whatever dimension $i$ of $l_3$, we will input to the loss, we get $\text{softmax}(\mathbf{l_3})_i$ as the first term. But only when $i = y$ do we get $-1$ in the second term.




