# Introduction to Neural Networks

Neural networks are a very flexible model by which to learn complex, high dimensional functions. For example, if we have a dataset (X, y)
where x \in X are 28 x 28 images, and y \in Y is an label, taking the form of an integer {1...10}. Our model could learn a function
that takes x as an input and outputs the most likely label i.e. f(x) : x \rightarrow y. Typically we will learn the condiional distribution
p(y|x). Here y is a discrete random variable (1..10) and x is an image, so p(y|X) will be a 10-element discrete probability distribution.
Note the model does not learn the probability distribution directly, but outputs 'logits' that we normalise with the softmax function
to get a probability distribution
so we will map x to l \in R^10 where each entry is
 represented by the vector $\mathbf{x}$ and a corresponding set of labels
p(y|X).

Therefore our model will approximate this function.

In this example, we will follow the [Pytorch](https://docs.pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) 'Getting Started' page to
train a simple model. Then, we will get a deeper grib on how these models work by implementing the model ourselves in Numpy, which requires calculating
the model update step ourselves.

## Coding up a Neural Network in Pytorch

This section follows [the pytorch inroducion](https://docs.pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)
almost exactly, so the notes will be light here.


### The data

We will use the MINST dataset, containing 28x28 images and associtesd label. e.g. an Image of a cap, with the label 0.
And image of a X, with the label 1. etc. for XXX images.

We will split the dataset into training and testing data.


In [3]:
import torch
from torch import nn
# DataLoader is an iterable around DataSet, which stores samples and their corresponding labels.
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST(
    root="data",  # root directory
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)



In [4]:
batch_size = 64

# Create data loaders
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break


Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64


This makes sense, our batch is (batch size, num channels (RGB), image height, image width) and for each image in the batch we have a single label (e.g. X = t-shirt).

### The Model

The very cool thing, we can just define the inputs, outputs, and define an architecture that we will be able to mould into the function of interest. Note the amazing beauty of how flexible this is, because of our set up. We simply define inputs, outputs and a black-box architecture (for now). Then we don't need to think about whats going on inside, we 'shape' this through iterative training of the weights on our known data. Once we have got a good mould we are done!


In [10]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device}")

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()

        self.flatten = nn.Flatten()

        # Linear layers are:  yᵢ = φ( ∑ⱼ xⱼ Wᵀⱼᵢ + bᵢ ) so together y = φ(xW^T + b)
        # where W is a (output_features, input_features) set of weights, b is vector of biases
        # so we can easily control output shape. Layers are fully connected.
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)




Using cuda
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)



#### What is the Model Learning?

Let's just look again what the model is learning. To fully characterise the dataset,  we would have the joint distribution
between the data and the labels p(x, y). We can draw this out in the simple case:


This also gives us information about the relative frequencies of the labels, y and images. Then we could convert to p(y|X) by dividing by p(x) i.e. Bayes


However, this is much more complicated, and the aim of generative models. This is what we need to learn for generative models! but, it is more complex.
Can maginalise out

In our case, we will take a shortcut and just learn p(y|x). Basically it learns the feature space and how that maps to 28*28 dimension space and partitions it
into 10 classes. The decision boundary is a hyper-surface in the space. Of course, this tells us nothing about p(x, y) because XXX. but it is a nice way of p(y|X).

A simple drawing!

but in reality, it is not a simple plane but a manifold in much hig

hmmm, still not really sure what exactly we are learning. DOes this literally funnel an unkonwn x into a already-known x and then map p(y|x)?

https://arxiv.org/abs/1311.2901
http://neuralnetworksanddeeplearning.com/
https://cs231n.github.io/linear-classify/
https://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf

In [6]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

# We can see we have Weights, Biases, Weights, Biases, Weights, Biases (3 layers)
print(f"The model parameters")
for name, param in model.named_parameters():
      print(
          f"Name: {name}\n"
          f"Type: {type(param)}\n"
          f"Size: {param.size()}\n"
          f"e.g. {param[:5]}\n\n")

The model parameters
Name: linear_relu_stack.0.weight
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([512, 784])
e.g. tensor([[ 0.0192,  0.0357, -0.0006,  ..., -0.0333, -0.0235,  0.0033],
        [-0.0218,  0.0026,  0.0204,  ...,  0.0252, -0.0259, -0.0091],
        [ 0.0164, -0.0108,  0.0284,  ..., -0.0228,  0.0354,  0.0250],
        [ 0.0121, -0.0232, -0.0101,  ..., -0.0092,  0.0006,  0.0351],
        [-0.0173,  0.0102,  0.0229,  ...,  0.0273, -0.0032,  0.0100]],
       device='cuda:0', grad_fn=<SliceBackward0>)


Name: linear_relu_stack.0.bias
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([512])
e.g. tensor([ 0.0227,  0.0245,  0.0095,  0.0304, -0.0260], device='cuda:0',
       grad_fn=<SliceBackward0>)


Name: linear_relu_stack.2.weight
Type: <class 'torch.nn.parameter.Parameter'>
Size: torch.Size([512, 512])
e.g. tensor([[-0.0316, -0.0208, -0.0376,  ..., -0.0434,  0.0045, -0.0224],
        [ 0.0171, -0.0396,  0.0083,  ..., -0.0402, -0.0002, -0.0139],
 

### Training and Evaluating the Model

In [7]:
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train(mode=True)  # put into 'training mode'

    for batch, (X, y) in enumerate(dataloader):

        X, y = X.to(device), y.to(device)

        # Pred is (64, 10) tuple of predictions for this batch
        # y is (64, 1) (classes)
        # Cross entropy loss https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss
        pred = model(X)
        loss = loss_fn(pred, y)

        loss.backward()
        optimizer.step()  # perform one step θt <- f(θ_{t-1})
        optimizer.zero_grad()  # zero the accumulated gradients, ready for the next step

        if batch % 100 == 0:
            loss = loss.item()  # Note `loss` is a object, we use `item()` to get the scalar loss
            current = (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

train(train_dataloader, model, loss_fn, optimizer)

loss: 2.297308  [   64/60000]
loss: 2.285923  [ 6464/60000]
loss: 2.268851  [12864/60000]
loss: 2.269377  [19264/60000]
loss: 2.253051  [25664/60000]
loss: 2.225462  [32064/60000]
loss: 2.232326  [38464/60000]
loss: 2.198354  [44864/60000]
loss: 2.198508  [51264/60000]
loss: 2.172440  [57664/60000]


In [8]:
def test(dataloader, model, loss_fn):
    """"""
    size = len(dataloader.dataset)
    num_batches = len(dataloader)

    model.eval()  # Go from train to eval mode

    test_loss, correct = 0, 0

    with torch.no_grad():  # this just turns of gradient computation for speed

        for batch, (X, y) in enumerate(dataloader):

            X, y = X.to(device), y.to(device)

            pred = model(X)
            # pred_i = torch.argmax(torch.exp(pred) / torch.sum(torch.exp(pred)), axis=1)
            pred_i = pred.argmax(1)  # of course, it doesn't matter if the logits are passed through softmax, which maintains transitivity
            correct += (pred_i == y).type(torch.float).sum().item()
            test_loss += loss_fn(pred, y)

        test_loss /= num_batches
        correct /= size
        print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

print("Epoch 1")
test(test_dataloader, model, loss_fn)

for i in range(5):
    print(f"Epoch {i + 2}")
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)


Epoch 1
Test Error: 
 Accuracy: 43.0%, Avg loss: 2.161597 

Epoch 2
loss: 2.166717  [   64/60000]
loss: 2.156315  [ 6464/60000]
loss: 2.102063  [12864/60000]
loss: 2.127685  [19264/60000]
loss: 2.076803  [25664/60000]
loss: 2.019103  [32064/60000]
loss: 2.050828  [38464/60000]
loss: 1.969575  [44864/60000]
loss: 1.976774  [51264/60000]
loss: 1.917304  [57664/60000]
Test Error: 
 Accuracy: 58.8%, Avg loss: 1.904784 

Epoch 3
loss: 1.932660  [   64/60000]
loss: 1.901088  [ 6464/60000]
loss: 1.786190  [12864/60000]
loss: 1.840296  [19264/60000]
loss: 1.725365  [25664/60000]
loss: 1.674734  [32064/60000]
loss: 1.705723  [38464/60000]
loss: 1.596955  [44864/60000]
loss: 1.619671  [51264/60000]
loss: 1.533497  [57664/60000]
Test Error: 
 Accuracy: 59.7%, Avg loss: 1.537534 

Epoch 4
loss: 1.597141  [   64/60000]
loss: 1.561808  [ 6464/60000]
loss: 1.412652  [12864/60000]
loss: 1.498589  [19264/60000]
loss: 1.372249  [25664/60000]
loss: 1.364505  [32064/60000]
loss: 1.383409  [38464/60000]
lo

Add a note here on generative vs discrimiantive. Here we only learn p(y|x) NOT p(x, y)! Here we simply do ML on p(y|x) !!!!



## Training a Neural Network by Hand - A Simple Example

Pytorch...

### The Model

Let $l_1$, $l_2$ and $l_3$ be vector-valued functions representing our layers. $l_1$ takes our length $784$ row vector $mathbf{x}$ and returns a length $512$ vector. $l_2$ takes as input, and outputs, a length $512$ vector. Finally, $l_3$ takes a length $512$ vector as input and outputs a length $10$ vector. To make the mathematics clear, we will first use a very simple model, that has no bias or non-linear rectifying functions.

\begin{aligned}
    &l_1(\mathbf{x}) = \mathbf{x}W_1^T \ \ \ \ \text{(1, 10) x (10, 512) = (1, 512)} \\
    &l_2(\mathbf{l_1}) = \mathbf{l_1}W_2^T \ \ \ \ \text{(1, 512) x (512, 512) = (1, 512)} \\
    &l_3(\mathbf{l_2}) = \mathbf{l_2}W_3^T \ \ \ \ \text{(1, 512) x (512, 10) = (1, 10)} \\
\end{aligned}

The $W$ are our matrices of weights with shape (outputs, inputs). $W_1$ is (512, 10),  $W_1$ is (512, 10) and $W_1$ is (512, 10). The shapes of these vector-matrix multiplications are shown on the right. We go from a length 784 vector to a length 10 vector as expected.

Here, we already run into some notational difficulty. Our layers are functions, which when we evaluate at some input we get a vector. We then feed this vector into the next function. So our layers are both functions, and when evaluated are vectors. We will use the bold font to indicate the evaluated function. We will drop the bracket notation as it is too verbose.

Finally, we take the (1, 10) shape vector of logits and use these to compute our loss function, the cross-entropy loss:

$$
L(\mathbf{l_3}, y) = -\log \dfrac{ \exp{ \mathbf{ l_{3, y} }}}{ \sum_k \exp{ \mathbf{l_{3, k} }}}
$$

Here $\mathbf{l_{3, k}}$ is the $k$-indexed element of our vector of logits from layer 3, and $y$ is the index of the correct label for this image. You may recognise the term after the $\log$ as the [softmax function](https://en.wikipedia.org/wiki/Softmax_function#:~:text=The%20softmax%20function%2C%20also%20known,used%20in%20multinomial%20logistic%20regression) which normalises the logits to probabilities. Therefore, we are computing the probability of the input image $\mathbf{x}$ being of label $y$ according to our model.

Therefore, we are aiming to maximise the log probability associated with the correct label, $y$. This makes intuitive sense, we have an image $\mathbf{x}$ and are computing a set of probabilities, one for each of the 10 labels. Of course, we want to maximise the probability we assign to the correct label $y$, we we know in this supervised context that it is the true, correct label for this image. Here we will equivalently minimise the negative log probability.



### Predicting a label from an image - the forward pass

For this simple network, it is easy to use our model to take an image $\mathbf{x}$ (input), map it to our output of 10 logits and use these to predict a label $\hat{y}$. First we apply the network to the input data:

\begin{aligned}
\mathbf{l_3} $= l_3(l_2(l_1(\mathbf{x}))) \\
             %=
\end{aligned}





How can we possibly make sense of these multiplications? (1, 10) X (10, 512) X (10, 512, 512)? Well, the first two work, and fortunately the third decomposes because the majority of the elements are zero. The stratergy is to break the problem down to look at individual elements of $\dfrac{\partial L(l_3)}{\partial W_3}$, as in the image above and fill in each term-wise e.g. $\dfrac{\partial L(l_3)}{\partial W_{3, ij}}$ as well as using the total derivative rule:

TODO: NEED TO HANDLE THE TRANSPOSE HERE
HUGE IMAGE ONF THE MATRIX

Then how we can compute this with the outer product!

Same idea for L2, but add in the  extra term

Then list all derivatives

and implement.


\begin{align}
\dfrac{\partial L(l_3)}{\partial W_{3, ij}}
&= \dfrac{\partial L(l_3)}{\partial l_{3}} \dfrac{\partial l_{3}}{\partial W_{3,ij}}  \\
&= \sum_k \dfrac{\partial L(l_3)}{\partial l_{3,k}} \dfrac{\partial l_{3,k}}{\partial W_{3,ij}}
\end{align}

The nice thing about this is that the derivative for $\dfrac{\partial l_{3,k}}{\partial W_{3,ij}}$ is $ 0 \ \forall \ j \neq k$  i.e. a weight that connects from unit $i$ has no effect on the $k$th unit in layer 3 unless it connects to unit $k$ i.e. $j = k$ in layer 3. Therefore, the term is

\begin{aligned}
    \dfrac{\partial l_{3,j}}{\partial W_{3,ij}} = \dfrac{\partial \ l_{2, i} W_{3, ij} }{\partial W_{3,ij}} =  l_{2, i}
\end{aligned}

IMAE OF UNITS
TODO: indicate W_ij is from unit i to j
i.e. IMAGE OF THE MATRIX









See **Appendix 1** for the derivation of $\dfrac{\partial L(l_3)}{\partial l_3}$, the notation $\delta{iy}$ is the Kronecker delta. The only way to see this is the derivation.

Note that this notation is doing a lot of heavy lifting! This is really key and ofen skipped, things work out nicely for our simple function that that is not always the case. For example, XX is a vector of a scalar valued function to a vector valued functino, XX is a derivative of a vector valued functino to a vector valued fucntion, and XX is the derivative of a vector valued function to a matrix! See **Appendix 2** for a full exploration of this. Simiarlly these are not scalar multiplcations, it depends on the results!

TODO: A percenton lesson. Because each perceptron learns a hpyerplane, so indeed we are stackng lots of hyperplans together in a net and summing them! beautiful! and then nonlinear... and deepp...




In [9]:
import numpy as np

class MyNeuralNetwork:
    def __init__(self, learning_rate=0.05):

        self.a = learning_rate

        # Define weight matrix (output dim, input dim) by convention
        self.W1 = np.random.uniform(0, 0.05, (512, 28*28))
        self.W2 = np.random.uniform(0, 0.05, (512, 512))
        self.W3 = np.random.uniform(0, 0.05, (10, 512))

    def loss(self, x, y):
        return - np.log( np.exp(x[y]) / np.sum(np.exp(x)) )

    def update_weights(self, x, y):

        # Forward pass
        l1 = x @ self.W1.T
        l2 = l1 @ self.W2.T
        l3 = l2 @ self.W3.T

        loss = self.loss(l3, y)
        print(f"Loss: {loss}")

        # 1) check these shapes, see why they must be wrong
        # 2) use out products and matrix multiplications. You will never
        # be able to do this without paying very careful attention to the matrix shapes!!!
        dloss_dl3 = np.exp(l3) / np.sum(np.exp(l3))
        dloss_dl3[y] -= 1
        dl3_W3 = l2
        dl3_l2 = self.W3.T
        dl2_W2 = l1
        dl2_l1 = self.W2.T
        dl1_W1 = x

        dloss_dW3 = dloss_dl3 * dl3_W3
        dloss_dW2 = dloss_dl3 * dl3_l2 * dl2_W2
        dloss_dW1 = dloss_dl3 * dl3_l2 * dl2_l1 * dl1_W1

        self.W3 -= self.a * dloss_dW3
        self.W2 -= self.a * dloss_dW2
        self.W1 -= self.a * dloss_dW1

    def train_step(self, x, y):

        self.update_weights(x, y)





