# Understanding Neural Networks

Remember that machine learning models are mathematical functions that perform some input-output transformation.
The goal of machine learning models is to learn the function they should represent.

More complex problems have more complex input-output relationships, and that means they need to be represented by more complicated models.
Models with the ability to represent more complicated functions are said to have a higher _capacity_.

> Neural networks are a type of model that can learn to represent very complicated input-output relationships

If you know how a logistic regression model is built and trained, then you already understand everything you need to know how neural networks work.

Logistic regression works by:
- Applying a linear transformation followed by a non-linear activation function (the softmax)
- Using the chain rule of differentiation to determine how changes to model parameters change the output, that is, to determine the gradient

> The idea behind neural networks is that by applying sequential transformations, we can represent a more complicated function. 

![](./images/Linear%20Regression%20vs%20NN%20Graphical%20Model.png)

We call these intermediate layers of transformation, between the input and the output, _hidden layers_.

The typical diagram of a neural network shows the individual nodes in each hidden layer.

> In a neural network, each sequential layer combines features to build more complex 

![](./images/Linear%20Regression%20vs%20NN.png)

And these neural networks have parameters between layers similar to those found in linear models:

![](./images/NN%20Showing%20Weights.png)

We will get onto how each value in each layer is calculated shortly.

The width and depth of a neural network, make up the part of the _neural network architecture_. Neural networks can have any number of hidden layers, with any width, and any number of inputs and outputs.

![](./images/NN%20with%20Many%20Layers.png)

That means that you can train neural networks to represent functions with any different kind of sized inputs and outputs.

![](./images/Different%20Vanilla%20NN%20Architectures.png)

There are a few more details to go into (particularly the activation function used below), but firstly, take a look at how easily building a neural network can be accomplished in PyTorch:

In [None]:
import torch

class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()
        self.activation_function = torch.nn.functional.relu
        self.hidden_layer_1 = torch.nn.Linear(num_inputs, 256)
        self.hidden_layer_2 = torch.nn.Linear(256, 128)
        self.hidden_layer_3 = torch.nn.Linear(128, 128)
        # you could have many more hidden layers here
        self.output_layer = torch.nn.Linear(128, num_outputs)

    def forward(self, X):
        hidden_layer_1_activations = self.activation_function(self.hidden_layer_1(X))
        hidden_layer_2_activations = self.activation_function(self.hidden_layer_1(hidden_layer_1_activations))
        hidden_layer_3_activations = self.activation_function(self.hidden_layer_2(hidden_layer_2_activations))
        output = self.output_layer(hidden_layer_3_activations)
        return output

Even better, would be to use `torch.nn.Sequential` to avoid having to come up with clear names for each layer. 
It takes in a number of different torch modules (transformations) and returns a callable object.
When called, each of these transformations is applied sequentially to the input.

In [None]:
import torch

class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()
        self.layers = torch.nn.Sequential(
            torch.nn.Linear(num_inputs, 256),
            torch.nn.ReLU(), # relu activation function
            torch.nn.Linear(256, 128),
            torch.nn.ReLU(), # relu activation function
            torch.nn.Linear(128, 128),
            torch.nn.ReLU(), # relu activation function
            # you could have many more hidden layers here
            torch.nn.Linear(128, num_outputs)
        )

    def forward(self, X):
        return self.layers(X)

Note that this model would have been tough to write out the calculations for the gradient of the output with respect to the model parameters, but PyTorch takes care of all of this for us in the background.

## What are those activation functions?

> An _activation function_ is any non-linear function applied to the linear output of a hidden layer

Here are some of the most common activation functions:

![](./images/Common%20Activation%20Functions.png)

> It is important that the activation function is differentiable, like every module in our neural networks, because the model will be trained using gradient descent

With the activation functions, the full forward pass (for a single hidden layer neural network) computes the following:

![](./images/NN%20Full%20Forward%20Pass%20Equation.png)

### Why do we need activation functions?

> If you leave out the activation function, the forward pass of a neural network is equivalent to a single linear transformation. You added all those parameters, and didn't increase the capacity of the model!

Below is a simple proof. 

_Note that biases are removed for simplification (a weighted sum plus bias can be rewritten as a single matrix multiplication, so the example below still holds)_

![](./images/Why%20you%20Need%20Activation%20Functions.png)

In general, stick with ReLU as your activation function.

> There are so many activation functions that work comparably well, that research on new activation functions has become uninteresting

## Neural networks build hierarchical representations of the data

> It's easier to learn input-output transformations step by step (layer by layer), rather than all at once.

For example:
- it's easy to learn a function that detects a line at a certain angle
- it is much harder to learn the function which detects what the make and model of a car in an image, directly from the raw pixels, all at once

Neural networks learn simple functions layer by layer. Their output is then the input to the next layer. The inputs to each successive layer are more meaningful and high level, and they can be combined to represent even more complex features.

![](./images/nnfeaturevis.png)

- The first hidden layer computes features that are combinations of the inputs
    - e.g. combine raw pixel values to create edges
- The next hidden layer computes more complex features, that are combinations of the previous layer
    - e.g. combine edges to create shapes
- Sequential layers continue to compute more complex features, that are combinations of the previous layer
    - e.g. combine shapes to represent objects
- And so on
    - e.g. deeper layers are building combining objects to build abstract, meaningful representations
- The output layer can easily combine the meaningful representations to make predictions
    - The high level features extracted by the deeper layers are the kind of things that can eventually be used to make direct predictions from.
        - e.g. if you know that an image contains a wheel, you can be confident in a prediction of the image containing a car
        - but you can't tell this just by looking at the value of the pixels

> Neural networks turn low-level (raw, simple) features into high-level (complex, abstract, meaningful) features that predictions can be made from before using them to make a prediction

Simple models try to make predictions directly from a combination of the raw inputs, in a single transformation. 
Neural networks, on the other hand, apply sequential intermediate transformations, where each layer builds up on the output of the layer before.

> Neural networks are different to many other types of models because they build hierarchical representations of data

## What do the values of the activations represent?

The function that a neural network represents, is based on the parameters in each layer, which are initialised randomly. That means, to start with, the neural network is a random function. We'll train it shortly using gradient descent.

> The values of activations in layers represent the presence of features in the input, and deeper layers represent more complex features

## Neural networks learn their own representations for the data

Whereas simple models are forced to make a prediction based on the way that data is provided to them, neural networks learn how to represent the data through the hidden layers, so that the final layer receives useful inputs.

> Neural networks learn to extract useful features from the input, by learning for themselves what it would be useful for each activation to represent 

## Notice that neural networks can have lots of parameters

- Vector outputs of hidden layers = weight matrices instead of just weight vectors
- Wider hidden layers = larger weight matrices
- More hidden layers = more weight matrices

These additional parameters are part of what gives neural networks a much greater representational capacity than simple (e.g. linear) models.

In [None]:
def count_parameters(model):
    n_params = 0
    for param in model.parameters():
        n_params += param.numel()
    print("# model parameters:", n_params)

model = NeuralNetwork(3, 1)
count_parameters(model)

## Neural Network Architectures

These neural networks that simply process vectors through linear layers with activation functions are often referred to as:
- "Vanilla" neural networks
- Feedforward neural networks
- Multi-layer perceptrons (MLP)

There are many more advanced neural network architectures that specialise in different ways. Common ones you may have heard of include:
- Convolutional Neural Networks
- Recurrent Neural Networks
- Transformers

## Training neural networks

Neural networks can be trained from end-to-end using gradient descent and the chain rule of differentiation.

A key thing to notice is that for many of the gradient calculations, some terms reappear for different parameters.

![](./images/Backpropagation.png)

> Backpropagation specifically refers to the algorithm that caches these terms that repeat, so that they don't have to be repeatedly computed.

PyTorch handles this all efficiently under the hood when you call `.backward()` on any variable computed from the model's output (probably the loss) to populate the parameter gradients.

Now, you understand everything required to train a neural network in PyTorch.




This cell just creates a dummy dataset.

In [None]:
def create_dummy_dataset(num_examples, num_features):
    X = torch.randn((num_examples, num_features))
    y = torch.randn((num_examples, 1)) # 1 label each
    return X, y

def create_dummy_dataloader(X, y, batch_size=4):
    def create_batches(data):
        return [
            data[idx*batch_size: (idx+1) * batch_size] if (idx + 1) * batch_size < len(data)
            else data[idx*batch_size:]
            for idx in range(len(data) // batch_size)
        ]
    batched_X = create_batches(X)
    batched_y = create_batches(y)
    return list(zip(batched_X, batched_y))
    
X, y = create_dummy_dataset(10, 4)
dataloader = create_dummy_dataloader(X, y)

Before we train the model, note that the dataset is random and hence the loss does not decay as it would with a real dataset.

Overall, the training loop is the same as for a simple linear model:

In [None]:
def train(model, dataloader, epochs=10):
    optimiser = torch.optim.SGD(model.parameters(), lr=0.01)
    for epoch in range(epochs):
        for batch in dataloader:
            features, labels = batch
            prediction = model(features)
            
            loss = torch.nn.functional.mse_loss(prediction, labels)
            loss.backward()
            print("Loss:", loss.item())

            optimiser.step()
            optimiser.zero_grad()
            

X, y = create_dummy_dataset(100, 4) # 100 examples, 4 features each
dataloader = create_dummy_dataloader(X, y)
model = NeuralNetwork(4, 1)
train(model, dataloader)