# Understanding PyTorch

Consider implementing a linear regression model from scratch, along the lines of that shown below:


In [None]:
import numpy as np

class LinearRegression:
    def __init__(self, num_inputs, num_outputs):
        self.weights = np.random.randn(num_outputs, num_inputs)
        self.bias = np.random.randn(num_outputs)

    def predict(self, X):
        return X @ self.weights + self.bias

    def get_gradients(self, X):
        grad_w = X
        grad_b = 1
        return grad_w, grad_b


Linear regression is a useful model, but it is unlikely to be able to tackle most problems of interest, because it can only represent simple linear input-output relationships.

Lots of real world prediction problems require a more complicated model, that can represent a more complex input-output function.
So we will probably want to replace the predict method with some other mathematical function.

The problem with the above implementation, is that:
1. Every time you change the model, you need to change how the gradients are calculated
    - The faster we can change the model, the faster we can experiment and determine what doesn't work
2. More complex models have more complex gradient calculations
    - It's easier to make mistakes when implementing the gradient calculations from scratch
    - It might take a bit of research to find out how to calculate the gradient for functions you aren't familiar with
    - Many lines of code might be required to compute individual gradients for models that apply many transformations

These problems become a big concern when dealing with large and complicated models, which are the kind of models that are used to tackle many modern AI problems.
Pytorch is designed to directly address these, specifically for the application of building deep learning models (neural networks).

> PyTorch automatically computes gradients of the different transformations applied by the model, so that you don't have to.

PyTorch may alse be described as:
- A deep learning framework
- A library containing common utilities for deep learning and gradient-based optimisation
- A library for performing efficient mathematical computations

But...

> PyTorch's main feature is its ability to perform automatic differentiation.

I don't expect you to understand this yet, but the equivalent implementation in PyTorch would be as shown below. 

In [4]:
import torch

class PyTorchLinearRegression(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()
        self.linear_layer = torch.nn.Linear(num_inputs, num_outputs)

    def forward(self, X):
        return self.linear_layer(X)

As you can see, the PyTorch implementation does not require you to define how to compute the gradients of the output with respect to the model parameters.

Aside from that, there are three key differences to address between the from-scratch implementation and the PyTorch implementation:
1. The model inherits from a class named `torch.nn.Module`
1. The method that takes in the inputs and makes a prediction has a special name `forward`
1. Instead of defining the parameters, we define an attribute of the class that represents the transformation
    - This is done using a class from PyTorch
    - The layer here contains both the bias and the weights
    - The instance of the class is callable, and when called on the inputs it performs the linear transformation (weighted sum of inputs + bias)

All of these points are really a result of the model inheriting from `torch.nn.Module`

The name "module" here refers to building blocks of computation which need to store an internal state (like parameters)
- The entire linear regression model is a module
- The linear layer itself (torch.nn.Linear(in, out)) is a module

This parent class does a lot of important things under the hood, which we will uncover as we need them going forward, but some key things to notice are:
1. When we call the model (by doing `model(input_data)`), `torch.nn.Module` tells it to run its `forward` method on the input data
1. `torch.nn.Module` module looks for any attributes of the class that also inherit from `torch.nn.Module`

The parameter values are initialised randomly by PyTorch and can be found inside the instance of the linear layer:

In [40]:
model = PyTorchLinearRegression(3, 1)
print("Weight:", model.linear_layer.weight)
print("Bias:", model.linear_layer.bias)

Parameter containing:
tensor([[ 0.0430, -0.2506, -0.3324]], requires_grad=True)
Parameter containing:
tensor([-0.0165], requires_grad=True)


## So where are the gradients?

Firstly, I need to introduce the main data type used in PyTorch: the _torch tensor_.

The model input must be a `torch.tensor`. The model output will be a `torch.tensor`. The model parameters are all `torch.tensors`.

Quickly scan the docs [here](https://pytorch.org/docs/stable/tensors.html).


In [64]:
import torch.nn.functional as F

X = torch.tensor([[1.0, 2.0, 3.0]]) # create a torch tensor
y = torch.tensor([[2.3]]) # create a label for this example

model = PyTorchLinearRegression(3, 1) # initialise the model
prediction = model(X) # TODO call the model on the data
loss = F.mse_loss(prediction, y)

print(type(prediction)) # TODO print the type of the model output

<class 'torch.Tensor'>


> Calling `any_variable.backward()` on any torch tensor populates the `.grad` attribute of all torch tensors which contribute to it, recursively. The value of a torch tensor's `.grad` attribute is the value of the gradient of `any_variable` with respect to that torch tensor.


In [65]:
prediction.backward()

Take the directed graph of the simplest machine learning model, linear regression, and its loss:

![linear](../../images/Linear%20Regression%20Directed%20Graph%20Backward%20Pass.gif)

_Note: See detail 1 below for why not all nodes have backward arrows_

### 2 more details of `.backward()`

#### Detail 1: Which tensors get their gradients populated?

Gradients are populated recursively, for not only all of the variables that contributed to whatever you called `.backward` on, but to whatever variables contributed to those, and so on...

However, not all tensors get their gradients populated.

Every `torch.tensor` has a `requires_grad` attribute, which is either `True` or `False`. The `requires_grad` attribute determines whether the gradient should be computed.
    - Our end goal is to use the gradient to optimise model parameters, so the `.grad` attribute of the model weights and bias will be `True`
    - We do not care about optimising the input data, which is fixed, so its `.grad` attribute will be `False`
    - Any `torch.tensor` computed from a tensor with `.requires_grad`

In [66]:
print("Input data `requires_grad`:", X.requires_grad)
print("Weights `requires_grad`:", model.linear_layer.weight.requires_grad)
print("Bias `requires_grad`:", model.linear_layer.bias.requires_grad)
print("Prediction `requires_grad`:", prediction.requires_grad)

Input data `requires_grad`: False
Weights `requires_grad`: True
Bias `requires_grad`: True
Prediction `requires_grad`: True


![](../../images/Linear%20Regression%20Directed%20Graph%20Backward%20Pass%20Requires%20Grad.png)

If we know the (differentiable) function that was used to compute a tensor (e.g. add, softmax, matrix multiply), then we know the function that tells us the gradient of the function for any input value.
We call this the gradient function.

> Every function has a corresponding gradient function that tells you how steep the original function is for any given inputs. This is the function we need to use in the backward pass.

![](../../images/Corresponding%20Gradient%20Functions.png)

Mathematicians found the gradient functions for different operations, and programmers implemented them as part of PyTorch.
A lot of the code under the hood of PyTorch defines the gradient functions for different operations.

About the gradient function:
- The resulting tensor from an operation stores that operation's gradient function in its `grad_fn` attribute.
- The gradient function represents the backward pass.
- Every PyTorch tensor that is computed from a tensor with `requires_grad=True` will have a `grad_fn` attribute.
- The gradient function should be called on the inputs to the forward pass to compute the `.grad` attribute.

In [67]:
t = torch.tensor(3.0)
t.requires_grad = True
prediction = t * X.T + 1

print("Input data `grad_fn`:", X.grad_fn)
print("Weights `grad_fn`:", model.linear_layer.weight.grad_fn)
print("Bias `grad_fn`:", model.linear_layer.bias.grad_fn)
print("Prediction `grad_fn`:", prediction.grad_fn)

Input data `grad_fn`: None
Weights `grad_fn`: None
Bias `grad_fn`: None
Prediction `grad_fn`: <AddBackward0 object at 0x7fd338fdaf70>



Remember... the gradients are populated recursively. 

Imagine:
1. You compute $b = f(a)$ then $c = g(b)$
1. You call `c.backward()`

Common misconception: `b.grad` $=\frac{\partial b}{\partial a}$

In fact: `b.grad` $=\frac{\partial c}{\partial b}\frac{\partial b}{\partial a}$

That is, `.backward()`...
- ...not only computes the rate of change of the immediate output with respect to the input
- ...but multiplies that with all of the other gradients 

> The value of `.grad` is populated by the multiplication of all gradients between whatever you called `.backward()` on and the tensor whos `.grad` you are populating. I.e. the chain rule of differentiation is applied.

![](../../images/Linear%20Regression%20Directed%20Graph%20Backward%20Pass%20Chain%20Rule.png)

## Tensor shapes: Specifically, PyTorch's batch dimension

You may have noticed that earlier, when defining our dataset (just a single example), `X`, it was a list of lists, each containing 3 numbers, rather than just a list of 3 numbers.
The inner list represents an example datapoint with 3 features.
The outer list represents the whole dataset, and the dataset will always contain more than just a single example.

> In PyTorch, the first dimension is the batch dimension

That means, if you have a dataset of 100 examples with 4 features each, then it will have size (100, 4).

This is important because it's expected by models and many other functions in PyTorch.

## Create a Dummy Dataloader

During training using mini-batch gradient descent, we will iterate through batches of data.

The cell below creates a larger dataset, and splits the data up into batches that we can iterate through.

Each of those batches has size (`batch_size`, `num_features`). Going forward we will reference this size as `(B, N)`.

Don't worry about the exact code inside the cell below, there's a much better way to build PyTorch dataloaders, but just understand that it splits the dataset into batches.

In [95]:
def create_dummy_dataset(num_examples, num_features):
    X = torch.randn((num_examples, num_features))
    y = torch.randn((num_examples, 1)) # 1 label each
    return X, y

def create_dummy_dataloader(X, y, batch_size=4):
    def create_batches(data):
        return [
            data[idx*batch_size: (idx+1) * batch_size] if (idx + 1) * batch_size < len(data)
            else data[idx*batch_size:]
            for idx in range(len(data) // batch_size)
        ]
    batched_X = create_batches(X)
    batched_y = create_batches(y)
    return list(zip(batched_X, batched_y))
    
X, y = create_dummy_dataset(10, 4)
print(X, y)

dataloader = create_dummy_dataloader(X, y)
for idx, batch in enumerate(dataloader):
    print(f'Batch {idx}')
    X, y = batch
    print("Features:")
    print(X)
    print("Labels:")
    print(y)
    print()

tensor([[-1.0984,  1.6435,  0.2389, -0.6266],
        [-0.5094, -0.3000, -0.4245,  1.2744],
        [-1.0762, -0.5869, -0.3151, -0.6500],
        [-1.2662, -0.5093,  0.4892, -0.7936],
        [ 0.2077, -1.5511,  1.6262,  0.9098],
        [ 1.2605, -0.0414, -1.4537, -1.3826],
        [-1.2499, -0.8110, -1.2698, -0.7196],
        [-0.7010,  0.1051,  0.7586, -0.4480],
        [-0.8535, -0.1496, -0.4446,  0.0243],
        [-0.9587, -0.8736,  0.2331,  0.4560]]) tensor([[-1.3960],
        [-0.9889],
        [ 0.0250],
        [-0.7487],
        [ 0.3950],
        [ 0.3530],
        [ 0.5453],
        [ 0.8167],
        [-0.8880],
        [-0.1232]])
Batch 0
Features:
tensor([[-1.0984,  1.6435,  0.2389, -0.6266],
        [-0.5094, -0.3000, -0.4245,  1.2744],
        [-1.0762, -0.5869, -0.3151, -0.6500],
        [-1.2662, -0.5093,  0.4892, -0.7936]])
Labels:
tensor([[-1.3960],
        [-0.9889],
        [ 0.0250],
        [-0.7487]])

Batch 1
Features:
tensor([[ 0.2077, -1.5511,  1.6262,  0.90

## Optimisation in PyTorch

> PyTorch includes many typical gradient based optimisation algorithms

Here's how you can get the classic stochastic gradient descent optimiser in PyTorch:

In [96]:
optimiser = torch.optim.SGD(model.parameters(), lr=0.01)


All PyTorch optimisers take in 2 key parameters:
- The parameters which they will be used to optimise
- `lr`: The (initial) learning rate

About the optimiser:
- Under the hood, each PyTorch optimiser defines its corresponding parameter update rule.
- All that the optimiser requires (other than potentially some intenal parameters), is the gradient of each parameter, stored in that parameters `.grad` attribute.
- When the `.step()` method of an optimiser is called, it iterates through each of the parameters passed in upon initialisation, and updates them using the `.grad` attribute and the parameter update rule.

Here's how one optimisation step would be performed:

In [97]:
model = PyTorchLinearRegression(3, 1)
optimiser = torch.optim.SGD(model.parameters(), lr=1)
prediction = model(X)

print("Initial parameter value:", model.linear_layer.weight)
print("Initial grad:", model.linear_layer.weight.grad)

prediction.backward() # populate gradients

print("grad after `.backward()`", model.linear_layer.weight.grad)

optimiser.step()
print("Final parameter value:", model.linear_layer.weight)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x4 and 3x1)

One important thing to know about `.grad` and the optimiser:
- When `.backward()` is called and `.grad` is populated, the previous value is not replaced new one
- instead, it is accumulated (added to)
- this can be useful in rare occaisions
- most of the time, you should make sure to call `optimiser.zero_grad()` after `optimiser.step()`
    - that's because the old `.grad` value is now meaningless, since that was the gradient of the loss with respect to the tensor at a previous parameter value, which has since been updated by `optimiser.zero_grad()`
    - `optimiser.zero_grad()` iterates through all of the parameters tracked by the optimiser, then sets their `.grad` attribute to zero

In [98]:
print(model.linear_layer.weight.grad)
optimiser.zero_grad()
print(model.linear_layer.weight.grad)

None
None


Putting all of that together, here's how we would implement a very basic training loop to optimise our PyTorch model (missing many fancy things we will introduce later):

In [102]:
from random import shuffle

def train(model, dataloader, epochs=10):
    optimiser = torch.optim.SGD(model.parameters(), lr=0.01)
    for epoch in range(epochs):
        shuffle(dataloader)
        for batch in dataloader:
            features, labels = batch
            prediction = model(features)
            loss = torch.nn.functional.mse_loss(prediction, labels)
            loss.backward()
            optimiser.step()
            optimiser.zero_grad()
        print("Loss:", loss.item())
        print()
            

X, y = create_dummy_dataset(100, 4) # 100 examples, 4 features each
dataloader = create_dummy_dataloader(X, y)
model = PyTorchLinearRegression(4, 1)
train(model, dataloader)

Loss: 0.5567338466644287

Loss: 0.7092782855033875

Loss: 1.0389024019241333

Loss: 0.8658571839332581

Loss: 0.8924563527107239

Loss: 0.6892538070678711

Loss: 0.24911728501319885

Loss: 1.357698917388916

Loss: 0.2516809105873108

Loss: 0.274871289730072



Note that because the data is random, the loss doesn't improve visibly. This notebook focuses on understanding PyTorch rather than introducing specific datasets that require more code. Try this on a real dataset to visualise the performance improvement.

## Defining what we mean by the forward and backward pass

Forward pass: 
- Going from input values to a prediction

Backward pass:
- Determining the gradient of the model
- Essentially computing the derivative of the model with respect to its parameters