## Part A - Linear Regression from scratch with PyTorch

After showing what the data looks like (from Notion), refer to the article - so people can see there example and how it looks visibally (also emphasise how important their blog posts and articles are)!

Also mention above the heading Part A & Part B

Ask Arvindra if I should cut out Part A (for the the presentation at least)

Disclaimer about the data

In [52]:
import numpy as np
import torch

Our model will predict the number of sales for 2 products (target variables) by looking at the number of Sales People, the number of Social Media Posts, as well as the amount of money spent on Google Ads (input variables or features) per month. Here’s the training data:

<img src="LinearRegressionData.png" style="width:800px;height:200px">

This walk-through is inspired by the team at [Jovian.ml](jovian.ml) [[1](https://medium.com/@aakashns/linear-regression-with-pytorch-3dde91d60b50)]. Jovian.ml was a course that I followed that helped me immensely, and I also liked how there were multiple target variables.

`sold_product1  = w11 * people + w12 * ads + w13 * posts + b1` 

`sold_product2 = w21 * people + w22 * ads + w23 * posts + b2`

Each target variable is calculated to be a weighted sum of the input variables, offset by some constant, known as a bias.

We use optimisation techniques to adjust the weights to make better predictions. The most common technique for Linear Regression is known as Gradient Descent.

In [53]:
# Input (people, ads, posts)
inputs = np.array([[8, 500, 6], 
                   [5, 200, 15], 
                   [6, 100, 12], 
                   [4, 300, 8], 
                   [6, 250, 3]], dtype='float32')

In [54]:
# Targets (product1, product2)
targets = np.array([[40, 20], 
                    [25, 15], 
                    [15, 10], 
                    [20, 15], 
                    [25, 20]], dtype='float32')

In [55]:
# Convert inputs and targets to tensors
inputs = torch.from_numpy(inputs)
targets = torch.from_numpy(targets)
print(inputs)
print(targets)

tensor([[  8., 500.,   6.],
        [  5., 200.,  15.],
        [  6., 100.,  12.],
        [  4., 300.,   8.],
        [  6., 250.,   3.]])
tensor([[40., 20.],
        [25., 15.],
        [15., 10.],
        [20., 15.],
        [25., 20.]])


We first start by randomely initialising the weights and biases matrices. We will see that the first row (/element) will be used to predict the first target variable.

In [56]:
# Randomely initialise the weights and biases matrices
w = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)
print(w)
print(b)

tensor([[-1.0462,  1.0804,  0.2616],
        [ 0.4201, -0.8799,  0.2832]], requires_grad=True)
tensor([0.3321, 2.2773], requires_grad=True)


Here we have used `torch.randn` to create 2 tensors, `w` and `b`, with dimensions (2 x 3) and  (2 x 1) filled with random elements.


In [57]:
def model(x):
    return x @ w.t() + b

In [58]:
# Calculate pred
preds = model(inputs)
print(preds)

tensor([[ 533.7485, -432.6068],
        [ 215.1112, -167.3523],
        [ 105.2371,  -79.7928],
        [ 322.3698, -257.7433],
        [ 264.9480, -214.3244]], grad_fn=<AddBackward0>)


Seeing as we randomely initialised our weights and biases matrices, which were used to make our predictions, we can see that our predictions are nowhere near out targets. So now we will use some sort of __loss function__ to evaluate our model's performance, then adjust our weights and biases matrices.

In [59]:
# MSE loss
def mse(t1, t2):
    diff = t1 - t2
    return torch.sum(diff * diff) / diff.numel()

This function calculates the Mean Squared Error (MSE) (which is our loss function). The MSE calculates the difference between our 2 tensors (preds and targets), then squares each element of this matrix to remove the negative values. The average of the elements is then calculated, then returned as a single number.

`torch.sum` - sums every element in a tensor.

`torch.numel()` - calculates the number of elements in a tensor.

In [60]:
# Compute loss
loss = mse(preds, targets)
print(loss)

tensor(81254., grad_fn=<DivBackward0>)


We can now see how far off each prediction is from each target, by taking the square root of this metric.

The targets themselves range from 1-40, so this loss is very high.

Hence we know that the model is performing the best when the loss is at it's lowest.

In [61]:
# Compute gradients
loss.backward()

In [62]:
# Gradients for weights (the derivative of the loss with respect to the weights, d(loss)/dw)
print(w)
print(w.grad)

tensor([[-1.0462,  1.0804,  0.2616],
        [ 0.4201, -0.8799,  0.2832]], requires_grad=True)
tensor([[  1618.2268,  88923.6328,   2007.1614],
        [ -1513.6586, -82431.4375,  -1882.6719]])


Calculating `d(loss)/dw` (the gradients for the the weights) is useful, as we are able to find the set of weights where the loss is the lowest. This is possible as the loss is just a quadratic function of our weights and biases.

If `d(loss)/dw` is positive:
- increasing the element’s value slightly = increases the loss.
- decreasing the element’s value slightly = decreases the loss.

When the gradient element is negative:
- increasing the element’s value slightly = decreases the loss.
- decreasing the element’s value slightly = increases the loss.

The change in loss by changing a weight element is __proportional__ to the value of the gradient of the loss w.r.t. that element.

__We adjust the weights by subtracting a small quantity proportional to the gradient.__

The gradients of the weights and biases then have to be reset to 0, because PyTorch accumulates the gradients. 

In [63]:
# Adjust weights & reset gradients
with torch.no_grad():
    w -= w.grad * 1e-5
    b -= b.grad * 1e-5
    w.grad.zero_()
    b.grad.zero_()

We use `torch.no_grad()` to tell PyTorch that we don't want to track these changes when updating the weights and biases, as PyTorch accumulates. Then we reset the gradients to 0.

The __learning rate__ is an important hyperparameter that ensures that the gradients aren't drastically changed. It is typically a small number that we multiply the gradients with. 

In [64]:
print(w)
print(b)

tensor([[-1.0624,  0.1912,  0.2415],
        [ 0.4353, -0.0556,  0.3020]], requires_grad=True)
tensor([0.3295, 2.2798], requires_grad=True)


In [65]:
# Calculate loss
preds = model(inputs)
loss = mse(preds, targets)
print(loss)

tensor(747.4018, grad_fn=<DivBackward0>)


We have only used gradient descent once, and can already see that the loss has decreased a lot!

We can continue to minimise the loss by iterating upon this process, where each iteration is called an __epoch__.

In [66]:
# Train for 100 epochs
for i in range(100):
    preds = model(inputs)
    loss = mse(preds, targets)
    loss.backward()
    with torch.no_grad():
        w -= w.grad * 1e-5
        b -= b.grad * 1e-5
        w.grad.zero_()
        b.grad.zero_()

In [67]:
preds = model(inputs)
loss = mse(preds, targets)
print(loss)

tensor(25.4727, grad_fn=<DivBackward0>)


In [68]:
preds

tensor([[42.4408, 22.7007],
        [18.6433, 15.0181],
        [ 7.0086, 11.5332],
        [27.6112, 15.5023],
        [19.2839, 13.3669]], grad_fn=<AddBackward0>)

In [69]:
targets

tensor([[40., 20.],
        [25., 15.],
        [15., 10.],
        [20., 15.],
        [25., 20.]])

## Part B - Using PyTorch built-ins for Linear Regression

In [70]:
import torch.nn as nn
# torch.nn contains utility classes for building neural networks

In [71]:
# Input (people, ads, posts)
inputs = np.array([[8, 500, 6], [5, 200, 15], [6, 100, 12],
                   [4, 300, 8], [6, 250, 3], [6, 300, 8], 
                   [7, 250, 20], [8, 350, 15], [8, 150, 8], 
                   [6, 150, 24], [5, 400, 20], [7, 350, 22], 
                   [9, 300, 16], [8, 50, 18], [8, 150, 20]], 
                  dtype='float32')

# Targets (product1, product2)
targets = np.array([[40, 20], [25, 15], [15, 10], 
                    [20, 15], [25, 20], [35, 25], 
                    [30, 25], [35, 25], [20, 12], 
                    [15, 10], [35, 10], [30, 17], 
                    [30, 26], [10, 16], [20, 14]], 
                   dtype='float32')

inputs = torch.from_numpy(inputs)
targets = torch.from_numpy(targets)

As you can see, we are using a larger dataset (15 training examples) which we can split into smaller batches.

`TensorDataset` - a list of tuples where each tuple corresponds to one point (input, target) (otherwise known as (feature, label)), and provides standard APIs for working with many different types of datasets in PyTorch. [[2](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset)]

`DataLoader` - useful when working with large datasets as you can split the data into batches of a predefined size while training. It also provides other utilities like shuffling and random sampling of the data.

We can reduce the loss more efficiently using the `shuffle` flag, which shuffles the training data before creating batches, to randomise the input to the optimisation algorithm.

In [72]:
from torch.utils.data import TensorDataset

In [73]:
# Define dataset
train_ds = TensorDataset(inputs, targets)
train_ds[0:3]

(tensor([[  8., 500.,   6.],
         [  5., 200.,  15.],
         [  6., 100.,  12.]]),
 tensor([[40., 20.],
         [25., 15.],
         [15., 10.]]))

In [74]:
from torch.utils.data import DataLoader

In [75]:
# Define DataLoader
batch_size = 5
train_dl = DataLoader(train_ds, batch_size, shuffle=True)

In [76]:
# How we typically use the DataLoader
for xb, yb in train_dl:
    print(xb)
    print(yb)
    break

tensor([[  4., 300.,   8.],
        [  7., 350.,  22.],
        [  8., 350.,  15.],
        [  6., 100.,  12.],
        [  5., 200.,  15.]])
tensor([[20., 15.],
        [30., 17.],
        [35., 25.],
        [15., 10.],
        [25., 15.]])


In Part A, we initialised the weights and biases manually, but we can do this more efficiently with PyTorch's `nn.Linear` class

In [77]:
# Define the model
model = nn.Linear(3, 2)
print(model.weight)
print(model.bias)

Parameter containing:
tensor([[ 0.5690, -0.0496, -0.0284],
        [-0.5325, -0.0467,  0.1187]], requires_grad=True)
Parameter containing:
tensor([0.2707, 0.3805], requires_grad=True)


In [78]:
# Parameters containing all of the weights and biases
list(model.parameters())

[Parameter containing:
 tensor([[ 0.5690, -0.0496, -0.0284],
         [-0.5325, -0.0467,  0.1187]], requires_grad=True),
 Parameter containing:
 tensor([0.2707, 0.3805], requires_grad=True)]

In [79]:
preds = model(inputs)
preds

tensor([[-20.1504, -26.5200],
        [ -7.2311,  -9.8430],
        [ -1.6166,  -6.0609],
        [-12.5620, -14.8117],
        [ -8.8020, -14.1347],
        [-11.4241, -15.8766],
        [ -8.7154, -12.6499],
        [-12.9650, -18.4462],
        [ -2.8454,  -9.9357],
        [ -4.4373,  -6.9722],
        [-17.2940, -18.5908],
        [-13.7326, -17.0831],
        [ -9.9442, -16.5247],
        [  1.8313,  -4.0785],
        [ -3.1859,  -8.5118]], grad_fn=<AddmmBackward>)

The `nn.functional` package contains many useful loss functions and several other utilities.

In [80]:
import torch.nn.functional as F

In [81]:
# Define the loss function
loss_fn = F.mse_loss

In [82]:
loss = loss_fn(model(inputs), targets)
print(loss)

tensor(1212.4377, grad_fn=<MseLossBackward>)


Just as we no longer need to manually initialise the weights and biases, PyTorch uses an optimiser for manipulating the weights' and biases' gradients. This is `optim.SGD`, where the SGD stands for Stochastic Gradient Descent.

The Learning Rate can also be specified so we can control the amount by which the parameters are modified.

In [83]:
# Define the optimizer
opt = torch.optim.SGD(model.parameters(), lr=1e-5)

We generally define a utility function `fit` that trains the model for a given number of epochs.

In [84]:
# Utility function to train the model
def fit(num_epochs, model, loss_fn, opt):
    
    # Repeat for given number of epochs
    for epoch in range(num_epochs):
        
        # Train with batches of data
        for xb,yb in train_dl:
            
            # 1 - Generate the predictions
            pred = model(xb)
            
            # 2 - Calculate the loss
            loss = loss_fn(pred, yb)
            
            # 3 - Compute the gradients
            loss.backward()
            
            # 4 - Update the parameters using gradients
            opt.step()
            
            # 5 - Reset the gradients to zero
            opt.zero_grad()
        
        # Print out the progress
        if (epoch+1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

In [85]:
fit(100, model, loss_fn, opt)

Epoch [10/100], Loss: 69.4871
Epoch [20/100], Loss: 71.2876
Epoch [30/100], Loss: 48.2677
Epoch [40/100], Loss: 59.9832
Epoch [50/100], Loss: 22.2297
Epoch [60/100], Loss: 38.4646
Epoch [70/100], Loss: 40.3895
Epoch [80/100], Loss: 91.7548
Epoch [90/100], Loss: 53.3781
Epoch [100/100], Loss: 19.6952


In [86]:
# Generate predictions
preds = model(inputs)
preds

tensor([[43.5830, 27.3620],
        [18.9134, 12.9657],
        [11.7514,  5.9728],
        [25.8749, 17.8681],
        [23.1098, 12.9351],
        [27.0564, 16.9283],
        [24.0415, 16.0191],
        [32.2245, 20.3997],
        [16.6989,  7.1447],
        [15.8383, 11.4294],
        [34.3989, 25.8026],
        [31.7742, 22.3330],
        [28.9890, 17.1910],
        [ 9.2068,  3.3394],
        [16.9396,  9.6534]], grad_fn=<AddmmBackward>)

In [87]:
# Compare with targets
targets

tensor([[40., 20.],
        [25., 15.],
        [15., 10.],
        [20., 15.],
        [25., 20.],
        [35., 25.],
        [30., 25.],
        [35., 25.],
        [20., 12.],
        [15., 10.],
        [35., 10.],
        [30., 17.],
        [30., 26.],
        [10., 16.],
        [20., 14.]])