In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
import torch
from torch import Tensor, nn, optim

## Optimisers
Optimisers are classes responsible for updating parameters according to their effect on the loss.
The general update rule is:

$$\theta_{t+1} = \theta_t - \gamma\nabla_{\theta_t}\!\mathcal{L}$$

The gradient on parameter $\theta$, $\nabla_{\theta_t}\!\mathcal{L}$ can be computed by backpropagating the loss value to the parameter. The learning rate $\gamma$ controls how much the parameter is changed by the gradient; how large a step size the optimiser makes. The fact that the change is negative, means that the loss should be such that minimising it results in better performance.

The loss is often evaluated using a batch of data points, rather than all data, or just one point. The result is a trade-off between speed and precise evaluation of the loss. The stochastic nature also reduces the chance that the DNN overfits to the training data.

PyTorch implements a variety of optimisers (https://pytorch.org/docs/stable/optim.html). See https://ruder.io/optimizing-gradient-descent/index.html for a good overview.

We'll stick with the standard SGD for now. When instantiating an `Optimizer` the parameters to be optimised must be provided, along with the necessary hyper-parameters of the optimisation algorithm. We'll starts with a very simple set of layers:

In [11]:
model = nn.Sequential(nn.Linear(3,10), nn.ReLU(), nn.Linear(10,1), nn.Sigmoid())
model.state_dict()

OrderedDict([('0.weight',
              tensor([[ 0.4969, -0.4201,  0.1253],
                      [-0.4167,  0.0887,  0.2076],
                      [-0.1430,  0.3111, -0.0112],
                      [-0.4380,  0.1722,  0.2635],
                      [ 0.4037,  0.2770, -0.1404],
                      [ 0.2496, -0.2851,  0.5548],
                      [-0.3452, -0.0351,  0.4498],
                      [ 0.3398, -0.3619,  0.1209],
                      [ 0.0627, -0.4354, -0.2994],
                      [ 0.1811, -0.0945, -0.2087]])),
             ('0.bias',
              tensor([-0.4521,  0.1363, -0.3722, -0.1289,  0.4030,  0.3634, -0.1745,  0.3683,
                       0.1704, -0.5234])),
             ('2.weight',
              tensor([[-0.2498, -0.1656, -0.1275, -0.0864,  0.2547,  0.2727, -0.0485,  0.0450,
                       -0.0448, -0.1232]])),
             ('2.bias', tensor([-0.1799]))])

To optimise the parameters of the model, we pass its `.paramaters()` generator to the optimiser constructor, which allows it to always be able to access the parameters.

In [12]:
opt = optim.SGD(params=model.parameters(), lr=1e-2)

We also need a loss function.

In [13]:
loss_fn = nn.BCELoss()

Now we pass some data through the network to get a prediction

In [17]:
inputs = torch.randn(20,3)
inputs[10:] += 0.25
targets = torch.zeros(20,1)
targets[10:] = 1

In [18]:
preds = model(inputs)
preds

tensor([[0.3966],
        [0.4726],
        [0.5016],
        [0.5888],
        [0.4748],
        [0.4353],
        [0.5002],
        [0.4654],
        [0.4718],
        [0.4801],
        [0.4923],
        [0.4844],
        [0.4952],
        [0.6057],
        [0.5413],
        [0.5354],
        [0.5824],
        [0.5801],
        [0.4812],
        [0.5211]], grad_fn=<SigmoidBackward0>)

Now we compute the loss value

In [20]:
loss = loss_fn(preds, targets)
loss

tensor(0.6451, grad_fn=<BinaryCrossEntropyBackward0>)

At this point we want to ensure that the parameters do not have any gradient value, e.g. left over from previous updates. In this case, we can see that the `.grad` attributes are `None`.

In [22]:
model[0].weight, model[0].weight.grad

(Parameter containing:
 tensor([[ 0.4969, -0.4201,  0.1253],
         [-0.4167,  0.0887,  0.2076],
         [-0.1430,  0.3111, -0.0112],
         [-0.4380,  0.1722,  0.2635],
         [ 0.4037,  0.2770, -0.1404],
         [ 0.2496, -0.2851,  0.5548],
         [-0.3452, -0.0351,  0.4498],
         [ 0.3398, -0.3619,  0.1209],
         [ 0.0627, -0.4354, -0.2994],
         [ 0.1811, -0.0945, -0.2087]], requires_grad=True),
 None)

Just in case, though we will ensure that they are all zero or None.

In [23]:
opt.zero_grad()

In [24]:
model[0].weight.grad

Now we can backpropagate the gradient of the loss:

In [25]:
loss.backward()

Now when we check the gradients on the parameters, we'll see that they are non-zero

In [26]:
model[0].weight.grad

tensor([[ 0.0234, -0.0042, -0.0055],
        [ 0.0189,  0.0129, -0.0122],
        [ 0.0031,  0.0020, -0.0027],
        [ 0.0047,  0.0028, -0.0109],
        [-0.0433, -0.0168, -0.0162],
        [-0.0418, -0.0342, -0.0020],
        [ 0.0043,  0.0025, -0.0015],
        [-0.0071, -0.0045, -0.0005],
        [ 0.0057,  0.0015,  0.0026],
        [-0.0003,  0.0034,  0.0056]])

The values of the parameters haven't changed, yet. We need to perform an update step with the optimiser

In [27]:
opt.step()

In [28]:
model[0].weight

Parameter containing:
tensor([[ 0.4967, -0.4201,  0.1254],
        [-0.4169,  0.0886,  0.2077],
        [-0.1431,  0.3111, -0.0112],
        [-0.4380,  0.1722,  0.2636],
        [ 0.4042,  0.2772, -0.1402],
        [ 0.2500, -0.2848,  0.5548],
        [-0.3452, -0.0352,  0.4498],
        [ 0.3399, -0.3619,  0.1209],
        [ 0.0626, -0.4354, -0.2995],
        [ 0.1811, -0.0945, -0.2088]], requires_grad=True)

The parameters have now updated slightly. They still have their gradients, though, which is why it is important that we always zero them before backpropagating the loss.

In [29]:
model[0].weight.grad

tensor([[ 0.0234, -0.0042, -0.0055],
        [ 0.0189,  0.0129, -0.0122],
        [ 0.0031,  0.0020, -0.0027],
        [ 0.0047,  0.0028, -0.0109],
        [-0.0433, -0.0168, -0.0162],
        [-0.0418, -0.0342, -0.0020],
        [ 0.0043,  0.0025, -0.0015],
        [-0.0071, -0.0045, -0.0005],
        [ 0.0057,  0.0015,  0.0026],
        [-0.0003,  0.0034,  0.0056]])