In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
import torch
from torch import Tensor, nn

## NN module
The `NN` module contains classes, functions, and other modules for creating neural networks from smaller building blocks. The majority of classes inherit from `nn.Module`, which provides a lot of functionality for tracking learnable parameters and acting on inputted tensors.

A class inheriting from `nn.Module` will have a:
- `forward()` method, where it can act on incoming tensors. The `__call__` method will call the `forward` method, meaning that once instantiated, the object can be called, in order to call it's `forward` method, e.g. `my_module(x)` will pass `x` to `my_module.forward(x)` and return the output.
- `parameters()` method, which is a provides a recursive generator that yields all `nn.Parameter`s stored by the `Module` and any other `nn.Module`s it stores
- `state_dict()` method, which returns a dictionary of the current values of all `nn.Parameter`s and registered buffers stored by the `Module` and any other `nn.Module`s it stores
- `to()` method, which will recursively place all other `nn.Module`s and `nn.Parameter`s stored by the `Module` onto the specified device


### nn.Parameter
`nn.Parameter`s are basically just `Tensor`s with `require_grad=True`, except that when they are declared as attributes of an `nn.Module`, they will be treated specially. E.g. they are returned by the `parameters()` generator, and stored in the `state_dict`. As we'll see later, optimsiers in PyTorch are initialised using the `parameters()` generator, so `nn.Parameter`s will therefore be updated by gradient descent. Additionally, loading and saving of a `nn.Module` is done via its `state_dict`, so the values of `nn.Parameter`s will be loaded and saved, too.

In [3]:
class MyModule(nn.Module):
    def __init__(self):
        super().__init__()  # The super constructor must always be called, otherwise no parameters can be assigned
        self.tensor_a = torch.tensor([3.], requires_grad=True)  # here we declare a tensor with gradient
        self.param_b = nn.Parameter(torch.tensor([2.]))  # here we declare a parameter with gradient

In [4]:
module = MyModule()

In [5]:
list(module.parameters())  # note that only param_b is listed, tensor_a is ignored

[Parameter containing:
 tensor([2.], requires_grad=True)]

In [6]:
module.state_dict()  # similarly, only param_b is listed

OrderedDict([('param_b', tensor([2.]))])

Parameters can also be included as `nn.ParameterList` and `nn.ParameterDict` classes, which act similarly to lists and dictionaries, except that they will also be identified as parameters of the module.

### Buffers
Sometimes we have values that we want to keep constant during optimisation, but also want to be included in the `state_dict` such that they can be easily loaded and saved. Such values can be registered as *buffers*:

In [7]:
class MyModule(nn.Module):
    def __init__(self, value):
        super().__init__()
        self.tensor_a = torch.tensor([3.], requires_grad=True)
        self.param_b = nn.Parameter(torch.tensor([2.]))
        self.register_buffer('buffer_c', value)  # register the buffer with a given name

In [8]:
module = MyModule(Tensor([-1]))

In [9]:
list(module.parameters())  # buffer_c isn't included as a parameter

[Parameter containing:
 tensor([2.], requires_grad=True)]

In [10]:
module.state_dict()  # but is included in the state dict

OrderedDict([('param_b', tensor([2.])), ('buffer_c', tensor([-1.]))])

In [11]:
module.buffer_c  # the buffer appears as an attribute with the name that was provided when it was registered

tensor([-1.])

## Common classes
There are many different classes implemented in PyTorch. See https://pytorch.org/docs/stable/nn.html for the full list. Described below are a few common examples.

### Linear layers

A common class is `nn.Linear`, which implements the linear transform `w.x+b`, where `w` and `b` are learnable parameters. These can be used for the "hidden" layers in feed-forward DNNs

In [12]:
lin = nn.Linear(in_features=4, out_features=6)  # the layer expects 4 features in and will output 6 features

In [13]:
lin.state_dict()  # it has a weight (6,4) and a bias (6), which are intialised at random

OrderedDict([('weight',
              tensor([[-0.4148, -0.0767, -0.4162,  0.3248],
                      [ 0.2116, -0.0338, -0.4803, -0.0719],
                      [-0.1532,  0.0357,  0.0993, -0.3531],
                      [-0.4223,  0.0337,  0.4757,  0.2106],
                      [ 0.4832,  0.2941, -0.3082, -0.1026],
                      [ 0.1516, -0.0272,  0.4565, -0.2877]])),
             ('bias',
              tensor([ 0.3766,  0.3292,  0.1382,  0.4352,  0.1774, -0.0696]))])

In [14]:
x = torch.randn(10,4)
x = lin(x)  # this calls the forward method of the linear layer, which applies the linear transformation to the incoming x tensor
x.shape, x.grad_fn  # note that the linear transform was broadcast across the first dimension of x, and that x now has a grad function

(torch.Size([10, 6]), <AddmmBackward0 at 0x7fb69fed7610>)

### Activation layers
Sometimes, the classes don't have any learnable parameters, but it is more convenient to treat them as `nn.Module`s. Activation functions are typical examples:

In [15]:
act = nn.ReLU()
act.state_dict()

OrderedDict()

In [16]:
act(x)

tensor([[0.0000, 0.6011, 0.4731, 0.0000, 1.2012, 0.0174],
        [0.1432, 1.0412, 0.3407, 0.0000, 0.8050, 0.0000],
        [0.4012, 0.0000, 0.0143, 1.1460, 0.0000, 0.2359],
        [1.0377, 0.1786, 0.5834, 0.8358, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.8740, 0.0025, 0.2736],
        [1.1782, 1.1063, 0.0000, 0.0000, 0.1309, 0.0000],
        [0.4244, 0.0000, 0.0000, 1.4880, 0.0000, 0.0387],
        [0.0000, 0.0000, 0.6371, 1.6151, 0.0000, 0.4471],
        [0.0000, 0.8299, 0.0000, 0.0000, 1.6888, 0.0000],
        [0.7403, 0.8747, 0.1704, 0.0000, 1.0006, 0.0000]],
       grad_fn=<ReluBackward0>)

See https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity for the full list of activation functions implemented in PyTorch

### Sequential
Above, we took some data and passed it through a linear layer and then through an activation function. This is a very common action in a neural network. Sometimes it can be convenient to group together layers and modules into an `nn.Sequential` class, which takes multiple `nn.Module`s and when its `forward` method is called, it will feed the input to the first module and then sequentially feed the output into the next module, and so on, finally returning the output of the last module

In [17]:
lin_act = nn.Sequential(lin, act)

In [18]:
lin_act(torch.randn(10,4))

tensor([[1.3807e+00, 7.3050e-01, 1.9314e-01, 2.8503e-01, 2.8928e-01, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 6.3162e-01, 7.5493e-01, 2.8447e-01, 6.5368e-01],
        [5.0934e-01, 3.4157e-01, 0.0000e+00, 4.7379e-01, 3.3435e-01, 0.0000e+00],
        [0.0000e+00, 3.5082e-01, 1.0966e-03, 2.1265e-01, 5.4601e-01, 2.3216e-01],
        [0.0000e+00, 1.8708e-01, 2.3001e-01, 4.2887e-01, 0.0000e+00, 4.1373e-01],
        [9.2773e-01, 5.5664e-02, 1.6467e-01, 9.8114e-01, 0.0000e+00, 0.0000e+00],
        [2.6894e-01, 5.0144e-01, 5.4058e-01, 1.7945e-01, 3.9264e-01, 1.3366e-02],
        [0.0000e+00, 2.9057e-01, 2.3775e-01, 3.2878e-01, 0.0000e+00, 3.2577e-01],
        [9.4376e-01, 2.2823e-01, 7.9454e-01, 7.4341e-01, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 3.7771e-01, 0.0000e+00, 2.6012e-02, 8.9576e-01, 4.1871e-01]],
       grad_fn=<ReluBackward0>)

In [19]:
lin_act.state_dict()  # the parameters of the linear layer are still contained in the state_dict o the sequential module

OrderedDict([('0.weight',
              tensor([[-0.4148, -0.0767, -0.4162,  0.3248],
                      [ 0.2116, -0.0338, -0.4803, -0.0719],
                      [-0.1532,  0.0357,  0.0993, -0.3531],
                      [-0.4223,  0.0337,  0.4757,  0.2106],
                      [ 0.4832,  0.2941, -0.3082, -0.1026],
                      [ 0.1516, -0.0272,  0.4565, -0.2877]])),
             ('0.bias',
              tensor([ 0.3766,  0.3292,  0.1382,  0.4352,  0.1774, -0.0696]))])

### Module lists and dicts
Similar to `nn.ParameterList` and `nn.ParameterDict`, `nn.ModuleList` and `nn.ModuleDict` can be used to contain multiple modules and have them be recognised by the parent `nn.Module` as modules:

In [20]:
mlist= nn.ModuleList([lin, act])

In [21]:
isinstance(mlist, nn.Module)

True

In [22]:
mlist(x)  # does not act like a Sequential

NotImplementedError: Module [ModuleList] is missing the required "forward" function

In [23]:
x = torch.rand(10,4)
for m in mlist: x = m(x)  # but can be iterated through
x

tensor([[0.3164, 0.3684, 0.0000, 0.3805, 0.6748, 0.0000],
        [0.2212, 0.0958, 0.0000, 0.6889, 0.2064, 0.0025],
        [0.0000, 0.0178, 0.0300, 0.6268, 0.3265, 0.3645],
        [0.2917, 0.0000, 0.0000, 0.8613, 0.0167, 0.0012],
        [0.0000, 0.0000, 0.0000, 0.7378, 0.4767, 0.1339],
        [0.0000, 0.2813, 0.0000, 0.3501, 0.6843, 0.1061],
        [0.1327, 0.0000, 0.0000, 0.7980, 0.0554, 0.1386],
        [0.2633, 0.0000, 0.0000, 0.8758, 0.1623, 0.0000],
        [0.3105, 0.0335, 0.0074, 0.7968, 0.0000, 0.0321],
        [0.2800, 0.0278, 0.0000, 0.7994, 0.1069, 0.0000]],
       grad_fn=<ReluBackward0>)

## Fowards pass
When building a new module, it is necessary to implement the forwards pass, which defines how the parameters and child modules will affect the incoming tensors. When dealing with high-level PyTorch, the backwards pass will be automatically implemented.

In [24]:
class LinAct(nn.Module):
    def __init__(self, nin, nout):
        super().__init__()
        self.lin = nn.Linear(nin, nout)
        self.act = nn.ReLU()
        
    def forward(self, x):
        x = self.lin(x)
        x = self.act(x)
        return x

In [25]:
lin_act = LinAct(2,4)

In [26]:
x = torch.randn(5,2)
lin_act(x)

tensor([[0.7515, 0.9448, 0.4860, 0.0000],
        [0.7264, 0.0000, 0.3230, 0.0000],
        [0.0744, 0.1151, 0.0000, 0.0000],
        [0.9014, 0.0000, 0.5478, 0.0000],
        [0.2433, 1.8337, 0.0000, 0.8897]], grad_fn=<ReluBackward0>)

In [27]:
lin_act.state_dict()

OrderedDict([('lin.weight',
              tensor([[-0.4735, -0.1305],
                      [ 0.1630, -0.6295],
                      [-0.5533, -0.2001],
                      [ 0.6462, -0.2662]])),
             ('lin.bias', tensor([ 0.3504,  0.0600, -0.0548, -0.4007]))])

In the forward method is compiled, meaning that conditionals can be used to change what actions are applied, depending on the data provided at runtime:

In [28]:
class LinAct(nn.Module):
    def __init__(self, nin, nout):
        super().__init__()
        self.lin = nn.Linear(nin, nout)
        self.act = nn.ReLU()
        self.zero_replacement = nn.Parameter(Tensor([-3]))
        
    def forward(self, x):
        x = self.lin(x)
        x = self.act(x)
        x[x<=0] = self.zero_replacement  # if any values are less than or equal to 0, replace them with a learnable default value
        return x

In [29]:
lin_act = LinAct(2,4)

In [30]:
x = torch.randn(5,2)
lin_act(x)

tensor([[-3.0000,  0.6434, -3.0000, -3.0000],
        [-3.0000,  0.0659,  1.2936,  0.1982],
        [-3.0000,  0.2788,  0.5140, -3.0000],
        [-3.0000,  0.3862,  0.2030, -3.0000],
        [-3.0000,  0.2865,  0.6796,  0.0125]], grad_fn=<IndexPutBackward0>)

## Initialisation
Suitable initialisation of parameters in a neural network is key to making them trainable quickly (or at all). PyTorch provides a default init scheme, but this isn't guaranteed to be suitable for the networks you create; it should vary according to, at least, the activation function used. The general rule-of-thumb is that data sampled from a unit-Gaussian, should retain a unit-Gaussian shape when passed through the DNN. To see why this is important, check out this interactive demo https://www.deeplearning.ai/ai-notes/initialization/index.html.

Two of the most common schemes are:
- Xavier Glorot: weights are sampled from either a uniform distribution between ±sqrt(6/(nin+nout)), or a Gaussian with mean 0 and std sqrt(2/(nin+nout)). This is used for e.g. linear, sigmoid, softmax, and tanh activation functions.
- Kaiming He: weights are sampled from either a uniform distribution between ±sqrt(3/nin), or a Gaussian with mean 0 and std sqrt(1/nin). This is used for e.g. ReLU, PReLU, Swish, and Mish activation functions.

As part of my LUMIN package, I maintain a list of applicable init schemes here https://github.com/GilesStrong/lumin/blob/master/lumin/nn/models/initialisations.py

The bias of a linear layer can generally be initialised to zeros

A full list of the init schemes in PyTorch is found here https://pytorch.org/docs/stable/nn.init.html

In [32]:
lin = nn.Linear(3,5)
lin.state_dict()

OrderedDict([('weight',
              tensor([[ 0.1572, -0.1353, -0.2506],
                      [ 0.2733,  0.2149, -0.2999],
                      [ 0.4583, -0.3283,  0.4848],
                      [-0.0452, -0.4359, -0.3869],
                      [-0.0950, -0.5202, -0.2576]])),
             ('bias', tensor([-0.1350,  0.3089, -0.0082,  0.2809,  0.3997]))])

This will set initial values in-place (note the _ at the end of the method to indicate its an in-place operation). Since we expect to feed the output into a ReLU, we need to specify 'relu' for the nonlinearaity. For no discernably good reason at all, there is a default value of 'leaky_relu' so don;t forget to correct this.

In [33]:
nn.init.kaiming_normal_(lin.weight, nonlinearity='relu')

Parameter containing:
tensor([[ 0.4379, -0.3443, -0.4662],
        [ 0.2523,  1.2330, -1.5220],
        [ 0.6170, -0.8882,  0.0229],
        [ 0.5918, -0.2919,  0.6845],
        [ 0.2057,  0.2468, -0.6487]], requires_grad=True)

Let's zero the bias

In [34]:
nn.init.zeros_(lin.bias)

Parameter containing:
tensor([0., 0., 0., 0., 0.], requires_grad=True)

In [36]:
lin.state_dict()

OrderedDict([('weight',
              tensor([[ 0.4379, -0.3443, -0.4662],
                      [ 0.2523,  1.2330, -1.5220],
                      [ 0.6170, -0.8882,  0.0229],
                      [ 0.5918, -0.2919,  0.6845],
                      [ 0.2057,  0.2468, -0.6487]])),
             ('bias', tensor([0., 0., 0., 0., 0.]))])

When writing a new module, generally I make sure that the layers are correctly initialised just after they are declared. However, it is still possible to reinitialise an instantiated module by recursively searching though it for different layers, like:

In [37]:
def init_net(model:nn.Module):
    r'''Recursively initialise fully-connected ReLU network with Kaiming and zero bias'''
    if isinstance(model,nn.Linear):
        init.kaiming_normal_(model.weight, nonlinearity='relu')
        init.zeros_(model.bias)
    for l in model.children(): init_net(l)

However you now lose a bit of control, and it's easy to use the "wrong" init scheme depending on the layer.

## Saving/loading
After training a neural network (or even during), we often want to save its parameters. This is done via the `state_dict`.

In [39]:
state = lin_act.state_dict()
state

OrderedDict([('zero_replacement', tensor([-3.])),
             ('lin.weight',
              tensor([[ 0.0758,  0.0297],
                      [ 0.0983,  0.0961],
                      [-0.2321, -0.4204],
                      [-0.0091, -0.3525]])),
             ('lin.bias', tensor([-0.3967,  0.2413,  0.6331, -0.2551]))])

In [40]:
torch.save(state, '03_save.pt')

Now later we may want to reload the saved parameters. Note that we have only saved the parameters, and not the class or code itself. So we would need to re-instantiate the network with random parameters, and then load the saved params into it:

In [42]:
new_lin_act = LinAct(2,4)
new_lin_act.state_dict()

OrderedDict([('zero_replacement', tensor([-3.])),
             ('lin.weight',
              tensor([[ 0.5269,  0.5570],
                      [ 0.4566,  0.4674],
                      [-0.2778,  0.6261],
                      [-0.2094,  0.2372]])),
             ('lin.bias', tensor([0.5638, 0.0331, 0.1616, 0.1378]))])

In [43]:
loaded_state = torch.load('03_save.pt')

In [44]:
new_lin_act.load_state_dict(loaded_state)

<All keys matched successfully>

In [45]:
new_lin_act.state_dict()

OrderedDict([('zero_replacement', tensor([-3.])),
             ('lin.weight',
              tensor([[ 0.0758,  0.0297],
                      [ 0.0983,  0.0961],
                      [-0.2321, -0.4204],
                      [-0.0091, -0.3525]])),
             ('lin.bias', tensor([-0.3967,  0.2413,  0.6331, -0.2551]))])

A more tricky, but flexible and user-convenient export method is the PyTorch packager https://pytorch.org/docs/stable/package.html. This allows one to export the saved parameters, the code necessary to produce the modules it relates to, and relevant code from any dependencies that users may not want to install.

## nn.functional
Many operations/layers are presented as classes, but most are also available a functions; https://pytorch.org/docs/stable/nn.functional.html -- commonly imported as `F`. Without moving to a full functional-programming approach it is still possible/useful to use these functions in you modules, e.g.:

In [48]:
import torch.nn.functional as F

In [49]:
class LinAct(nn.Module):
    def __init__(self, nin, nout):
        super().__init__()
        self.lin = nn.Linear(nin, nout)
        
    def forward(self, x):
        x = self.lin(x)
        x = F.relu(x)  # Note rather than have the ReLU be an object of the module, we use the function version
        return x

In [64]:
x = torch.randn(4,2)
x = LinAct(2,6)(x)
x

tensor([[0.0000, 0.5719, 0.0000, 0.0000, 0.2449, 0.4093],
        [0.0000, 0.1497, 0.1155, 0.1138, 0.0000, 1.1983],
        [0.0000, 0.1622, 0.0000, 0.0000, 0.1617, 0.5982],
        [0.0000, 0.8241, 0.0000, 0.0000, 0.1513, 0.6360]],
       grad_fn=<ReluBackward0>)

A fully functional approach would be something like:

In [65]:
w = nn.init.kaiming_normal_(torch.empty(6,2))
b = torch.zeros(6)

In [84]:
x = torch.randn(4,2)
F.relu(F.linear(x, weight=w, bias=b))

tensor([[0.0000, 0.0000, 0.0000, 0.9819, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.1321, 2.9878, 0.0000, 1.7430],
        [0.0218, 0.0000, 0.0000, 0.0000, 0.0367, 0.0000],
        [0.2847, 0.0000, 0.0000, 0.0000, 0.5401, 0.0000]])

Or even:

In [85]:
torch.clamp_min(x@w.T+b, 0)

tensor([[0.0000, 0.0000, 0.0000, 0.9819, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.1321, 2.9878, 0.0000, 1.7430],
        [0.0218, 0.0000, 0.0000, 0.0000, 0.0367, 0.0000],
        [0.2847, 0.0000, 0.0000, 0.0000, 0.5401, 0.0000]])