## Neural Network Module

So far we have looked into the tensors, their properties and basic operations on tensors. These are especially useful to get familiar with if we are building the layers of our network from scratch. We will utilize these in Assignment 3, but moving forward, we will use predefined blocks in the `torch.nn` module of `PyTorch`. We will then put together these blocks to create complex networks. Let's start by importing this module with an alias so that we don't have to type `torch` every time we use it.

In [1]:
import torch
import torch.nn as nn

### **Linear Layer**
We can use `nn.Linear(H_in, H_out)` to create a a linear layer. This will take a matrix of `(N, *, H_in)` dimensions and output a matrix of `(N, *, H_out)`. The `*` denotes that there could be arbitrary number of dimensions in between. The linear layer performs the operation `Ax+b`, where `A` and `b` are initialized randomly. If we don't want the linear layer to learn the bias parameters, we can initialize our layer with `bias=False`.

In [5]:
# Create the inputs
input = torch.ones(2, 3, 4)
# N*H_in -> N*H_out

# Make a linear layers transforming N,*,H_in dimensional inputs to
# N,*,H_out dimensional outputs
linear = nn.Linear(4, 2)
linear_output = linear(input)
linear_output

tensor([[[-0.3042, -0.1397],
         [-0.3042, -0.1397],
         [-0.3042, -0.1397]],

        [[-0.3042, -0.1397],
         [-0.3042, -0.1397],
         [-0.3042, -0.1397]]], grad_fn=<ViewBackward0>)

Last dimension of the input needs to match the input of the output layer

In [6]:
list(linear.parameters()) # Ax + b

[Parameter containing:
 tensor([[ 0.1587,  0.0453,  0.0508, -0.3680],
         [ 0.2049,  0.2626, -0.3730, -0.0281]], requires_grad=True),
 Parameter containing:
 tensor([-0.1910, -0.2061], requires_grad=True)]

In [None]:
# Data of shape [batch_size, feature_dim] # 4
# [batch_size, output_dim] # 2

# linear layer of shape (feature_dim, output_dim)

### **Other Module Layers**
There are several other preconfigured layers in the `nn` module. Some commonly used examples are `nn.Conv2d`, `nn.ConvTranspose2d`, `nn.BatchNorm1d`, `nn.BatchNorm2d`, `nn.Upsample` and `nn.MaxPool2d` among many others. We will learn more about these as we progress in the course. For now, the only important thing to remember is that we can treat each of these layers as plug and play components: we will be providing the required dimensions and `PyTorch` will take care of setting them up.

### **Activation Function Layer**
We can also use the `nn` module to apply activations functions to our tensors. Activation functions are used to add non-linearity to our network. Some examples of activations functions are `nn.ReLU()`, `nn.Sigmoid()` and `nn.LeakyReLU()`. Activation functions operate on each element seperately, so the shape of the tensors we get as an output are the same as the ones we pass in.

In [7]:
linear_output

tensor([[[-0.3042, -0.1397],
         [-0.3042, -0.1397],
         [-0.3042, -0.1397]],

        [[-0.3042, -0.1397],
         [-0.3042, -0.1397],
         [-0.3042, -0.1397]]], grad_fn=<ViewBackward0>)

In [8]:
sigmoid = nn.Sigmoid()
output = sigmoid(linear_output)
output

tensor([[[0.4245, 0.4651],
         [0.4245, 0.4651],
         [0.4245, 0.4651]],

        [[0.4245, 0.4651],
         [0.4245, 0.4651],
         [0.4245, 0.4651]]], grad_fn=<SigmoidBackward0>)

### **Putting the Layers Together**
So far we have seen that we can create layers and pass the output of one as the input of the next. Instead of creating intermediate tensors and passing them around, we can use `nn.Sequentual`, which does exactly that.

In [9]:
block = nn.Sequential(
    nn.Linear(4, 2),
    nn.Sigmoid()
)

input = torch.ones(2,3,4)
output = block(input)
output

tensor([[[0.5690, 0.5417],
         [0.5690, 0.5417],
         [0.5690, 0.5417]],

        [[0.5690, 0.5417],
         [0.5690, 0.5417],
         [0.5690, 0.5417]]], grad_fn=<SigmoidBackward0>)

### Custom Modules

Instead of using the predefined modules, we can also build our own by extending the `nn.Module` class. For example, we can build a the `nn.Linear` (which also extends `nn.Module`) on our own using the tensor introduced earlier! We can also build new, more complex modules, such as a custom neural network. You will be practicing these in the later assignment.

To create a custom module, the first thing we have to do is to extend the `nn.Module`. We can then initialize our parameters in the `__init__` function, starting with a call to the `__init__` function of the super class. All the class attributes we define which are `nn` module objects are treated as parameters, which can be learned during the training. Tensors are not parameters, but they can be turned into parameters if they are wrapped in `nn.Parameter` class.

All classes extending `nn.Module` are also expected to implement a `forward(x)` function, where `x` is a tensor. This is the function that is called when a parameter is passed to our module, such as in `model(x)`.

In [10]:
class MultilayerPreceptron(nn.Module):

  def __init__(self, input_size, hidden_size):
    # Call to the __init__ function of the super class
    super(MultilayerPreceptron, self).__init__()

    # Bookkeeping: Saving the initialization parameters
    self.input_size = input_size
    self.hidden_size = hidden_size

    # Defining our model
    # There isn't anything specific about the naming of `self.model`
    # It could be something arbitrary
    self.model = nn.Sequential(
        nn.Linear(self.input_size, self.hidden_size),
        nn.ReLU(),
        nn.Linear(self.hidden_size, self.input_size),
        nn.Sigmoid()
    )

  def forward(self, x):
    output = self.model(x)
    return output

In [12]:
# Make a sample input
input = torch.randn(2, 5)

# Create our model
model = MultilayerPreceptron(5, 3)

# Pass out input through our model
model(input)

tensor([[0.6355, 0.5599, 0.7260, 0.7225, 0.6926],
        [0.6583, 0.5405, 0.7435, 0.7074, 0.6545]], grad_fn=<SigmoidBackward0>)

In [13]:
list(model.named_parameters())

[('model.0.weight',
  Parameter containing:
  tensor([[-0.2593,  0.3496,  0.2565, -0.0212,  0.1479],
          [-0.3470,  0.2189,  0.2929, -0.3131,  0.2926],
          [-0.1784, -0.0630, -0.1378, -0.1042, -0.2313]], requires_grad=True)),
 ('model.0.bias',
  Parameter containing:
  tensor([ 0.1755,  0.4140, -0.3019], requires_grad=True)),
 ('model.2.weight',
  Parameter containing:
  tensor([[ 0.3475, -0.1371,  0.2973],
          [-0.1524,  0.2370, -0.2218],
          [ 0.5438,  0.1234, -0.5754],
          [-0.0465,  0.3260,  0.5406],
          [-0.2919,  0.5702,  0.0846]], requires_grad=True)),
 ('model.2.bias',
  Parameter containing:
  tensor([0.4732, 0.0418, 0.3950, 0.5584, 0.2759], requires_grad=True))]

## Optimization
We have showed how gradients are calculated with the `backward()` function. Having the gradients isn't enought for our models to learn. We also need to know how to update the parameters of our models. This is where the optomozers comes in. `torch.optim` module contains several optimizers that we can use. Some popular examples are `optim.SGD` and `optim.Adam`. When initializing optimizers, we pass our model parameters, which can be accessed with `model.parameters()`, telling the optimizers which values it will be optimizing. Optimizers also has a learning rate (`lr`) parameter, which determines how big of an update will be made in every step. Different optimizers have different hyperparameters as well.

In [14]:
import torch.optim as optim

After we have our optimization function, we can define a `loss` that we want to optimize for. We can either define the loss ourselves, or use one of the predefined loss function in `PyTorch`, such as `nn.BCELoss()`. Let's put everything together now! We will start by creating some dummy data.

In [15]:
# Create the y data
y = torch.ones(10, 5)

# Add some noise to our goal y to generate our x
# We want our model to predict our original data, albeit the noise
x = y + torch.randn_like(y)
x

tensor([[-0.9643,  0.8302, -0.3312,  0.5233,  0.8402],
        [ 0.7868,  1.3532,  0.4634,  2.2927,  1.9026],
        [ 1.1020,  1.7708,  0.9847,  0.4543,  0.5615],
        [-0.0502,  1.3032,  1.2882,  1.5763, -0.1869],
        [-0.6478, -0.4904, -0.1440,  0.0830,  0.4464],
        [ 1.0662,  1.4892,  1.5628,  1.8956,  0.7481],
        [ 0.2444,  1.1906,  1.5316,  0.3316,  0.9216],
        [ 0.8383,  1.1177,  0.2687, -1.2358, -0.0720],
        [ 0.3813,  0.7183,  0.9800,  1.9383,  1.1709],
        [ 0.8400, -0.4514,  0.4877,  2.1526,  1.5566]])

Now, we can define our model, optimizer and the loss function.

In [17]:
# Instantiate the model
model = MultilayerPreceptron(5, 3)

# Define the optimizer
adam = optim.Adam(model.parameters(), lr=1e-1)

# Define loss using a predefined loss function
loss_function = nn.BCELoss()

# Calculate how our model is doing
y_pred = model(x)
loss_function(y_pred, y).item()

0.6285327672958374

Let's see if we can have our model achieve a smaller loss. Now that we have everything we need, we can setup our training loop.

In [26]:
# Set the number of epoch, which determines the number of training iterations
n_epoch = 100

for epoch in range(n_epoch):
  # Set the gradients to 0
  adam.zero_grad()

  # Get the model predictions
  y_pred = model(x)

  # Get the loss
  loss = loss_function(y_pred, y)

  # Print stats
  print(f"Epoch {epoch}: training loss: {loss}")

  # Compute the gradients
  loss.backward()

  # Take a step to optimize the weights
  adam.step()

Epoch 0: training loss: 0.046968575567007065
Epoch 1: training loss: 0.040236763656139374
Epoch 2: training loss: 0.035454392433166504
Epoch 3: training loss: 0.031939342617988586
Epoch 4: training loss: 0.029276227578520775
Epoch 5: training loss: 0.027201585471630096
Epoch 6: training loss: 0.025465665385127068
Epoch 7: training loss: 0.023145359009504318
Epoch 8: training loss: 0.020726410672068596
Epoch 9: training loss: 0.018294090405106544
Epoch 10: training loss: 0.015935909003019333
Epoch 11: training loss: 0.013725428842008114
Epoch 12: training loss: 0.011714156717061996
Epoch 13: training loss: 0.009930022060871124
Epoch 14: training loss: 0.008380299434065819
Epoch 15: training loss: 0.007056694012135267
Epoch 16: training loss: 0.0059408340603113174
Epoch 17: training loss: 0.005009099375456572
Epoch 18: training loss: 0.004236235283315182
Epoch 19: training loss: 0.0035977547522634268
Epoch 20: training loss: 0.003071337006986141
Epoch 21: training loss: 0.002637388650327

In [27]:
list(model.parameters())

[Parameter containing:
 tensor([[-0.0182,  1.6984,  1.4604,  1.5639,  1.4134],
         [ 0.1076, -0.2563,  0.1508, -0.1516, -0.2936],
         [-0.4619, -0.8205, -1.1260, -0.0103, -0.0219]], requires_grad=True),
 Parameter containing:
 tensor([ 2.3666, -0.2720, -0.9283], requires_grad=True),
 Parameter containing:
 tensor([[ 2.0084,  0.0029,  1.1707],
         [ 2.4318, -0.4874,  0.3727],
         [ 2.3818, -0.1265,  0.4126],
         [ 2.1378, -0.5300,  0.2120],
         [ 2.3393,  0.0768,  0.3778]], requires_grad=True),
 Parameter containing:
 tensor([2.2316, 1.7545, 2.1739, 1.4753, 2.1327], requires_grad=True)]

In [28]:
y_pred = model(x)
y_pred

tensor([[1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [0.9984, 0.9989, 0.9992, 0.9974, 0.9991],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [0.9994, 0.9997, 0.9998, 0.9991, 0.9997],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000]], grad_fn=<SigmoidBackward0>)

In [29]:
# Create test data and check how our model performs on it
x2 = y + torch.randn_like(y)
y_pred = model(x2)
y_pred

tensor([[1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [0.9999, 0.9999, 1.0000, 0.9998, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000]], grad_fn=<SigmoidBackward0>)

Great! Looks like our model almost perfectly learned to filter out the noise from the `x` that we passed in!