# Building Models with PyTorch

## `torch.nn.Module` and `torch.nn.Parameter`

[torch.nn](https://pytorch.org/docs/stable/nn.html)

One important behavior of `torch.nn.Module` is registering parameters. If a particular Module subclass has learning weights, these weights are expressed as instances of `torch.nn.Parameter`. The `Parameter` class is a subclass of `torch.Tensor`, with the special behavior that when they are assigned as attributes of a `Module`, they are added to the list of that modules parameters. These parameters may be accessed through the `parameters()` method on the `Module` class.

In [1]:
import torch

class TinyModel(torch.nn.Module):

    def __init__(self):
        super(TinyModel, self).__init__()

        self.linear1 = torch.nn.Linear(100, 200)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(200, 10)
        self.softmax = torch.nn.Softmax()

    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x

tinymodel = TinyModel()

print('The model:')
print(tinymodel)

print('\n\nJust one layer:')
print(tinymodel.linear2)

print('\n\nModel params:')
for param in tinymodel.parameters():
    print(param)

print('\n\nLayer params:')
for param in tinymodel.linear2.parameters():
    print(param)

The model:
TinyModel(
  (linear1): Linear(in_features=100, out_features=200, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=200, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)


Just one layer:
Linear(in_features=200, out_features=10, bias=True)


Model params:
Parameter containing:
tensor([[-0.0678, -0.0083,  0.0389,  ...,  0.0846, -0.0746,  0.0811],
        [-0.0356, -0.0681,  0.0407,  ..., -0.0334,  0.0926, -0.0103],
        [-0.0768,  0.0274, -0.0580,  ...,  0.0773, -0.0267, -0.0276],
        ...,
        [-0.0742,  0.0702, -0.0173,  ...,  0.0899,  0.0311, -0.0975],
        [ 0.0210, -0.0501,  0.0672,  ..., -0.0244, -0.0752, -0.0604],
        [-0.0501, -0.0739, -0.0244,  ..., -0.0244,  0.0633, -0.0640]],
       requires_grad=True)
Parameter containing:
tensor([ 0.0908,  0.0186, -0.0268, -0.0434, -0.0570, -0.0222, -0.0604,  0.0200,
        -0.0494, -0.0108,  0.0860,  0.0659, -0.0044, -0.0891, -0.0466, -0.0179,
         0.0430,  0.0358, -0.0626, -0.09

## Common Layer Types

### Linear Layers

In [2]:
lin = torch.nn.Linear(3, 2)     # 3 inputs, 2 outputs
x = torch.rand(1, 3)
print('Input:')
print(x)

print('\n\nWeight and Bias parameters:')
for param in lin.parameters():
    print(param)

y = lin(x)
print('\n\nOutput:')
print(y)

Input:
tensor([[0.1672, 0.9918, 0.5229]])


Weight and Bias parameters:
Parameter containing:
tensor([[ 0.3893, -0.5746, -0.2660],
        [-0.5176, -0.0223,  0.3729]], requires_grad=True)
Parameter containing:
tensor([ 0.4292, -0.5554], requires_grad=True)


Output:
tensor([[-0.2146, -0.4691]], grad_fn=<AddmmBackward0>)


If you do the matrix multiplication of `x` by the linear layer’s weights, and add the biases, you’ll find that you get the output vector `y`.

One other important feature to note: When we checked the weights of our layer with `lin.weight`, it reported itself as a `Parameter` (which is a subclass of `Tensor`), and let us know that it’s tracking gradients with autograd. This is a default behavior for `Parameter` that differs from `Tensor`.

### Convolutional Layers

Convolutional layers are built to handle data with a high degree of spatial correlation. They are very commonly used in computer vision, where they detect close groupings of features which the compose into higher-level features. They pop up in other contexts too - for example, in NLP applications, where a word’s immediate context (that is, the other words nearby in the sequence) can affect the meaning of a sentence.

In [3]:
import torch.functional as F


class LeNet(torch.nn.Module):

    def __init__(self):
        super(LeNet, self).__init__()
        # 1 input image channel (black & white), 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = torch.nn.Conv2d(1, 6, 5) #(1, 6, (5, 5))
        # The first argument to a convolutional layer’s constructor is the number of input channels
        # This is the second argument to the constructor is the number of output features.
        # The third argument is the window or kernel size.
        self.conv2 = torch.nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = torch.nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = torch.nn.Linear(120, 84)
        self.fc3 = torch.nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

Let's break down what's happening in the convolutional layer:

1. The first argument to a convolutional layer’s constructor is the number of input channels

2. This is the second argument to the constructor is the number of output features.

3. The third argument is the window or kernel size. eg: `(3, 5)` to get a 3x5 convolution kernel.

The output of a convolutional layer is an activation map - a spatial representation of the presence of features in the input tensor. conv1 will give us an output tensor of 6x28x28; 6 is the number of features, and 28 is the height and width of our map. (The 28 comes from the fact that when scanning a 5-pixel window over a 32-pixel row, there are only 28 valid positions.)

We then pass the output of the convolution through a ReLU activation function (more on activation functions later), then through a max pooling layer. The max pooling layer takes features near each other in the activation map and groups them together. It does this by reducing the tensor, merging every 2x2 group of cells in the output into a single cell, and assigning that cell the maximum value of the 4 cells that went into it. This gives us a lower-resolution version of the activation map, with dimensions 6x14x14.

Our next convolutional layer, conv2, expects 6 input channels (corresponding to the 6 features sought by the first layer), has 16 output channels, and a 3x3 kernel. It puts out a 16x12x12 activation map, which is again reduced by a max pooling layer to 16x6x6. Prior to passing this output to the linear layers, it is reshaped to a 16 * 6 * 6 = 576-element vector for consumption by the next layer.

There are convolutional layers for addressing 1D, 2D, and 3D tensors. There are also many more optional arguments for a conv layer constructor, including stride length(e.g., only scanning every second or every third position) in the input, padding (so you can scan out to the edges of the input), and more. See the [documentation](https://pytorch.org/docs/stable/nn.html#convolution-layers) for more information.

### Recurrent Layers

Recurrent neural networks (or RNNs) are used for sequential data - anything from time-series measurements from a scientific instrument to natural language sentences to DNA nucleotides. An RNN does this by maintaining a hidden state that acts as a sort of memory for what it has seen in the sequence so far.

In [4]:
# LSTM (long short-term memory)
# GRU (gated recurrent unit)
class LSTMTagger(torch.nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = torch.nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = torch.nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

https://pytorch.org/tutorials/beginner/nlp/

https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

The constructor has four arguments:

1. `vocab_size` is the number of words in the input vocabulary. Each word is a one-hot vector (or unit vector) in a `vocab_size`-dimensional space.

2. `tagset_size` is the number of tags in the output set.

3. `embedding_dim` is the size of the embedding space for the vocabulary. An embedding maps a vocabulary onto a low-dimensional space, where words with similar meanings are close together in the space.

4. `hidden_dim` is the size of the LSTM’s memory.

### Transformers

https://pytorch.org/docs/stable/nn.html#transformer-layers

https://pytorch.org/tutorials/beginner/transformer_tutorial.html

## Other Layers and Functions

https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html#other-layers-and-functions

### Data Manipulation Layers

Max pooling (and its twin, min pooling) reduce a tensor by combining cells, and assigning the maximum value of the input cells to the output cell (we saw this).

In [5]:
my_tensor = torch.rand(1, 6, 6)
print(my_tensor)

maxpool_layer = torch.nn.MaxPool2d(3)       # 3 * 3 max value: each quadrant of 6 * 6 input
print(maxpool_layer(my_tensor))

tensor([[[0.2374, 0.6107, 0.7623, 0.4106, 0.6409, 0.9780],
         [0.5889, 0.3431, 0.4157, 0.0494, 0.6288, 0.7908],
         [0.8304, 0.3715, 0.3963, 0.4810, 0.1997, 0.9967],
         [0.5372, 0.2004, 0.9509, 0.9904, 0.4673, 0.6919],
         [0.7814, 0.0688, 0.2617, 0.3582, 0.4661, 0.0269],
         [0.2980, 0.1115, 0.4167, 0.2535, 0.3754, 0.9538]]])
tensor([[[0.8304, 0.9967],
         [0.9509, 0.9904]]])


**Normalization layers** re-center and normalize the output of one layer before feeding it to another. Centering the and scaling the intermediate tensors has a number of beneficial effects, such as letting you use higher learning rates without exploding/vanishing gradients.

In [6]:
my_tensor = torch.rand(1, 4, 4) * 20 + 5
print(my_tensor)

print(my_tensor.mean())

norm_layer = torch.nn.BatchNorm1d(4)
normed_tensor = norm_layer(my_tensor)
print(normed_tensor)

print(normed_tensor.mean())

tensor([[[19.1950, 22.5907, 10.8352,  8.4007],
         [12.7303,  5.9058,  7.0304, 14.9840],
         [14.0016,  9.8467, 19.7862, 12.3650],
         [14.6087, 22.4253, 11.5003, 11.9954]]])
tensor(13.6376)
tensor([[[ 6.7600e-01,  1.2587e+00, -7.5847e-01, -1.1762e+00],
         [ 6.7565e-01, -1.1201e+00, -8.2420e-01,  1.2687e+00],
         [ 4.7868e-04, -1.1367e+00,  1.5836e+00, -4.4744e-01],
         [-1.1975e-01,  1.6677e+00, -8.3058e-01, -7.1736e-01]]],
       grad_fn=<NativeBatchNormBackward0>)
tensor(-5.2154e-08, grad_fn=<MeanBackward0>)


**Dropout layers** are a tool for encouraging sparse representations in your model - that is, pushing it to do inference with less data.

Dropout layers work by randomly setting parts of the input tensor during training - dropout layers are always turned off for inference. This forces the model to learn against this masked or reduced dataset.

https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html

In [7]:
my_tensor = torch.rand(1, 4, 4)

dropout = torch.nn.Dropout(p=0.4)
print(dropout(my_tensor))
print(dropout(my_tensor))

tensor([[[1.4551, 0.0000, 0.0000, 1.4674],
         [0.0000, 1.1179, 0.6502, 0.0000],
         [0.0000, 0.0000, 1.3803, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.7054]]])
tensor([[[1.4551, 0.3403, 0.3280, 0.0000],
         [0.4372, 0.0000, 0.6502, 0.7847],
         [0.9113, 0.5016, 1.3803, 0.0000],
         [0.8670, 0.0000, 0.5093, 0.7054]]])


### Activations Functions

`torch.nn.Module` has objects encapsulating all of the major activation functions including `ReLU` and its many `variants`, `Tanh`, `Hardtanh`, `sigmoid`, and more. It also includes other functions, such as `Softmax`, that are most useful at the output stage of a model.

### Loss Functions

**Loss functions** tell us how far a model’s prediction is from the correct answer. PyTorch contains a variety of loss functions, including common `MSE` (`mean squared error` = L2 norm), `Cross Entropy Loss` and `Negative Likelihood Loss` (useful for classifiers), and others.