# Building Models with PyTorch

This notebook is referenced from the fourth video in the [PyTorch Beginner Series](https://www.youtube.com/playlist?list=PL_lsbAsL_o2CTlGHgMxNrKhzP97BaG9ZN) by Brad Heintz on YouTube. The video focuses on the basic concepts in PyTorch that are used to handle several deep learning tasks and demonstrates how these concepts come together to make PyTorch a robust machine learning framework. You can find the notebook associated with the video [here](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).


In [1]:
# Import libraries here
import json

import numpy as np
import torch
import torch.nn.functional as F
import torch.optim as optim
from torch import Tensor

## Build a Simple Model

This model is similar to the one built in notebook-03.


In [2]:
class TinyModel(torch.nn.Module):
    """A simple model created to set a baseline."""

    def __init__(self, *args, **kwargs) -> None:
        super(TinyModel, self).__init__(*args, **kwargs)

        # Setup layers and activations
        self.linear1 = torch.nn.Linear(100, 200)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(200, 10)
        self.softmax = torch.nn.Softmax()           # converts output to probabilities

    def forward(self, x: Tensor) -> Tensor:
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)

        return x

In [3]:
# Initialize the model
tiny_model = TinyModel()
print(f'The Model Architecture:\n{tiny_model}\n')
print(f'Layer `linear1`:\n{tiny_model.linear1}\n')
print(f'Layer `linear2`:\n{tiny_model.linear2}')

The Model Architecture:
TinyModel(
  (linear1): Linear(in_features=100, out_features=200, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=200, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)

Layer `linear1`:
Linear(in_features=100, out_features=200, bias=True)

Layer `linear2`:
Linear(in_features=200, out_features=10, bias=True)


In [4]:
# Print model parameters
print('~~~ Model Parameters ~~~')
for param in tiny_model.parameters():
    print(param)

~~~ Model Parameters ~~~
Parameter containing:
tensor([[-0.0096, -0.0139,  0.0813,  ..., -0.0623,  0.0220,  0.0795],
        [-0.0685, -0.0007, -0.0208,  ..., -0.0624,  0.0247, -0.0416],
        [-0.0643,  0.0150,  0.0839,  ...,  0.0790, -0.0351,  0.0725],
        ...,
        [ 0.0953, -0.0047, -0.0872,  ...,  0.0153, -0.0308, -0.0769],
        [ 0.0222,  0.0118,  0.0024,  ..., -0.0430,  0.0779, -0.0577],
        [ 0.0057, -0.0811, -0.0907,  ...,  0.0742, -0.0705, -0.0153]],
       requires_grad=True)
Parameter containing:
tensor([ 8.3996e-02,  1.0357e-02, -2.4370e-02,  4.0140e-02, -7.2262e-02,
         6.9584e-02, -1.2005e-02, -5.4845e-02, -7.5443e-02, -7.2718e-02,
         1.4722e-02, -3.8378e-02, -1.4451e-02,  8.2732e-02,  8.9670e-02,
        -7.0884e-02,  5.8302e-03,  6.3268e-02, -8.9240e-02,  5.4744e-02,
        -1.7199e-02,  8.8408e-02, -2.3642e-02,  6.5555e-02,  1.0773e-02,
         4.0726e-02,  4.9739e-03, -8.2377e-02, -2.7889e-02, -8.6999e-03,
        -6.3434e-02,  3.0250e-02

In [5]:
# Print parameters for `linear1`
print('~~~ Parameters for `linear1` ~~~')
for param in tiny_model.linear1.parameters():
    print(param)

~~~ Parameters for `linear1` ~~~
Parameter containing:
tensor([[-0.0096, -0.0139,  0.0813,  ..., -0.0623,  0.0220,  0.0795],
        [-0.0685, -0.0007, -0.0208,  ..., -0.0624,  0.0247, -0.0416],
        [-0.0643,  0.0150,  0.0839,  ...,  0.0790, -0.0351,  0.0725],
        ...,
        [ 0.0953, -0.0047, -0.0872,  ...,  0.0153, -0.0308, -0.0769],
        [ 0.0222,  0.0118,  0.0024,  ..., -0.0430,  0.0779, -0.0577],
        [ 0.0057, -0.0811, -0.0907,  ...,  0.0742, -0.0705, -0.0153]],
       requires_grad=True)
Parameter containing:
tensor([ 8.3996e-02,  1.0357e-02, -2.4370e-02,  4.0140e-02, -7.2262e-02,
         6.9584e-02, -1.2005e-02, -5.4845e-02, -7.5443e-02, -7.2718e-02,
         1.4722e-02, -3.8378e-02, -1.4451e-02,  8.2732e-02,  8.9670e-02,
        -7.0884e-02,  5.8302e-03,  6.3268e-02, -8.9240e-02,  5.4744e-02,
        -1.7199e-02,  8.8408e-02, -2.3642e-02,  6.5555e-02,  1.0773e-02,
         4.0726e-02,  4.9739e-03, -8.2377e-02, -2.7889e-02, -8.6999e-03,
        -6.3434e-02,  3.

In [6]:
# Print parameters for `linear2`
print('~~~ Parameters for `linear2` ~~~')
for param in tiny_model.linear2.parameters():
    print(param)

~~~ Parameters for `linear2` ~~~
Parameter containing:
tensor([[ 0.0220, -0.0115, -0.0099,  ..., -0.0230, -0.0522, -0.0582],
        [-0.0649, -0.0156, -0.0100,  ...,  0.0078,  0.0496, -0.0472],
        [ 0.0267,  0.0307,  0.0091,  ..., -0.0699, -0.0444,  0.0642],
        ...,
        [-0.0156, -0.0348, -0.0549,  ..., -0.0410, -0.0562, -0.0690],
        [-0.0145, -0.0271,  0.0367,  ..., -0.0088, -0.0289,  0.0588],
        [ 0.0653, -0.0488,  0.0368,  ...,  0.0137, -0.0194,  0.0349]],
       requires_grad=True)
Parameter containing:
tensor([ 0.0287, -0.0619,  0.0022, -0.0518,  0.0398,  0.0272,  0.0683, -0.0197,
         0.0049, -0.0497], requires_grad=True)


## Examining Layer Types

Some common layer types are listed below:

- Linear layers - also called fully-connected layers where every input influences every output.
- Convolutional layers - used to handle data with a high degree of spatial correlation.
- Recurrent layers - used for sequential data by maintaining a memory using hidden states.
- Transformers - multi-purpose network with in-built attention heads, encoders, decoders, etc.
- Data manipulation layers
  - Max/Average pooling layers - reduces a tensor by combining cells and assigning max/average value.
  - Normalization layers - re-centers and normalizes the output of one layer before passing it to another.
  - Dropout layers - randomly sets inputs to 0, encouraging sparse representations in the model.

Some associated functions that are important in building a model:

- Activation functions - introduces non-linearity in the model and determines if the neuron is activated.
- Loss functions - evaluates the "goodness" of the model, the weights are optimized to reduce this.


### Linear Layers


In [7]:
# Define a linear layer
linear = torch.nn.Linear(3, 2)

# Define inputs
x = torch.rand(1, 3)
print(f'Inputs:\n{x}\n')

# Print the weights and bias
print('~~~ Weights and Bias for the Linear Layer ~~~')
for param in linear.parameters():
    print(param)

# Produce outputs
y = linear(x)
print(f'\nOutputs:\n{y}')

Inputs:
tensor([[0.0988, 0.2758, 0.8293]])

~~~ Weights and Bias for the Linear Layer ~~~
Parameter containing:
tensor([[ 0.5479,  0.1983,  0.2888],
        [-0.2375, -0.3624,  0.3960]], requires_grad=True)
Parameter containing:
tensor([-0.0918, -0.2877], requires_grad=True)

Outputs:
tensor([[ 0.2565, -0.0828]], grad_fn=<AddmmBackward0>)


### Convolutional Layers


In [8]:
# Define a convolutional neural network
class ConvNet(torch.nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super(ConvNet, self).__init__(*args, **kwargs)

        # Define model architecture
        self.conv1 = torch.nn.Conv2d(1, 6, 5)
        self.conv2 = torch.nn.Conv2d(6, 16, 5)
        self.fc1 = torch.nn.Linear(16 * 5 * 5, 120)
        self.fc2 = torch.nn.Linear(120, 84)
        self.fc3 = torch.nn.Linear(84, 10)

    def forward(self, x: Tensor) -> Tensor:
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

In [9]:
# Initialize the CNN
conv_net = ConvNet()
print(f'The Model Architecture:\n{conv_net}\n')

# Define inputs
x = torch.rand(1, 1, 32, 32)
print(f'Inputs:\n{x}\n')

# Produce outputs
y = conv_net(x)
print(f'Outputs:\n{y}')

The Model Architecture:
ConvNet(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

Inputs:
tensor([[[[0.6978, 0.4626, 0.4278,  ..., 0.2858, 0.8567, 0.2354],
          [0.7500, 0.8984, 0.8596,  ..., 0.1333, 0.5216, 0.0551],
          [0.5411, 0.1402, 0.4467,  ..., 0.9585, 0.0554, 0.3797],
          ...,
          [0.6499, 0.0406, 0.0228,  ..., 0.5594, 0.4459, 0.3524],
          [0.9065, 0.2543, 0.0755,  ..., 0.4392, 0.7752, 0.3953],
          [0.3245, 0.0428, 0.3254,  ..., 0.4777, 0.3953, 0.3578]]]])

Outputs:
tensor([[-0.0524,  0.0297, -0.1074, -0.0212, -0.0262, -0.1134,  0.0113,  0.0020,
         -0.0375, -0.1123]], grad_fn=<AddmmBackward0>)


### Recurrent Layers


In [10]:
# Define a recurrent neural network with LSTM cells
class LSTMTagger(torch.nn.Module):
    def __init__(
        self,
        embedding_dim: int,
        hidden_size: int,
        vocab_size: int,
        tagset_size: int,
    ) -> None:
        super(LSTMTagger, self).__init__()

        # Set hidden dimensions
        self.hidden_size = hidden_size

        # Define word embeddings
        self.word_embeddings = torch.nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=embedding_dim,
        )

        # Define LSTM cell
        self.lstm = torch.nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_size,
        )

        # Setup a hidden layer that maps from hidden state space to tag space
        self.hidden2tag = torch.nn.Linear(hidden_size, tagset_size)

    def forward(self, sentence: Tensor) -> Tensor:
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)

        return tag_scores

In [11]:
# Setup training data
train_data = [
    ('The dog ate the apple'.split(), ['DET', 'NN', 'V', 'DET', 'NN']),
    ('Everybody read that book'.split(), ['NN', 'V', 'DET', 'NN']),
    ('The apple ate the book'.split(), ['DET', 'NN', 'V', 'DET', 'NN']),
    ('Everybody read the apple'.split(), ['NN', 'V', 'DET', 'NN']),
]

# Mapping words to indices
word_indices = {}
for sentence, _ in train_data:
    for word in sentence:
        if word not in word_indices:
            word_indices[word] = len(word_indices)
print(f'Word Indices = {json.dumps(word_indices, indent=4)}')

# Mapping tags to indices
tag_indices = {'DET': 0, 'NN': 1, 'V': 2}
print(f'Tag Indices = {json.dumps(tag_indices, indent=4)}')

Word Indices = {
    "The": 0,
    "dog": 1,
    "ate": 2,
    "the": 3,
    "apple": 4,
    "Everybody": 5,
    "read": 6,
    "that": 7,
    "book": 8
}
Tag Indices = {
    "DET": 0,
    "NN": 1,
    "V": 2
}


In [12]:
def encode_sequence(seq: list[str], indices: dict[str, int]) -> Tensor:
    """
    Converts a sequence of words to a tensor of indices based on the given mapping.

    Args:
        seq (list[str]): A list of words to be encoded.
        indices (dict[str, int]):\
            A dictionary mapping words to their corresponding indices.

    Returns:
        Tensor: A tensor containing the indices of the words in the input sequence.
    """
    idxs = [indices[word] for word in seq]
    return torch.tensor(idxs, dtype=torch.long)

In [13]:
# Initialize the LSTM model
lstm_tagger = LSTMTagger(
    embedding_dim=6,
    hidden_size=6,
    vocab_size=len(word_indices),
    tagset_size=len(tag_indices),
)
print(f'The Model Architecture:\n{lstm_tagger}')

The Model Architecture:
LSTMTagger(
  (word_embeddings): Embedding(9, 6)
  (lstm): LSTM(6, 6)
  (hidden2tag): Linear(in_features=6, out_features=3, bias=True)
)


In [14]:
# Setup the loss function and optimizer
loss_fn = torch.nn.NLLLoss()
optimizer = optim.SGD(lstm_tagger.parameters(), lr=0.001)

# Setup prediction collection
evaluation_results = {}

# Train the model
N_EPOCHS = 100
for epoch in range(N_EPOCHS):
    for sentence, tags in train_data:
        # Prepare the inputs and targets
        lstm_tagger.zero_grad()
        sentence_encoded = encode_sequence(sentence, word_indices)
        targets = encode_sequence(tags, tag_indices)

        # Perform forward pass
        tag_scores = lstm_tagger(sentence_encoded)
        predictions = tag_scores.argmax(dim=1)
        evaluation_results[' '.join(sentence)] = dict(
            targets=targets.numpy().tolist(),
            predictions=predictions.numpy().tolist(),
        )

        # Compute loss and perform backpropagation
        loss = loss_fn(tag_scores, targets)
        loss.backward()
        optimizer.step()

    # Print training data
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch + 1:3d}/{N_EPOCHS}], Loss: {loss.item():.4f}')

Epoch [ 10/100], Loss: 1.1323
Epoch [ 20/100], Loss: 1.1284
Epoch [ 30/100], Loss: 1.1247
Epoch [ 40/100], Loss: 1.1211
Epoch [ 50/100], Loss: 1.1175
Epoch [ 60/100], Loss: 1.1140
Epoch [ 70/100], Loss: 1.1107
Epoch [ 80/100], Loss: 1.1074
Epoch [ 90/100], Loss: 1.1042
Epoch [100/100], Loss: 1.1011


In [15]:
# Get the prediction evaluation results
for sentence, result in evaluation_results.items():
    print(f'Sentence: "{sentence}"')
    targets = result['targets']
    predictions = result['predictions']
    print(f'    Targets     : {targets}')
    print(f'    Predictions : {predictions}')

Sentence: "The dog ate the apple"
    Targets     : [0, 1, 2, 0, 1]
    Predictions : [1, 2, 1, 2, 2]
Sentence: "Everybody read that book"
    Targets     : [1, 2, 0, 1]
    Predictions : [1, 1, 1, 1]
Sentence: "The apple ate the book"
    Targets     : [0, 1, 2, 0, 1]
    Predictions : [1, 1, 1, 2, 1]
Sentence: "Everybody read the apple"
    Targets     : [1, 2, 0, 1]
    Predictions : [1, 1, 2, 1]


In [16]:
# Compute the accuracy of the model
correct_predictions = 0
total_predictions = 0
for sentence, result in evaluation_results.items():
    targets = result['targets']
    predictions = result['predictions']

    correct_predictions += (
        (np.array(targets) == np.array(predictions)).sum()
    )
    total_predictions += len(predictions)

accuracy_score = correct_predictions / total_predictions
print(f'Accuracy of the Model: {(accuracy_score * 100):.4f}%')

Accuracy of the Model: 33.3333%


In [17]:
# Compile the lstm model to a static representation
lstm_script = torch.jit.script(lstm_tagger)

# Save the model script locally for future use
lstm_script.save('../models/04_lstm_tagger.pt')

### Data Manipulation Layers

These layers do not participate in the learning process but are essential for manipulating tensors, such as:

- Average/Max pooling layers
- Normalization layers
- Dropout layers


In [18]:
# Define a tensor
tensor_0 = torch.rand(1, 6, 6)
print(f'Tensor 0:\n{tensor_0}')

# Create pooling layers
avg_pooling_layer = torch.nn.AvgPool2d(3)
max_pooling_layer = torch.nn.MaxPool2d(3)
print(f'\nAverage-Pooled Tensor:\n{avg_pooling_layer(tensor_0)}')
print(f'\nMax-Pooled Tensor:\n{max_pooling_layer(tensor_0)}')

Tensor 0:
tensor([[[0.1469, 0.8377, 0.9813, 0.2753, 0.5628, 0.3891],
         [0.3284, 0.9396, 0.9312, 0.2181, 0.8076, 0.9192],
         [0.9497, 0.9007, 0.4809, 0.8573, 0.6149, 0.5760],
         [0.4669, 0.8119, 0.9940, 0.2087, 0.5927, 0.1092],
         [0.2479, 0.2736, 0.3748, 0.9859, 0.8099, 0.1101],
         [0.6605, 0.7027, 0.3157, 0.1883, 0.5325, 0.2906]]])

Average-Pooled Tensor:
tensor([[[0.7218, 0.5800],
         [0.5387, 0.4253]]])

Max-Pooled Tensor:
tensor([[[0.9813, 0.9192],
         [0.9940, 0.9859]]])


In [19]:
# Set the kernel size
kernel_size = 3

# Compute the dimensions of the output tensor
_, H, W = tensor_0.size()
H_out, W_out = H // kernel_size, W // kernel_size

# Setup pooled tensors
avgs = torch.zeros(1, H_out, W_out)
maxs = torch.zeros(1, H_out, W_out)

for i in range(H_out):
    for j in range(W_out):
        # Extract the current (kernel_size x kernel_size) window
        window = tensor_0[
            0,
            (i * kernel_size) : ((i + 1) * kernel_size),
            (j * kernel_size) : ((j + 1) * kernel_size),
        ]

        # Calculate the average and
        # max values of the window
        avgs[0, i, j] = window.mean()
        maxs[0, i, j] = window.max()

# Print the manually computed tensors
print(f'Tensor 0:\n{tensor_0}')
print(f'\nAverage-Pooled Tensor (manual):\n{avgs}')
print(f'Is it the same?: {torch.equal(avg_pooling_layer(tensor_0), avgs)}')
print(f'\nMax-Pooled Tensor (manual):\n{maxs}')
print(f'Is it the same?: {torch.equal(max_pooling_layer(tensor_0), maxs)}')

Tensor 0:
tensor([[[0.1469, 0.8377, 0.9813, 0.2753, 0.5628, 0.3891],
         [0.3284, 0.9396, 0.9312, 0.2181, 0.8076, 0.9192],
         [0.9497, 0.9007, 0.4809, 0.8573, 0.6149, 0.5760],
         [0.4669, 0.8119, 0.9940, 0.2087, 0.5927, 0.1092],
         [0.2479, 0.2736, 0.3748, 0.9859, 0.8099, 0.1101],
         [0.6605, 0.7027, 0.3157, 0.1883, 0.5325, 0.2906]]])

Average-Pooled Tensor (manual):
tensor([[[0.7218, 0.5800],
         [0.5387, 0.4253]]])
Is it the same?: False

Max-Pooled Tensor (manual):
tensor([[[0.9813, 0.9192],
         [0.9940, 0.9859]]])
Is it the same?: True


In [20]:
# Define another tensor
tensor_1 = torch.rand(1, 4, 4) * 20 + 5
print(f'Tensor 1:\n{tensor_1}')
print(f'Mean : {tensor_1.mean()}')
print(f'Std  : {tensor_1.std()}')

# Setup a normalization layer
normalization_layer = torch.nn.BatchNorm1d(4)
normalized_tensor = normalization_layer(tensor_1)
print(f'\nNormalized Tensor:\n{normalized_tensor}')
print(f'Mean : {normalized_tensor.mean()}')
print(f'Std  : {normalized_tensor.std()}')

Tensor 1:
tensor([[[14.1492, 16.2734, 18.0320, 17.1201],
         [ 8.4800, 15.2588, 15.1151,  8.5726],
         [13.1176, 19.5021, 18.3224,  8.0068],
         [10.3068,  8.2278, 20.0766, 17.6381]]])
Mean : 14.262460708618164
Std  : 4.286467552185059

Normalized Tensor:
tensor([[[-1.5615, -0.0837,  1.1398,  0.5054],
         [-1.0137,  1.0214,  0.9783, -0.9859],
         [-0.3545,  1.0430,  0.7848, -1.4733],
         [-0.7622, -1.1842,  1.2207,  0.7257]]],
       grad_fn=<NativeBatchNormBackward0>)
Mean : -2.2351741790771484e-08
Std  : 1.032794713973999


The normalized tensor has mean equal to 0 and standard deviation equal to 1 (approximately).


In [21]:
# Define a third tensor
tensor_2 = torch.rand(1, 4, 4)
print(f'Tensor 2:\n{tensor_2}')

# Create a dropout layers
dropout_layer_1 = torch.nn.Dropout(p=0.4)
print(f'\nDropout (p=0.4):\n{dropout_layer_1(tensor_2)}')

dropout_layer_2 = torch.nn.Dropout(p=1.0)
print(f'\nDropout (p=1.0):\n{dropout_layer_2(tensor_2)}')

Tensor 2:
tensor([[[0.9159, 0.5101, 0.0268, 0.1150],
         [0.1693, 0.8296, 0.4148, 0.4501],
         [0.8585, 0.5178, 0.8615, 0.1808],
         [0.4003, 0.0933, 0.0819, 0.4586]]])

Dropout (p=0.4):
tensor([[[0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.7501],
         [0.0000, 0.8630, 0.0000, 0.3013],
         [0.6671, 0.1555, 0.1365, 0.7644]]])

Dropout (p=1.0):
tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])


These create sparse representations of the tensor based on the probability value.
