<h1>Pytorch Tutorial Part 2</h1>
<h2>Build the Neural Network</h2>
<a href="https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html">Here.</a>


In [1]:
import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

<h3>Get device for Training</h3>
<p>We want to be able to train our model on a hardware accelerator like the GPU or MPS, if available. Let’s check to see if torch.cuda or torch.backends.mps are available, otherwise we use the CPU.</p>

In [2]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using cpu device


<h3>Define the Class</h3>
<p>We define our neural network by subclassing nn.Module, and initialize the neural network layers in __init__. Every nn.Module subclass implements the operations on input data in the forward method.</p>

In [3]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

<p>We create an instance of NeuralNetwork, and move it to the device, and print its structure.</p>

In [4]:
model = NeuralNetwork().to(device)
print(model)

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


<p>To use the model, we pass it the input data. This executes the model’s forward, along with some background operations. Do not call model.forward() directly!

Calling the model on the input returns a 2-dimensional tensor with dim=0 corresponding to each output of 10 raw predicted values for each class, and dim=1 corresponding to the individual values of each output. We get the prediction probabilities by passing it through an instance of the nn.Softmax module.</p>

In [5]:
X = torch.rand(1, 28, 28, device=device)
logits = model(X)
pred_probab = nn.Softmax(dim=1)(logits) ##Softmax using the softplus curve?
y_pred = pred_probab.argmax(1)
print(f"Predicted class: {y_pred}")

Predicted class: tensor([2])


<h3>Model Layers</h3>
<p>Let’s break down the layers in the FashionMNIST model. To illustrate it, we will take a sample minibatch of 3 images of size 28x28 and see what happens to it as we pass it through the network.</p>

In [6]:
input_image = torch.rand(3,28,28)
print(input_image.size())

torch.Size([3, 28, 28])


In [7]:
##nn.Flatten
#convert each 2D 28*28 image into a contiguous array of 784 pixel values
#(minibatch dimension (at dim=0) is maintained)
flatten = nn.Flatten()
flat_image = flatten(input_image)
print(flat_image.size())

torch.Size([3, 784])


In [8]:
##nn.Linear
#linear layer is module that applies a linear transformation on the input using
#its stored weights and biases
layer1 = nn.Linear(in_features=28*28, out_features=20)
hidden1 = layer1(flat_image)
print(hidden1.size())

torch.Size([3, 20])


In [10]:
##nn.ReLu
#ReLu curves - squiggles! See StatQuest notes
#Non-linear activations are what create the complex mappings between the model’s
#inputs and outputs. They are applied after linear transformations to introduce
#nonlinearity, helping neural networks learn a wide variety of phenomena.

#In this model, we use nn.ReLU between our linear layers, but there’s other
#activations to introduce non-linearity in your model.
print(f"Before ReLU: {hidden1}\n\n")
hidden1 = nn.ReLU()(hidden1)
print(f"After ReLU: {hidden1}")

Before ReLU: tensor([[0.0504, 0.0000, 0.0000, 0.3182, 0.1382, 0.0901, 0.0534, 0.4403, 0.0253,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0663, 0.1126, 0.0000,
         0.7353, 0.0000],
        [0.2159, 0.0000, 0.0097, 0.3971, 0.0000, 0.0000, 0.0000, 0.1731, 0.0462,
         0.0000, 0.0000, 0.0430, 0.3793, 0.0272, 0.0000, 0.0000, 0.2539, 0.0000,
         0.5289, 0.0000],
        [0.2683, 0.0000, 0.0631, 0.3418, 0.0000, 0.0000, 0.0000, 0.0000, 0.0490,
         0.1616, 0.0777, 0.0000, 0.0875, 0.1322, 0.0000, 0.0000, 0.1325, 0.0000,
         0.7407, 0.0000]], grad_fn=<ReluBackward0>)


After ReLU: tensor([[0.0504, 0.0000, 0.0000, 0.3182, 0.1382, 0.0901, 0.0534, 0.4403, 0.0253,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0663, 0.1126, 0.0000,
         0.7353, 0.0000],
        [0.2159, 0.0000, 0.0097, 0.3971, 0.0000, 0.0000, 0.0000, 0.1731, 0.0462,
         0.0000, 0.0000, 0.0430, 0.3793, 0.0272, 0.0000, 0.0000, 0.2539, 0.0000,
         0.5289, 0.0000],
       

In [11]:
##nn.Sequential
#nn.Sequential is an ordered container of modules. The data is passed through
#all the modules in the same order as defined. You can use sequential
#containers to put together a quick network like seq_modules.
seq_modules = nn.Sequential(
    flatten,
    layer1,
    nn.ReLU(),
    nn.Linear(20, 10)
)
input_image = torch.rand(3,28,28)
logits = seq_modules(input_image)

In [12]:
##nn.Softmax
#The last linear layer of the neural network returns logits - raw values in
#[-infty, infty] - which are passed to the nn.Softmax module. The logits are
#scaled to values [0, 1] representing the model’s predicted probabilities for
# each class. dim parameter indicates the dimension along which the values must
#sum to 1.
softmax = nn.Softmax(dim=1)
pred_probab = softmax(logits)

<p>Links to all below:</p>
<ul>
  <li><a href="https://pytorch.org/docs/stable/generated/torch.nn.Flatten.html">Flatten</a></li>
  <li><a href="https://pytorch.org/docs/stable/generated/torch.nn.Linear.html">Linear Layer</a></li>
  <li><a href="https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html">ReLu</a></li>
  <li><a href="https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html">Sequential</a></li>
  <li><a href="https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html">Softmax</a></li>
</ul>

<h3>Model Params</h3>
<p>Many layers inside a neural network are parameterized, i.e. have associated weights and biases that are optimized during training. Subclassing nn.Module automatically tracks all fields defined inside your model object, and makes all parameters accessible using your model’s parameters() or named_parameters() methods.

In this example, we iterate over each parameter, and print its size and a preview of its values.</p>

In [13]:
print(f"Model structure: {model}\n\n")

for name, param in model.named_parameters():
    print(f"Layer: {name} | Size: {param.size()} | Values : {param[:2]} \n")

Model structure: NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


Layer: linear_relu_stack.0.weight | Size: torch.Size([512, 784]) | Values : tensor([[-0.0133, -0.0011, -0.0217,  ...,  0.0325, -0.0033,  0.0252],
        [ 0.0128, -0.0061, -0.0239,  ...,  0.0173, -0.0172, -0.0331]],
       grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.0.bias | Size: torch.Size([512]) | Values : tensor([ 0.0038, -0.0113], grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.2.weight | Size: torch.Size([512, 512]) | Values : tensor([[-0.0431,  0.0236,  0.0379,  ..., -0.0360,  0.0149, -0.0180],
        [-0.0427, -0.0036,  0.0158,  ...,  0.0115,  0.0034,  0.0238]],
       grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.2.bias | 

<h2>Automatic Differentiaton<h2>
<h3>with Autograd</h3>
<a href="https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html">Here.</a>

<p>When training neural networks, the most frequently used algorithm is <b>back propagation</b>. In this algorithm, parameters (model weights) are adjusted according to the <b>gradient</b> of the loss function with respect to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph.

Consider the simplest one-layer neural network, with input x, parameters w and b, and some loss function. It can be defined in PyTorch in the following manner:</p>

In [14]:
import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

<h3>Tensors, Functions and Computational Graph</h3>
<p>See diagram on web page</p>
<p>In this network, w and b are parameters, which we need to optimize. Thus, we need to be able to compute the gradients of loss function with respect to those variables. In order to do that, we set the requires_grad property of those tensors.</p>

<p><em>You can set the value of requires_grad when creating a tensor, or later by using x.requires_grad_(True) method.</em></p>

<p>A function that we apply to tensors to construct computational graph is in fact an object of class Function. This object knows how to compute the function in the forward direction, and also how to compute its derivative during the backward propagation step. A reference to the backward propagation function is stored in grad_fn property of a tensor. You can find more information of Function in the <a href="https://pytorch.org/docs/stable/autograd.html#function">documentation</a>.</p>




In [15]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x7ebb3712f0d0>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7ebb3712eb90>


<h3>Computing Gradients</h3>
<p>To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need:</p>
(∂
𝑙
𝑜
𝑠
𝑠)/
(∂
𝑤
∂w
​)
  and
(∂
𝑙
𝑜
𝑠
𝑠)/
(∂
𝑏)

  <p>under some fixed values of x and y. To compute those derivatives, we call loss.backward(), and then retrieve the values from w.grad and b.grad:</p>
  

In [16]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.0159, 0.0447, 0.2784],
        [0.0159, 0.0447, 0.2784],
        [0.0159, 0.0447, 0.2784],
        [0.0159, 0.0447, 0.2784],
        [0.0159, 0.0447, 0.2784]])
tensor([0.0159, 0.0447, 0.2784])


<h4>Note:</h4>
<ul>
  <li>We can only obtain the grad properties for the leaf nodes of the computational graph, which have requires_grad property set to True. For all other nodes in our graph, gradients will not be available.</li>
  <li>We can only perform gradient calculations using backward once on a given graph, for performance reasons. If we need to do several backward calls on the same graph, we need to pass retain_graph=True to the backward call.</li>
</ul>

<h3>Disabling Gradient Tracking</h3>
<p>By default, all tensors with requires_grad=True are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to do forward computations through the network. We can stop tracking computations by surrounding our computation code with torch.no_grad() block:</p>

In [17]:
z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

True
False


In [18]:
##Another way is with the detach() method
z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

False


<p>Reasons to disable grad tracking:</p>
<ul>
  <li>To mark some parameters in your neural network as <b>frozen parameters</b>.</li>
  <li>To <b>speed up computations</b> when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.</li>

<h3>More on Computational Graphs</h3>
<p>Conceptually, autograd keeps a record of data (tensors) and all executed operations in a directed acyclic graph (DAG) consisting of Function objects.In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.</p>

<p>n a forward pass, autograd does two things simultaneously:</p>
<ul>
  <li>run the requested operation to compute a resulting tensor</li>
  <li>maintain the operation’s gradient function in the DAG.</li>
</ul>

<p>The backward pass kicks off when .backward() is called on the DAG root. autograd then:</p>
<ul>
  <li>computes the gradients from each .grad_fn,</li>
  <li>accumulates them in the respective tensor’s .grad attribute</li>
  <li>using the chain rule, propagates all the way to the leaf tensors.</li>
</ul>

<p><em>Note: <b>DAGs are dynamic in PyTorch</b> An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.</em></p>

<h3>Optional reading</h3>
<p> See web page for diagrams</p>
<p>In many cases, we have a scalar loss function, and we need to compute the gradient with respect to some parameters. However, there are cases when the output function is an arbitrary tensor. In this case, PyTorch allows you to compute so-called Jacobian product, and not the actual gradient.</p>

In [19]:
inp = torch.eye(4, 5, requires_grad=True)
out = (inp+1).pow(2).t()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"First call\n{inp.grad}")
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nSecond call\n{inp.grad}")
inp.grad.zero_()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nCall after zeroing gradients\n{inp.grad}")

First call
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])

Second call
tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.]])

Call after zeroing gradients
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])


<p>Notice that when we call backward for the second time with the same argument, the value of the gradient is different. This happens because when doing backward propagation, PyTorch <b>accumulates the gradients</b>, i.e. the value of computed gradients is added to the grad property of all leaf nodes of computational graph. If you want to compute the proper gradients, you need to zero out the grad property before. In real-life training an optimizer helps us to do this.</p>

<p><em>Note: Previously we were calling backward() function without parameters. This is essentially equivalent to calling backward(torch.tensor(1.0)), which is a useful way to compute the gradients in case of a scalar-valued function, such as loss during neural network training.</em></p>

<h2>Optimisation</h2>
<a href="https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html">Here</a>

<p>Now that we have a model and data it’s time to train, validate and test our model by optimizing its parameters on our data. Training a model is an iterative process; in each iteration the model makes a guess about the output, calculates the error in its guess (loss), collects the derivatives of the error with respect to its parameters (as we saw in the previous section), and <b>optimizes</b> these parameters using gradient descent. For a more detailed walkthrough of this process, check out <a href="https://www.youtube.com/watch?v=tIeHLnjs5U8">this video on backpropagation from 3Blue1Brown</a>.</p>

<h3>Pre-requisite Code</h3>
<p>Load code from prev tutorials on Datasets and DataLoaders and Build Model</p>



In [20]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork()

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to data/FashionMNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 26421880/26421880 [00:02<00:00, 11113742.74it/s]


Extracting data/FashionMNIST/raw/train-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 29515/29515 [00:00<00:00, 199053.06it/s]


Extracting data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 4422102/4422102 [00:01<00:00, 3756112.07it/s]


Extracting data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 5148/5148 [00:00<00:00, 12219737.97it/s]

Extracting data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw






<h3>Hyperparameters</h3>
<p>Hyperparameters are adjustable parameters that let you control the model optimization process. Different hyperparameter values can impact model training and convergence rates</p>

<p>Define the following hyperparams for training:</p>
<ul>
  <li>Number of Epochs - number of times to iterate over dataset</li>
  <li>Batch Size - number of data samples propagated through network before the params are updated</li>
  <li>Learning rate - how much to update models parameters at each batch/epoch. Smaller values yield slow learning speed, while large values may result in unpredictable behavior during training.</li>

In [21]:
learning_rate = 1e-3
batch_size = 64
epochs = 5

<h3>Optimisation Loop</h3>
<p>Once we set our hyperparameters, we can then train and optimize our model with an optimization loop. Each iteration of the optimization loop is called an epoch.</p>

<p>Epochs have 2 main parts:</p>
<ul>
  <li>The Train Loop - iterate over the training dataset and try to converge to optimal parameters.</li>
  <li>The Validation/Test Loop - iterate over the test dataset to check if model performance is improving.</li>
</ul>


<h3>Training Loop Concepts</h3>
<h4>Loss Function</h4>
<p>Loss function measures the degree of dissimilarity of obtained result to the target value, and it is the loss function that we want to minimize during training. To calculate the loss we make a prediction using the inputs of our given data sample and compare it against the true data label value.</p>

<p>Common Loss functions include <a href="https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss">nn.MSELoss (Mean Square Error)</a> for regression tasks and <a href="https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html#torch.nn.NLLLoss">nn.NLLLLoss (Negative Log Likelihood)</a> for classification.</p>
<p><a href="https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss">nn.CrossEntropyLoss</a> combines nn.LogSoftmax and nn.NLLLoss.</p>

<p>We pass our model’s output logits to nn.CrossEntropyLoss, which will normalize the logits and compute the prediction error.</p>