# PyTorch and Neural Networks 2

[YuJa recording of lecture](https://uci.yuja.com/V/Video?v=4417722&node=14870050&a=1646074732&autoplay=1)

Topics mentioned at the board (not in this notebook):
* Importance of using activation functions to break linearity.
* Common choices of activation functions: sigmoid and relu.
* Concept of *one hot encoding*.

In [1]:
from tqdm.std import tqdm, trange
from tqdm import notebook
notebook.tqdm = tqdm
notebook.trange = trange

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import torch
from torch import nn
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader

In [2]:
# Load the data
training_data = datasets.MNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)

test_data = datasets.MNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)

Second YouTube video on *Neural Networks* from 3Blue1Brown.  This video is on *gradient descent*.  Recommended clips:
* 0:25-1:24
* 3:18-4:05
* 5:15-7:50

<iframe width="560" height="315" src="https://www.youtube.com/embed/IHZwWFHWa-w" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

This is what we finished with on Monday:

In [3]:
class ThreeBlue(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layers = nn.Sequential(
            nn.Linear(784,10)
        )

    def forward(self,x):
        y = self.flatten(x)
        z = self.layers(y)
        return z

We instantiate an object in this class as follows.

In [4]:
wed = ThreeBlue()

In class (see the YuJa recording above), we gradually built up to the following code.  It was designed to match the 3Blue1Brown video's neural network.

In [7]:
class ThreeBlue(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layers = nn.Sequential(
            nn.Linear(784,16),
            nn.Sigmoid(),
            nn.Linear(16,16),
            nn.Sigmoid(),
            nn.Linear(16,10),
            nn.Sigmoid()
        )

    def forward(self,x):
        x = x/255
        y = self.flatten(x)
        z = self.layers(y)
        return z

In [8]:
wed = ThreeBlue()

Here are the weights and biases for this neural network.  When we talk about fitting or training a neural network, we mean adjust the weights and biases to try to minimize some loss function.

In [10]:
for p in wed.parameters():
    print(p.shape)

torch.Size([16, 784])
torch.Size([16])
torch.Size([16, 16])
torch.Size([16])
torch.Size([10, 16])
torch.Size([10])


In [11]:
for p in wed.parameters():
    print(p.numel())

12544
16
256
16
160
10


Notice that this is the same 13002 number which appeared in the 3Blue1Brown videos.

In [14]:
sum([p.numel() for p in wed.parameters()])

13002

You can even do the same thing without the square brackets.  This is secretly using a *generator expression* instead of a list comprehension.

In [15]:
sum(p.numel() for p in wed.parameters())

13002

In [16]:
wed

ThreeBlue(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (layers): Sequential(
    (0): Linear(in_features=784, out_features=16, bias=True)
    (1): Sigmoid()
    (2): Linear(in_features=16, out_features=16, bias=True)
    (3): Sigmoid()
    (4): Linear(in_features=16, out_features=10, bias=True)
    (5): Sigmoid()
  )
)

In the line that begins `self.layers = ` above, we were specifying that each ThreeBlue object should have a `layers` attribute.  Here is that attribute for the case of `wed`.

In [20]:
wed.layers

Sequential(
  (0): Linear(in_features=784, out_features=16, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=16, out_features=16, bias=True)
  (3): Sigmoid()
  (4): Linear(in_features=16, out_features=10, bias=True)
  (5): Sigmoid()
)

You can access for example the second element of `wed.layers` using subscripting, `wed.layers[2]`.

In [21]:
wed.layers[2]

Linear(in_features=16, out_features=16, bias=True)

In [22]:
wed.layers[2].weight.shape

torch.Size([16, 16])

In [23]:
wed.layers[2].bias.shape

torch.Size([16])

On Monday, we were having to divide by 255 each time we input data to our neural network.  Today, we've put that step directly into the `forward` method of the neural network; it's the line `x = x/255`.

In [29]:
wed(training_data.data)[:3]

tensor([[0.5128, 0.3936, 0.5649, 0.5723, 0.5577, 0.5520, 0.5960, 0.5789, 0.4498,
         0.5246],
        [0.5123, 0.3924, 0.5644, 0.5736, 0.5568, 0.5522, 0.5969, 0.5799, 0.4494,
         0.5249],
        [0.5124, 0.3922, 0.5646, 0.5733, 0.5572, 0.5532, 0.5970, 0.5803, 0.4490,
         0.5274]], grad_fn=<SliceBackward0>)

In [30]:
y_pred = wed(training_data.data)

In [31]:
training_data.targets[:3]

tensor([5, 0, 4])

To match the 3Blue1Brown video, we are going to convert the targets, which are integers like `5`, into length 10 vectors like `[0,0,0,0,0,1,0,0,0,0]`.  This procedure is called *one-hot encoding*, and it also exists in scikit-learn.

In [32]:
from torch.nn.functional import one_hot

In [33]:
one_hot(training_data.targets[:3], num_classes=10).to(torch.float)

tensor([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])

In [34]:
y_true = one_hot(training_data.targets, num_classes=10).to(torch.float)

In [35]:
y_true.shape

torch.Size([60000, 10])

Using Mean-Squared Error on the probabilities for this classification problem is not considered the best approach, but it is easy to understand, and we will follow this approach for now to match the 3Blue1Brown video.

In [36]:
loss_fn = nn.MSELoss()

Here is the performance of the randomly initialized model.  The output of this sort of loss function is not so easy to analyze in isolation.  The important thing is that if we can lower this number, then the model is performing better (on the training data).

In [38]:
loss_fn(y_pred, y_true)

tensor(0.2792, grad_fn=<MseLossBackward0>)

Here we try to find better weights and biases using *gradient descent*.  Try to get comfortable with these steps (they can take some time to internalize).

In [39]:
optimizer = torch.optim.SGD(wed.parameters(), lr=0.1)

There aren't yet any gradients associated with the parameters of the model (the weights and biases).

In [40]:
for p in wed.parameters():
    print(p.grad)

None
None
None
None
None
None


In [41]:
loss = loss_fn(y_pred, y_true)

Still no gradients.

In [42]:
for p in wed.parameters():
    print(p.grad)

None
None
None
None
None
None


In [43]:
loss.backward()

The line `loss.backward()` told PyTorch to compute the gradients of the loss calculation with respect to the 13002 weights and biases.

In [44]:
for p in wed.parameters():
    print(p.grad)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
tensor([-1.8313e-04,  6.0323e-05, -7.9682e-05,  1.3652e-04,  2.8588e-04,
         5.4873e-05,  5.6339e-04, -4.3447e-04,  8.5402e-05,  4.7174e-05,
         2.8684e-04,  2.4752e-04,  3.4093e-04,  1.2426e-04,  7.8342e-05,
         3.8503e-04])
tensor([[ 1.5513e-03,  1.4997e-03,  1.4607e-03,  1.5071e-03,  1.4121e-03,
          1.5751e-03,  1.5298e-03,  1.5007e-03,  1.4708e-03,  1.5008e-03,
          1.6073e-03,  1.6327e-03,  1.5269e-03,  1.4076e-03,  1.5441e-03,
          1.3184e-03],
        [ 2.4013e-03,  2.3361e-03,  2.2776e-03,  2.3523e-03,  2.1835e-03,
          2.4395e-03,  2.3866e-03,  2.3222e-03,  2.2672e-03,  2.3256e-03,
          2.4822e-03,  2.5124e-03,  2.3493e-03,  2.1990e-03,  2.4188e-03,
          2.0566e-03],
        [-2.216

The next line adjusts the weights and biases by adding a multiple of the negative gradient.  (We are trying to minimize the loss, and the gradient points in the direction of fastest ascent, and the negative gradient points in the direction of fastest descent.)  The multiple we use is determined by the *learning rate* `lr` that we specified when we created the optimizer above.

In [45]:
optimizer.step()

In [46]:
wed(training_data.data)[:3]

tensor([[0.5100, 0.3918, 0.5617, 0.5692, 0.5546, 0.5489, 0.5928, 0.5757, 0.4474,
         0.5216],
        [0.5095, 0.3906, 0.5612, 0.5705, 0.5537, 0.5491, 0.5936, 0.5768, 0.4471,
         0.5219],
        [0.5096, 0.3905, 0.5614, 0.5701, 0.5540, 0.5501, 0.5938, 0.5771, 0.4466,
         0.5244]], grad_fn=<SliceBackward0>)

We now want to repeat that procedure.  Here we will repeat it 10 times, but often we will want to repeat it many more times.  What we hope is that the loss value is decreasing.

In [47]:
epochs = 10

for i in range(epochs):
    y_true = one_hot(training_data.targets, num_classes=10).to(torch.float)
    y_pred = wed(training_data.data)
    loss = loss_fn(y_true,y_pred)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print(loss)

tensor(0.2767, grad_fn=<MseLossBackward0>)
tensor(0.2741, grad_fn=<MseLossBackward0>)
tensor(0.2717, grad_fn=<MseLossBackward0>)
tensor(0.2692, grad_fn=<MseLossBackward0>)
tensor(0.2668, grad_fn=<MseLossBackward0>)
tensor(0.2644, grad_fn=<MseLossBackward0>)
tensor(0.2620, grad_fn=<MseLossBackward0>)
tensor(0.2597, grad_fn=<MseLossBackward0>)
tensor(0.2574, grad_fn=<MseLossBackward0>)
tensor(0.2551, grad_fn=<MseLossBackward0>)


An important thing to point out is that if we run the same code again, we won't be starting back at the beginning.  Each time we run this training procedure, it will begin where the last training procedure left off.

In [48]:
epochs = 100

for i in range(epochs):
    y_true = one_hot(training_data.targets, num_classes=10).to(torch.float)
    y_pred = wed(training_data.data)
    loss = loss_fn(y_true,y_pred)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if i%2 == 0:
        print(loss)

tensor(0.2528, grad_fn=<MseLossBackward0>)
tensor(0.2484, grad_fn=<MseLossBackward0>)
tensor(0.2441, grad_fn=<MseLossBackward0>)
tensor(0.2399, grad_fn=<MseLossBackward0>)
tensor(0.2358, grad_fn=<MseLossBackward0>)
tensor(0.2318, grad_fn=<MseLossBackward0>)
tensor(0.2280, grad_fn=<MseLossBackward0>)
tensor(0.2242, grad_fn=<MseLossBackward0>)
tensor(0.2206, grad_fn=<MseLossBackward0>)
tensor(0.2170, grad_fn=<MseLossBackward0>)
tensor(0.2136, grad_fn=<MseLossBackward0>)
tensor(0.2102, grad_fn=<MseLossBackward0>)
tensor(0.2069, grad_fn=<MseLossBackward0>)
tensor(0.2038, grad_fn=<MseLossBackward0>)
tensor(0.2007, grad_fn=<MseLossBackward0>)
tensor(0.1977, grad_fn=<MseLossBackward0>)
tensor(0.1948, grad_fn=<MseLossBackward0>)
tensor(0.1920, grad_fn=<MseLossBackward0>)
tensor(0.1893, grad_fn=<MseLossBackward0>)
tensor(0.1866, grad_fn=<MseLossBackward0>)
tensor(0.1840, grad_fn=<MseLossBackward0>)
tensor(0.1815, grad_fn=<MseLossBackward0>)
tensor(0.1791, grad_fn=<MseLossBackward0>)
tensor(0.17

Notice how the loss is steadily decreasing.  That's the best result we can hope for.  If we were to choose a learning rate that was much too big, the performance would be very different.  Here we set `lr=500` which is much too big.

In [96]:
wed = ThreeBlue()

In [97]:
optimizer = torch.optim.SGD(wed.parameters(), lr=500)

In [98]:
epochs = 10

for i in range(epochs):
    y_true = one_hot(training_data.targets, num_classes=10).to(torch.float)
    y_pred = wed(training_data.data)
    loss = loss_fn(y_true,y_pred)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print(loss)

tensor(0.2877, grad_fn=<MseLossBackward0>)
tensor(0.1000, grad_fn=<MseLossBackward0>)
tensor(0.1000, grad_fn=<MseLossBackward0>)
tensor(0.1000, grad_fn=<MseLossBackward0>)
tensor(0.1000, grad_fn=<MseLossBackward0>)
tensor(0.1000, grad_fn=<MseLossBackward0>)
tensor(0.1000, grad_fn=<MseLossBackward0>)
tensor(0.1000, grad_fn=<MseLossBackward0>)
tensor(0.1000, grad_fn=<MseLossBackward0>)
tensor(0.1000, grad_fn=<MseLossBackward0>)


Here it improves for one iteration of gradient descent, and then it seems to get stuck.