### The Pipeline

So far in this series, we learned about Tensors, and we've learned all about PyTorch neural networks. We are now ready to begin the training process.

1. Prepare the data
2. Build the model
3. __Train the model__ 
4. Analyze the model's results

Training is essentially gradient descent: we calculate the gradient of the loss, and use this to update the weights so that the resulting loss is closer the the minima. 

### Training

During training, we get a __batch__ and pass it forward through the network. Once the output is obtained, we compare the predicted output to the actual labels, and once we know how close the predicted values are from the actual labels, we tweak the weights inside the network in such a way that the values the network predicts move closer to the true values (labels). All of this is for a single batch, and we repeat this process for every batch until we have covered every sample in our training set. 

After we've completed this process for all of the batches and passed over every sample in our training set, we say that an __epoch__ is complete. During the entire training process, we do as many epochs as necessary to reach our desired level of accuracy. 

__Training a neural net:__
1. Get batch from the training set.
2. Pass batch to network.
3. Calculate the __loss__ (difference between the predicted values and the true values).
4. Calculate the gradient of the loss function w.r.t the network's weights.
5. Update the weights using the gradients to reduce the loss.
6. Repeat steps 1-5 until one epoch is completed.
7. Repeat steps 1-6 for as many epochs required to reach the minimum loss.

We know how to do steps 1 and 2. The loss is specified depending on the problem. We use backpropagation and an optimization algorithm to perform step 4 and 5. Steps 6 and 7 are just standard Python loops (the training loop).

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Network(nn.Module):
    def __init__(self, channels=1): # default grayscale
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=channels, out_channels=6, kernel_size=5) 
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)
        
        self.fc1 = nn.Linear(in_features=12*4*4, out_features=120) # ((28-5+1)/2 -5 +1)/2 = 4
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)
        
    def forward(self, t):
        # (1) input layer
        t = t
        print('initial shape', t.shape)
        
        # (2) hidden conv layer
        t = self.conv1(t)
        print(' after conv1 shape', t.shape)
        t = F.relu(t) # activation_function='relu' in tf.keras      
        t = F.max_pool2d(t, kernel_size=2, stride=2)
        print('after maxpool1 shape', t.shape)
        
        # (3) hidden conv layer
        t = self.conv2(t)
        print('after conv2 shape', t.shape)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)
        print('after maxpool2 shape', t.shape)

        # (4) hidden linear layer
        t = t.reshape(-1, 12*4*4)
        print('after reshape shape', t.shape)
        t = self.fc1(t)
        print('after fc1 shape', t.shape)
        t = F.relu(t) # activation_funcion='relu' in tf.keras
        
        # (5) hidden linear layer
        t = self.fc2(t)
        print('after fc2 shape', t.shape)
        t = F.relu(t)
        
        # (6) output layer
        t = self.out(t)
        print('after out shape', t.shape)
        #t = F.softmax(t, dim=1) # first index is batch
        return t

In [2]:
import torchvision
import torchvision.transforms as transform

train_set = torchvision.datasets.FashionMNIST(
    root='./data/FashionMNIST',
    download=True,
    transform=transform.ToTensor())

train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)
batch = next(iter(train_loader)) # get one batch
images, labels = batch

# initialize a network
network = Network() 

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/26421880 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/29515 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/4422102 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/5148 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/FashionMNIST/raw



In [3]:
def get_num_correct(preds, labels):
    return (preds.argmax(dim=1) == labels).sum()

#### Cross entropy loss

`F.cross_entropy` combines __negative log-loss with softmax and averages the result__. Thus we comment out above the softmax activation. Here we have a batch size of 2. Thus to compare absolute lossess for different batch sizes one must multiply by the batch size. 

In [4]:
F.cross_entropy(torch.tensor([[3, 6, 2], [1, 1, 2]]).float(), torch.tensor([2, 1]))

tensor(2.8087)

In [5]:
import numpy as np
softmax = lambda a: np.exp(a) / np.exp(a).sum()

(-np.log(softmax(np.array([3,6,2])))[2] + -np.log(softmax(np.array([1,1,2])))[1])/2 # averages

2.8086643088447403

#### Backpropagation and Gradient Descent

In [6]:
import torch.optim as optim
optimizer = optim.Adam(network.parameters(), lr=0.01) # optimizer has access to network parameters

In [7]:
network.conv1.weight.grad is None

True

In [8]:
preds = network(images)
loss = F.cross_entropy(preds, labels) 
print('loss: ', loss) 
print('no. correct:', get_num_correct(preds, labels)) # out of 100

initial shape torch.Size([100, 1, 28, 28])
 after conv1 shape torch.Size([100, 6, 24, 24])
after maxpool1 shape torch.Size([100, 6, 12, 12])
after conv2 shape torch.Size([100, 12, 8, 8])
after maxpool2 shape torch.Size([100, 12, 4, 4])
after reshape shape torch.Size([100, 192])
after fc1 shape torch.Size([100, 120])
after fc2 shape torch.Size([100, 60])
after out shape torch.Size([100, 10])
loss:  tensor(2.3098, grad_fn=<NllLossBackward0>)
no. correct: tensor(4)


In [9]:
loss.backward() # backprop, looks at definition of loss and crawls backward into the network
network.conv1.weight.grad.shape # gradients updated after one pass of backprop

torch.Size([6, 1, 5, 5])

In [10]:
optimizer.step() # based on our new loss gradient values, we update weights accdg to Adam to minimize loss.

In [11]:
preds = network(images) # run new predictions
loss = F.cross_entropy(preds, labels) 
print('loss after backprop: ', loss, 'no. correct:', get_num_correct(preds, labels))

initial shape torch.Size([100, 1, 28, 28])
 after conv1 shape torch.Size([100, 6, 24, 24])
after maxpool1 shape torch.Size([100, 6, 12, 12])
after conv2 shape torch.Size([100, 12, 8, 8])
after maxpool2 shape torch.Size([100, 12, 4, 4])
after reshape shape torch.Size([100, 192])
after fc1 shape torch.Size([100, 120])
after fc2 shape torch.Size([100, 60])
after out shape torch.Size([100, 10])
loss after backprop:  tensor(2.2894, grad_fn=<NllLossBackward0>) no. correct: tensor(15)


The correct predictions have increased after one step of gradient descent!

### Training one batch

Collect everything in one place:

In [12]:
# compile the neural net
network = Network()
optimizer = optim.Adam(network.parameters(), lr=0.01)


# loss
loss = F.cross_entropy(network(images), labels)
print("Step 0:")
print(loss.item())
print(get_num_correct(network(images), labels))

# backprop
loss.backward()  # update gradients
optimizer.step() # update weights using gradients to minimize loss

# recalculating loss based on new weights
loss = F.cross_entropy(network(images), labels)
print("\nStep 1:")
print(loss.item())
print(get_num_correct(network(images), labels))

initial shape torch.Size([100, 1, 28, 28])
 after conv1 shape torch.Size([100, 6, 24, 24])
after maxpool1 shape torch.Size([100, 6, 12, 12])
after conv2 shape torch.Size([100, 12, 8, 8])
after maxpool2 shape torch.Size([100, 12, 4, 4])
after reshape shape torch.Size([100, 192])
after fc1 shape torch.Size([100, 120])
after fc2 shape torch.Size([100, 60])
after out shape torch.Size([100, 10])
Step 0:
2.3137776851654053
initial shape torch.Size([100, 1, 28, 28])
 after conv1 shape torch.Size([100, 6, 24, 24])
after maxpool1 shape torch.Size([100, 6, 12, 12])
after conv2 shape torch.Size([100, 12, 8, 8])
after maxpool2 shape torch.Size([100, 12, 4, 4])
after reshape shape torch.Size([100, 192])
after fc1 shape torch.Size([100, 120])
after fc2 shape torch.Size([100, 60])
after out shape torch.Size([100, 10])
tensor(9)
initial shape torch.Size([100, 1, 28, 28])
 after conv1 shape torch.Size([100, 6, 24, 24])
after maxpool1 shape torch.Size([100, 6, 12, 12])
after conv2 shape torch.Size([100,

### Training a single epoch

In [13]:
train_loader = torch.utils.data.DataLoader(train_set, batch_size=100) 

network = Network()
optimizer = optim.Adam(network.parameters(), lr=0.01)

total_loss = 0
total_correct = 0
for batch in train_loader:
    images, labels = batch 

    preds = network(images) 
    loss = F.cross_entropy(preds, labels) 

    optimizer.zero_grad() 
    loss.backward()  # calculate gradients
    optimizer.step() # update weights using gradients using adam

    total_loss += loss.item()
    total_correct += get_num_correct(preds, labels)
    
print(
    "epoch:", 0, 
    "total_correct:", total_correct, 
    "loss:", total_loss
)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
after reshape shape torch.Size([100, 192])
after fc1 shape torch.Size([100, 120])
after fc2 shape torch.Size([100, 60])
after out shape torch.Size([100, 10])
initial shape torch.Size([100, 1, 28, 28])
 after conv1 shape torch.Size([100, 6, 24, 24])
after maxpool1 shape torch.Size([100, 6, 12, 12])
after conv2 shape torch.Size([100, 12, 8, 8])
after maxpool2 shape torch.Size([100, 12, 4, 4])
after reshape shape torch.Size([100, 192])
after fc1 shape torch.Size([100, 120])
after fc2 shape torch.Size([100, 60])
after out shape torch.Size([100, 10])
initial shape torch.Size([100, 1, 28, 28])
 after conv1 shape torch.Size([100, 6, 24, 24])
after maxpool1 shape torch.Size([100, 6, 12, 12])
after conv2 shape torch.Size([100, 12, 8, 8])
after maxpool2 shape torch.Size([100, 12, 4, 4])
after reshape shape torch.Size([100, 192])
after fc1 shape torch.Size([100, 120])
after fc2 shape torch.Size([100, 60])
after out shape torch.Size(

Setting the gradients to zero in line 14 is necessary because `loss.backward()` _adds_ the calculated gradients instead of assigning them. 

In [14]:
optimizer.zero_grad()

In [15]:
network.conv1.weight.grad.sum()

tensor(0.)

### Training with multiple epochs

In [16]:
train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)

network = Network()
optimizer = optim.Adam(network.parameters(), lr=0.01)

for epoch in range(10):    
    total_loss = 0
    total_correct = 0
    
    for batch in train_loader:    
        images, labels = batch 
        preds = network(images)
        loss = F.cross_entropy(preds, labels) # check that loss tensor has a gradient attribute
                                              # so that line 17 makes sense
        optimizer.zero_grad() # set all gradients to zero
        loss.backward() # calculate gradient
        optimizer.step() # update Weights

        total_loss += loss.item()
        total_correct += get_num_correct(preds, labels)

    print(
        "epoch", epoch, 
        "total_correct:", total_correct, 
        "loss:", total_loss
    )

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
after reshape shape torch.Size([100, 192])
after fc1 shape torch.Size([100, 120])
after fc2 shape torch.Size([100, 60])
after out shape torch.Size([100, 10])
initial shape torch.Size([100, 1, 28, 28])
 after conv1 shape torch.Size([100, 6, 24, 24])
after maxpool1 shape torch.Size([100, 6, 12, 12])
after conv2 shape torch.Size([100, 12, 8, 8])
after maxpool2 shape torch.Size([100, 12, 4, 4])
after reshape shape torch.Size([100, 192])
after fc1 shape torch.Size([100, 120])
after fc2 shape torch.Size([100, 60])
after out shape torch.Size([100, 10])
initial shape torch.Size([100, 1, 28, 28])
 after conv1 shape torch.Size([100, 6, 24, 24])
after maxpool1 shape torch.Size([100, 6, 12, 12])
after conv2 shape torch.Size([100, 12, 8, 8])
after maxpool2 shape torch.Size([100, 12, 4, 4])
after reshape shape torch.Size([100, 192])
after fc1 shape torch.Size([100, 120])
after fc2 shape torch.Size([100, 60])
after out shape torch.Size(

In [18]:
53277/60000 # train accuracy

0.88795