# HarperNet Sprint 1 - Story 3

## Part 1: Develop Network Architecture
* Determine which model to use
* Understand whether we need to determine forward pass behaviour
* Develop alternatives for FC layers
* Explore fine tuning and Feature Extractor

## Part 2: Develop training function
* Start with Udacity code, understand it and compare with alternatives (PyTorch tutorial)
* Understand how to record loss and error, and how to plot this
* Develop early stopping

## Part 3: Develop inference function
* How to display different probabilities 


In [1]:
import torch
from torch import nn
from torchvision import models

## Part 1: Network Architecture

In [2]:
# here is an example of CNN architecture as a class, updated and commented
## explain everything that is going on here, and try it with a different model - squeezenet
import torch
from torch import nn
from torchvision import models

class ResNetCNN(nn.Module):
    
    # class_size relates to the final layer 
    def __init__(self, class_size):
        
        # this is necessary - but need to explain this more
        # when we initialise, then we combine the variable name with EncoderCNN module
        super(ResNetCNN, self).__init__()
        
        # here we instantiate the restnet50 module
        # presumably pretrained gives us our weights?
        resnet = models.resnet50(pretrained=True)
        
        # interesting that this needs to be a loop rather than resnet.parameters().required_grad_(False)
        for param in resnet.parameters():
            param.requires_grad_(False)
        
        # does order matter? here is what we have done:
        # 1. instantiated a model as resnet, with the pretrained weights
        # 2. determined that this model does not need training 
        # 3. created a list of the model layers (or params, or children!) except the last FC layer
        # 4. stored this list as the layers in the sequential element of the model 
        # 5. added a new layer, nn.Linear, which takes in the size of the final fc layer we want in resnet
        # and returns the class_size number as output
        modules = list(resnet.children())[:-1]
        
        # note the difference between resent and self.resnet
        self.resnet = nn.Sequential(*modules)
        self.category = nn.Linear(resnet.fc.in_features, class_size)

    def forward(self, images):
        
        # our features is a variable - presumably the output of the original image * weights etc
        # - of whatever goes through the resnet layers
        features = self.resnet(images)
        
        # we then need to reshape the output for the linear layers
        features = features.view(features.size(0), -1)
        
        ## we need to perform something here like a softmax? 
        features = self.category(features)
        
        return features

In [3]:
# instantiate the model 
class_size = 2                  # add in variable here that will update with the size of class  

# creare model as cnn variable
cnn = ResNetCNN(class_size)

# we can print out the model 
# print(cnn)

In [33]:
# we can print out the resnet model like this 
# (list(models.resnet50().children()))

[Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False),
 BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True),
 ReLU(inplace),
 MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False),
 Sequential(
   (0): Bottleneck(
     (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
     (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
     (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (relu): ReLU(inplace)
     (downsample): Sequential(
       (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
       (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affi

In [36]:
# trying to understand what the final layer activation function should be
# (list(models.vgg19().children()))

In [30]:
# here we compare the final two layers of resnet and cnn
print('the final layer of the resnet model:', list(models.resnet50().children())[-1:])
print('the final layer of the CNN model:', list(cnn.children())[-1:])

the final layer of the resnet model: [Linear(in_features=2048, out_features=1000, bias=True)]
the final layer of the CNN model: [Linear(in_features=2048, out_features=2, bias=True)]


### Next Steps 
* Try this with a different model, such as `VGG` and with `Squeezenet`
* Look at different options for final layers 

### widki
* It must be the case that when we download a model, we get:
    * the model structure - accessible it seems via `children` and `parameters`
    * the model weights, but **widki** is how to access them - though if we set `param.requires_grad_(False)` then this stops the weights being trained, so I would suggest that `parameters` somehow gives us access to the weights. But, not training the parameters basically could mean don't access those layers. 
    * new layers are set to `requires_grad` = true by default
        * perhaps through `pre_trained=True`
* More on Sequential models 
* **Understand what loss function to apply on the final layer**
    * Is this required in the transfer learning scenario? 
    * It looks like we need to specify in the optimizer what 

In [None]:
# let's look at other options for building a model

# in_features must return a value of the layer size in question 
n_inputs = model.classifier[6].in_features

# using Sequential must provide some benefits in how we define our models
# this method seems to allow us specify our activation functions alongside the layers - so not class based 
model.classifier[6] = nn.Sequential(
    nn.Linear(n_inputs, 256), nn.ReLU(), nn.Dropout(0.4),
    nn.Linear(256, n_classes), nn.LogSoftmax(dim=1))

model.classifier

In [None]:
# from the tutorial - final later feature extractor - note no change to final layer function

model_conv = torchvision.models.resnet18(pretrained=True)
for param in model_conv.parameters():
    param.requires_grad = False

# Parameters of newly constructed modules have requires_grad=True by default
num_ftrs = model_conv.fc.in_features
model_conv.fc = nn.Linear(num_ftrs, 2)

model_conv = model_conv.to(device)

criterion = nn.CrossEntropyLoss()

# Observe that only parameters of final layer are being optimized as
# opoosed to before - below.
optimizer_conv = optim.SGD(model_conv.fc.parameters(), lr=0.001, momentum=0.9)

# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_conv, step_size=7, gamma=0.1)

In [None]:
# from tutorial - fine tuning 

model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 2)

model_ft = model_ft.to(device)

### Summary
* Need to trial and error on the activation function for the final layer, start with a no and see if we need to update.

## Part 2: Develop training function

* Understand interplay between final layer (e.g. softmax) output and loss function (e.g. nnLoss)
* Set up a training and validation loop and a means of recording them
* Specify optimizer, criterion and learning rate.
    * looks like we can use a LR scheduler - explore
        * see tutorial; we pass our optimizer to the scheduler and use it when we call step
    * optimizer will need to be set on the parameters we wish to train - some or all 
    * need to know a bit more about error 

### Loss Function
* explore what this function returns and what the backward calculation is 
* item() returns a python number we can use as a non-tensor 
* Cross Entropy Loss: losses are averaged across observations for each minibatch.

In [None]:
# example used from TL tutorial
criterion = nn.CrossEntropyLoss()

In [None]:
# if we use a softmax layer then we can use 
nn.NLLLoss # i think...

In [None]:
# this returns a loss value 
loss = criterion(outputs,labels)

# loss returns a tensor, loss.item() returns a float
type(loss.item())

### Optimizer
* Various options to be used to update the weights per the the losses
* Might be certain rules of thumb that will help us

In [43]:
# the brackets here give us a generator - maybe initiate the call function?
cnn.category.parameters()

<generator object Module.parameters at 0x13810ee08>

In [42]:
# this will give us all the parameters from the model, to be used for fine tuning
cnn.parameters()

<generator object Module.parameters at 0x13810eb48>

In [4]:
import torch.optim as optim

# to be used at a later date, lets just try SGD at first
#optimizer = optim.Adam(cnn.category.parameters())

In [None]:
# example used from TL tutorial
criterion = nn.CrossEntropyLoss()

# Observe that all parameters are being optimized
optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)

# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

In [6]:
# lets set a SGD optimizer, as per egs
optimizer_SGD = optim.SGD(cnn.category.parameters(), lr=0.001, momentum=0.9)

In [17]:
# this will show us which parameters require training - same as the last layer, so not too interesting?
for params in optimizer_SGD.param_groups[0]['params']:
    if params.requires_grad:
        print(params.shape)

torch.Size([2, 2048])
torch.Size([2])


* our optimizer parameters change based on whether we are fine tuning or feature extracting - we only optimize our final layer parameters for the latter, and all from the former. 
    * I *think* I understand - we could of course optimize for all the layers but most of them are frozen so what would be the point?
* What is our best optimizer? and parameters? What are default, good rules of thumb? and what 

### Training Function
* add in validation loop
* add in ability to move to GPU
* learn more about how optimizer and criterion work - what they return, what the different options are, and the rules of thumb for both

In [None]:
# basic function - we can build on this by passing parameters for the model, the optimizer, the 

def train(n_epochs):
    
    # loop over the dataset multiple times
    for epoch in range(n_epochs):  

        # initiate a running loss total 
        running_loss = 0.0
        
        for batch_i, data in enumerate(train_loader):
            # get the input images and their corresponding labels
            inputs, labels = data       

            # zero the parameter (weight) gradients
            optimizer.zero_grad()

            # forward pass to get outputs
            outputs = net(inputs)

            # calculate the loss ## what does criterion return? 
            loss = criterion(outputs, labels)

            # backward pass to calculate the parameter gradients
            loss.backward()

            # update the parameters
            optimizer.step()

            # print loss statistics
            # to convert loss into a scalar and add it to running_loss, we use .item()
            running_loss += loss.item()
            if batch_i % 1000 == 999:    # print every 1000 mini-batches
                print('Epoch: {}, Batch: {}, Avg. Loss: {}'.format(epoch + 1, batch_i+1, running_loss/1000))
                running_loss = 0.0

    print('Finished Training')

### Thoughts...
* Think about what we are getting when we train a model..
    * the model weights for interference 
	* the overall model accuracy and loss, the validation and training versions.
	* there are captured in **widki** what output from the `model` or `train_loader`
    * actually, what does `train_loader` return? Lots of documentation to look up!

In [None]:
# more complicated train function from tutorial

def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
    
    # this must be used to get the time the training started
    since = time.time()

    # we define as variable the model weights as a dictionary. widki copy.deepcopy 
    # and why we need to specify it now
    best_model_wts = copy.deepcopy(model.state_dict())
    
    best_acc = 0.0

    for epoch in range(num_epochs):
        # here we could have put epoch + 1 and left num_epochs alone
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        # this seems like a good way of looping through the different phases
        for phase in ['train', 'val']:
            
            if phase == 'train':
                # widki think we must be performing an update to step_size param
                # widki how optimizer and scheduler work together 
                scheduler.step()
                
                # we need to set the model modes differently 
                model.train()  # Set model to training mode
            
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            # each train and val phase is a complete turn of the dataloader, batch_size etx.
            for inputs, labels in dataloaders[phase]:
                
                # assume these don't have to be in place if we don't use a GPU
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                
                # some of this matches the code above, we don't make predictions as we use only loss
                # set_grad_enabled turns gradient calculation on or off - on if phase == Train 
                # so if not enabled, then go gradient calculation, we pass the model through inputs
                # make a prediction and calculate a loss (which might be why we use the with....)
                # but nore sure why we wouldn't just run the outputs and preds, then put loss in if phase == train
                # see if exploring criterion will help
            
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        # what does this hanging out there?
        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights  - where to? 
    model.load_state_dict(best_model_wts)
    
    # does this return the model and best weights?
    return model


In [None]:
# my function as saved in train_validate.py

## add in early stopping
## add in GPU options
import copy

def train(n_epochs, train_loader, val_loader, model, optimizer, criterion, print_batch=False, print_epoch=True):

    # loop over the dataset multiple times
    for epoch in range(n_epochs):  
        
        print('Epoch {}/{}:'.format(epoch + 1, n_epochs))
        print('-' * 10)
       
        # initiate a running loss total for train and validation sets 
        running_loss = 0.0
        val_running_loss = 0.0
        
        # initiate a running accuracy total for train and validation sets
        running_accuracy = 0.0
        val_running_accuracy = 0.0
        
        # inititate a best accuracy variable and a best model weights dictions
        best_model_wts = copy.deepcopy(model.state_dict())
        best_acc = 0.0
        
        for batch_i, (inputs, labels) in enumerate(train_loader):
            
            # do we need to do thi? prepare the net for training
            model.train()      

            # zero the parameter (weight) gradients
            optimizer.zero_grad()

            # forward pass to get outputs
            outputs = model(inputs)
            
            ## ACCURACY
            # get predictions to help determine accuracy
            _, predictions = torch.max(outputs, 1)
            
            # get the correction predictions total by comparing predicted with actual
            correct_predictions = torch.sum(predictions == labels).item()
            
            # get an accuracy per batch
            acc_per_batch = correct_predictions / train_loader.batch_size
            
            # calculate a running total of accuracy
            running_accuracy += correct_predictions
            
            # and get an average by dividing this by the size of the dataset
            running_acc_avg = running_accuracy / (train_loader.batch_size * (batch_i + 1))

            ## LOSS
            # calculate the loss 
            loss = criterion(outputs, labels)

            # backward pass to calculate the parameter gradients
            loss.backward()
            
            # store the loss value as a oython number in a variable 
            loss_per_batch = loss.item()
            
            # update the parameters
            optimizer.step()

            # keep a running total of our losses 
            running_loss += loss_per_batch
            
            # and an average per batch 
            running_loss_avg = running_loss / float (batch_i + 1)
            
            if print_batch:
                print('Batch {}: Accuracy: {:.4f}; Loss: {:.4f}'.format(batch_i + 1,acc_per_batch, loss_per_batch))
                
        if print_epoch:        
            print('Loss: {:.4f}; Accuracy: {:.4f}'.format(running_loss_avg, running_acc_avg))  
    
        for batch_ii, (val_inputs, val_labels) in enumerate(val_loader):
            
            # no requirement to monitor gradients - LOOK UP
            with torch.no_grad():
                # so set to eval mode - LOOK UP
                model.eval()
    
                # zero the parameter (weight) gradients
                optimizer.zero_grad()

                # forward pass to get outputs
                val_outputs = model(val_inputs)

                ## ACCURACY
                # get predictions to help determine accuracy
                _, val_predictions = torch.max(outputs, 1)

                # get the correction predictions total by comparing predicted with actual
                val_correct_predictions = torch.sum(val_predictions == val_labels).item()

                # get an accuracy per batch
                val_acc_per_batch = val_correct_predictions / val_loader.batch_size

                # calculate a running total of accuracy
                val_running_accuracy += val_correct_predictions

                # and get an average by dividing this by the size of the dataset
                val_running_acc_avg = val_running_accuracy / (val_loader.batch_size * (batch_ii + 1))

                ## LOSS
                # calculate the loss  - we don't need to calculate the loss.backward or optimizer step
                val_loss = criterion(val_outputs, val_labels)

                # store the loss value as a oython number in a variable 
                val_loss_per_batch = val_loss.item()

                # keep a running total of our losses 
                val_running_loss += val_loss_per_batch

                # and an average per batch 
                val_running_loss_avg = val_running_loss / float (batch_ii + 1)
                
                if print_batch:
                    print('VAL: Batch {}: Accuracy: {:.4f}; Loss: {:.4f}'
                          .format(batch_ii + 1, val_acc_per_batch, val_loss_per_batch))

            if val_running_acc_avg > best_acc:
                bes_acc = val_running_acc_avg
                best_model_wts = copy.deepcopy(model.state_dict())
                # print('Model weights updated')
                    
        if print_epoch:        
            print('VAL: Loss: {:.4f}; Accuracy: {:.4f}'.format(val_running_loss_avg, val_running_acc_avg)) 
            print()

    print('Finished Training')
    
     # load best model weights
    model.load_state_dict(best_model_wts)
    print('Best model weights saved')
    
    return model

In [None]:
# for loop exploration
# here we can see that, if for loops are at the same level, they each complete a full loop 
# before moving to the next for loop. 
# the top level means that one loop is completed before moving to the next, in this case the other two 
# loops are completed before we go on to the top loop's second value and start again 

## the lesson: our val set needs to be on the same level as our train set

loop_1 = 0
loop_2 = 10

for i in range(2):
    print ('new loop')

    for i in range(10):
        loop_1 += 1
        print (loop_1)
    
    print('end loop 1')

    for i in range(10):
        loop_2 += 1
        print (loop_2)
    
    print('end loop 2')

print('end main loop')

#### Exploring Accuracy

In [None]:
## we get outputs per batch - widki these are our predictions
## then pass those outputs to the loss function

resnet.eval()

for i, (data, labels) in enumerate(train_loader):
    # this returns one batch's worth of output values
    outputs = resnet(data)

In [None]:
# we then need a way to determine what the predictions are - using the torch.max function
import torch

# if we just call torch.max on outputs 
torch.max(outputs)

In [None]:
# if we just call torch.max on outputs with dim = 1 then it returns two values
# Returns the maximum value of each row of the input tensor in the given dimension dim. 
# The second return value is the index location of each maximum value found (argmax).
torch.max(outputs, 1)

In [None]:
# this returns just the predictions
_, pred = torch.max(outputs, 1)
pred

#### Comparing the predicted and actual values
* view() returns a different shape for the tensor - looks like we need to specify an integer for this
* view_as() returns a different shape for the tensor we perform the method on, and lets us specify another tensor as the desired shape
* One of the comparison of true vs predicted is using element-wise equality. We compare pred with label data
* We can do this a few ways, similar to numpy.
* When we put in our equality comparator, we can just specify labels. This seems to work. The code below seems to use a belt and braces approach, so that if labels and preds are different sizes, we change labels to the size of preds to help with the equality comparison.

In [None]:
n1 = torch.Tensor([2,2,2,0])
n2 = torch.Tensor([1,1,1,0])

In [None]:
## eq compares two different tensors for equality, elementwise. it returns 1 for T, 0 for false

# this will return a comparison of n1 and n2
torch.eq(n1,n2)

In [None]:
# as will this 
n1.eq(n2)

In [None]:
# so why do we need to use this code?
# View this tensor as the same size as other. self.view_as(other) is equivalent to self.view(other.size()).
# Please see view() for more information about view.
# Parameters:	other (torch.Tensor) – The result tensor has the same size as other.

labels.size()
labels.data.size()
labels.data.view_as(pred).size()

In [None]:
## there seems to be a few ways to then compare with the actuals

# correct_tensor = pred.eq(target.data.view_as(pred))
pred.eq(labels.data.view_as(pred))

In [None]:
# lets sum this up 
sum(n1.eq(n2))

# and compare with the len of the data
# note we need to use .double() to conver to tensor float, or .item() to convert to python
sum(n1.eq(n2)).double() / len(n1)

In [None]:
torch.sum(pred == labels).item() / len(pred)

In [None]:
# why use .data? This is an interesitng question
# labels.type() and labels.data.type return the same thing - see notes 

In [None]:
# statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)

#### Saving the model 

In [None]:
# this gives is our weights, which we can save as a variable and load later
model.state_dict())   
    
# load best model weights
model.load_state_dict(best_model_wts)